Replication Data

Diversity in Software Engineering Research


One of the goals of software engineering research is to achieve generality: Are the phenomena found in a few projects reflective of others? Will a technique perform as well on projects other than the projects it is evaluated on? While it is common sense to select a sample that is representative of a population, the importance of diversity is often overlooked yet as important. In this paper, we combine ideas from representativeness and diversity and introduce a measure called sample coverage, defined as the percentage of projects in a population that are similar to a given sample. We introduce algorithms to compute the sample coverage for a given set of projects and to select the projects increase the coverage the most. We demonstrate our technique on research presented over two years at ICSE and FSE with respect to a population of 20,000 active open source projects monitored by Knowing the coverage of a sample enhances our ability to reason about the findings of a study. Furthermore, we propose reporting guidelines for research: in addition to coverage scores, papers should discuss the target population of the research (universe) and dimensions that potentially can influence the outcomes of a research (space).


If you found this replication package helpful or used it for your own project, please consider citing our original paper:

title = "Diversity in Software Engineering Research",
author = "Meiyappan Nagappan and Thomas Zimmermann and Christian Bird",
year = "2013",
month = "August",
booktitle = "Proceedings of the 9th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering",
location = "Saint Petersburg, Russia",
publisher = "ACM",

Preprint of the paper is available here.

Data and Source Code

(last modified June 2nd 2013)

Source Code: (1.32 MB)

Raw Data: (63 MB)

Masterdata: masterdata.txt (6.3 MB)

Masterdata: Conferences-Masterdata.txt (53 kB)

Spreadsheets with the conference data: Conferences-Masterdata.xlsx (41 kB)