Replication Data

The Impact of Classifier Configuration and Combination on Bug Localization

Stephen W. Thomas, Meiyappan Nagappan, Dorothea Blostein, and Ahmed E. Hassan.
Submitted to IEEE Transactions on Software Engineering.


Bug localization is the task of determining the source code entities that are relevant to a new bug report. Manual bug localization is labor intensive, since a developer must consider hundreds or thousands of source code entities. Current research builds bug localization classifiers, based on information retrieval models, to locate entities that are textually similar to a given bug report. Current research, however, does not consider the effect of classifier configuration, i.e., all the parameter values that specify the exact behavior of the classifier. As such, it is unknown how important each parameter is, or which particular parameter values lead to the best overall bug localization performance. In this paper, we empirically investigate the effectiveness of a large space of classifier configurations, 3,172 in total. Further, we introduce a framework for combining the results of multiple classifier configurations, a technique which has shown promise in many other domains. Through a detailed case study on over 8,000 bug reports from three real-world systems, we determine (a) that the parameters of a classifier have a significant impact on its performance, and therefore practitioners and researchers must consider them carefully, and (b) that combining multiple classifiers improves the performance of even the best individual classifiers, often by significant amounts. Our results substantially improve the state-of-the-art in bug localization.


(Under submission)

Data and Scripts