Replication Data

Modeling the Evolution of Topics in Source Code Histories

Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
MSR 2011


Studying the evolution of topics (collections of co-occurring words) in a software repository is an emerging technique to automatically shed light on how the repository is changing over time: which topics are becoming more actively developed, which ones are dying down, or which topics are lately more error-prone and hence require more testing. Existing techniques for modeling the evolution of topics in software repositories sur from issues of data duplication, i.e., when the repository contains multiple copies of the same document, which is typical in source code histories. To address this issue, we propose the Diff model, which applies a topic model only to the changes, or diffs, of the documents in each version instead of to the whole document at each version.


The camera-ready copy can be found here.
The publisher's webpage can be found here here.


If you found this replication package helpful or used it for your own project, please consider citing our original paper:

   title = {Modeling the evolution of topics in source code histories},
   booktitle = {Proceedings of the 8th Working Conference on Mining Software Repositories},
   author = {S. W. Thomas and B. Adams and A. E. Hassan and D. Blostein},
   pages = {173--182},
   year = {2011},

Data and Scripts

Here we supply the data and results of our studies thus far.