Replication Data

Modeling the Evolution of Topics in Source Code Histories

Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
MSR 2011


Studying the evolution of topics (collections of co-occurring words) in a software repository is an emerging technique to automatically shed light on how the repository is changing over time: which topics are becoming more actively developed, which ones are dying down, or which topics are lately more error-prone and hence require more testing. Existing techniques for modeling the evolution of topics in software repositories sur from issues of data duplication, i.e., when the repository contains multiple copies of the same document, which is typical in source code histories. To address this issue, we propose the Diff model, which applies a topic model only to the changes, or diffs, of the documents in each version instead of to the whole document at each version.


The camera-ready copy can be found here.
The publisher's webpage can be found here here.


Data and Scripts

Here we supply the data and results of our studies thus far.