Replication Data

Studying Software Logging Using Topic Models

Heng Li, Tse-Hsun Chen, Weiyi Shang, and Ahmed E. Hassan
Submitted to Empirical Software Engineering Journal.


Abstract

Software developers insert logging statements in their source code to record important runtime information; such logged information is valuable for understanding system usage in production and debugging system failures. However, providing proper logging statements remains a manual and challenging task. Missing an important logging statement may increase the difficulty of debugging a system failure, while too much logging can increase system overhead and mask the truly important information. Intuitively, the actual functionality of a software component is one of the major drivers behind logging decisions. For instance, a method maintaining network communications is more likely to be logged than getters and setters. In this paper, we use automatically-computed topics of a code snippet to approximate the functionality of a code snippet. We study the relationship between the topics of a code snippet and the likelihood of a code snippet being logged (i.e., to contain a logging statement). Our driving intuition is that certain topics in the source code are more likely to be logged than others. To validate our intuition, we conduct a case study on six open source systems, and we find that i) there exists a small number of “log-intensive” topics that are more likely to be logged than other topics; ii) each pair of the studied systems share 12% to 62% common topics, and their likelihood of logging such common topics has a statistically significant correlation of 0.35 to 0.62; and iii) our topic-based metrics help explain the likelihood of a code snippet being logged, providing an improvement of 3% to 13% on AUC and 6% to 16% on balanced accuracy over a set of baseline metrics that capture the structural information of a code snippet. Our findings highlight that topics contain valuable information that can help guide and drive developers' logging decisions.

BibTeX

(Under submission)

Replication Package

In our replication package, we share how to run topic modeling (LDA) in our study, our topic model output data, and our baseline metrics that are extracted from the six studied systems. So interested researchers can leverage the shared data to replicate or improve our study.