Replication Data

Studying Software Logging Using Topic Models

Heng Li, Tse-Hsun Chen, Weiyi Shang, and Ahmed E. Hassan
Submitted to Empirical Software Engineering Journal.

Abstract

Software developers insert logging statements in their source code to record important runtime information; such logged information is valuable for understanding system usage in production and debugging system failures. However, providing proper logging statements remains a manual and challenging task. Missing an important logging statement may increase the difficulty of debugging a system failure, while too much logging can increase system overhead and mask the truly important information. Intuitively, the actual functionality of a software component is one of the major drivers behind logging decisions. For instance, a method maintaining network communications is more likely to be logged than getters and setters. In this paper, we use automatically-computed topics of a code snippet to approximate the functionality of a code snippet. We study the relationship between the topics of a code snippet and the likelihood of a code snippet being logged (i.e., to contain a logging statement). Our driving intuition is that certain topics in the source code are more likely to be logged than others. To validate our intuition, we conduct a case study on six open source systems, and we find that i) there exists a small number of “log-intensive” topics that are more likely to be logged than other topics; ii) each pair of the studied systems share 12% to 62% common topics, and their likelihood of logging such common topics has a statistically significant correlation of 0.35 to 0.62; and iii) our topic-based metrics help explain the likelihood of a code snippet being logged, providing an improvement of 3% to 13% on AUC and 6% to 16% on balanced accuracy over a set of baseline metrics that capture the structural information of a code snippet. Our findings highlight that topics contain valuable information that can help guide and drive developers' logging decisions.

BibTeX

(Under submission)

Replication Package

In our replication package, we share how to run topic modeling (LDA) in our study, our topic model output data, and our baseline metrics that are extracted from the six studied systems. So interested researchers can leverage the shared data to replicate or improve our study.

Running topic modeling. First, we use a lightweight source code preprocesser to preprocess the source code of a subject system. Then, we use the MALLET toolkit to run topic modeling on the preprocessed data. Our script for running the MALLET tool can be downloaded at run_mallet.sh
Topic model output data. We share the topic model output data for individual systems that is used in RQ1 and RQ3 as well as the topic model output data for the combined systems that is used in RQ2. The shared data include the document-topic files which show the topics' distribution in each document and the topic-word files which list the most probable words in each topic. These two types of topic model output files are both used in our analysis.
Baseline metrics data. We share the data of all our baseline metrics that are used in RQ3 as well as the number of logging statements in each method.

Replication Data

Studying Software Logging Using Topic Models

Heng Li, Tse-Hsun Chen, Weiyi Shang, and Ahmed E. Hassan Submitted to Empirical Software Engineering Journal.

Abstract

BibTeX

Replication Package

Heng Li, Tse-Hsun Chen, Weiyi Shang, and Ahmed E. Hassan
Submitted to Empirical Software Engineering Journal.