Noise and Heterogeneity in Historical Build Data: An Empirical Study of Travis CI

Abstract

Automated builds, which may pass or fail, provide feedback to a development team about changes to the codebase. A passing build indicates that the change compiles cleanly and tests (continue to) pass. A failing (a.k.a., broken) build indicates that there are issues that require attention. Without a closer analysis of the nature of build outcome data, practitioners and researchers are likely to make two critical assumptions: (1) build results are not noisy; however, passing builds may contain failing or skipped jobs that are actively or passively ignored; and (2) builds are equal; however, builds vary in terms of the number of jobs and configurations. To investigate the degree to which these assumptions about build breakage hold, we perform an empirical study of 3.7 million build jobs spanning 1,276 open source projects. We find that: (1) 12% of passing builds have an actively ignored failure; (2) 9% of builds have a misleading or incorrect outcome on average; and (3) at least 44% of the broken builds contain passing jobs, i.e., the breakage is local to a subset of build variants. Like other software archives, build data is noisy and complex. Analysis of build data requires nuance.

Tools and Data

Preprint

Bibtex

@inproceedings{gallaba2018ase, Author = {Keheliya Gallaba and Christian Macho and Martin Pinzger and Shane McIntosh}, Title = , Year = {2018}, Booktitle = {Proc. of the International Conference on Automated Software Engineering (ASE)}, Pages = {87–97} }