Empirical Research on Software Engineering

Collection of software engineering studies based on evidence


Results:
76% of the studies have identified a significant increase in code coverage while 88% identified a meaningful increase in external software quality. 44% of studies show TDD takes longer time.
Method:
Systematic literature review of articles on TDD from 1999 to 2014. 57% of studies were experiments, 32% case studies.In 12 of 27 studies participants were professional, most studies used Java. Considers effects identified regardless of metrics used.
Results:
CBO, WMC and LOC linked to reliability. WMC, RFC and LOC linked to maintainability. Inheritance measures have a weak link with reliability and maintainability.
Method:
Systematic Literature Review of 99 peer-reviewed empirical studies published between 1996 and 2012. Most studies were analysis on Java open source projects.
Results:
ITL outperforms TDD to a small non-statistically significant extent, concluding TDD does not increase quality over ITL. There were differences between participants. Most case studies and surveys on TDD showed that outperforms ITL on external quality (test cases that successfully pass from a battery of tests).
Method:
6 hours training on TDD, ran experiment of 2.25h on TDD vs ITIL on companies Bittium (C++) and F-Secure (Java).
Results:
Generated test suites detected 56% of bugs. 63% of non-found bugs were covered by the test suite. 15% of tests were flaky. Randoop had the most flaky tests. AgitarOne the most false positives. Coverage ranged from 40% to 80%.
Method:
Applied unit test generation tools for Java Randoop  EvoSuite, and AgitarOne on the buggy versions of Defects4J dataset to check if generated test suites can detect the bugs.
Results:
Statistically significant correlation between code coverage and bugs found.
Method:
Randoop generate test suites with varying levels of coverage (0.2%, 0.5%, 1%, 5%, 10%, 100%) and run them in Apache HTTPClient and Mozilla Rhino.
Results:
65% density of useful comments.Experience with the code base is an important factor to increase the density of useful comments in code reviews.Review effectiveness decreases with the number of files in the change set.
Method:
Developer interviews to learn what comments were useful. Built an automated classifier of comments based on the interviews. Applied the classifier to 5 Microsoft projects.
Results:
Review coverage is negatively associated with the incidence of post-release defects. However, several defect-prone components have high coverage rates, suggesting that other properties of the code review are at play.Lack of participation in code review has negative impact on software quality. Frequent occurrences of reviews without sufficient discussion are associated with higher post-release defect counts. Components with many changes that do not involve a subject matter expert in the authoring or reviewing process tend to be prone to post-release defects.
Method:
Case study of the Qt, VTK, and ITK projects. Extracted code review data from Gerrit and linked to patches.
Results:
Only 9% and 14% of Nova and Neutron review comments are related to software design. 73% of the design related comments also providing suggestions to address the concerns. 
Method:
Train and evaluate classifiers to automatically label review comments as design related or not on code reviews of two projects: OpenStack Nova and Neutron.
Results:
Test failures the most common reason for build failure. 30% of errors occur in the first half of the build runtime. The later half is dominated by 70% test failures. PRs tend to cause failures more frequently than changes pushed directly into a branch. Successfully building a PR does not guarantee the success of a subsequent PR merge.
Method:
Collect build data from CI servers and change data from Github for 14 known Java open source projects that use Github and Travis with more than 400 builds and 50 committers.
Results:
50% of bugs had spans in the code base of 14 months prior to being discovered. Severity score is not correlated with shorter lives. 21% cases of vulnerabilities disclosed before patched. Over 50% of vulnerabilities were patched more than a week before public disclosure.
Method:
Collect vulnerabilities from NVD dataset from December 2016, linked them to git commits fixing the vulnerabilities in open source projects.
Results:
80% of bugs depend on workload conditions, 20% on environment. For 58% of bugs the manifestation requires a specific request type. For 45% require two conditions to surface, those requiring one condition are too trivial to survive early testing.In 25% of cases need a sequence of requests to make the bug.
Method:
Two authors analyzed 568 reported bugs in Apache2 and 98 in MySQL 5.1 to look at what triggers the bug in production.
Results:
Common robustness problems can be fixed without sacrificing program performance.2 out of 4 surveyed professional developers could not predict the exception handling characteristics of their software systems, and may not have the ability to write robust software systems without better training.
Method:
Applied tool Ballista on software modules for robustness to exceptional input conditions: POSIX calls on several Unix systems, math library, SFIO library.Look at how C++ and Java developers in two industrial projects handle exceptions.
Results:
With each vulnerability discovered within a bounty program, the probability of finding the next vulnerability decreases more rapidly than the corresponding increase in payoff. Therefore, security researchers switch to newly launched bounty programs. Number of bugs discovered is a super-linear function of security researchers who have enrolled.
Method:
Investigate a public dataset of 35 public bug bounty programs from the HackerOne website.
Results:
The number of faults per line of code has been decreasing.Drivers fault rate below other directories (unlike 10 years ago).The kinds of faults considered 10 years ago are still relevant. Lifespan of faults is under 2 years. Fault-finding tools are used regularly in, but they have had small impact. Files with higher churn or larger have higher fault rate. Higher fault rate is found in new files. Lower fault rate on files compiled with allyesconfig (includes as many options as possible).
Method:
Used Coccinelle and Herodotos to Linux versions 2.6.0 to 2.6.33 to find possible faults in source code. Faults include: "Always check bounds of array indices" or "Check potentially NULL pointers returned from routines".
Results:
"Modern programming practices" are not so much miraculous productivity aids as they are sound management practices. Thus, their use will reduce the risk and margin of error in predicting and controlling project outcomes. "Modern programming practices" compare to other projects did not lead to reduced effort compared to software of similar size, but productivity was consistent. Had similar error rates and category of errors. Program size was 211K SLOC took 170K hours or 1133 person/month. Testing took 17% of effort.
Method:
Assess the impact of programming practices during two software development projects taking 2 years.
Results:
Apps with higher average rating in Google Play in general use APIs more stable and less fault-prone.
Method:
Analyzed the relationship between the average ratings in Google Play of 7,097 free Android apps and the stability and fault-proneness of the used Android APIs. Fault-proneness used the number of bugs fixed in the API. Stability the number of changes at method level.
Results:
Most of the implementation faults were related to well known programming slips: first value of loop forgotten, last value of loop forgotten, initialisation of variable omitted.
Method:
Determine max cycle length over any two numbers in the 3n+1 problem. Out of 29012 C programs were submitted, they included 24% of them correct and 35% wrong. 39% with fatal errors were ignored.
Results:
Around 25% of bugs result from reward program contributions. The cost is comparable to the cost of just one member of the browser security team, the benefit outweighs a single security researcher.
Method:
Analyze data on critical/high vulnerabilities (receiving bounties) affecting stable releases and rewards from 2010 to 2013 for Google Chrome and Mozilla Firefox Vulnerability Rewards Programs.