- Model-based Fault Classification for Automotive Software(arXiv)
Abstract : Intensive testing using model-based approaches is the stan- dard way of demonstrating the correctness of automotive software. Un- fortunately, state-of-the-art techniques leave a crucial and labor intensive task to the test engineer: identifying bugs in failing tests. Our contribu- tion is a model-based classification algorithm for failing tests that assists the engineer when identifying bugs. It consists of three steps. (i) Fault localization replays the test on the model to identify the moment when the two diverge. (ii) Fault explanation then computes the reason for the divergence. The reason is a subset of actions from the test that is suffi- cient for divergence. (iii) Fault classification groups together tests thatfail for similar reasons. Our approach relies on machinery from formal methods: (i) symbolic execution, (ii) Hoare logic and a new relationship between the intermediary assertions constructed for a test, and (iii) a new relationship among Hoare proofs. A crucial aspect in automotive software is timing requirements, for which we develop appropriate Hoare logic theory. We also briefly report on our prototype implementation for the CAN bus Unified Diagnostic Services in an industrial project.
2. Predicting Flaky Tests Categories using Few-Shot Learning(arXiv)
Abstract : Flaky tests are tests that yield different outcomes when run on the same version of a program. This non- deterministic behaviour plagues continuous integration with false signals, wasting developers’ time and reducing their trust in test suites. Studies highlighted the importance of keeping tests flakiness-free. Recently, the research community has been push- ing forward the detection of flaky tests by suggesting many static and dynamic approaches. While promising, those approaches mainly focus on classifying tests as flaky or not and, even when high performances are reported, it remains challenging to understand the cause of flakiness. This part is crucial for researchers and developers that aim to fix it. To help with the comprehension of a given flaky test, we propose FlakyCat, the first approach for classifying flaky tests based on their root cause category. FlakyCat relies on CodeBERT for code representa-tion and leverages a Siamese network-based Few-Shot learning method to train a multi-class classifier with few data. We train and evaluate FlakyCat on a set of 343 flaky tests collected from open-source Java projects. Our evaluation shows that FlakyCat categorises flaky tests accurately, with a weighted F1 score of 70%. Furthermore, we investigate the performance of our approach for each category, revealing that Async waits, Unordered collections and Time-related flaky tests are accurately classified, while Concurrency-related flaky tests are more challenging to predict. Finally, to facilitate the comprehension of FlakyCat’s predictions, we present a new technique for CodeBERT-based model interpretability that highlights code statements influencing the categorization.
3. How Readable is Model-generated Code? Examining Readability and Visual Inspection of GitHub Copilot(arXiv)
Author : Naser Al Madi
Background: Recent advancements in large language models have motivated the practical use of such models in code generation and program synthesis. However, little is known about the effects of such tools on code readability and visual attention in practice. Objective: In this paper, we focus on GitHub Copilot to address the issues of readability and visual inspection of model generated code. Readability and low complexity are vital aspects of good source code, and visual inspection of generated code is important in light of automation bias.
Method: Through a human experiment (n=21) we compare model generated code to code written completely by human programmers. We use a combination of static code analysis and human annotators to assess code readability, and we use eye tracking to assess the visual inspection of code.
Results: Our results suggest that model generated code is compa- rable in complexity and readability to code written by human pair programmers. At the same time, eye tracking data suggests, to a statistically significant level, that programmers direct less visual attention to model generated code.
Conclusion: Our findings highlight that reading code is more im- portant than ever, and programmers should beware of complacency and automation bias with model generated code.
4. Mining Android API Usage to Generate Unit Test Cases for Pinpointing Compatibility Issues(arXiv)
Abstract : Despite being one of the largest and most popular projects, the official Android framework has only provided test cases for less than 30% of its APIs. Such a poor test case coverage rate has led to many compatibility issues that can cause apps to crash at runtime on specific Android devices, resulting in poor user experiences for both apps and the Android ecosystem. To mitigate this impact, var- ious approaches have been proposed to automatically detect such compatibility issues. Unfortunately, these approaches have only focused on detecting signature-induced compatibility issues (i.e., a certain API does not exist in certain Android versions), leaving other equally important types of compatibility issues unresolved. In this work, we propose a novel prototype tool, JUnitTestGen, to fill this gap by mining existing Android API usage to generate unit test cases. After locating Android API usage in given real-worldAndroid apps, JUnitTestGen performs inter-procedural backward data-flow analysis to generate a minimal executable code snippet (i.e., test case). Experimental results on thousands of real-world An- droid apps show that JUnitTestGen is effective in generating valid unit test cases for Android APIs. We show that these generated test cases are indeed helpful for pinpointing compatibility issues, including ones involving semantic code changes.