User study observations from how humans debug errors in large data flows
“Where are these errors in my output coming from?!” Findings errors in the outputs of the large data flows are a common issue. Large data flows ingest several datasets, which are processed through many code modules which are potentially written by multiple developers.
However, tracing the error back to its origin in the data flow is difficult:
- It is usually performed through trial and error.
- Each intermediate output must be analyzed because the error propagates throughout the data flow.
- A code module that is not the source of the error but propagates the error may be misidentified as the source of the error.
Given these challenges, we built a visual and interactive tool called WhyFlow that helps users identify the modules in the data flow that are propagating errors. After building WhyFlow, we ran a user study and we asked users to identify errors in small toy data flows.
While the users were given small toy data flows, we found it very interesting that even with a visual tool on a small toy data flow with simple errors, users, all of whom are advanced computer science students, had found these debugging tasks time-consuming to complete.
In this article, we want to go in-depth on the specific tasks we gave the users, go over them, and show why we believe debugging data flows stands to be a challenging thing to do even to this day.
A data flow contains code modules that process, transforms, and join the data. The input data sources flow through the data flow from left to right. This data flow joins two database tables, Customer and Store, and then towards the end of the data flow, it counts the number of people per location.
The Customer module reads in the customer table with their personal information such as location and the Store module reads in the store table that contains specific store information. The data flow cleans location details in the customer table with the ValidateLocation module and then the InnerJoinOnLocation module joins both tables by the current location field. Finally, the workflow counts how many matches exist within a certain location with the GroupbyLocationCount module.
Users who use the final outputs of a data flow for other purposes may catch erroneous outputs. Errors propagate into a data flow due to
- a faulty code module or
- an incorrect input, which is due to an upstream faulty code module.
Suppose a store manager notices that there is an erroneous record with a store location that is incorrect. The manager informs the developers so that the module propagating the error would be fixed. In this example, the data flow has an error in the join module, where a left join occurred instead of an inner join.
While this data flow is somewhat tiny with only 5 code modules and less than even 50 data points in total, we would expect users to easily debug errors that manifest in such data flows.
However, when we gave humans these debugging tasks over such data flows, we found this to be a time-consuming task. We even provided them with automatically generated explanations of the error datapoints and easy-to-understand visualizations of the datapoints to help users in the debugging process.
In the following, I will describe the user study we conducted where we asked users to play the role of a developer who needs to identify the source of an error propagated through the data flow. We show from our observations that debugging errors in even small data flows is time-consuming and challenging.
We recruited 10 participants. In the experiment, we asked users to play the role of a debugger who has to find the errors in the data flows.
Our users were presented with two data flows. These two data flows were reproduced from the TPC-Decision Support (TPC-DS) benchmark queries and data set. The TPC-DS benchmark is widely used in systems research and industry to evaluate general purpose decision support systems and big data systems.
In the beginning of the experiment we presented participants with data flows W1 and W2 without any errors in them and asked them to describe each code module’s expected operation and identify the columns it operated on. This exercise helped the participants familiarize themselves with the data flows.
All Possible Code Modules
All the users have taken an introductory database applications course. While the participants were familiar with the standard relational operators, they were not familiar with how they were implemented in the data flows. We provided the users with knowledge of the space of all the possible code modules possible beforehand during the tutorial.
We found this necessary to do as when we ran an initial pilot user study without telling them the space of possible code modules, some users took more than 4 hours trying to complete one debugging task. The limitation of most user studies for one person is generally advised to not go beyond 2 hours.
Error classes and possible errors
Additionally, we also provided the space of possible errors that could have been introduced. This also helped us limit the time per user to 2 hours.
Examples of errors include duplication errors, an incorrect filter predicate, incorrect handling of null values, incorrect string operation or incorrect string constant operand, or a swapped inner join with a left join. Specifically, the following are the possible errors introduced into the data flows W1 and W2:
1. Filter Errors: These errors modify a filter operation such that it selects correct and incorrect values in the output.
2. Join Type Errors: These errors modify a join type replacing inner joins with full outer joins.
3. Join Column Errors: These errors modify the parameters of the join operation causing the join to operate on incorrect columns.
Errors we do not cover are ones in group-by and aggregation code modules or errors that involve dropping data (since we are limited to the particular type of historical data lineage of the data flow, called the why-provenance of the data flow which doesn’t maintain information on datapoints that were dropped).
When an error occurs in a code module, that error propagates to the other code modules downstream. The error then manifests differently at each downstream code module. We refer to this as label propagation.
To simplify the experiment and have users focus on debugging rather on labeling, we pre-labeled the datapoints and explained to the users that “All the datapoints in the data flows have been labeled by a domain expert who has identified the incorrect datapoints.”
Comparing traditional manual debugging against debugging with the help of error explanations
To understand how humans debug errors in data flows, we setup the experiment as a comparative user study. We compared users debugging with the help of error explanations (a tool which we call WhyFlow) against users debugging manually (the Baseline tool). In the Baseline tool, the user is only limited to clicking the modules from a simple data flow visualization to view each code module’s output datapoints in a tabular format.
WhyFlow is different from the Baseline in that it enables users to visually examine the data flows. Its core feature is the error explanations generated at different code modules from the incorrect datapoints. Explanations of the errors are predicates that capture datapoints that contribute to the errors at the final outputs. WhyFlow generates explanations for every code module. Explanations help the user quickly identify the shared characteristics of the erroneous datapoints. For instance, if a code module has an error where a left join occurred instead of an inner join, then the error datapoints would contain unequal column values where the inner join was intended to occur at. To learn more about how explanations are generated you can read about that here.
We hypothesized that with WhyFlow users would identify errors and select appropriate explanations in less time and with more accuracy than with Baseline.
In the beginning of the experiment, we trained each participant on the tools with a quick tutorial and some mock debugging tasks. We then presented the participants with correct executions of data flows W1 and W2 in WhyFlow and asked them to describe each code module’s expected operation and identify the columns it operated on.
We then introduced errors into these data flows W1 and W2 and asked users to debug the errors and identify the source of the error. If they were using WhyFlow, we asked them, “Select an explanation over the columns that describe the data in which the error manifests in.” And if they were using Baseline, we asked them, “Provide an explanation for the error observed. Describe the data in which the error manifests in, e.g. include error columns and the characteristics the incorrect data shares.” In other words, to complete a debugging task, users had to either select an explanation for the error in WhyFlow or provide one for Baseline.
For each condition, WhyFlow and Baseline, each participant was given three debugging tasks for different error classes. We randomized the order of conditions they started with and the order of tasks within each condition.
We also asked users to think aloud and verbalize their thoughts when they believed they had identified the faulty module, determined the error class, or could express some intuition on the nature of the error. We recorded when each of these events occurred and what the users verbalized during the course of the experiment.
During the experiment, we observed how users debugged errors in data flows when using WhyFlow. Their workflow is broken down into the following steps as follows:
- High-level data flow analysis: The user first selects a suspicious module. WhyFlow display the error and correct datapoints of the code modules.
- Code module analysis: The user then analyzes the datapoints in the data panel.
- Hypothesizing and identifying the error class: The user identifies the error class by analyzing the datapoints, e.g. “It seems like an incorrect filter predicate caused the values to equal CT”.
- Error explanation: The user confirms their hypothesized error class by looking at the generated explanations on the error datapoints. In this case, the user attempts to find an explanation predicate that essentially says, e.g. “All the error datapoints have value CT”.
At each step, we recorded the timestamp when the user completed it.
Most users took too much time to complete the debugging task (Step 3 and 4)
We conducted a two-way repeated-measures ANOVA on the overall time for users to complete the debugging task. We found no significant interaction effects and no significant main effects of tool-used (WhyFlow vs Baseline). In other words, auto generated explanations of the errors in WhyFlow did not make a significant difference in helping a user quickly completing the debugging task.
But users in WhyFlow were able to quickly identify the error class (Step 3)
We conducted a two-way repeated-measures ANOVA on the duration of time for users to verbalize the error class, with error class and tool-used (WhyFlow vs Baseline) as independent factors. There were neither significant interaction effect, nor a significant main effect of error class. We did find a significant main effect of tool-used (F1,9 = 10.15, p = 0.01). In other words, users spent less time identifying the error class with WhyFlow. Aside from join type errors, their initial intuition of the error was more accurate than with Baseline.
The remaining bulk of the time for completing the debugging task was mostly taken up by the task of selecting an explanation for the error (Step 4)
In other words, most users took up a lot of time to select an explanation in WhyFlow. Specifically, we observed that users spent as much time selecting an explanation in WhyFlow as identifying the error class. Three users commented that, “there [were] a lot of explanations.” One user said “Seeing 1,000 explanations made it seem like 1,000 different types of errors.” Some modules had as many as 60 explanations generated. Four users found it difficult to discern the quality of the suggested explanations since many of them seemed similar to each other.
Ultimately, any gains that WhyFlow provided in helping users quickly determine the error class were lost due to the time spent searching for and selecting an appropriate explanation. We approximated the time to select an explanation in WhyFlow or provide one in Baseline as the difference between the total time to complete a debugging task (Steps 3 and 4) and the time to verbalize the error class (Step 3).
The above figure shows this approximated explanation time for each user. We again conducted a two-way repeated measures ANOVA on this time and we found a significant main effect of tool-used (F1,9 = 6.790, p = 0.028).
Most users selected the correct explanation (Step 4) with the help of explanations generated by WhyFlow
We find that users selected correct explanations more frequently with WhyFlow than with Baseline. There were five users who initially misidentified the join type error class, e.g. thought that the error type was some other error type other than the join error type. But those same five users selected the correct explanation from WhyFlow’s generated explanation, e.g. “The wrong datapoints have storeIDs that are not equal” although the module is a join module that performs a join on storeID and thus they should be equal.
Overall our quantitative results indicate that WhyFlow can be an effective data flow debugging tool in that users select more accurate explanations with WhyFlow.
WhyFlow targets end-users who are debugging data flows and trying to identify the cause of the error datapoints in the final outputs. WhyFlow explains errors with predicates and visualizes the data flow, the provenance of outputs and differences between erroneous and correct datapoints. Our results indicate that WhyFlow enables most users to accurately localize and communicate faults.
But why is debugging errors in data flows still challenging? On one hand, we discovered that providing automatically generated explanations of the errors helps users in the task of identifying the error class though not necessarily in explaining and selecting an explanation of the error. Debugging data flows is a time-consuming task even with explanations of errors on small data flows. The data flows in the user study were small, exceeding no more than 40 code modules, and the errors introduced were somewhat simple. But users still took a long time to debug the error.
Additionally our user study was limited. Real-world deployments tend to have other aspects that our study did not cover:
- Debugging Multiple Errors. In our evaluations, WhyFlow was tested against data flows with only one error introduced. Yet, even with only one error introduced into the data flow, our user study shows that it is still a time-consuming process, as WhyFlow users on average took about an hour in total to debug fairly simple errors in three data flows. While it is still challenging to debug one error in a data flow, realistically, a data flow may have multiple errors.
- Supported Explainable Errors. WhyFlow’s current language does not capture aggregation errors. However, errors in modules that perform aggregation may occur. Additionally there are errors that may drop desired datapoints. Datapoints that are missing in the outputs cannot be debugged with the help of WhyFlow.
There is definitely a lot of opportunity to improve how humans debug errors in data flows.
Detailed paper: https://openreview.net/forum?id=uO07BC54cW