Finding pixel-wise annotation errors in MIT ADE20K
TL;DR — Bad labels are a major problem in AI. I introduce the first technique for finding annotation errors in semantic segmentation datasets. On MIT ADE20K I find over 50 label issues, with confirmed errors on 7% of total pixels. For some rarer label classes, I triple the number of annotations. I’m building a company to improve computer vision datasets, if you’d like to find errors in your dataset contact me. Click here for similar results on MS COCO object detection.
In the minds of many ML academics, the typical model development path looks something like this
However, those who have spent time building models in industry know that the real challenge lies is dealing with scenarios like this.
In conversations with over 75 ML teams, this comes up all the time. It doesn’t matter how big your dataset or how fancy the model — without clean, high quality labels you’re not going to get good performance.
Unfortunately, while tweaking hyperparameters is easy, finding broken labels is not. Semantic segmentation is particularly hard as you need to search over all the pixels contained in an image. As a consequence, the current state of the art for finding errors in semantic segmentation models is to look through every image manually, which is very expensive. To repeat myself, no techniques exist, in industry or academia, to find errors in semantic segmentation datasets.
Until now, that is! In this post, we extend FIXER to find pixel-wise errors in semantic segmentation datasets. FIXER uses novel explainable AI techniques to flag arbitrary image patches for manual review. On the MIT ADE20K dataset, it identifies over 50 distinct issues, finding¹ confirmed errors in 7% of total labels. In some of the rarer classes, such as “pillow”, the number of annotated pixels is tripled by FIXER.
- I am developing Breakpoint, a no-code UI for exploring and improving computer vision datasets using FIXER (without the need to share data). If you would like to be a design partner, or be placed on the waitlist, please sign up here.
- If you would like to use FIXER on your computer vision dataset, please contact me. I provide a consulting service: send me your dataset and I’ll send you a cleaned version back.
We also handle object detection models, and image classification, and are actively adding new model types.
Finding errors in MIT ADE20K
MIT ADE20K is one of the most widely used semantic segmentation datasets, with over 20,000 images. Each pixel in each image is labelled into one of 150 classes, ranging from “floor” to “radiator”.
In total, FIXER finds a total of 7% confirmed errors in ADE20K, with 48% of the discovered errors falling across 52 specific issues, and the other 52% general errors. FIXER’s output both selects a patch of pixels with a particular label, and suggests a corrected class for that patch. Some of the different error types FIXER captures are listed below (more examples are provided in the appendix).
1 — General errors (not in a specific issue). These are clear mistakes made by the labeler, that don’t follow any particular pattern.
2— Ambiguous labels are situations where labels are inconsistently applied, and the ground truth is unclear (these often require a judgement call)
3 — False negatives, where a rare class (e.g. cushion) is missed in favor of a more common class (e.g. couch)
In this post, I introduce FIXER — the first technique for finding pixel-level errors in semantic segmentation datasets. On MIT ADE20K, it finds over 50 different label issues, covering general errors, ambiguous labels, and false negatives. While I focused on the outputs of FIXER in this post, I intend to present the underlying methodology in future work.
In object detection, FIXER previously found nearly 300,000 errors in MS COCO, and an upcoming post will showcase some surprising results on image classification.
If you’d like to hear about future posts, consider following me on Medium, Twitter or Linkedin. If this problem intrigues you, I’d love to chat: firstname.lastname@example.org. We’re also actively looking for design partners/building a waitlist for Breakpoint (our no-code UI for improving models), consulting clients (you share your data, we send back a cleaned version), and founding engineers.
 : We estimated our error numbers by randomly choosing 25 flagged image patches from each label, for a total of 25*150=3750 patches, and manually checking each one by hand. We then use the estimated accuracy to compute the expected number of errors.
While “7% of labels are incorrect” is simple and easy to understand, it actually misses a lot. For instance, in this dataset the “blind, screen” class accounts for 0.1% of total labels, and FIXER-corrected labels more than triples that. That is a very significant change, but is effectively a rounding error when looking at the % of labels corrected. Collectively, there are 8 classes that account for only 0.7% of total labels, who’s size increases by almost 250% after FIXER. This is the type of highly material change that should be measured.
To fix this, I argue that we should evaluate labels the same way we evaluate model predictions. That is, we can treat the original labels as “predictions” of the corrected labels, and use the same metrics used to evaluate our models, in this case mean Intersection over Union(mIoU). These metrics are already designed to handle things like the rare class issue above.
Under this metric, we estimate that FIXER achieves improvements equal to 82.7 mIoU. As a comparison point, the current SOTA is 62.8 mIoU.
While there are currently no methods to benchmark against in this space, this point is worth noting as we continue to improve on FIXER. It is also noteworthy that the error rates for SOTA models are not that much lower than the error rates of the original labels.