To our surprise, label errors are pervasive across 10 popular benchmark test sets used in most machine learning research, destabilizing benchmarks.

It’s well-known that ML datasets are not perfectly labeled. But there hasn’t been much research to systematically quantify how error-ridden the most commonly-used ML datasets are at scale. Prior work has focused on errors in train sets of ML datasets. But no study has looked at systematic error across the most-cited ML test sets — the sets we rely on to benchmark the progress of the field of machine learning.

Here, we algorithmically identified and human-validated that there are indeed pervasive label errors in the ten of the most-cited test sets, then studied how they affect the stability of ML benchmarks. Here we summarize our findings along with key takeaways for ML practitioners.

This post overviews the paper Pervasive Label Errors in Test Sets Destabilize ML Benchmarks authored by Curtis G. Northcutt (ChipBrain & MIT), Anish Athalye (MIT), and Jonas W. Mueller (Amazon).

Errors in Highly-Cited Benchmark Test Sets

Browse all label errors across all 10 ML datasets at (demos below):

Key Takeaways of Pervasive Label Errors

How pervasive are errors in ML test sets?

  • We estimate an average of 3.4% errors across the 10 datasets, where for example 2,916 label errors comprise 6% of the CIFAR-100 test set and ~390,000 label errors comprise ~4% of the Amazon Reviews dataset. Even the MNIST dataset, assumed to be error-free and benchmarked in tens of thousands of peer-reviewed ML publications, contains 15 (human-validated) label errors in the test set.

Of the 10 ML datasets you looked at, which had the most errors?

  • The QuickDraw test set contains over 5 million errors comprising about 10% of the test set. View the QuickDraw label errors.

How did you find label errors in vision, text, and audio datasets?

  • In all 10 datasets, label errors are identified algorithmically using confident learning and then human-validated via crowd-sourcing (54% of the algorithmically flagged candidates are indeed erroneously labeled). The confident learning framework is not coupled to a specific data modality or model, allowing us to find label errors in many kinds of datasets.

What are the implications of pervasive test set label errors?

  • Higher capacity/complex models (e.g. ResNet-50) perform better on the original incorrectly labeled test data (i.e. what one traditionally measures), but lower capacity models (e.g. ResNet-18) yield higher accuracy on corrected labels (i.e. what one cares about in practice, but cannot measure without the manually corrected test data we provide). This likely occurs because higher capacity models overfit to the train set label noise during training and/or overfit to the validation/test set by tuning hyper-parameters on the test set (even though the test set is supposedly unseen).

How much noise can destabilize ImageNet and CIFAR benchmarks?

  • On ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of mislabeled test examples increases by 5%.

Is a cleaned version of each test set available?

  • Yes, [cleaned ML test sets here]. In these cleaned sets, humans corrected a large fraction of the label errors. We hope future research on these benchmarks will use this improved test data instead of the original erroneous labels.

Can I interact with the label errors in each dataset?

Are the label errors 100% accurate?

  • Results are not perfect. In some cases, Mechanical Turk workers agree on the wrong label. We still likely only capture a lower bound on the error given that we only validated a small fraction of the datasets for errors. Although our corrected labels are not 100% accurate, on inspection, they appear vastly superior to the original labels.

What should ML practitioners do differently?

Traditionally, ML practitioners choose which model to deploy based on test accuracy — our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets. We provide two recommendations for ML practitioners:

  1. Correct your test set labels (e.g. using our approach)
    • to measure the real-world accuracy you care about in practice.
    • to find out whether your dataset suffers from destabilized benchmarks.
  2. Consider using simpler/smaller models for datasets with noisy labels
    • especially for applications trained/evaluated with labeled data that may be noisier than gold-standard ML benchmark datasets.

Finding Label errors

Label Error counts and percentages across 10 popular ML datasets
Test set errors prominent across common benchmark datasets. Errors are estimated using confident learning (CL) and validated by human workers on Mechanical Turk.

Human-validation of label errors

Mechanical Turk Interface used to verify label errors
Mechanical Turk worker interface showing an example from CIFAR-100 (with given label “cat”). For each data point algorithmically identified as a potential label error, the interface presents the data point, along with examples belonging to the given class. The interface also shows data points belonging to the confidently predicted class. Either the given class is shown as option (a) and predicted is shown as option (b), or vice versa (chosen randomly). The worker is asked whether the image belongs to class (a), (b), both, or neither.
Mechanical Turk Validation of Label errors in their various forms
Mechanical Turk validation confirming the existence of pervasive label errors and categorizing the types of label issues.

Effects of Test Label Errors on Benchmarks

Benchmark rankings of CIFAR and ImageNet on corrected test labels
Benchmark ranking comparison of 34 models pre-trained on ImageNet and 13 pre-trained on CIFAR-10 (more details in the paper). Benchmarks are unchanged by removing label errors (a), but change drastically on the subset of the test set identified as having labels that should be corrected, in particular when comparing accuracy on the original (erroneous) labels versus corrected labels, e.g. Nasnet: 1/34 → 29/34, ResNet-18: 34/34 → 1/34.

Instability of ML Benchmarks

Noise prevalences where CIFAR benchmark rankings to change
ImageNet top-1 original accuracy (top panel) and corrected accuracy (bottom panel) vs Noise Prevalence (N). Vertical lines indicate noise levels at which the ranking of two models changes (in terms of original/corrected accuracy). The left-most noise prevalance (N = 6%) on the x-axis is the (rounded) original estimated noise prevalence from [this table]. The leftmost vertical dotted line in the bottom panel should be read as, "The Resnet-50 and Resnet-18 benchmarks cross at noise prevalence N = 11.4%, implying Resnet-18 outperforms Resnet-50 when N increases by around N = 6% relative to the original test data (N = 5.83%)."
Noise prevalences where CIFAR benchmark rankings to change
CIFAR-10 top-1 original accuracy (top panel) and corrected accuracy (bottom panel) vs Noise Prevalence (with agreement threshold = 3). Vertical lines indicate noise levels at which the ranking of two models changes (in terms of original/corrected accuracy).

Learn more

A detailed discussion of this work is available in [our arXiv paper].

These results build upon a wealth of work done at MIT in creating confident learning, a sub-field of machine learning that looks at datasets to find and quantify label noise. For this project, confident learning is used to algorithmically identify all of the label errors prior to human verification.

We made it easy for other researchers to replicate their results and find label errors in their own datasets using cleanlab, an open-source python package for machine learning with noisy labels.

  1. Introduction to Confident Learning: [view this post]
  2. Introduction to cleanlab Python package for ML with noisy labels: [view this post]


This work was supported in part by funding from the The MIT-IBM Watson AI Lab, MIT Quanta Lab, and the MIT Quest for Intelligence. We thank Jessy Lin for her help with early versions of this work (accepted as a workshop paper at NeurIPS 2020 Workshop on Dataset Curation and Security).