Technology: Machine Learning
On datasets and machine learning
From Microsoft’s Tay chatbot, yanked from Twitter after a tirade of sexist and racist utterances, to Facebook’s recent BlenderBot 3, also spewing racist stuff, we’ve heard a lot about technology learning harmful bias from us. Less prominent in the conversation is us — our role in collecting information about the world and formatting it in a way machines can learn from, or that we can use to evaluate a machine’s performance.
Today, I thought I’d do a three-minute dive into an excellent survey on the topic of datasets —Data and its (dis)contents: A survey of dataset development and use in machine learning research — to look at some of the concerns around data collection and use. Obviously, I can’t touch on everything in the paper in three minutes, but I’ll touch on some bits that struck me.
Datasets have what Kate Crawford calls “representational harms,” which can manifest as an under-representation of people who are not white, Western men. Stereotypes and other dubious representations (such as offensive image labels), as well as artifacts that lead to spurious correlations (models can ‘learn’ to see pneumonia simply by mastering the hospital-specific marks on chest X-rays) make datasets problematic training material, not to mention benchmarks for model performance.
Several common practices used in building datasets are problematic as well — scraping information from the internet (leaving human subjects unaware that their Flickr photos are being used in a facial analysis, for example), and failing to recognize that annotating is interpretive work (annotators bring their own values and biases to the task).
Identifying and correcting for bias can be hard because datasets are large, but the survey authors highlight a number of efforts, both human — Pipkin spent hours of watching MIT’s “Moments in Time” video dataset to identify disturbing footage, for example — and algorithmic, such as Wang et al’s REVISE tool, which looks at images and their annotations to find biases. (Gender biases are of particular interest to me, as is REVISE’s finding that people too small to see clearly in images are labeled as ‘man’).
Very fascinating is the discussion around measuring model performance and the machine learning community’s love for using benchmark datasets (which, as we’ve already seen, are problematic). Practitioners are rewarded with top leaderboard positions when models perform well against benchmarks, but is this enough? Some argue that current evaluation criteria are too narrow (and should include “reports of energy consumption, model size, fairness metrics, and more”); others believe that the current focus has the “potential to stunt the development of new ideas.”
Issues around privacy and data reuse are also a concern — and the studies that illustrate harms are each well worth thinking about, from Joanna Radin’s history of the Pima Indians Diabetes Dataset (PIDD), used to train algorithms that had “nothing to do with diabetes or even to do with bodies,” to Peng, who found that even “after certain problematic face datasets were removed, hundreds of researchers continued to cite and make use of copies of this dataset months later.”
The survey also discusses labor and legal perspectives, but I’m running out of time! You can read more in the paper itself.
Read more on machine learning:
Kraft, Amy. “Microsoft shuts down AI chatbot after it turned into a Nazi.” CBSnews.com. (2016).
Paullada, Amandalynne, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. “Data and its (dis) contents: A survey of dataset development and use in machine learning research.” Patterns 2, no. 11 (2021): 100336.
Silva, Christianna. “It took just one weekend for Meta’s new AI Chatbot to become racist.” Mashable.com. (2022).