Technology: Machine Learning
When is a model ready to deploy? Who’s deciding? And why?
“Overwhelmingly, the subjects selected a model with the highest accuracy — even though it exhibited the largest gender disparity.”
The subjects referenced are engineers, and the model is a machine learning model trained on real-world medical images, an example used in a fascinating study about which machine learning models are chosen to be deployed.
Noting that model developers often train several models with different optimizers or hyperparameters, and then choose one to deploy, the study’s authors point out that human judgement plays a role in deciding which of the models is best.
If you’re reading this thinking: simple! the best model is the one with the highest accuracy, you stand with the majority of test subjects. However, as the authors point out, the model with the highest accuracy is not always equally accurate for different sub-populations. A model that detects skin cancer can be better at detecting it for men than women, for example, and even a small difference can have a huge impact when a technology is deployed at scale, or when the incidence of cancer is higher in one sub-population than another.
“We therefore advocate examining sub-population performance variability as an essential component of performing fair model selection,” the authors write, adding that exposing the model-selection decisions and rationales also helps build more accountable systems.
Even outside a medical context, where a false diagnosis can be a matter of life or death, gender bias can have a substantial impact. And one of the more remarkable findings concerning current NLP models is that few people are actually testing for gender bias before deploying the models.
Karolina Stańczak and Isabelle Augenstein, who surveyed 304 papers on gender bias in natural language processing write:
Despite a myriad of papers on gender bias in NLP methods, we find that most of the newly developed algorithms do not test their models for bias and disregard possible ethical considerations of their work.
In addition to the failure to test for gender bias, the authors note that much of the existing research on gender bias looks at gender as binary, which erases non-binary identities and leads to other harms, such as misgendering. And the bulk of the research has focused on English and/or lacks baselines and evaluation metrics.
The gender gap in the NLP community itself is also worth mentioning. Saif M. Mohammad, who looked at tens of thousands of NLP papers and their citations, found that only 29% of first authors are female and 25% of last authors, and that on average, male first authors are cited “markedly more” than female ones.
Ultimately, humans are making a lot of decisions about the technologies we build, how we evaluate those technologies, and what and when we deploy them — and the people most negatively impacted by gender bias are also underrepresented among those decision makers. The next time you come across a machine learning system spewing sexist content, perhaps you will think of this, too.
Read more about gender bias and technology:
Forde, Jessica Zosa, A. Feder Cooper, Kweku Kwegyir-Aggrey, Chris De Sa, and Michael Littman. “Model Selection’s Disparate Impact in Real-World Deep Learning Applications.” arXiv preprint arXiv:2104.00606 (2021).
Mohammad, Saif M. “Gender gap in natural language processing research: Disparities in authorship and citations.” arXiv preprint arXiv:2005.00962 (2020).
Stanczak, Karolina, and Isabelle Augenstein. “A survey on gender bias in natural language processing.” arXiv preprint arXiv:2112.14168 (2021).