Do we need a new paradigm to train efficient ML models?
In the era of Big Data, data has become an invaluable asset for advanced machine learning and artificial intelligence, and most of this data is sourced from human inference. Data are categorized and then meaningful patterns, correlations, associations, and implications are found through the process of empirical analysis, which is similar to semantic interpretation. Various technique for empirical analysis has been developed and have become standards in Natural Language Processing (NLP), for more than 20 years. One of these techniques is using human annotation to produce a ground truth or gold standard.
Most of the work in NLP is still based on the presumption that natural language (NL) expressions have a single, unambiguous interpretation in a particular context.
This assumption is only a convenient idealization, as every project focused on large-scale annotation has discovered that genuine disagreements are commonplace. These discrepancies can occasionally be attributed to miscommunications or inadequately described annotation schemes. However, the text interpretation is frequently found to be inherently ambiguous or unclear.
Today world is advancing in Artificial Intelligence (AI) and Cognitive Sciences which is primarily because of freely available and huge annotated datasets. Not to mention a majority of these AI-driven supervised learning tasks are based on the notion of ground truth or also referred to as the ‘gold label’.
What are gold labels?
In the context of machine learning, gold labels are the true labels with which you want to train your supervised machine learning algorithm. These are the targets during model training.
What is the myth?
It is usually found that in most machine learning tasks, multiple opinions are flattened or aggregated using the majority vote approach to generate a single ground truth, totally disregarding the disagreement in the opinion. This is only rational when the complete population agrees to a single truth.
To give an example, if we are to ask these questions to a classifier expected to identify a cat: ‘Is this a cat?’ or ‘To what extent does this look like a cat?’
Here the focus shifts from ‘To what extent the classifier agrees with the single truth? ’ to ‘To what extend the classifier agrees with the population distribution?’
Just as science is evolving with every passing day so are humanities and so should the collection of human-annotated data. Now to think of it …
Do we really think that machine learning models (especially NLP models) trained on single ground truth ‘gold labels’ will be able to capture human subjectivity??
So, what’s next??
Considering the fact that human interpretation is subjective, evaluating crowd annotation on the same object in a dataset will provide a wider perspective of subjectivity and interpretation. Studies have shown that when the same text annotation task was given to different people, there was substantial disagreement in the results. Up until now, rather than accepting this as a natural property of semantic interpretation, disagreement has been considered a measure of poor quality in the annotation task, either because the task was poorly defined or because the annotators lack sufficient training.
However, that’s not always the case. Disagreement is not a noise but a signal. It gives us information.
So, the big questions to ponder are,
- How can we leverage this disagreement in the crowd annotated data to make powerful NLP models by debunking the notion of the ‘gold label’?
- What kind of evaluation metrics do we need to identify to evaluate a model trained on a dataset having more than one annotation per object?