With all the advancements that have been made thus far in the field of Natural Language Processing (NLP) and Linguistics, you’d think intelligent systems would be capable of finding out the right meaning of a word used in a certain context. However, that is not the case, at least not in low-resource languages like Hindi and code mixed Hinglish, wherein much work is yet to be done. This obviously opens up a lot of opportunities for research and other application-oriented work in word sense disambiguation. Hence, my co-author, Mirza Yusuf and I decided to build a Python library that would act as a helpful starting point for those interested in research around word sense disambiguation for Hindi and code-mixed Hinglish. This article is a brief on our thought process, work, and the library.
Word Sense Disambiguation
Put simply, word sense disambiguation refers to the identification of the correct meaning of a word used in a certain context. Here’s an example:
“I hit the ball with a bat.”
A very simple and straightforward sentence and I’m sure most of us know that “bat” here refers to the wooden stick-like object used in cricket, baseball, etc. Now, the other meaning of “bat” would be the flying mammal that hangs upside down off a tree but we know that you cannot use it to hit a ball (or so I hope).
So this is what our problem statement essentially was. We carried out research to devise optimal ways in which we could teach our systems to disambiguate words based on the context in which they were used. While there’s a decent amount of work done in high-resource languages like English, the low-resource languages still have a long way to go.
Being able to read, write & understand Hindi fairly well, my co-author and I decided to tackle this problem for both Hindi as well as code-mixed Hinglish text.
A lot of our initial time was utilized in looking for relevant datasets and pre-existing Python libraries on which we could build, along with a lot of literature surveys. In the process, we came across WordNet, a lexical database consisting of semantic relations between words, and eventually the IndoWordNet, a version of the WordNet that consists of prominent Indian languages.
An important decision we had to make was to see which kind of learning methods/models to use in order to successfully tackle this problem. Now usually, we look at supervised or unsupervised methods involving neural network architectures such as LSTMs, GRUs, BERT, etc. However, we found the already existing supervised and unsupervised methods rather difficult to work with, with dataset formats also not being in a very desirable form. This prompted us to explore dictionary-based methods, which while experimenting with, we also found out were faster than supervised or unsupervised methods and perfectly leveraged the characteristics of the WordNet.
A rather simple but effective algorithm, the Lesk algorithm leverages the structure and characteristics of WordNets to carry out word sense disambiguation.
In essence, it takes in the word to be disambiguated along with the context in which this word is being used and then uses WordNet to find out the meaning of the word.
This is what the pseudo-code looks like:
Based on the context (words surrounding the ambiguous word) and words from the IndoWordNet that are found in relation to the ambiguous word, an overlap is calculated. This overlap is calculated for each meaning of the word present in the IndoWordNet and the meaning with the highest overlap score is considered to be the correct meaning.
Here’s an example to provide a better understanding:
However, there was one major issue that we noticed with this algorithm- it only provided an extremely generalized solution. Due to the common nature of support words (context) used for various different meanings, the Lesk algorithm often failed to identify the correct meaning of a word in a given sentence.
Custom Lesk Algorithm
This led us to explore approaches, both supervised and unsupervised, only to realize that the datasets in this area were particularly scarce. Hence, we narrowed our problem statement. We decided to enhance word sense disambiguation for 20 of the most commonly used ambiguous Hindi words. This allowed us to also curate a special dataset with various meanings of these 20 words and use it to enhance our Lesk algorithm.
We proceeded by modifying the Lesk algorithm in a way such that the words present in our custom dataset would also play a part in determining the sense of the word.
An additional parameter called ‘intersection’ was used by us to enhance the algorithm. Here’s the pseudo-code for our version of the algorithm and despite only making a small change, we observed a great difference in the results!
With a small tweak and a more focused approach, we were able to greatly enhance the accuracy of our dictionary-based Lesk algorithm.
We were able to double the accuracy on the 20 chosen words and the 2 most commonly used meanings of each of these words. More importantly, we were able to create a framework that can be used in the future for the purpose of research or simply as a convenient tool for a project!
While this may not be the most robust solution and it could surely be time taking for a whole language, our hope is that our work here will encourage more research and curation of datasets in the future from which supervised and unsupervised methods can be used in a straightforward manner and provide quick predictions.
For more information, you can check out our Python package here or simply,
pip install hindiwsd
Feel free to interact with us on our Github repository as well, we are more than happy to help!