An encapsulation of my work and experience throughout Google Summer of Code 2022 at OWASP Maryam
Hey there, your today’s read is from Harsh Malani, final year CS undergrad. In this post, I’ll describe the work that I did for OWASP’s Maryam project and brief experience as a part of Google Summer of Code 2022 Mentee. All credits to my mentors Saeed Dehqan and Kaushik Sivashankar 🙂
Still remember going through the proposed projects in 2021 and how overwhelmed I was, I didn’t even apply. Coming to 2022, was a less intimidating experience and took the courage to apply. Now here I am completing the last bits of the project when I thought I would never even get in!
It was my friends who forced me to apply to orgs in the first place because I didn’t have enough confidence myself. Taking up their advice, I started to look for orgs and projects that use ML/DL/NLP. On the way, I stumbled across OWASP’s Maryam project and it immediately caught my attention. Doing something uncommon and not very popular was something that kept me going. The time I got the acceptance notification, after days after submitting my proposal just before the deadline, was probably one of my happiest moments.
My project was to implement a deep neural network for clustering, topic modelling and document retrieval. For such unsupervised tasks, NLP along with neural nets has always been a way to go due to its property of identifying complex patterns even with directly inseparable data.
Maryam has a module named Iris which is a module for clustering, and sentiment analysis and has a meta-search engine as well using NLP. My task was to implement this NLP module using neural networks. This neural network could be of any type like DCEN, DCC, DAC or even transfer learning techniques. We ended by creating a novel embeddings model, not perfect yet, implemented using tensorflow2.
Before starting the actual implementation, I just had a basic idea of how to proceed due to less experience with unsupervised tasks. My mentors recommended an amazing book:
Kudos to the authors of this, it helped me for a deep and clear understanding of NLP concepts and fundamentals. The next phase or more like a simultaneous phase of research included publications and research papers of course. If I have the count nearby, I can recall going through and understanding 15–20 papers on related work.
Yeah, reading, theory, and research is an important aspect of any Machine Learning project, but so is actually implementing the theory. I recall just spending hours and hours on reading and not actually getting hands-on, which was and is a fatal mistake.
After spending weeks on this, we decide to implement our custom deep embeddings model, which can and will be used for clustering. This model was thought to incorporate maximum metadata from the raw data to generate better embeddings.
Here comes the best part. This section has all the details on how the new model was made, what all features were included etc. Skip this section if you don’t wanna go into much technical details.
Context and word2vec
As said by my mentor, NLP and word embeddings are about how well we can represent a word/data given a corpus. So, why not use more information from the data like distances between words or word probabilities or frequency? Better yet, why not use all of the info!
One embeddings model that almost everyone involved with ML knows about is word2vec. This model basically:
- Takes either partial/full context into consideration.
- Makes use of one-hot encoded vectors. These are context words basically.
- These are used to predict the target word or the source word for that particular context word.
- The weights between the input and the hidden layer are used as word embeddings.
Generating context map
This is how we processed and generated a new type of context map that:
- Takes a full context window (number of words to consider before and after the source/target word) into consideration.
- The distance of each context word from the source word. This distance is updated if this particular context word occurs multiple times in a window.
- If the target word also occurs more than once (but not in the context window) in each line of the corpus, then too the distance is updated.
- The frequency of context word occurrence is also taken into consideration.
- The probability of each context word (global as well as in the context window) is also taken into consideration. Note: this feature is yet to be implemented, a constant is assigned meanwhile.
This leads to the consideration of 3 additional features or information. The context data is generated as a dictionary and stored as JSON. For example, let’s take the following sentence in our corpus:
The quick brown fox jumps over lazy dog
Assuming a window size of 3 (The window here represents how many words to be considered before and after the target word for context, excluding the target word), the context map will look like this:
"the" : [dist,freq,prob],
"quick" : [dist,freq,prob],
"brown" : [dist,freq,prob],
"jumps" : [dist,freq,prob],
"over" : [dist,freq,prob],
"lazy" : [dist,freq,prob]
Such context is generated for every word, i.e, “The”, “quick” …. “fox” …. “dog” and the same process takes place throughout the corpus.
In the corpus, one hot vectors are also generated for every word. This is done so that every word can have a unique representation. A sample of one hot vector for the same sentence used above:
"The" : [1,0,0,0,0,0,0,0,0]
"fox" : [0,0,0,1,0,0,0,0,0]
"dog" : [0,0,0,0,0,0,0,0,1]
Preparing final data
Once the context map and one hot vectors are ready, they are multiplied element-wise. For example, when my target word is “fox”:
- The words in the context window before and after are taken. If sufficient words are not available, then context is padded with a fixed string which has its own context map and one hot encoding.
- These words are multiplied with the context map of these words with respect to the target word.
- Here, with fox as an example, the context will be : [“brown”, “jumps”, “quick”, “over”, “The”, “lazy” ]. Each of these words will be replaced with context array ( [dist, freq, prob] ) with respect to “fox” and then multiplied with the one hot vectors of these words.
- For each target word, the input will consist of the entire context, i.e, window*2.
- Another thing to note here is the way the words occur in context. The words with distance 1 on both sides are taken first, followed by 2, and so on till the window ends.
The neural network
Currently, a simple enough neural network is used with an input layer, 2 hidden layers and an output layer. The size or the dimension of the embedding is fixed. This also represents the number of nodes in one of the hidden layers.
All the above-generated training data is passed through the input layer followed by the hidden layer 1 and 2 and finally the output layer as seen above. The intermediate weights w2 and b2 are used to get the final word embeddings.
The result is the deep word embedding for each word. This is basically an array of floating point values, size = embedding dimension, which best represents the target word. A sample of how an embedding looks with dimension 10:
"fox" : tf.Tensor([ 3.307526 1.2745522 0.6221365 1.0608096 -1.3911397 -2.7389529 -0.03785023 -1.6968094 0.61216784 0.995255 ], shape=(10,), dtype=float32)
The embeddings can be used as required. May it be for classification or clustering or to train other NLP models!
Certainly a lot. Although only 3 features are considered for this model, more will follow. Because, the more features and information, the better the embeddings.
Along with that, the structure of the neural network can also be changed as desired. More specialised cells like LSTMs or maybe even CNNs can be integrated here. Transformers are the new state-of-the-art, so even an encoder-decoder structure can be adopted. The scope is endless, which is pretty normal for any ML/DL application!!
This concludes my work for now, but I’ll keep working on this model until it becomes a benchmark! If you are interested in the implementation, check this repo out. For any suggestions or feedback, feel free to reach out via LinkedIn or Twitter.