Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Machine Learning

Embedding Myself to Generate Custom Embeddings Model- GSoC’22 | by Harsh | Sep, 2022

admin by admin
September 12, 2022
in Machine Learning


An encapsulation of my work and experience throughout Google Summer of Code 2022 at OWASP Maryam

Hey there, your today’s read is from Harsh Malani, final year CS undergrad. In this post, I’ll describe the work that I did for OWASP’s Maryam project and brief experience as a part of Google Summer of Code 2022 Mentee. All credits to my mentors Saeed Dehqan and Kaushik Sivashankar 🙂

Still remember going through the proposed projects in 2021 and how overwhelmed I was, I didn’t even apply. Coming to 2022, was a less intimidating experience and took the courage to apply. Now here I am completing the last bits of the project when I thought I would never even get in!

It was my friends who forced me to apply to orgs in the first place because I didn’t have enough confidence myself. Taking up their advice, I started to look for orgs and projects that use ML/DL/NLP. On the way, I stumbled across OWASP’s Maryam project and it immediately caught my attention. Doing something uncommon and not very popular was something that kept me going. The time I got the acceptance notification, after days after submitting my proposal just before the deadline, was probably one of my happiest moments.

My project was to implement a deep neural network for clustering, topic modelling and document retrieval. For such unsupervised tasks, NLP along with neural nets has always been a way to go due to its property of identifying complex patterns even with directly inseparable data.

Maryam has a module named Iris which is a module for clustering, and sentiment analysis and has a meta-search engine as well using NLP. My task was to implement this NLP module using neural networks. This neural network could be of any type like DCEN, DCC, DAC or even transfer learning techniques. We ended by creating a novel embeddings model, not perfect yet, implemented using tensorflow2.

Before starting the actual implementation, I just had a basic idea of how to proceed due to less experience with unsupervised tasks. My mentors recommended an amazing book:

Natural Language Processing in Action by Lane, Howard and Hapke

Kudos to the authors of this, it helped me for a deep and clear understanding of NLP concepts and fundamentals. The next phase or more like a simultaneous phase of research included publications and research papers of course. If I have the count nearby, I can recall going through and understanding 15–20 papers on related work.

Yeah, reading, theory, and research is an important aspect of any Machine Learning project, but so is actually implementing the theory. I recall just spending hours and hours on reading and not actually getting hands-on, which was and is a fatal mistake.

After spending weeks on this, we decide to implement our custom deep embeddings model, which can and will be used for clustering. This model was thought to incorporate maximum metadata from the raw data to generate better embeddings.

Here comes the best part. This section has all the details on how the new model was made, what all features were included etc. Skip this section if you don’t wanna go into much technical details.

Context and word2vec

As said by my mentor, NLP and word embeddings are about how well we can represent a word/data given a corpus. So, why not use more information from the data like distances between words or word probabilities or frequency? Better yet, why not use all of the info!

One embeddings model that almost everyone involved with ML knows about is word2vec. This model basically:

  • Takes either partial/full context into consideration.
  • Makes use of one-hot encoded vectors. These are context words basically.
  • These are used to predict the target word or the source word for that particular context word.
  • The weights between the input and the hidden layer are used as word embeddings.
Simple visualization of word2vec. (source: https://bit.ly/3L5WDUh)

Generating context map

This is how we processed and generated a new type of context map that:

  • Takes a full context window (number of words to consider before and after the source/target word) into consideration.
  • The distance of each context word from the source word. This distance is updated if this particular context word occurs multiple times in a window.
  • If the target word also occurs more than once (but not in the context window) in each line of the corpus, then too the distance is updated.
  • The frequency of context word occurrence is also taken into consideration.
  • The probability of each context word (global as well as in the context window) is also taken into consideration. Note: this feature is yet to be implemented, a constant is assigned meanwhile.

This leads to the consideration of 3 additional features or information. The context data is generated as a dictionary and stored as JSON. For example, let’s take the following sentence in our corpus:

The quick brown fox jumps over lazy dog

Assuming a window size of 3 (The window here represents how many words to be considered before and after the target word for context, excluding the target word), the context map will look like this:

{. 
.
.
"fox":{
"the" : [dist,freq,prob],
"quick" : [dist,freq,prob],
"brown" : [dist,freq,prob],
"jumps" : [dist,freq,prob],
"over" : [dist,freq,prob],
"lazy" : [dist,freq,prob]
}
.
.
.}

Such context is generated for every word, i.e, “The”, “quick” …. “fox” …. “dog” and the same process takes place throughout the corpus.

One-hot vectors

In the corpus, one hot vectors are also generated for every word. This is done so that every word can have a unique representation. A sample of one hot vector for the same sentence used above:

"The" : [1,0,0,0,0,0,0,0,0]
"fox" : [0,0,0,1,0,0,0,0,0]
"dog" : [0,0,0,0,0,0,0,0,1]

Preparing final data

Once the context map and one hot vectors are ready, they are multiplied element-wise. For example, when my target word is “fox”:

  • The words in the context window before and after are taken. If sufficient words are not available, then context is padded with a fixed string which has its own context map and one hot encoding.
  • These words are multiplied with the context map of these words with respect to the target word.
  • Here, with fox as an example, the context will be : [“brown”, “jumps”, “quick”, “over”, “The”, “lazy” ]. Each of these words will be replaced with context array ( [dist, freq, prob] ) with respect to “fox” and then multiplied with the one hot vectors of these words.
  • For each target word, the input will consist of the entire context, i.e, window*2.
  • Another thing to note here is the way the words occur in context. The words with distance 1 on both sides are taken first, followed by 2, and so on till the window ends.

The neural network

Currently, a simple enough neural network is used with an input layer, 2 hidden layers and an output layer. The size or the dimension of the embedding is fixed. This also represents the number of nodes in one of the hidden layers.

The neural net used to get embeddings

All the above-generated training data is passed through the input layer followed by the hidden layer 1 and 2 and finally the output layer as seen above. The intermediate weights w2 and b2 are used to get the final word embeddings.

The result is the deep word embedding for each word. This is basically an array of floating point values, size = embedding dimension, which best represents the target word. A sample of how an embedding looks with dimension 10:

"fox" : tf.Tensor([ 3.307526    1.2745522   0.6221365   1.0608096  -1.3911397  -2.7389529 -0.03785023 -1.6968094   0.61216784  0.995255  ], shape=(10,), dtype=float32)

The embeddings can be used as required. May it be for classification or clustering or to train other NLP models!

Certainly a lot. Although only 3 features are considered for this model, more will follow. Because, the more features and information, the better the embeddings.

Along with that, the structure of the neural network can also be changed as desired. More specialised cells like LSTMs or maybe even CNNs can be integrated here. Transformers are the new state-of-the-art, so even an encoder-decoder structure can be adopted. The scope is endless, which is pretty normal for any ML/DL application!!

Photo by Giordano Rossoni on Unsplash

This concludes my work for now, but I’ll keep working on this model until it becomes a benchmark! If you are interested in the implementation, check this repo out. For any suggestions or feedback, feel free to reach out via LinkedIn or Twitter.

Special thanks to Shubham Palriwala, Aarush Bhat and Hemanth Krishna 🙂





Source link

Previous Post

Delivering Success in Natural Language Processing Projects: Part Four | by C Thinwa | Sep, 2022

Next Post

ZEDDISKALM. Worden onze lontjes alsmaar korter? | by Jurgen Masure | Sep, 2022

Next Post

ZEDDISKALM. Worden onze lontjes alsmaar korter? | by Jurgen Masure | Sep, 2022

An intelligent project measurement system for tracking data science projects | by Alapon Sen | dunnhumby Data Science & Engineering | Sep, 2022

Purchase Order (PO) Matching - Automate with AI

Related Post

Artificial Intelligence

Exploring TensorFlow Model Prediction Issues | by Adam Brownell | Feb, 2023

by admin
February 2, 2023
Machine Learning

Different Loss Functions used in Regression | by Iqra Bismi | Feb, 2023

by admin
February 2, 2023
Machine Learning

How to organize bills? – 3 ways to track bills

by admin
February 2, 2023
Artificial Intelligence

How to decide between Amazon Rekognition image and video API for video moderation

by admin
February 2, 2023
Artificial Intelligence

The Future of AI: GPT-3 vs GPT-4: A Comparative Analysis | by Mohd Saqib | Jan, 2023

by admin
February 2, 2023
Deep Learning

6 Ways To Streamline Tech Hiring With A Recruitment Automation Platform

by admin
February 2, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.