## Data Science

## How are they better than Word2Vec?

Table of Contents🧤 GloVe

⚙️ The Basics

🧮 Cost Function Derivation

🔮 Final Prediction

🪙 Advantages & Limitations⏩ fastText

📚 Skip-gram reviewed

📈 Improving Skip-gram

🆚 fastText vs Word2Vec🚀 Summary

In most cases, **Word2Vec** embedding is better than the bag of words representation of texts by allowing you to customize the length of feature vectors and capturing the relationships between tokens using **(center, context)** word pairs.

Yet, there are still some limitations to Word2Vec, four of which are:

- Word2Vec relies on local information about words, i.e. the context of a word relies only on its neighbors.
- The obtained word embedding is a byproduct of training a neural network, hence the linear relationships between feature vectors are a black box (kind of).
- Word2Vec cannot understand out-of-vocabulary (OOV) words, i.e. words not present in training data. You could assign a UNK token which is used for all OOV words or you could use other models that are robust to OOV words.
- By assigning a distinct vector to each word, Word2Vec ignores the
**morphology**of words. For example,*eat*,*eats*, and*eaten*are considered independently different words by Word2Vec, but they come from the same root:*eat*, which might contain useful information.

In this story, we will introduce embedding models that in theory could resolve these limitations: **GloVe** and **fastText**. From now on, we will call a single observation of text by **document** and a collection of documents by **corpus**.

## ⚙️ The Basics

Unlike Word2Vec, GloVe (Global Vectors) is created to find linear structures of feature vectors *directly*. In doing so, it directly captures global corpus statistics which takes advantage of the vast amount of repetition in the data. GloVe tackles limitations 1 and 2 of Word2Vec.

First, some notation!

As an example, let’s say you have a corpus of one document that says “The quick brown fox jumps over the lazy dog”. Then with a co-occurrence window size of 1, *X* would be as follows. It’s easy to see that *X* is symmetric.

The idea of GloVe is to convert *X* into feature matrices *W* (whose rows are populated by *wᵢ* or *wⱼ*) and *V* (whose rows are populated by *vₖ*). The most general GloVe model takes the form

where *F* relates *W* and *V* to *X*. Let’s see how *F* distinguishes relevant words from irrelevant words:

- If word
*k*is related to both word*i*and*j*(*Pᵢₖ*and*Pⱼₖ*are large), or unrelated to both word*i*and*j*(*Pᵢₖ*and*Pⱼₖ*are small), then the value of*F*would be close to 1. - If word
*k*is related to*exactly one*of the words*i*and*j*(one of*Pᵢₖ*and*Pⱼₖ*is small, the other is large), then the value of*F*would be far from 1.

## 🧮 Cost Function Derivation

There are many algebraic manipulations of feature vectors that satisfy *F*. To find an appropriate cost function, let’s shrink down the number of possibilities for *F* so that it is calculable but still makes sense. This is an 8-step process:

**(1).** Since vector spaces are inherently linear structures, the most natural way to encode the information present in *Pᵢₖ*/*Pⱼₖ* is with vector differences. Hence, you could restrict *F* to be

**(2).** Note that *F* converts vectors to a scalar. To maintain the linear structure of the vectors, restrict *F* one more time such that it now receives the dot product of its arguments,

**(3).** There is nothing special about *i* and *j*, they point to arbitrary words in the corpus. Hence, you could flip the role of *i* and *j* in equation (2) to obtain

Define *u* = *wᵢ* − *wⱼ*, you have

**(4).** The distinction between a center word and a context word in *X* is arbitrary and you are free to exchange the two roles. To do so consistently, you must not only exchange *W* ↔ *V* but also *X* ↔ *Xᵀ*.

The model should be invariant under this relabeling, but equation (2) is not since the RHS is invariant due to the symmetricity of *X* but the LHS isn’t. To achieve invariance, *F* is assumed to be a **group homomorphism** from (ℝ, +) to (ℝ⁺, ×). Yeah, we restrict *F* one more time. Hence

**(5).** One solution to equation (4) is

The group homomorphism of *F* is satisfied by *F* = exp, so

**(6).** Since the term log(*Xᵢ*) doesn’t depend on *k*, it could be seen as a bias term *bᵢ* for *wᵢ*. To restore exchange symmetry, add also a bias *aₖ* for *vₖ*. The final model becomes

**(7).** The model in part (6) is called the log-bilinear regression model. The optimal values for parameters *wᵢ*, *vₖ*, *bᵢ*, and *aₖ* can be found using the least squares method.

A weighting function *f*(*Xᵢₖ*) is introduced to compensate whenever *Xᵢₖ* = 0 since log(*Xᵢₖ*) is undefined at that point, and also to balance out the contribution of frequent and infrequent words to the model. The cost function of the model becomes

where `vocab_size`

is the size of the corpus.

**(8).** What is a good choice of *f*? First of all, for the reason explained in part (7), *f* should satisfy *f*(0) = 0. Also, if it’s continuous, it should approach zero as *x* → 0 faster than log² *x* does. Secondly, *f*(*x*) should be non-decreasing so that rare co-occurrences (small *x*) are not overweighted (has relatively large *f*). Third, *f*(*x*) should be relatively small for large values of *x*, so that frequent co-occurrences are not overweighted. One good simple choice of *f* is

where *x*ₘₐₓ and *α* are hyperparameters. Here’s the plot of *f*(*x*).

## 🔮 Final Prediction

So, you already:

- calculated the word-word co-occurrence counts matrix
*X*, and - initialized parameters
*wᵢ*,*vₖ*,*bᵢ*, and*aₖ*randomly.

After some iterations, you obtained optimized *wᵢ* and *vₖ* that build up word embeddings *W* and *V*, respectively. Now, what? Which one do you choose as your final word embedding?

Since *X* is symmetric, *W* and *V* are equivalent and differ only as a result of their random initialization. To reduce overfitting, the final word embedding is set to *W* + *V*.

## 🪙 Advantages & Limitations

Advantages:

- Since GloVe tries to capture linear structures of feature vectors directly, it performs better than Word2Vec for
**word analogy**tasks. It also outperforms related models on**similarity**tasks and**named entity recognition**. - GloVe gives proper weights to word pairs so that no word dominates the training process.

Limitations:

- GloVe is trained on a word-word co-occurrence counts matrix, which takes a lot of memory for storage, especially if you are working on a very large corpus.
- Just like Word2Vec, GloVe doesn’t recognize OOV words and ignores the morphology of words.

As the name suggests, fastText is a fast-to-train word representation based on the Word2Vec skip-gram model, that can be trained on more than one billion words in less than ten minutes using a standard multicore CPU.

fastText can address limitations 3 and 4 of the vanilla Word2Vec mentioned at the beginning of this story. The model learns word representations while also taking into account morphology, which is captured by considering subword units (character *n*-grams).

## 📚 Skip-gram reviewed

The objective of skip-gram is to find word representations that are useful for predicting the surrounding/context words in a document. More formally, given a training corpus represented as a sequence of words *w₁*, …, *wₘ*, the objective of skip-gram is to maximize the following log-likelihood:

where 𝒞*ᵢ* is the set of indices of context words surrounding the center word *wᵢ*. One possible choice to define probability *p* is the softmax of the output layer of a dense neural network.

The diagram below explains the training process of skip-gram. Here, ∑ represents the weighted sum of the output of the network’s previous layer, and S means softmax.

More on skip-gram and Word2Vec in general:

This formulation is impractical because of the large computational cost. However, the problem of predicting context words can instead be framed as a set of independent binary classification tasks, where the goal is to independently predict the presence (or absence) of context words.

For a center word at index *i* in the corpus, all corresponding context words are considered positive examples, and negative examples (words other than the context words) are chosen at random from the corpus. Using the binary logistic loss, skip-gram minimizes the following objective function:

where 𝒩*ᵢ*,*ⱼ* is the set of indices of negative examples sampled from the corpus and *s* is a scoring function that maps pairs of **(center, context)** to ℝ. For skip-gram, the choice of *s* would be a dense neural network.

## 📈 Improving Skip-gram

Since skip-gram uses a distinct feature vector for each word, it ignores the internal structures of words. fastText attempts to solve this by treating each word as the aggregation of its character *n*-grams. The vector for a word is simply taken to be the sum of all vectors of its character *n*-grams.

But how are the character *n*-grams created? Let’s say you have the word *write* with *n* = 3. Then the character 3-grams of *write* would be

,

There are five character 3-grams and one special sequence

. The symbols `<`

and `>`

at the beginning and end of words allows us to distinguish prefixes and suffixes from other character sequences.

After creating character *n*-grams with *n* of your choice, you’d have a corpus of *n*-grams of size *G*. Let

Define the scoring function of a **(center, context)** pair for fastText as the sum of the dot products of vector representations of the center word’s *n*-grams and the context word, i.e.,

This simple model allows sharing the representations across words, thus allowing it to learn reliable representations for rare words. Hence, fastText modeling really is just solving

Since working with *n*-grams multiplies the size of the corpus, you need to address the memory issue. You could use a hashing function that maps *n*-grams to integers in 1 to *K*, where *K* is a large number of your choice.

## 🆚 fastText vs Word2Vec

fastText does significantly better on **syntactic** tasks compared to the vanilla Word2Vec, especially when the size of the training corpus is small. Word2Vec slightly outperforms fastText on **semantic** tasks though. The differences grow smaller as the size of the training corpus increases.

Unlike Word2Vec, fastText can obtain vectors even for OOV words, by summing up vectors for its component character *n*-grams, provided at least one of the character *n*-grams was present in the training data.

Congrats on reaching the end!

In this story, you are introduced to 2 methods that could potentially improve Word2Vec performance:

**GloVe**finds linear structures of feature vectors directly by also using global information about words.**fastText**learns morphology through character*n*-grams of words and can estimate feature vectors of OOV words.

GloVe performs better than Word2Vec for **word analogy** tasks. It also outperforms related models on **similarity** tasks and **named entity recognition**. fastText does significantly better on **syntactic** tasks than Word2Vec while Word2Vec slightly outperforms fastText on **semantic** tasks.