[ad_1]

Co-author: **Jacob Bamberger**

Our story begins in the IV century a.C., in Athens. At that time, one of the greatest mind of western philosophy, Plato, was sitting alone in his favourite tavern near the *acropolis*, sipping wine. He was thinking about the problem of *existence*. He wanted to understand what nature, humanity and the universe *are* and why they *are*. “What are celestial bodies made of?” wondered the philosopher. “Clearly, they cannot be made of rocks, as otherwise they would have already fallen! Neither of water, otherwise they would have already dripped on Earth… Air is invisible, so we shall exclude it as well. How about fire? Flames do go up to the sky: this could be a relevant indication! I just need to find a more convincing argument and then I’ll write my scientific article about it”. This article will be known to posterity as *Metaphysics*. One of his achievements was to associate nature’s basic elements (rocks, water, air, fire) with regular geometrical shapes: he was able to *deduce, from simple geometric principles, *that the stars and the moon were neither made of rocks, water, air nor fire! Rather, they needed to be constituted by a 5th element, lighter than fire, that he would call *Aether*.

How did Plato reached such conclusion? Well, he (or the mathematicians of the Archimedean school maybe) proved that there are only five regular solids: those that are now known as the platonic solids. The hypothesis was then to associate to each one of them an element of the physical earth: rocks, water, air and fire. And what’s the 5th? And most importantly, where do we find it? To give an answer to this question he concluded that celestial bodies must be made of the 5th element.

What does the above story have to do with natural language? Well, what we are exploring in this article is the **relationship between geometry and language**, and, in particular, if we assume that natural language does have a geometric structure, are we able to exploit it and deduce a geometric definition of the *meaning of a word*? Somehow, we are trying to mimic the reasoning of Plato and hope to discover some new interesting viewpoint on natural language.

**NOTE**: In the next section we are going to show concrete results of word disambiguation: the full notebook with working code can be found here.

Let’s focus on the particular problem of word *disambiguation*. This problem consists of understanding, from the context, whether two homographic words have the same meaning or not.

Let’s consider a couple of examples:

“One of the properties is a print history. This will detail the most recent printing activity concerning the file ( although it only records the printing requests , it is not absolute proof that a print was actually produced ) . Note that it only allows a maximum of 250 characters and may therefore not provide a complete history; in the event of the history exceeding this volume , it can only guarantee details of the most recent activity .”

and

“Lacking absolute pitch , most of us ca n’t make that connection-labelling a note as ” D ” , for example. But do the connections and labels get hammered in during music lessons , or are some babies just born with a flair for identifying pitch ? That ’s a hard question to answer , since musical parents often pass a passion for music-as well as their genes-on to their children .”

In the examples above, the word “note” has two different meanings: either a *verb* or a *sound with a well defined frequency*. Can machine learning and deep learning algorithms understand this difference? Well, the standard tools based on word embeddings would simply not make it: the reason being that word embeddings takes into account only the *grapheme (i.e. how the word is written) *when creating the mapping too a vector space! Hence, whether “note” is a verb or a musical pitch, the grapheme is always the same and so would the representing vector [2].

## The issue with word embeddings

Word embeddings are NLP technique that transform a *token* (a word stripped off of some endings) into a numerical array (this transformation is called *vectorisation*). This step is needed to allow to input natural language into machine learning models (that require either numbers or arrays of numbers as inputs). One of the most famous algorithms for vectorisation is called **word2vec**. The structure of **word2vec** is that of a simple neural network, which takes as input a word and outputs an associated word embedding. Such algorithm is trained on the task of guessing a missing word in a sentence based on the neighbouring words of that sentence. For more details, you can have a look here.

Despite this method achieving impressive results, it has a major flaw: **the algorithm will always associate to the same token the same numerical array**. Hence, word disambiguation cannot be solved simply by looking at the *vectorisation* of the word: how about if we look at the neighbouring words?

Looking at neighbouring vectorised tokens of a given word is the approach we are taking here: how exactly can we describe, in numerical terms, the shape of the neighbourhood of a word? Concepts from algebraic topology comes in handy when we want to describe shapes algebraically: in particular, we are going to exploit the concept of *local homology*.

Local homology is the study of the shape of the neighbourhood of a point embedded in a topological space. Let’s consider **a point on the intersection of two lines**, and **a point in the middle of a plane.** These two points, from the viewpoint of local homology, are very different. Why? Well, because the neighbourhoods of two such points look very different!

We can easily generalise the computations to any dimension we like: e.g. a point lying on a sphere, a point lying on the tip of a double cone… But, how can we compute local homology?

The exact algorithmic procedure to compute local homology is a bit too technical to be described in detail here: the gist of it is to **compute the standard persistent homology of the coned-off relative set**. “Coning-off” means exactly what you see in the picture below: after you define the relative set of interest (in the picture below the intersection of two lines and the disk), you glue together the boundary of your relative set.

Out of the picture above we can make one observation: from a local neighbourhood of a point we create some very standard topological objects: spheres, circles, bouquet of circles, etc… Once these standard shapes are obtained (via the coning-off procedure shown in the picture above), we can compute the usual homology of such topological space. Put differently, the usual homology of the coned-off local neighbourhood of a point is the local homology of that point. In conclusion, we got what we were looking for: a tool to describe algebraically the local neighbourhood of a point in a topological space.

One step remains: how do we bridge these local homology computations to natural language?

The missing link is how to apply such local homology techniques to natural language for disambiguation.

The first observation is that *sentences* are transformed into *point clouds* by **word2vec**. And the word we want to disambiguate — say it is the word “note” — is a point in this point cloud. Hence the question: can we build a meaningful topological space out of a point cloud (hence discrete!) so that we can straightforwardly apply the local homology tool?

## Persistence homology

The answer to the above question is a big **YES! **We can actually build many topological spaces out of a point cloud and the next animation gives the intuitive idea on how we can do it:

For a more detailed introduction to topological data analysis (TDA) ad its techniques to compute homology, have a look here.

So, depending on which scale we consider (corresponding to the radius of the bubbles in the animation above), we can build a **simplicial complex** (i.e. a high dimensional graph, containing vertices, edges, faces,…): *a simplicial complex is a topological space* and hence homology calculations can be performed.

## Visualising word embeddings

Here below we show the word embeddings of the example sentences above: to reduce to two dimensions, we used the UMAP algorithm. Notice that there is indeed a difference between the local shapes of the word “note” — represented as a yellow point — in the two picture: can we detect this difference with local homology?

## Results

By applying local homology to the embeddings above, we are able to distinguish the two meanings! Check the pictures below, taken directly from the associated notebook:

There are some parameters to tweak in the algorithm and a full benchmark analysis has to yet be performed on many homographic words. But this task demands a future work!

The use of local homology to disambiguate words seems a very promising direction. There are published preprints and papers on the topic that all follow the same basic intuition: see [1] and [4].

While a generic algorithm is available in Giotto-tda [3], a systematic study and benchmark is still lacking: hence, we encourage the interested reader to step up and push forward the knowledge boundary of mankind to better understand natural language and its geometric interpretation.

[1] Jakubowski A. et al., Topology of Word Embeddings: Singularities Reflect Polysemy, arXiv2011.09413

[2] Mikolov T. at al., Efficient Estimation of Word Representations in Vector Space, arXiv:1301.3781

[3] Tauzin G. et al., giotto-tda: A Topological Data Analysis Toolkit for Machine Learning and Data Exploration, arXiv2004.02551

[4] Temčinas T., Local Homology of Word Embeddings, arXiv1810:10136

[ad_2]

Source link