Not so long ago in natural language processing words were often represented as points in a high dimensional space. This format was thought to be beneficial to numerical analysis, however it was less than ideal for practical applications.
What is a King?
A famous example of words-as-points tries to define the concept of a King. With numerical representations, this allowed us to do math directly with words:
King — Man + Woman = Queen
This formula gained a certain amount of traction in academic settings, but practically was much less useful than what we see from transformers and deep learning. The main problem here is that words are never single points. The meaning of a word depends on context and intent.
Two types of ambiguity
Ambiguity in natural language can take two forms:
two words can mean the same thing 🟡
one word can have multiple meanings 🟠
Neither of these cases can be represented by a single point. There have been many iterations from then (points) until now (self-attention) but we will skip those intermediate steps in this article.
What shall we do with all the words in between?
One of the main contributions of transformer’s self-attention model is how it so naturally becomes context-sensitive in all the right ways. When considering a full sentence, it isn’t always correct to immediately zoom in to look at named entities. Often it is important to consider the grammatical placement of a feature to derive the correct meaning.
When looking at a sentence such as
We all live on a yellow submarine.
There are many ways to read this sentence mechanically and it is all about order. Just a few strategies include:
Start from the beginning
Start from the end
Start from the rarest word
Start from the most common word
It is never clear what strategy is ultimately superior because natural language has so many strange edge cases. We used to think that maybe grammar could be organized into tree-like structures and that humans innately possessed the ability to hear and speak language with some common patterns. However, we know now that humans are incredibly free.
The Hydra
One of the techniques used by Transformers is something called multi-head attention. This complicated sounding word just means that we should pay attention to the entire sentence whenever possible. This allows us to use all-of-the-above strategies to parse grammar. This is highly practical.
Given the above sentence
We all live on a yellow submarine.
We might incrementally explore tokens with a very cherry-picked strategy:
— on —
— live on —
we — live on —
we — live on a —
we — live on a — submarine
we — live on a yellow submarine
we all live on a yellow submarine
This strategy closely aligns with some of the psychological quirks that we know about humans. For example did you know that you can scramble most interior letters in a word as long as the first and last letter are still correct. Not only can people read scrambled words, but most people will barely notice the change.
A context to disambiguate
Not only does this attention mechanism solve the problem of parsing complex grammars, but it also happens to help us determine the meaning of ambiguous words. Consider the following sentences
I wore a watch to the baseball game.
I watch baseball often on the television.
These sentences are quite similar, however the usage of the word watch is different in each sentence. One word is a noun, whereas the other word is a verb. Context helps us determine the correct meaning of the word for these simple examples.
Warning: the transformer model is not without flaws. When moving from points to neurons we have not solved all problems at the cost of nothing. We don’t fully understand the weaknesses of Transformers in NLP. Creating networks with trillions of parameters might hide some of the problems but it does not eliminate fundamental flaws.