Word Embeddings

In the previous lecture, we started a review of a high-level task: predicting the next word given the previous context. The solution to this task is a probabilistic model called a language model (LM). We covered two approaches to build LMs: count-based methods and a neural network method. Language models play an important practical role as a module in many NLP tasks. They are also important for two additional reasons: (1) they illustrate the importance of gradable, probabilistic judgments on language; (2) as a side-effect of the neural LM construction, we obtain naturally an artifact of critical importance: word embeddings.

One-hot vs. Dense Word Representations in Deep Learning

Consider a supervised classification problem on text documents: we want to learn a function f: Docs → Labels on the basis of example pairs (doc,label).

A simple deep learning model for this task consists of modeling documents as bag of words (BoW) and then passing the BoW representation to an affine layer followed by a discontinuous function. Consider the dimensions of the representations manipulated in such a model:

Given a set of L distinct labels
and documents containing words from a vocabulary of V distinct words

Encode words as one-hot vectors of dimension V.
A document is encoded as the sum of the word vectors for the words it contains - obtaining a vector of dim V.
Scores on the labels are represented as vectors of dimension L.

The model optimizes the parameters of the linear transformation A and b:

scores(Xdoc) = h(A Xdoc + b)

Where:
	A is a matrix of dimension LxV
	b is a vector of dimension L
	h is a discontinuous function such as sigmoid, tanh or relu.

The loss function is:
	Loss(doc, label) = cross-entropy-loss( softmax[scores(Xdoc)], one-hot(label) )
		= - log[softmax(scores(Xdoc))label]
		
	where Xdoc is the V dimension BoW encoding of doc
	and   label ∈ [1..L]
	and   one-hot(label) is the L dimension one-hot encoding of label
Such a model is implemented in the PyTorch implementation of the Logistic Bag of Words model which illustrates the task of classifying documents as either English or Spanish documents.

This word-based encoding is not an effective way to address the specific task of language identification - it would be much more effective to use a character-level representation of documents to solve this task (as is shown for example in this Character-based RNN Classification Tutorial in PyTorch. We use this task as a motivating example only.

The learned classifier is now capable of predicting the language of a given document by first encoding the document as a V dimension BoW vector, passing this vector to the trained function scores(doc) - and given the output vector (scoreenglish, scorespanish) to predict the class with the higher score.

Let us now consider the trained parameter A. It is a matrix of dimension LxV (in our case L=2 for the 2 labels English and Spanish). We can think of this matrix as a collection of V vectors of dimension L. A side-effect of training the classifier was to construct this matrix A. Given a word w, the prediction whether this word is in English or Spanish will be computed as the vector of dimension L corresponding to the column w of the matrix A. This vector ew = (weng, wspa) once normalized through the softmax function will represent the prediction that w is English or Spanish. Let us call these normalized vectors ew the task-specific embeddings of the words.

These embedding vectors have the following properties:
  1. They have much lower dimension than the original one-hot encodings (L vs. V)
  2. They are dense encodings - that is, floating points with non-zero representations along all L dimensions as opposed to discrete sparse values in the original one-hot encoding.

Task-Specific Word Similarity using Embeddings

Consider the task of determining whether two words w1 and w2 are similar with respect to the task of language identification. We can use the learned representation of these words - ew1 and ew2 - to determine this similarity.

To assess the similarity between the two words encoded as vectors, we use the cosine distance:

cosine(w1, w2) =  w1 . w2 / |w1|x|w2|

where the numerator is the scalar product of the two vectors:
w1 . w2 = Σi=1..d w1i w2i
The cosine metric measures the angle between the two vectors - it is close to 0 when the vectors are orthogonal, and close to 1 when the vectors are aligned. The cosine similarity measure is such that cosine(w,w)=1 for all w, and cosine(x,y) is between 0 and 1. cosine(x,y) is 0 when the vectors are orthogonal (this is the case for example for any two distinct one-hot vectors).

In our example with the two well-identified dimensions of the vector indicating the belief that a word is English or Spanish, the cosine metric will be close to 1 when the two vectors have the same dimension small and the other large, and close to 0 when the two dimensions are one large and the other small in different order:

cosine(e1, e2) =  e1,eng . e2, eng + e1, spa . e2, spa
This distance will be close to 1 when the two words are classified in the same language, and close to 0 when they are classified as different languages. This captures our intuition that the embeddings are a good tool to measure the task-specific similarity between words.

Language Modeling and Word Embeddings

Consider again the simple neural language model we introduced in Lecture 3: The task we address is to predict a distribution over a vocabulary of V words given the previously observed k words in a sequence of words.
The input of this model is an n-gram w1:k and the output is a probability distribution over the vocabulary V.

The words in the n-gram are first mapped to one-hot vectors of size V.
These vectors are then mapped to encoding vectors of size dw.
v(wi) = E x wi where E is a matrix of size dw x V

The input vector is the concatenation of these embedding vectors - 
x = [v(w1); ...; v(wk)].

The vector x is then pushed to an MLP with one hidden layer:
y = p(wi | w1:k) = LM(w1:k) = softmax(W2h + b2)
h = g(W1x + b1)

To train this model, we iterate over a corpus of text, and feed as training pairs every observed pair of (n-gram of k words, wk+1).  

The loss function is cross-entropy - which compares the predicted distribution over V words with the observed word wk+1.
Consider that as part of the process of learning a language model, we produce a matrix E of size dw x V. This matrix has V columns of size dw. Multiplying this matrix by a one-hot vector vi is equivalent to looking up the i-th column from the matrix. In other words, the matrix E is the concatenation of V word embedding vectors of size dw. Typical values for dw range between 50 and 2000 - 300 is a commonly used value.

Learning this language model produced as a side effect a word embedding matrix for the words in the vocabulary. This word embedding matrix captures the left-context of observed words - so that, we expect that words that occur in similar left-contexts will have similar word embeddings. Similarity across word embeddings can be computed using the cosine distance.

The difference between the task-specific word embedding we started with above and the language-model-based word embeddings is that there is no direct interpretation of the vectors of the embeddings according to simple task properties. Whereas we interpreted the 2 dimensions of the task-specific embeddings above as word is English / Spanish - we do not have a clear interpretation of each dimension among the dw dimensions in the case of the Language Model matrix E.

Instead, the common intuition prevalent among proponents of neural network models is that word embeddings are a specific form of "distributed representation" - in the sense that information is captured in this vectors as patterns of activation across the dimensions of the vectors.

Distributional Similarity: Predicting a Word based on its Context

Consider now a different objective instead of the specific language model task: we want to learn a representation for words which captures their semantics. Remember that the one-hot encoding representation of words is particularly ill-adapted to this objective, because in a one-hot representation, all the words are orthogonal to each other (the cosine similarity of any two distinct words is 0).

One method to approach the semantic similarity of words is to adopt the so-called distributional hypothesis (see distributional semantics in Wikipedia for references, in particular to Harris's theory and to Firth's).

In this approach, the semantic similarity of two words is assessed by comparing the contexts in which the words tend to occur. To formalize this intuition, we model a task of predicting a word given its context. The technical approach is similar to the one we reviewed above for predicting the language of a document, but with a different task. It is in fact a slight variation of the Language Modeling task.

The following PyTorch program implements this approach. In this approach, the model is called the Continuous Bag of Words (CBoW) method. We take as the context of a word the window of N/2 words on the left and on the right of each occurrence of the word. Given the representation of these N words, the task is to predict the word itself.

Other notions of context can be adopted in general - for example, we could take as a context all the words that occur in a document, or the syntactic context of a word by exploiting the syntactic parse tree of sentences and more - or, as in the LM case, the k preceding tokens.

The CBoW model consists of the following elements:

The model optimizes the parameters of the linear transformation A and b:
scores(Xcontext) = h(A Xcontext + b)

Where:
	A is a matrix of dimension VxV
	b is a vector of dimension V
	h is a discontinuous function such as sigmoid, tanh or relu.

The loss function is:
	Loss(context, word) = cross-entropy-loss( softmax[scores(Xcontext)], one-hot(word) )
		= - log[softmax(scores(Xcontext))word]
		
	where Xcontext is the V dimension BoW encoding of context
	and   word ∈ [1..V]
	and   one-hot(word) is the V dimension one-hot encoding of word
The main computational cost of this method consists of computing the softmax function over the large vocabulary V. Algorithms mentioned below have focused on avoiding this computational bottleneck. Unfortunately, the odds of such a model to "generalize" are very low: the task of predicting exactly a word represented in a one-hot encoding given its context is extremely hard. Instead, we adopt the strategy of reducing dimensionality to force generalization. We map the sparse discrete one-hot representations of dimension V to a dense space of dimension d. We compare the contexts and the words in the embedding space. This leads to a slight modification of the model: The model optimizes the parameters of the linear transformation A and b:
embed(context) = A Xcontext + b
embed(w) = A one-hot(w) + b
scores(context, w) = cosine(embed(context), embed(w))

Where:
	A is a matrix of dimension dxV
	b is a vector of dimension d
	Xcontext is the V dimension BoW encoding of the context
	embed(context) is the d dimension CBoW encoding of the context
	word ∈ [1..V]
	embed(word) is the d dimension encoding of word
	
The loss function is:
	Loss(context, word) = cross-entropy-loss( softmax[scores(context,w)] )

Pre-trained Word Representations vs. Task-specific Word Representations

The TensorFlow tutorial on Word2Vec provides a high-level review of the topic of word embeddings in a readable manner.

When we train this model, we optimize the embed() function which learns to represent words in a d-dimension space as dense vectors. This representation is such that two words that occur in similar contexts will have similar embeddings according to the cosine metric.

Such embeddings when pre-trained on large corpora capture interesting semantic similarity relations - which can be used as a general representation of words instead of one-hot encodings for many other tasks. For example, when we cluster word embedding vectors according to cosine distance using an algorithm such as K-Nearest-Neighbors, we observe that many clusters correspond to groups of semantically or syntactically related words.

Another way to investigate the type of knowledge captured by pre-trained word embeddings is to perform vector arithmetic. The most famous example of such arithmetic is:

v(king) - v(man) + v(woman) = v(queen). 
That is, adding the vectors associated with the words king and woman while subtracting man is equal to the vector associated with queen. The vector v(man) - v(woman) captures the gender relationship. In another example:
v(paris) - v(france) + v(italy) = v(rome)
The vector v(paris) - v(france) captures the relation of "capital of a country".

Depending on the form of context used to construct the word embeddings (word co-occurrence, syntactic dependencies, larger windows of words), different types of semantic relations can be captured by the word embedding matrix. See Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning Vylomova et al 2016 for a discussion of the lexical relations captured by word embeddings.

Empirically, for many applications, initializing word representations with pre-trained word vectors prepared on large corpora and capturing distributional semantics is more effective than starting from one-hot word encodings. The pre-trained word vectors are optimized to capture semantic similarity.

The most popular pre-trained word representations have been Word2Vec Mikolov et al 2013 and GLOVE Pennington et al 2014. Both algorithms (Word2vec and Glove) improved over the LM way to train word embeddings by reducing the cost (avoiding the softmax cost by considering variants of the task based on Noise Contrastive Estimation) and by considering left-right context in contrast to left-only context.

In recent work, two significant models have been introduced which produce excellent pre-trained representations for many NLP tasks:

  1. ELMO
  2. BERT
These models improve upon Word2vec and Glove by:
  1. Considering sub-word units: characters for ELMO, word pieces for BERT. This is important in making the models more robust in the presence of previously unseen words.
  2. Using deep architectures: the model that learns the pre-trained encodings has up to 12 layers - which capture abstract relations.
  3. Using Larger training datasets (3.3 billion tokens for BERT) and larger models (BERT has about 350M parameters).
  4. Using variants of the task of "guessing a word in context": BERT uses two variant tasks (word masking guessing and deciding whether two sentences follow each other).

Last modified 02 Nov 2018