Word Embeddings

One-hot vs. Dense Word Representations in Deep Learning

Consider a supervised classification problem on text documents: we want to learn a function f: Docs → Labels on the basis of example pairs (doc,label).

A simple deep learning model for this task consists of modeling documents as bag of words (BoW) and then passing the BoW representation to an affine layer followed by a discontinuous function. Consider the dimensions of the representations manipulated in such a model:

Given a set of L distinct labels
and documents containing words from a vocabulary of V distinct words

We encode words as one-hot vectors of dimension V.
A document is encoded as the sum of the word vectors for the words it contains - obtaining a vector of dim V.
Scores on the labels are represented as vectors of dimension L.

The model optimizes the parameters of the linear transformation A and b:

scores(X_doc) = h(A X_doc + b)

Where:
	A is a matrix of dimension LxV
	b is a vector of dimension L
	h is a discontinuous function such as sigmoid, tanh or relu.

The loss function is:
	Loss(doc, label) = cross-entropy-loss( softmax[scores(X_doc)], one-hot(label) )
		= - log[softmax(scores(X_doc))_label]
		
	where X_doc is the V dimension BoW encoding of doc
	and   label ∈ [1..L]
	and   one-hot(label) is the L dimension one-hot encoding of label

Such a model is implemented in the PyTorch implementation of the Logistic Bag of Words model which illustrates the task of classifying documents as either English or Spanish documents.

Note that this word-based encoding is not an effective way to address the specific task of language identification - it would be much more effective to use a character-level representation of documents to solve this task - we use this task as a motivating example only.

The learned classifier is now capable of predicting the language of a given document by first encoding the document as a V dimension BoW vector, passing this vector to the trained function scores(doc) - and given the output vector (score_english, score_spanish) to predict the class with the higher score.

Let us now consider the trained parameter A. It is a matrix of dimension LxV (in our case L=2 for the 2 labels English and Spanish). We can think of this matrix as a collection of V vectors of dimension L. A side-effect of training the classifier was to construct this matrix A. Given a word w, the prediction whether this word is in English or Spanish will be computed as the vector of dimension L corresponding to the column w of the matrix A. This vector e_w = (w_eng, w_spa) once normalized through the softmax function will represent the prediction that w is English or Spanish. Let us call these normalized vectors e_w the task-specific embeddings of the words.

Note these embedding vectors have the following properties:

They have much lower dimension than the original one-hot encodings (L vs. V)
They are dense encodings - that is, floating points with non-zero representations along all L dimensions as opposed to discrete sparse values in the original one-hot encoding.

Task-Specific Word Similarity using Embeddings

Consider the task of determining whether two words w₁ and w₂ are similar with respect to the task of language identification. We can use the learned representation of these words - e_w1& and e_w2 - to determine this similarity.

To assess the similarity between the two words encoded as vectors, we use the cosine distance:

cosine(w₁, w₂) =  w₁ . w₂ / |w₁|x|w₂|

where the numerator is the scalar product of the two vectors:
w₁ . w₂ = Σ_i=1..d w_1i w_2i

The cosine metric measures the angle between the two vectors - it is close to 0 when the vectors are orthogonal, and close to 1 when the vectors are aligned. Note that the cosine similarity measure is such that cosine(w,w)=1 for all w, and cosine(x,y) is between 0 and 1.

In our example with the two well-identified dimensions of the vector indicating the belief that a word is English or Spanish, the cosine metric will be close to 1 when the two vectors have the same dimension small and the other large, and close to 0 when the two dimensions are one large and the other small in different order:

cosine(e₁, e₂) =  e_1,eng . e_{2, eng} + e_{1, spa} . e_{2, spa}

This distance will be close to 1 when the two words are classified in the same language, and close to 0 when they are classified as different languages. This captures our intuition that the embeddings are a good tool to measure the task-specific similarity between words.

Distributional Similarity: Predicting a Word based on its Context

Consider now a different objective: we want to learn a representation for words which captures their semantic. Remember that the one-hot encoding representation of words is particularly ill-adapted to this objective, because in a one-hot representation, all the words are orthogonal to each other (the cosine similarity of any two distinct words is 0).

One method to approach the semantic similarity of words is to adopt the so-called distributional hypothesis (see distributional semantics in Wikipedia for references, in particular to Harris's theory and to Firth).

In this approach, the semantic similarity of two words is assessed by comparing the contexts in which the words tend to occur. To formalize this intuition, we model a task of predicting a word given its context. The technical approach is similar to the one we reviewed above for predicting the language of a document, but with a different task.

The following PyTorch program implements this idea. In this approach, the model is called the Continuous Bag of Words (CBoW) method. We take as the context of a word the window of N/2 words on the left and on the right of each occurrence of the word. Given the representation of these N words, the task is to predict the word itself.

Other notions of context can be adopted in general - for example, we could take as a context all the words that occur in a document, or the syntactic context of a word by exploiting the syntactic parse tree of sentences and more.

In the CBoW model, the approach consists of:

Given a corpus of documents containing words from a vocabulary of V distinct words

We encode words as one-hot vectors of dimension V.

For each token in the corpus, we generate the n-gram context of the word as the N words around the token (N/2 on the left, N/2 on the right).

A context is encoded as the sum of the word vectors for the words it contains - obtaining a vector of dim V.
NOTE: a different encoding of vectors is possible and could be considered in some cases, for example giving more weight to closer words.

The model must predict the current token given the context.

Scores on the predicted words are represented as vectors of dimension V.

The model optimizes the parameters of the linear transformation A and b:

scores(X_context) = h(A X_context + b)

Where:
	A is a matrix of dimension VxV
	b is a vector of dimension V
	h is a discontinuous function such as sigmoid, tanh or relu.

The loss function is:
	Loss(context, word) = cross-entropy-loss( softmax[scores(X_context)], one-hot(word) )
		= - log[softmax(scores(X_context))_word]
		
	where X_context is the V dimension BoW encoding of context
	and   word ∈ [1..V]
	and   one-hot(word) is the V dimension one-hot encoding of word

But the odds of such a model to "generalize" are very low: the task of predicting exactly a word represented in a one-hot encoding given its context is extremely hard. Instead, we adopt the strategy of reducing dimensionality to force generalization. We map the sparse discrete one-hot representations of dimension V to a dense space of dimension d. We compare the contexts and the words in the embedding space. This leads to a slight modification of the model:

Given a corpus of documents containing words from a vocabulary of V distinct words

We encode words as one-hot vectors of dimension V.

For each token in the corpus, we generate the n-gram context of the word as the N words around the token (N/2 on the left, N/2 on the right).

A context is encoded as the sum of the word vectors for the words it contains - obtaining a vector of dim V.

We map the context encoding into a space of dimension d (embedding) by using an affine transformation
and the word itself to the same embedding space of dimension d using the word embedding transformation.
NOTE: the two transformations (context and words) need not be the same - but we can adopt this as a convenience.

The model must predict the current token w encoding given the context representation.

The model optimizes the parameters of the linear transformation A and b:

embed(context) = A X_context + b
embed(w) = A one-hot(w) + b
scores(context, w) = cosine(embed(context), embed(w))

Where:
	A is a matrix of dimension dxV
	b is a vector of dimension d
	X_context is the V dimension BoW encoding of the context
	embed(context) is the d dimension CBoW encoding of the context
	word ∈ [1..V]
	embed(word) is the d dimension encoding of word
	
The loss function is:
	Loss(context, word) = cross-entropy-loss( softmax[scores(context,w)] )

The TensorFlow tutorial on Word2Vec provides a high-level review of the same ideas in a graphical and readable manner.

When we train this model, we optimize the embed() function which learns to represent words in a d-dimension space as dense vectors. This representation is such that two words that occur in similar contexts will have similar embeddings according to the cosine metric.

Such embeddings when pre-trained on large corpora capture interesting semantic similarity relations - which can be used as a general representation of words instead of one-hot encodings for many other tasks.

Last modified 22 Dec, 2017