Quizz 08: Structured Prediction and RNNs

This quizz covers material from the seventh lecture on Structured Prediction and on RNNs.

Explain why structured prediction is different from the greedy model of POS tagging we reviewed in previous lectures. What are we trying to optimize differently?
What is the difference between a generative probabilistic model and a discriminative one?
In a Hidden Markov Model of a tagging problem - given a sequence of words xi and the predicted labels yi, we model the joint distribution p(x,y) as:
```
p(x1 ... xn, y1 ... yn+1) = Π_i=1..N+1 q(yi | yi−2, yi−1) Π_i=1..Ne(xi | yi)

Assuming y0 = y-1 = *
```
Write down the independence assumptions which correspond to this equation.

Consider the basic Viterbi algorithm:

Input: a sentence x1 ... xn, parameters q(s|u, v) and e(x|s).
Definitions: Define K to be the set of possible tags. Define K−1 = K0 = {*}, and Kk = K for k = 1 ... n.
Initialization: Set π(0, *, *) = 1.
Algorithm:
For k = 1 ... n,
	For u ∈ K_k−1, v ∈ K_k,
		π(k, u, v) = max_{w ∈ K_k−2}(π(k − 1, w, u) × q(v | w, u) × e(x_k | v))
Return max_{u ∈ K_n−1,v ∈ Kn}(π(n, u, v) × q(STOP | u, v))

What is the complexity of this algorithm which tags a sequence of size n with a label set of size K?

Consider the basic equations on an RNN:
```
x_t is the input at time step t - one-hot encodings of the words in a sentence.
h_t is the hidden state at time step t. 

h_t=ReLU(U x_t + W h_t-1). 

h_-1 is initialized to all zeroes.

o_t is the output at step t. In a tagging problem: o_t = softmax(A h_t).
```
What is the basic property addressed by RNNs which could not be solved using the architectures we have discussed so far such as CBoW? What is the most pressing computational problem RNN architectures must address to enable convergence of the training procedure?

Last modified 29 Dec 2017