p(x1 ... xn, y1 ... yn+1) = Πi=1..N+1 q(yi | yi−2, yi−1) Πi=1..Ne(xi | yi) Assuming y0 = y-1 = *Write down the independence assumptions which correspond to this equation.
Input: a sentence x1 ... xn, parameters q(s|u, v) and e(x|s). Definitions: Define K to be the set of possible tags. Define K−1 = K0 = {*}, and Kk = K for k = 1 ... n. Initialization: Set π(0, *, *) = 1. Algorithm: For k = 1 ... n, For u ∈ Kk−1, v ∈ Kk, π(k, u, v) = maxw ∈ Kk−2(π(k − 1, w, u) × q(v | w, u) × e(xk | v)) Return maxu ∈ Kn−1,v ∈ Kn(π(n, u, v) × q(STOP | u, v))What is the complexity of this algorithm which tags a sequence of size n with a label set of size K?
xt is the input at time step t - one-hot encodings of the words in a sentence. ht is the hidden state at time step t. ht=ReLU(U xt + W ht-1). h-1 is initialized to all zeroes. ot is the output at step t. In a tagging problem: ot = softmax(A ht).What is the basic property addressed by RNNs which could not be solved using the architectures we have discussed so far such as CBoW? What is the most pressing computational problem RNN architectures must address to enable convergence of the training procedure?