Deep NLP Kata

Section 1. Linear Algebra Basics

These exercises help you familiarize yourself with the basic building blocks of neural network computation, such as vectors, matrices, and operations.

Kata 1
Calcualte $ 2 + 3 $ using your framework, i.e., not just using Python operations but using e.g., tf.Tensor and torch.Tensor.
Kata 2
Caluclate the inner product of two vectors: $ v_1 = [1 \ 3 \ -5] $ and $ v_2 = [4 \ -2 \ -1] $.
Kata 3
Calucalte the product of two matrices: $$ m_1 = \begin{bmatrix} -1 & 2 & 3 \\ 4 & -2 & 0 \end{bmatrix} $$ and $$ m_2 = \begin{bmatrix} -2 & -3 \\ 4 & 1 \\ 0 & 1 \end{bmatrix}. $$
Kata 4
Consider this linear model: $ y = xw + b $, where $ w = [2 \ -1]^T $ and $ b = 1 $.

What is the value of $ y $ when $x = [1 \ 1], [0 \ 1], [1 \ 0] $, respectively?
Kata 5
Consider the linear model $ y = xw + b $, where the values of $ w $ and $ b $ are unknown.

When the values of $ x $ are $x = [1 \ 1], [0 \ 1], [1 \ 0] $, the values of $ y $ are 2, 0, 3, respectively. Solve for $ w $ and $ b $.

First, compute the design matrix: $$ X = \begin{bmatrix} 1 & x^1_1 & x^1_2 \\ 1 & x^2_1 & x^2_2 \\ 1 & x^3_1 & x^3_2 \end{bmatrix}, $$ where $ x^i_j $ is the j-th dimension of i-th training instance. Then, the solution for $ b $ and $ w $ is given by: $$ (X^T X)^{-1} X^T y $$
Kata 6
Compute the mean squared loss between two vectors: $y_1 = [1 \ 1 \ 1]$ and $y_2 = [2 \ 0 \ 3]$.
Kata 7
Consider the linear model $ y = xw + b $, where $ w = [0 \ 0]^T $ and $ b = 0 $. For input $x = [1 \ 1], [0 \ 1], [1 \ 0] $, compute the output ($ y $) of the model. What is the mean squared loss between $ y $ and desired output $ y = [2 \ 0 \ 3] $?

Also, compute the gradient of the loss w.r.t $ w $ and $ b $.
Kata 8
Consider the logistic regression model $ y = \sigma(xw + b) $ where $ w = [0 \ 0]^T $, $ b = 1 $, and $ \sigma $ is a sigmoid function.

What are the values of $ y $ for input $x = [1 \ 1], [0 \ 1], [1 \ 0] $, respectively?
Kata 9
Compute the cross entropy loss between the output of Kata 8 and desied output $ \hat{y} = [1 \ 0 \ 0] $.
Kata 10
Compute the gradient of the cross entropy loss from Kata 9 w.r.t $ w $ and $ b $.

Section 2. Feed-forward Neural Networks

In this section, we'll start building feed-forward neural networks using the building blocks we learned previously.

Kata 11
Let's generate some fake data for regression. Let $ \hat{w} = [2 \ -1] $ and $ \hat{b} = 1 $. These are the parameters you want your neural network to learn. Going forward, pretend you don't know the "true" values of these parameters.

First, generate a number (e.g., 100) of two dimensional vectors whose values are drawn randomly from a standard normal distribution (Gaussian distribution with $ \mu = 0 $ and $\sigma = 1$). Let this be $ \mathbf{x} $. Second, generate the desired output $ \mathbf{\hat y} $ s.t. $ \mathbf{\hat y} = \mathbf{x} \hat{w} + \hat{b} $.
Kata 12
Build a feed-forward neural network that has one densely connected linear layer. It takes $ \mathbf{x} $ as input and output $ \mathbf{y} = \mathbf{x} w + b $. Initialize $ w = [0 \ 0] $ and $ b = 0 $. What does $ \mathbf{y} $ look like?

Optionally, you can mini-batch the input $ \mathbf{x} $.
Kata 13
Compute the mean squared loss between $ \mathbf{y} $ and $ \mathbf{\hat y} $.
Kata 14
Back-propagate the gradient of the loss function and update the model parameters (connection weights and bias of the linear layer) so that the loss is minimized. Iterate this over the training data (i.e., $ \mathbf{x} $ and $ \mathbf{\hat y} $) multiple times. Log the values of the loss function. If your model and training pipeline is working correctly, those values should decrease as you iterate.
Kata 15
Build another feed-forward neural network that has one two-dimensional "hidden" layer. In other words, this model has two sets of connections:
- input(2) →(linear)→ hidden(2) and
- hidden(2)→(linear)→output(1),
where the numbers in ( ) are dimensions of each layer.
Kata 16
As you did in Kata 14, train the model you just created in Kata 15 by back-propagating the loss and updating the model parameters.

What does the loss decrease look like in comparison to the previous Kata?
Kata 17
Add ReLU (Rectified Linear Unit) as the activation function (non-linearity) to the hidden layer, and train the model.

Also, try the Tanh and Sigmoid functions. Which activation functions perform better in terms of reducing the loss?
Kata 18
Now, let's turn to a classification problem and generate some fake data for it. Let $ \hat{w} = [2 \ -1] $ and $ \hat{b} = 1 $.

First, as we did in Kata 11, let's generate a bunch (maybe 100) of two-dimensional vectors from a standard Gaussian distribution. Let this be $ \mathbf{x} $.

Second, let $ \mathbf{p} = \sigma(\mathbf{x} \hat{w} + \hat{b}) $, where $ \sigma() $ is the element-wise sigmoid function.

Finally, generate $ \mathbf{\hat y} $ s.t. $ y_i = 1 $ if $ p_i > 0.5$ and 0 otherwise, where $ y_i $ and $ p_i $ are the i-th element of $ \mathbf{\hat y} $ and $ \mathbf{p} $, respectively.
Kata 19
Create a feed-forward neural network that has one two-dimensional hidden layer with the Tanh activation function, as you did in Kata 17.

Add a sigmoid activation function on top of the output layer, so that the network can output a probability between 0 and 1 instead a real value.
Kata 20
Define a cross entropy loss between the output of the network $ \mathbf{y} $ and the target $ \mathbf{\hat y} $.

Back-propagate the gradient of the loss function and update the model parameters.

Section 3. Word Embeddings

In this section, we'll be building so-called word2vec models, namely skip-gram and CBOW (continous bag-of-words) models. In terms of architecture, they are very simple feed-forward neural networks.

Kata 21
First of all, let's prepare the data set for the word2vec models. Go to Matt Mahoney's page, find and download text8.zip, and unzip it. This file, consisting of 100M-character excerpts from cleaned English Wikipedia, is often used for testing word2vec algorithms.

Second, read the text, split into words by whitespace, and convert words into word indices. In doing so, you want to limit the size of the vocabulary to $ N $ (say, the most frequent 10,000 words). You can do this in the following fashion:
1. Go through the text (list of words) and count the word occurrences, i.e., number of times each word appears in the text. Python's Counter comes in handy here.
2. Sort the list of unique words by their occurrences, and retain the top $ N - 1 $ most frequently occurring words. (Why $ N - 1 $ not $ N $? See below.)
3. Assign an integer index from 1 to $ N - 1$ for each unique word in this word list. Now you created a look-up table that converts words to their indices.
4. Go through the text again while looking up in the word-index table.
5. Since the table only contains top-$N-1$ words, you'll encounter words that are not in the list. Replace such words with a special token UNK. Assign index=0 to UNK, that is, those "out-of-vocabulary" words are assigned an index of 0.
If you do this correctly, you'll have created a long list of word indices, which are integers ranging from 0 to $ N - 1 $.
Kata 22
The skip-gram model is trained so that the neural network can solve this "fake task": given a word in text, it predicts what types of words tend to appear in its context. Create the training data for this task as follows:
1. For each word (let this be $ w_t $) in the data set, define its context by taking $ c $ words before and $ c $ words after. $ c $ here is called window size.
2. Pick $ k $ (usually $ k $ = 2) words randomly from this "window", and let it be $ w_{t+j} $, where $ j $ is the integer offset from $ w_t $.
3. Now, pairs $ (w_t, w_{t+j}) $ for each $ j $ you picked is the training data.
In machine learning, we usually group such training pairs into minibatches. Continue this process for different values of $ t $ and $ j $ until you have $ b $ such pairs. Group them into a minibatch.
Kata 23
The key to word2vec models is word embeddings, which are semantic representations of words through dense, real-valued vectors. Word2vec models convert input/output words to embeddings and use them for prediction.

Conversion from words to embeddings is usually done via an embedding matrix, which works as a look-up table from word IDs to their corresponding embedding vectors. First, define an embedding matrix in your programming language/framework. Second, represent a word via a one-hot vector, which is a sparse, integer vector where i-th element (i = ID of the word) is 1 and everything else is 0. Finally, the corresponding word vector is obtained by multiplying this one-hot vector with the embedding matrix. (Note that most modern neural network frameworks have a built-in implementation for this whole pipeline).
Kata 24
Implement the skip-gram model. First, using the embedding matrix above, obtain the input word vector. Let this be $ v_{w_I} $. Second, using another embedding matrix (but with the same size and shape), obtain the output word vector $ v'_{w_O} $. Take the inner product between those two. Finally, normalize this by considering all words in the vocabulary using softmax:
$$ p(w_O | w_I) = \frac{\exp({v'_{w_O}}^T v_{w_I})}{\sum_{w \in V} \exp({v'_w}^T v_{w_I})) }, $$ where $ V $ is the vocabulary. The loss function is the sum of the logarithm of the probability above over all training pairs: $$ \sum_{(w_O, w_I) \in X} \log P(w_O | w_I). $$
Train your skip-gram model by maximizing the log probability above.
Kata 25
After training the skip-gram model, as a by-product you end up with the "optimized" embedding matrix, whose rows correspond to what you want as word vectors.

Pick one word arbitrarily from the vocabulary (say, "dog"), and compute the cosine similarity between this word vector and every other word vector in the vocabulary, and sort the words by this similarity. Do you see similar words to the one you picked?
Kata 26
You may have noticed that the model you just built is very slow to train. One reason is that the softmax loss (see Kata 24) considers every word in the vocabulary.
One alternative is to use Negative Sampling. The idea of Negative Sampling is similar to turning this problem of skip-gram into a classification problem. Specifically, instead of feeding positive output $ (w_t, w_{t+j}) $ you created from the corpus, you feed both correct and negative output, where negative output $ (w_t, w_i) $ is generated by pairing $ w_t $ with a randomly sampled word $ w_i $ from a unigram distribution.

Implement Negative Sampling. See Equation (4) in the original word2vec paper for more details of the loss function to use.
Kata 27
Another reason why a naive skip-gram model tends to be slow to train is because of high-frequency words such as "the" and "a". They can appear millions of times in a large corpus, while providing very little information for word embeddings. For example, a correct pair ("dog", "the"), which can appear in a corpus thousands of times, provides very little information about the word "dog", beside it's probably a noun, compared to other pairs such as ("dog", "white") and ("dog", "barked").

One way to deal with this issue is to subsample frequent words, which means simply to discard word pairs with frequent words. In the original word2vec paper, each word in the training corpus is discarded with a probability that is monotonically increasing with the word frequency.
Kata 28
At this point, your word embeddings might be starting to make some sense. One way to verify if word embeddings are making any sense is to visualize them.

Visualize the word embeddings in a 2D space using the t-SNE algorithm, which is a popular way to map high-dimensional vectors onto a low-dimensinoal space. If you are using Python, sklearn.manifold.TSNE is a good choice.

One of the easiest way to verify word embeddings is to look for small, closed word clusters such as days of week ("Sunday", "Monday", ...), pronouns ("I", "you", "she", ...), and numbers ("one", "two", "three", ...), because they tend to appear in similar context and their corresponding word embedding vectors tend to cluster together. Do you see related words in the vicinity?
Kata 29
Another algorithm in the word2vec family is Continuous Bag-of-Words (CBOW) model. CBOW flips the input and output of skip-gram, and predicts the target word from its context. Specifically, CBOW considers all words in a context window ($k$ words before and after the target), looks up their word embeddings, and then takes the sum of those vectors.

Implement the CBOW model by modifying the skip-gram algorithm.
Kata 30
There are numerous extensions proposed for word2vec so far. One of such extensions is called sentence2vec, which can be used to estimate not only word embeddings but also paragraph embeddings.

The sentence2vec model is very similar to word2vec models, except there is one additional input called paragraph vector. This is a vector obtained by looking up the paragraph ID in the embeddings matrix. For extending the CBOW model, it is added to one of the input word embeddings. To extend the skip-gram model, it is used to replace the target word embeddings. These two models are called the Distributed Memory Model of Paragraph Vectors (PV-DM) and the Distributed Bag of Words version of Paragraph Vector (PV-DBOW), respectively.

About

Section 1. Linear Algebra Basics

Section 2. Feed-forward Neural Networks

Section 3. Word Embeddings

Section 4. Language Modeling

Section 5. Text Generation

Section 6. Text Classification

Section 7. Sequence-to-Sequence Models

Section 8. Attention

Section 9. Convolutional Neural Networks

Section 10. (TBD)