Practice Exercises for Deep Learning and Natural Language Processing
Deep NLP Kata is a set of practice exercises ("Katas") that help you familiarize with the basic concepts of deep learning for natural language processing. Each exercise is designed to be fun, practical, and bite-sized (i.e., can be completed in less than 30 minutes).
Those exercises do not depend on the language nor the framework you use. Before starting this exercises, make sure to install the programming language and the framework of your choice and set up the environment. We recommend Python + any of major neural network frameworks, such as TensorFlow, Keras, PyTorch, and Caffe.
Note that this it not an introductory course for machine learning, deep learning, nor programming. We assume you already know (or know how to Google) the basics of linear algebra, neural networks, and machine learning, as well as how to implement them.
If you have any feedback / comments / suggestions for this kata, visit the Github repo.
These exercises help you familiarize yourself with the basic building blocks of neural network computation, such as vectors, matrices, and operations.
Calcualte \( 2 + 3 \) using your framework, i.e., not just using Python operations but using e.g., tf.Tensor
and torch.Tensor
.
Caluclate the inner product of two vectors: \( v_1 = [1 \ 3 \ -5] \) and \( v_2 = [4 \ -2 \ -1] \).
Calucalte the product of two matrices: $$ m_1 = \begin{bmatrix} -1 & 2 & 3 \\ 4 & -2 & 0 \end{bmatrix} $$ and $$ m_2 = \begin{bmatrix} -2 & -3 \\ 4 & 1 \\ 0 & 1 \end{bmatrix}. $$
Consider this linear model: \( y = xw + b \), where \( w = [2 \ -1]^T \) and \( b = 1 \).
What is the value of \( y \) when \(x = [1 \ 1], [0 \ 1], [1 \ 0] \), respectively?
Consider the linear model \( y = xw + b \), where the values of \( w \) and \( b \) are unknown.
When the values of \( x \) are \(x = [1 \ 1], [0 \ 1], [1 \ 0] \), the values of \( y \) are 2, 0, 3, respectively. Solve for \( w \) and \( b \).
First, compute the design matrix: $$ X = \begin{bmatrix} 1 & x^1_1 & x^1_2 \\ 1 & x^2_1 & x^2_2 \\ 1 & x^3_1 & x^3_2 \end{bmatrix}, $$ where \( x^i_j \) is the j-th dimension of i-th training instance. Then, the solution for \( b \) and \( w \) is given by: $$ (X^T X)^{-1} X^T y $$
Compute the mean squared loss between two vectors: \(y_1 = [1 \ 1 \ 1]\) and \(y_2 = [2 \ 0 \ 3]\).
Consider the linear model \( y = xw + b \), where \( w = [0 \ 0]^T \) and \( b = 0 \). For input \(x = [1 \ 1], [0 \ 1], [1 \ 0] \), compute the output (\( y \)) of the model. What is the mean squared loss between \( y \) and desired output \( y = [2 \ 0 \ 3] \)?
Also, compute the gradient of the loss w.r.t \( w \) and \( b \).
Consider the logistic regression model \( y = \sigma(xw + b) \) where \( w = [0 \ 0]^T \), \( b = 1 \), and \( \sigma \) is a sigmoid function.
What are the values of \( y \) for input \(x = [1 \ 1], [0 \ 1], [1 \ 0] \), respectively?
Compute the cross entropy loss between the output of Kata 8 and desied output \( \hat{y} = [1 \ 0 \ 0] \).
Compute the gradient of the cross entropy loss from Kata 9 w.r.t \( w \) and \( b \).
In this section, we'll start building feed-forward neural networks using the building blocks we learned previously.
Let's generate some fake data for regression. Let \( \hat{w} = [2 \ -1] \) and \( \hat{b} = 1 \). These are the parameters you want your neural network to learn. Going forward, pretend you don't know the "true" values of these parameters.
First, generate a number (e.g., 100) of two dimensional vectors whose values are drawn randomly from a standard normal distribution (Gaussian distribution with \( \mu = 0 \) and \(\sigma = 1\)). Let this be \( \mathbf{x} \). Second, generate the desired output \( \mathbf{\hat y} \) s.t. \( \mathbf{\hat y} = \mathbf{x} \hat{w} + \hat{b} \).
Build a feed-forward neural network that has one densely connected linear layer. It takes \( \mathbf{x} \) as input and output \( \mathbf{y} = \mathbf{x} w + b \). Initialize \( w = [0 \ 0] \) and \( b = 0 \). What does \( \mathbf{y} \) look like?
Optionally, you can mini-batch the input \( \mathbf{x} \).
Compute the mean squared loss between \( \mathbf{y} \) and \( \mathbf{\hat y} \).
Back-propagate the gradient of the loss function and update the model parameters (connection weights and bias of the linear layer) so that the loss is minimized. Iterate this over the training data (i.e., \( \mathbf{x} \) and \( \mathbf{\hat y} \)) multiple times. Log the values of the loss function. If your model and training pipeline is working correctly, those values should decrease as you iterate.
Build another feed-forward neural network that has one two-dimensional "hidden" layer. In other words, this model has two sets of connections:
As you did in Kata 14, train the model you just created in Kata 15 by back-propagating the loss and updating the model parameters.
What does the loss decrease look like in comparison to the previous Kata?
Add ReLU (Rectified Linear Unit) as the activation function (non-linearity) to the hidden layer, and train the model.
Also, try the Tanh and Sigmoid functions. Which activation functions perform better in terms of reducing the loss?
Now, let's turn to a classification problem and generate some fake data for it. Let \( \hat{w} = [2 \ -1] \) and \( \hat{b} = 1 \).
First, as we did in Kata 11, let's generate a bunch (maybe 100) of two-dimensional vectors from a standard Gaussian distribution. Let this be \( \mathbf{x} \).
Second, let \( \mathbf{p} = \sigma(\mathbf{x} \hat{w} + \hat{b}) \), where \( \sigma() \) is the element-wise sigmoid function.
Finally, generate \( \mathbf{\hat y} \) s.t. \( y_i = 1 \) if \( p_i > 0.5\) and 0 otherwise, where \( y_i \) and \( p_i \) are the i-th element of \( \mathbf{\hat y} \) and \( \mathbf{p} \), respectively.
Create a feed-forward neural network that has one two-dimensional hidden layer with the Tanh activation function, as you did in Kata 17.
Add a sigmoid activation function on top of the output layer, so that the network can output a probability between 0 and 1 instead a real value.
Define a cross entropy loss between the output of the network \( \mathbf{y} \) and the target \( \mathbf{\hat y} \).
Back-propagate the gradient of the loss function and update the model parameters.
In this section, we'll be building so-called word2vec models, namely skip-gram and CBOW (continous bag-of-words) models. In terms of architecture, they are very simple feed-forward neural networks.
First of all, let's prepare the data set for the word2vec models. Go to Matt Mahoney's page, find and download text8.zip, and unzip it. This file, consisting of 100M-character excerpts from cleaned English Wikipedia, is often used for testing word2vec algorithms.
Second, read the text, split into words by whitespace, and convert words into word indices. In doing so, you want to limit the size of the vocabulary to \( N \) (say, the most frequent 10,000 words). You can do this in the following fashion:
Counter
comes in handy here.UNK
. Assign index=0 to UNK
, that is, those "out-of-vocabulary" words are assigned an index of 0.If you do this correctly, you'll have created a long list of word indices, which are integers ranging from 0 to \( N - 1 \).
The skip-gram model is trained so that the neural network can solve this "fake task": given a word in text, it predicts what types of words tend to appear in its context. Create the training data for this task as follows:
In machine learning, we usually group such training pairs into minibatches. Continue this process for different values of \( t \) and \( j \) until you have \( b \) such pairs. Group them into a minibatch.
The key to word2vec models is word embeddings, which are semantic representations of words through dense, real-valued vectors. Word2vec models convert input/output words to embeddings and use them for prediction.
Conversion from words to embeddings is usually done via an embedding matrix, which works as a look-up table from word IDs to their corresponding embedding vectors. First, define an embedding matrix in your programming language/framework. Second, represent a word via a one-hot vector, which is a sparse, integer vector where i-th element (i = ID of the word) is 1 and everything else is 0. Finally, the corresponding word vector is obtained by multiplying this one-hot vector with the embedding matrix. (Note that most modern neural network frameworks have a built-in implementation for this whole pipeline).
Implement the skip-gram model. First, using the embedding matrix above, obtain the input word vector. Let this be \( v_{w_I} \). Second, using another embedding matrix (but with the same size and shape), obtain the output word vector \( v'_{w_O} \). Take the inner product between those two. Finally, normalize this by considering all words in the vocabulary using softmax:
$$ p(w_O | w_I) = \frac{\exp({v'_{w_O}}^T v_{w_I})}{\sum_{w \in V} \exp({v'_w}^T v_{w_I})) }, $$ where \( V \) is the vocabulary. The loss function is the sum of the logarithm of the probability above over all training pairs: $$ \sum_{(w_O, w_I) \in X} \log P(w_O | w_I). $$Train your skip-gram model by maximizing the log probability above.
After training the skip-gram model, as a by-product you end up with the "optimized" embedding matrix, whose rows correspond to what you want as word vectors.
Pick one word arbitrarily from the vocabulary (say, "dog"), and compute the cosine similarity between this word vector and every other word vector in the vocabulary, and sort the words by this similarity. Do you see similar words to the one you picked?
You may have noticed that the model you just built is very slow to train. One reason is that the softmax loss (see Kata 24) considers every word in the vocabulary.
One alternative is to use Negative Sampling. The idea of Negative Sampling is similar to turning this problem of skip-gram into a classification problem. Specifically, instead of feeding positive output \( (w_t, w_{t+j}) \) you created from the corpus, you feed both correct and negative output, where negative output \( (w_t, w_i) \) is generated by pairing \( w_t \) with a randomly sampled word \( w_i \) from a unigram distribution.
Implement Negative Sampling. See Equation (4) in the original word2vec paper for more details of the loss function to use.
Another reason why a naive skip-gram model tends to be slow to train is because of high-frequency words such as "the" and "a". They can appear millions of times in a large corpus, while providing very little information for word embeddings. For example, a correct pair ("dog", "the"), which can appear in a corpus thousands of times, provides very little information about the word "dog", beside it's probably a noun, compared to other pairs such as ("dog", "white") and ("dog", "barked").
One way to deal with this issue is to subsample frequent words, which means simply to discard word pairs with frequent words. In the original word2vec paper, each word in the training corpus is discarded with a probability that is monotonically increasing with the word frequency.
At this point, your word embeddings might be starting to make some sense. One way to verify if word embeddings are making any sense is to visualize them.
Visualize the word embeddings in a 2D space using the t-SNE algorithm, which is a popular way to map high-dimensional vectors onto a low-dimensinoal space.
If you are using Python, sklearn.manifold.TSNE
is a good choice.
One of the easiest way to verify word embeddings is to look for small, closed word clusters such as days of week ("Sunday", "Monday", ...), pronouns ("I", "you", "she", ...), and numbers ("one", "two", "three", ...), because they tend to appear in similar context and their corresponding word embedding vectors tend to cluster together. Do you see related words in the vicinity?
Another algorithm in the word2vec family is Continuous Bag-of-Words (CBOW) model. CBOW flips the input and output of skip-gram, and predicts the target word from its context. Specifically, CBOW considers all words in a context window (\(k\) words before and after the target), looks up their word embeddings, and then takes the sum of those vectors.
Implement the CBOW model by modifying the skip-gram algorithm.
There are numerous extensions proposed for word2vec so far. One of such extensions is called sentence2vec, which can be used to estimate not only word embeddings but also paragraph embeddings.
The sentence2vec model is very similar to word2vec models, except there is one additional input called paragraph vector. This is a vector obtained by looking up the paragraph ID in the embeddings matrix. For extending the CBOW model, it is added to one of the input word embeddings. To extend the skip-gram model, it is used to replace the target word embeddings. These two models are called the Distributed Memory Model of Paragraph Vectors (PV-DM) and the Distributed Bag of Words version of Paragraph Vector (PV-DBOW), respectively.