[Deep Learning] Practical limitations of end-to-end learning in text classification part 2

This is the second part of the comparison between artificial neural networks in a binary chat classification task. In this post I will show that neural networks outperform traditional approaches, like show in the previous post. Furthermore, I will show several “state of the art” approaches for text classification and why deep learning is not necessarily the way to go. This time there will be some code available.

Reminder

The data set consists of conversations in chats and the goal is to predict if they scheduled a meeting and met. To do so a subtask on message level is trying to detect if they are sharing location data.

  • Each message is labeled with TRUE, if the messages consists information about a location. (we skip that here and go for end to end learning)
  • Each thread is labeled with TRUE, if the conversation ended up in a meeting.

Baseline

As shown before the simple location prediction with BoW + SVM gives us a matthews correlation coefficient of 0.6 (from 1.0). Using these outputs and some aggregated meta data with another SVM performs quite good. We achieved an MCC of 0.56 on the final prediction with cross validation. this is our reference.

Deep Learning

Research shows that deep neural networks outperform traditional approaches. Therefore, let’s try out if we can apply some approaches to increase the performance of the predictions.

Word embeddings and stacked recurrent neural networks

Word embeddings like word2vec, Glove and fastText are well known in natural language processing. With these embeddings we represent words as fixed size vectors and therefore a sentence is a sequence of word embeddings. As mentioned in the first part, sequences are more complicated, due their variable length, but they have a major advantage, which is the order of the words.

To deal with sequences we use recurrent neural networks. There are several different RNNs, but the most successful ones are LSTMs and GRUs. The basic implementation is quite easy:

But what does it actually mean?

Let´s start with the first line. We use a sequential model, which is easier but less powerful then functional models, which we use later. The first layer is the embedding layer, which is in our case the layer which takes word indices and transforms these indices to a vector representation. In this case we learn the embeddings from scratch, but it would be possible to use pretrained embeddings with the usage of:

model.add(Embedding(num_features, output_dim=32, input_length=seq_length, weights=[embedding_matrix], trainable=False))

Another option would be to use pretrained initial embedding but make them trainable.

The stacked LSTM layers are responsible to learn the internal states. The difference in these three layers is return_sequences, which is necessary if we want to stack the recurrent layers. With return_sequences set to True the LSTM returns the internal state for each timestep, if it is set to False it does only return the last value. In the last LSTM we only want to know the result without having to address any sequence data, for every cell we only got the result. All of these cell results are now used in the last layer, which is just one single neuron and this neuron makes the final classification. Compiling the model is the last step. Adam is the optimizer used, binary_crossentropy is the loss metric and metrics should be clear.

This approach gives us 0.58 MCC, slightly better then the baseline.

Adding attention

Lets try to improve the results with attention mechanism, which is used in seq2seq applications, but should be useful here too. Attention is basically a mechansim, which learns what part of the text is more imporant then the others in a specific context. The implementation of the Attention decoder can be found here.

Attention increases the training time significantly  (10x) and achieves 0.60 MCC.

Bidirectional recurrent layers

Another option to improve the model, is to not only look at previous word in the sentence, but to look at the following words. This is called bidirectional LSTM and is simply implemented by having basically two RNNs cells, one for forward direction and one for the backward path. In Keras we can just add the Bidirectional layer wrapper.

This training is even more time consuming but increases the MCC to 0.61.

Regularization and dropout

You may notice a difference in the last snippet. With dropout and regularization added to our layers, we can avoid overfitting, while training. Regularization just adds a term to the cost metric, which penalizes large values (exploding gradients) and dropout is used to avoid, that single neurons become to important for the network. It makes the network more reliable, but on the other hand often increases the number of epochs necessary.

Word embeddings and convolution networks

Now there is another popular approach to solve the text classification problem, which is quite different from the previous ones. In this sequential convolution approach, we make use of convolutional layers, which are mostly applied in computer vision.

To understand how it works, there is a very good blog post about convolution in NLP. With this more complex architecture we can make use of the functional model definition, which is more powerful.

In general, the sequential convolution does exactly work like the convolution in computer vision, with one difference. It only takes one dimension to go through, which is in our case the word sequence. One kernel is looking at 4 words at once, one is looking at 8 and one is looking and 16. For every kernel size there are 32 filters (word combinations) learned. Then max pooling is applied for each feature map and the results are concatenated together. After concatenation there is a dense layer for classification.

The covNet is the most stable one, faster to train and the results are comparable with MCC of  0.61.

Hyperparameters, time and architectures

You might ask, if my scores are reliable, because training neural networks is non-deterministic. It is possible that you can train your network twice and get different results. This is the major problem with neuronal networks. You must train them again and again to get reliable results. I know from scientists in this fields, that some of them run the same network again and again until they outperform another approach and then publish the results as a paper. Personally, I think this is crap, results should be verifiable.

To get reliable results first try to find hyperparameters with cross validation and grid search. This is already a problem because training takes too long to execute grid search on big datasets. Some people say Bayesian optimization is the solution, but I made better progress with grid search and a reduced cross validation with less folds and make several runs from coarse to fine. Hyperparameters include the network architectures, you can easily have hundreds of different parameters sets for your network.

If you include the number of epochs as a hyper parameter, it´s over. What I actually did is trying to train the network with a smaller dataset and train/test spilt to see if it converges and use early stopping in keras to speed up the training. Then I used the top N of these networks and executed a grid search with 3-fold cross validation (3 times for each param set), again with early stopping. For early stopping it is necessary to have a validation split inside of each cross-validation step. You don´t want to use early stopping with training scores, due overfitting.

Even this reduced approach takes a lot of time. Honestly, too much time if you relate time, work and performance.

Conclusion

Clearly deep learning outperformed traditional approaches, but is it worth the effort in practice? What if you want to use your model in production environments, where the dataset is dynamic, and the model must be trained automatically from time to time? Is it worth to build the infrastructure to train your models (GPU, AWS)? How to monitor the performance? Is it worth to invest the time and infrastructure to get the minor improvement?

In research such questions are not asked very often. But as an engineer for applied machine learning you must decide if you want to deal with all the problems, which are introduced by neural networks, TensorFlow and everything related. While this depends on the task and the potential outcome, it seems like the answer is often: No.

0.00 avg. rating (0% score) - 0 votes

Related Posts

1 comment

[…] where chats were classified in two categories, using traditional techniques (stacked ensemble). In another post I showed how to improve this approach by using deep neural networks, but only used the text data. […]

Leave a reply


This website stores some user agent data. These data are used to provide a more personalized experience and to track your whereabouts around our website in compliance with the European General Data Protection Regulation. If you decide to opt-out of any future tracking, a cookie will be set up in your browser to remember this choice for one year. I Agree, Deny
702