[Deep Learning] Practical limitations of end-to-end learning in text classification part 1

In the last months I was working on a text classification problem at the university. The problem is basically a binary classification of chat-like text data, where the training dataset consists of 500k messages and these messages belong to ~150k threads. One part of these project was to evaluate if artificial neural networks (Deep Learning) can outperform traditional methods. In these two posts I want to show, how and why it is probably better to stay away from neural networks, even if they outperform tranditional approaches. In the first post I will show how to build a text simple text preocessing pipeline and use stacked ensembles.


As said the data consists of messages, belonging to a thread in a conversation of two people. The data is multilingual, but around 70% is englisch. As the data is is from real people, it is not free from mistakes, slang words, abbreviations and other errors. For such communications there is no public dataset in this size available and probably will not be, due privacy considerations. All the data is labeled with a binary attribute per message and a binary attribute per thread, while these message attributes strongly correlate to the thread attribute. Threads consist of a variable amount of messages and each message can also be of variable length (words).

Lets assume the conversation is about finding a location for a meeting and we want to predict if the meeting happend.
Each message is labeled with TRUE, if the messages consists information about a location.
Each thread is labeled with TRUE, if the conversation ended up in a meeting.

The most challenging part of this classification task, is that we want to predict the label of the thread at any state of the conversation. So at some point in the conversation, it might be probable that the they meet, but in the last message one of them declines the meeting. This case should be predicted as well.

The traditional approach

The first attempt to solve such a problem is to split the task into sub tasks. In this case one sub task might be a classification of each message and another one would be the aggregation of these results. Because the thread classifier is stacked onto the message classifier, this approach is called Stacking.

Example thread:

Number Message Label-Loc State-Thread
#1 Hello Linda, I would like to schedule a meeting for tomorrow after lunch. 0 0
Hi Sam, sure! Where do we want to meet? 1pm should be fine. 0 0
I have booked the meeting room in building 12, Room 43. 1 0
#4 Ok perfect, see you there! 0 1
#5 Sorry, unfortunately I can not be there today! 0 0

The Label-Loc is what we want to predict in the first place and the final goal would be to say if the meeting happend or not, which would be TRUE after message 4 and FALSE after message 5.

Message level classification

One simple approach to predict if a location occured in a message would be, to transform the text data into some vector representation e.g. the Bag-of-Words representation or word embeddings. Additional it is neccecary to preprocess the text data. Therefore the data is cleaned, filtered by stop words and stemming is applied. To avoid building a dictionary over the hole text corpus, we may apply feature hashing. As we want to predict if there is a location inside of the message, we want to apply some regular expressions to filter out specific adresses, places, citys and countrys and replace them by placeholders. This prevents our classifier from overfitting, because we don´t want our classifier to learn specific places, but abstract patterns. All of these features are now used in sci-kit learn to predict our taget label via cross-validation of several different models (example benchmark script). There are some scripts to get a rough direction and its a good idea to apply cross validation for a set of classifiers and adjust hyperparameters.

For high dimensional nlp problems the assumption that SVM work very good is confirmed by the benchmark. With SVMs it is also possible to get probabilities instead of classes, by using platt scaling, which helps later in the aggregation part. A common approach ist to use SVM + bag of words features.

Applying this approach give us an accuracy of 95%  and more important a matthews correlation of 0.6, which is a good start. The high accuracy can be explained by the imbalanced data set. About 90% of the messages are labeled with FALSE (no location). Matthews correlation coefficient instead can be understood as a weighted accuracy, which gives aus 1 if everything is correct, 0 if its random and -1 if everything is wrong. MCC is my favorite score for binary classification, because it is applicable in unbalanced data sets.

You can find a java library, which does actually includes a preprocessing for this problem here.

Aggregation on threads

Now we have a probability of each message, which gives us a prediction if a location was mentioned. Furthermore we have some metadata available. For example we have the time, when a message has been sent, the number of messages and some more.

While solving the message classification problem, we learned a lot about the data. Lets assume that we know that in very short threads it is unlikely that a meeting happend and in long threads it is very likely that a meeting happend. But there are some exceptions, like very long threads, where the messages are just spam.

At some point it is a good idea to restrict the length of a thread. The reason is that if we want to use a simple classifier, like SVM. A simple classifier in this case is not able to deal with sequences of inputs, like hidden markov models. Such a classifier can only deal with fixed input data. With the limitation of the thread length, we can just pad our data into a fixed size vector, looking at the last n messages. Even eaiser we can just aggregate the data. In the case of the location feature, we use the maximum from all messages. Aggregating data in pandas is easy and can be done like this:

Now we feed all the data from the message level classification and additional data into our classifier validation pipeline. We can also just use our meta data and the aggregated data as a feature vector.

We have built a machine learning pipeline to classify our thread like text data, without any neural networks. What we get is a good knowledge about the data, we can get feature importances out of our classifiers. For example we can take a look at the random forest decision trees, which confirms that our approach with the message level classification of locations works very good , because it is the most important feature. Or we can take a look at the message level classification features importance(words) and get some understanding which words are the most significant.

Lets go deep

Deep neural networks are proposed as end-to-end machine learning, where you can simply define your target “labels” put in everything you have and the network learns everything itself. Sounds promising, but..

to be continued in part 2….

0.00 avg. rating (0% score) - 0 votes

Related Posts


[…] post I will show that neural networks outperform traditional approaches, like show in the previous post. Furthermore, I will show several “state of the art” approaches for text classification and why […]

[…] I showed an example, where chats were classified in two categories, using traditional techniques (stacked ensemble). In […]

Leave a reply

This website stores some user agent data. These data are used to provide a more personalized experience and to track your whereabouts around our website in compliance with the European General Data Protection Regulation. If you decide to opt-out of any future tracking, a cookie will be set up in your browser to remember this choice for one year. I Agree, Deny