[Deep Learning] Combining numerical and text features in (deep) neural networks

In this post I will show how to combine features from natural language processing with traditional features (meta data) in one single model in keras (end-to-end learning). The solution is a multiple inputs model.

Problem definition

Scientific data sets are usually limited to one single kind of data, for example text, images or numerical data. This makes a lot of sense, as the goal is to compare new with existing models and approaches. Anyway, often ml models combine more then one single data source and therefore deal with different kinds of data. To utilize end-to-end learning neural networks, instead of manually stacking models, we need to combine these different feature spaces inside the neural network.

Let´s assume we want to solve a text classification problem and we do have additional meta data for each of the documents in our corpus. In simple approaches, where our document is represented by a bag of words, we could just add our metadata to the BoW vector, and we are done. But when using word embeddings it´s a bit more complicated.

Special Embeddings

The easiest solution is to add our meta data as additional special embeddings. In this case we need to transform our data into categorial features, because our embeddings can exist or not exist. This works if we increase the vocabulary size by the number of additional features and treat them as additional words.

Example: Our dictionary is 100 words and we have 10 additional features. In this case we add 10 additional words to the dictionary. The sequence of embeddings now always starts with the meta data features, therefore we must increase our sequence length by 10. Each of these 10 special embeddings represent one of the added features.

There are several drawbacks with this solution. We only have categorical features, not continuous values and even more important our embedding space mixes up nlp and meta data.

Multiple input models

Much better is a model, which can handle continuous data and just works as a classifier with nlp features and meta data. This is possible with multiple inputs in keras. Example:

nlp_input = Input(shape=(seq_length,), name='nlp_input')
meta_input = Input(shape=(10,), name='meta_input')
emb = Embedding(output_dim=embedding_size, input_dim=100, input_length=seq_length)(nlp_input)
nlp_out = Bidirectional(LSTM(128, dropout=0.3, recurrent_dropout=0.3, kernel_regularizer=regularizers.l2(0.01)))(emb)
x = concatenate([nlp_out, meta_input])
x = Dense(classifier_neurons, activation='relu')(x)
x = Dense(1, activation='sigmoid')(x)
model = Model(inputs=[nlp_input , meta_input], outputs=[x])

We use a bidirectional LSTM model and combine its output with the metadata. Therefore we define two input layers and treat them in separate models (nlp_input and meta_input). Our NLP data goes through the embedding transformation and the LSTM layer. The meta data is just used as it is, so we can just concatenate it with the lstm output (nlp_out). This combined vector is now classified in a dense layer and finally sigmoid in to the output neuron.

This concept is usable for any other domain, where sequence data from RNNs is mixed up with non-sequence data. The output of an LSTM is representing the Sequence in an intermidiate space. That means the output of the LSTM is also a special kind of embedding.

Related Posts

10 comments

[…] is even possible to add meta data (numerical) to the model, as described in the previous post.In this metadata the information about the author of each message can be stored and furthermore […]

Christoffer Refsgaard

If you were to add another LSTM layer or another NN layer, how would you need to modify the code above?

digital-thinking

Hi Christopher,
if you want to stack the LSTM layer, you can this:

layer_1= Bidirectional(LSTM(128, dropout=0.3, recurrent_dropout=0.3, kernel_regularizer=regularizers.l2(0.01), return_sequences=True))(emb)
nlp_out = Bidirectional(LSTM(128, dropout=0.3, recurrent_dropout=0.3, kernel_regularizer=regularizers.l2(0.01)))(layer_1)

Hi,
Awesome post! I was wondering how we can use an LSTM to perform text classification using numeric data. For example, suppose I have a dataframe with 11 columns and 100 rows, and columns 1-10 are the features (all numeric) while column 11 has sentences (targets). I gave a thought over the multiple input model which is described in this post which is quite useful when the features are a mix of numeric and text, but in this case, it would be like using the numeric features to predict the sentences. Any minimal example regarding this is highly appreciated.

digital-thinking

What do you mean excatly, you want to predict the sentences in terms of generating sentences or just using the sentence as class?

Hi. I would like to predict the sentences based on the numeric features. So it would be something like using these numeric features and the model would have sentences as targets instead of just an integer label. I can make it more precise with some code I am trying, but am not sure why I am always getting the same sentences as output. I would make this more clear, I have a data frame with many rows and columns (dimensions 21392×1973). I would like to use columns from 1-1972 (which are all numeric) as features, while column 1973 (the last column in dataframe has sentences within). So basically, by providing the model with new features from a test set, I want to predict the sentences. I am trying this approach but feel that I am doing something wrong, especially with a major confusion in choosing the loss and activation (as I’ve one hot encoded sentences in the last column and am inverse transforming them back in the end). Any suggestions and guidance would be highly appreciated. I regularly read your blogs, and am looking for an approach for data-text based LSTM:-)

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from keras.models import load_model

df = pd.read_pickle(‘my_data.pkl’) # Here, I am reading the data which is completely structured
df_features = df.iloc[:,:-1] # These are the numeric features as inputs
outputs_df = df.iloc[:,-1] # These are outputs (all sentences)
outputs_df = outputs_df.values # Converting the outputs to an array

# center and scale to ensure numeric features have values between 0-1 (as all have different ranges)
print(“Center and Scaling taking place….”)
scaler = MinMaxScaler(feature_range=(0, 1))
df_features = scaler.fit_transform(df_features)

# # one-hot encode the outputs (which are the sentences)
print(“One hot encoding the outputs for training….”)
onehot_encoder = OneHotEncoder()
encode_categorical = outputs_df.reshape(len((outputs_df)), 1)
outputs_encoded = onehot_encoder.fit_transform(encode_categorical).toarray()
print(‘outputs_encoded.shape after One Hot Encode:’, outputs_encoded.shape)

X_train, X_test, y_train, y_test = train_test_split(df_features,outputs_encoded,test_size=0.30) #Splitting data into train and test set
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed

n_timesteps = len(X_train) #No. of time steps for training data
n_timesteps_test = len(X_test) #No. of time steps for test data
X_train = X_train.reshape(1,n_timesteps,1972)
y_train = y_train.reshape(1,n_timesteps,26)

X_test = X_test.reshape(1,n_timesteps_test,1972)
y_test = y_test.reshape(1,n_timesteps_test,26)

# create a sequence classification instance
def get_sequence(n_timesteps, X, y):
X = X.reshape(1, n_timesteps, 1972)
y = y.reshape(1, n_timesteps, 26)
return X, y

# define LSTM
model_lstm = Sequential()
model_lstm.add(LSTM(100, input_shape=(None, 1972), return_sequences=True))
model_lstm.add(TimeDistributed(Dense(26, activation=’sigmoid’)))
model_lstm.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘acc’])

# fitting model
X, y = get_sequence(n_timesteps, X_train, y_train)

# training LSTM
model_lstm.fit(X, y, epochs=100, batch_size=128, verbose=2)

# evaluate LSTM on test data
yhat_lstm_test = model_lstm.predict(X_test, verbose=0)
yhat_squeezed_lstm_test = np.squeeze(yhat_lstm_test, axis=0)

yhat_sentences_lstm_test = onehot_encoder.inverse_transform(yhat_squeezed_lstm_test) # These are the sentences, but weirdly all are generated as same, which is not the case in the y_test (original test sentences)

So basically, I would like to predict sentences in the sense of language generation with the LSTM. Thanks.

digital-thinking

It would go beyond the scope here to go through in detail, sorry. Mabye I come back to your example later on.
Basically, I would say that the data representation does not meet your requirements. Your label is a bag of words vector with 26 words * n_timesteps. And you are trying to predict a new bag of words vector for each timestep, but your LSTM predicts the first BoW vector after seeing the first timestep in the input sequence (return_sequences). I guess this is not what you want, you need some internal representation of the whole sequence first (sequence embedding).
I would rather use word embeddings, instead of OneHot encoding and a sequence to sequence model (Encoder/Decoder). Maybe this post helps a bit.

Thanks for your reply again. I would really appreciate a blog article on something similar with my example scenario, as it will be very helpful for me and other people looking for something like this. I got your idea of using word embeddings for the outputs, instead of binary one hot encoding and a Seq2Seq model, and a coherent blog article would be much appreciated! Cheers.

Leave a reply