Sometimes it happens, that you have input data, which does not really fit into standard approaches. In this post I will show, how to build a keras model, which operates on sequences of sequences. To be precise a model is build to process dialogs from people in chats.

## Approach

Lets assume we have a bunch of threads, consisting of messages, where we predict some label for the whole thread. In this post I wrote about a similar problem and used some simple techniques to merge sequences together. But merging these sequences is not always possible or does even decrease the model performance.

With some adjustments, it is possible to build a keras model, which can handle sequences of sequences of word embeddings. Every message consists of a sequences of word embeddings a thread consists of messages. To handle that, we can just stack LSTMs.

1 2 3 4 5 6 7 8 |
nlp_seq = Input(shape=(number_of_messages ,seq_length,), name='nlp_input') emb = TimeDistributed(Embedding(input_dim=num_features, output_dim=embedding_size, input_length=seq_length, mask_zero=True, input_shape=(seq_length, )))(nlp_seq) msg_out = TimeDistributed(Bidirectional(LSTM(32)))(emb) x = LSTM(32)(msg_out) out = Dense(1, activation='sigmoid')(x) model = Model(inputs=nlp_seq , outputs=out) |

The most important layer here is the TimeDistributed layer. What it does is, to apply the inner layer (e.g. the embeddings layer) to every single timestep. The key point is, that the weights are shared, so you are using the same layer for every timestep. It is essentially doing the same as iterating through all timesteps and applying the layer to each of the timestep.

The input data has to be 2D for each sample or a 3D Tensor with (None, num_messages, seq_length) to make these kind of prediction . The embedding layer is converting the input data into some 4D tensor (None, num_messages, seq_lenght, embedding_size). Now the LSTM is appled for every message (TimeDistributed), which means it does create output of (num_messages, seq_lengh, 64). The 64 is the number of neurons (32) in the LSTM layer times two, because it is a bidirectional one (2 LSTMs, backward and forward pass and the output is concatenated). Now that we have an inner representation for every message, we can apply another LSTM onto the message sequence and finally get our result.

Using this architecture is quite heavy and needs a lot time to train. To make it faster, we could also go for convolutional instead of recurrent layers. Therefore, we just replace the second LSTM with 1D convolutions.

1 2 3 4 5 6 7 |
c1 = Conv1D(filter_size, kernel1, padding='valid', activation='relu'))(msg_out) p1 = GlobalMaxPooling1D()(c1) c2 = Conv1D(filter_size, kernel2, padding='valid', activation='relu')(msg_out) p2 = GlobalMaxPooling1D()(c2) c3 = Conv1D(filter_size, kernel3, padding='valid', activation='relu')(msg_out) p3 = GlobalMaxPooling1D()(c3) x = concatenate([p1, p2, p3]) |

The kernel size represents the number of messages to look at locally (“dependencies between messages”). The filter_size is the number of different filters (“dependency policies”).

## Adding metadata

It is even possible to add meta data (numerical) to the model, asdescribed in the previous post. Then we have a model, which looks as the following.