Background: Paragram Embeddings
Paragram embeddings are a type of word embeddings focused on the semantics of individual words. They were created by instituting various semantic similarity constraints from the Paraphrase Database onto GloVe vectors in order to improve their semantic content.
What did you try?
After last week, we finally made a shift to multi-turn dialogues as that is what we cared about the most and what we wanna optimize on.
Instead of just having the Paragram embedding representations of the user utterance passed into the model, we wanted to go ahead and see if there were any performance improvements to including the ELMO embeddings to the utterance representation.

We focused on modifying the user utterance representation.
By adding ELMO, we hoped to take advantage of its contextual word understanding within the dialogues that the user provides. By doing so, we hoped that the accuracy of our belief state tracker would increase as the model would have a better understanding of the user's dialogue.
To do so, we had to make several modifications to our pipeline to support a trainable ELMo (softmax-normalized weights on the hidden representations from the language model and
the task-specific scaling factor)
- We had to modify the CNN model to be able to take in the Paragram representations + ELMO representation of the user utterance
- We changed the Convolution filters to support our new representation of the user dialogues
- Configuring our pipeline
- To run multiple experiments we extended the existing code to support configs for all the different modes we cared about.
- Single turn vs Multi turn
- ELMo vs no ELMo
Exciting Results

We ran our experiment to train ELMo vs no ELmo for 120 epochs for the multi turn and noticed some interesting results.
- The model with ELMo trained better than the one with just the Paragram embeddings. That is, it got a higher accuracy with less number of epochs.
- Unfortunately, we couldn't add a graph for that as the ELMo takes longer to train and we weren't caching our model at each epoch. It would have taken too long if we did cache the models and re-ran it
- We see some interesting behavior in the results as well
- Our ELMo model reduces the number of false positives by ~28%
- Our ELMo model reduces the number of false negatives by ~14%
Issues we ran into
Engineering Efforts
- Running ELMo on the GPU
- There was also a lot engineering effort involved in making the ELMo run with the GPU provided. As tensorflow only supported 10.0, while the GPU machines had CUDA 10.1. So we had to download the entire CUDA 10.0 version in our personal directory to make it work
- The GPU machines also lacked the correct cuDNN version, so we also had to configure that
- One consequence of adding ELMo was the fact that training the model takes much longer.
- Without ELMO: ~2 hours
- With ELMO: ~5 hours