Baselines

We converted the dataset to contain single turn dialogues. This allowed us to run our model on independent sentences.

To make sure that we could get an accurate evaluation of the performance of the "single turn" model, we had to make sure that we only trained and evaluated on data which does not depend on a previous belief state. As a result, we just took the WOZ dialogues and the first "dialogue turn" in each dialogue instance to train on.

We tried multiple baselines approaches on single turn predictions. Some of these baseline approaches were discussed in the paper.

Baseline #1 - Constrained Update

The constrained update approach was discussed in the original paper about the neural belief tracker. The main idea is to constrain the weights matrix updates such that not all weight components are updated individually but the diagonal elements and the rest of the elements are updated together. This allows the model to have lower number of parameters and be more confident on the beliefs given the decision marker.

Baseline #2 - Unconstrained Update

The unconstrained approach was tried by us to identify the difference that the weighted constrained update would make in our model. The unconstrained update used the decision maker and passes it through a softmax layer to compute the beliefs for each slot model.

Baseline #3 - Fully Connected update

The Fully Connected update approach was our idea for a baseline. This approach had unconstrained weight matrix and a bias for the decision maker. The output was then passed through a softmax layer to compute the beliefs for each slot model.

We tried different kinds of updates for the decision maker to compute the belief state and evaluated the model based on the single slot accuracy and joint slot accuracy. We trained the models for 30 epochs on the train dataset and evaluated on the validation dataset. The results we computed were as follows:

Error Analysis

The constrained model seems to produce the lowest error for single turn prediction on joint slot predictions.

Here is some error analysis we got for our best model (most of of the others models got them wrong too):