Using a confusion matrix in Compass

Why use a confusion matrix?

  • To compare what Luminoso’s classifiers detected vs. what the true label was (the label on the training data)

  • To be able to see which labels are having overlap, and may be causing ‘confusion’ in the classifier.

  • To fix those areas of confusion, so that the classifier is in a better state to push into production

Reading a confusion matrix

An example of how to read a column in the above screenshot of a confusion matrix:

  • In D2 you see, ‘local host installs’.

  • This D column is to illustrate how the classification for the label ‘local host installs’ performed.

  • You can see that out of 304 testing documents (B8), 268 (B3) were correctly classified.

  • 4 documents were classified with the label of ‘installation’, which differed from the training data tag of ‘local host install”.

Determining correction needs

  • Generally, we recommend inspecting and potentially correcting a label if the accuracy is below 80%. 

    • This accuracy is calculated by taking the correct number, and dividing by the total number, for example in the above screenshot (B3/B8=accuracy)

  • Also, if there is an incorrect cell that is taking a large portion of the total incorrect classifications, this should be investigated to see why this large overlap/confusion is existing. A ‘large portion’ is not an exact threshold, but typically 10-20% of the total classifications is a good rule of thumb.