To compare what Luminoso’s classifiers detected vs. what the true label was (the label on the training data)
To be able to see which labels are having overlap, and may be causing ‘confusion’ in the classifier.
To fix those areas of confusion, so that the classifier is in a better state to push into production
Reading a confusion matrix
An example of how to read a column in the above screenshot of a confusion matrix:
In D2 you see, ‘local host installs’.
This D column is to illustrate how the classification for the label ‘local host installs’ performed.
You can see that out of 304 testing documents (B8), 268 (B3) were correctly classified.
4 documents were classified with the label of ‘installation’, which differed from the training data tag of ‘local host install”.
Determining correction needs
Generally, we recommend inspecting and potentially correcting a label if the accuracy is below 80%.
This accuracy is calculated by taking the correct number, and dividing by the total number, for example in the above screenshot (B3/B8=accuracy)
Also, if there is an incorrect cell that is taking a large portion of the total incorrect classifications, this should be investigated to see why this large overlap/confusion is existing. A ‘large portion’ is not an exact threshold, but typically 10-20% of the total classifications is a good rule of thumb.