# Revision history [back]

I am not 100% clear on your question but I will try and steer you in the write direction. The general workflow with machine learning is as follows:

1. Train your classifier on a test set of data.
2. Use a separate data set to quantify the performance of your classifier on new data. What happens is that your classifier can over-fit itself to your training data, and you want to get an idea of how it works on things it hasn't seen before.
3. Iterate using this loop until you achieve the desired performance.
4. Deploy the classifier in the wild.

The confusion matrix is a tool that you use to reason about how your classifier performs. What the matrix tells you is how confusing each class is to your classifier. Say you used color to train a classifier for apples and oranges, the confusion matrix tells you:

1. How many apples were labeled apples.
2. How many oranges were labeled oranges
3. How many apples are labeled oranges.
4. How many oranges are labeled apples.

This would look something like (this matrix may or may not be transposed).

  X| _A_|_O_| --> These are the "truth" values, what we know from the labels that we gave everything

A| 8  |  2  | --> We labeled 10 things apples, 8 are actually apples 2 are really oranges

O| 3  |  7  | --> We labeled 10 things oranges, 3 were actually apples, the other 7 were oranges

|
These rows are what what each thing was classified as.


From this matrix we can infer that we tested 20 things, 11 were apples, 9 were oranges for a total of 20 things. We got (2+3)=5 things labeled incorrectly meaning we were 25% incorrect. We got (8+7)=15 things labeled correctly, meaning we were 75% correct. For 8 of the 11 apples were labeled correct and 7 of the 9 oranges were labeled correct.

Now if you have a lot of oranges being labeled as apples, but not a lot of apples being labeled as oranges one can infer from this that perhaps your apple classifier shows too much preference for orange colors, and if you change that threshold you can get better results.

To get back to your first questions, this means that your test set must be labeled just like your train set. The labels matter because this set is used to test your data. This is to say that the test set needs to have a few (well really a lot, like hunders usually) images in each of the class folders. There are no bounds on the number of each class that the classifier is going to be labeled with a particular class. Generally if all of your images end up with a particular label you are doing something wrong or your selection of features or classifiers sucks for that data (i.e. you need to re-evaluate your system).

The second question I think should be clear from this discussion. Generally the confusion matrix should be additive so you can add the two together, but this is not a good idea as it doesn't really tell you anything more.