# LandscapeClassificationMatrix

I am running Windows Vista and the latest SimpleCV version. Now I am running MLTestSuite with my own dataset. Since I started to work through more insistently with SimpleCV this issue arises:

I am training my dataset according to SVMClassifier in MLTestSuite: - with three paths and three classes e.g.: beach, forest, waterbody ...

print('Train')
classifierSVMP = SVMClassifier(extractors,props)
classifierSVMP.train(path,classes, savedata = "/Users/Arenzky/Desktop/testtab/",disp=display,subset= n) #train


Then I have a test dataset. The test dataset has paths to images, but which have no defined classes. They do no belong to a specific feature, such as forest or waterbody...e.g folder1, folder2,...

QUESTION 1:

Lets say we have 15 images from the test path. How many images (from folder1,2 or 3)can be classified as beach, how many as waterbody and how many as forest based on my trained dataset?

QUESTION 2:

    print('Test')
[pos,neg,confuse] = classifierSVMP.test(path,classes,savedata = "/Users/Arenzky/Desktop/testtab/", disp=display,subset=n)
files = []


Further, to my knowledge the "confuse" displays a so-called confusion matrix. Is there a way to have a matrix combining classes fom the train dataset and from test dataset such as:

/       beach forest water_body (train)
folder1

folder2

folder3
(test)


I am not sure if I am on the right track, but the MLTestSuite seemed to me the most feasible Template to work with. Can anybody give me some help or any tip?

thanks Daniel

edit retag close merge delete

Sort by » oldest newest most voted

I am not 100% clear on your question but I will try and steer you in the write direction. The general workflow with machine learning is as follows:

1. Train your classifier on a test set of data.
2. Use a separate data set to quantify the performance of your classifier on new data. What happens is that your classifier can over-fit itself to your training data, and you want to get an idea of how it works on things it hasn't seen before.
3. Iterate using this loop until you achieve the desired performance.
4. Deploy the classifier in the wild.

The confusion matrix is a tool that you use to reason about how your classifier performs. What the matrix tells you is how confusing each class is to your classifier. Say you used color to train a classifier for apples and oranges, the confusion matrix tells you:

1. How many apples were labeled apples.
2. How many oranges were labeled oranges
3. How many apples are labeled oranges.
4. How many oranges are labeled apples.

This would look something like (this matrix may or may not be transposed).

  X| _A_|_O_| --> These are the "truth" values, what we know from the labels that we gave everything

A| 8  |  2  | --> We labeled 10 things apples, 8 are actually apples 2 are really oranges

O| 3  |  7  | --> We labeled 10 things oranges, 3 were actually apples, the other 7 were oranges

|
These rows are what what each thing was classified as.


From this matrix we can infer that we tested 20 things, 11 were apples, 9 were oranges for a total of 20 things. We got (2+3)=5 things labeled incorrectly meaning we were 25% incorrect. We got (8+7)=15 things labeled correctly, meaning we were 75% correct. For 8 of the 11 apples were labeled correct and 7 of the 9 oranges were labeled correct.

Now if you have a lot of oranges being labeled as apples, but not a lot of apples being labeled as oranges one can infer from this that perhaps your apple classifier shows too much preference for orange colors, and if you change that threshold you can get better results.

To get back to your first questions, this means that your test set must be labeled just like your train set. The labels matter because this set is used to test your data. This is to say that the test set needs to have a few (well really a lot, like hunders usually) images in each of the class folders. There are no bounds on the number of each class that the classifier is going to be labeled with a particular class. Generally if all of your images end up with a particular label you are doing something wrong or your selection of features or classifiers sucks for that data (i.e. you need to re-evaluate your system).

The second question I think should be clear from this discussion. Generally the confusion matrix should be additive so you can ...

more

Now I will try to have a more detailed explanation: I have a dataset composed by images, images of trees (trees-folder), images of roses (roses-folder) and images of a house (house-folder)

The test dataset is composed by 1,2,3,4,5 folders. I know from my research that images in the five folders belong to those object classes, but they are disordered (images and are not labeled with the feature they represent, they are labeled with numbers,. For example I do not know how many trees, roses or houses are in folder2. This step is very important, because

1. the images in the folders are grouped according a specific geographic location and
2. the task of identifying features in an image was originally empirically defined by an expert (a human being) for each of the test folders. In other words I want to be SimpleCV be my expert and therefore (partly) automating this classification process.

Just a suggestion, as you explained in your answer, has it sense to label each folder in test dataset in the same way as the trained dataset, regardless of the content of the folder?

more