Thanks all for your replies, and I completly agree on the importance to
take into account the specificities of the statistical properties of
the data-sets and on the use of the F-Measure.
In fact what I'm interrested is precisely in the case where the
experiments are comparable (i.e. the statistical properties of the
underlying classes are the same -same separability, ...-) but the
number of classes differ.
The question I mentionned is the same whether you use Recall or
F-Measure.
Example: I use the same test set with the same algorithm and the same
measure, but in one case I consider a two class problem and in the
other case I consider a three class problem; how do I compare the
results ?
Best regards
Geoffroy Peeters
Xavier Amatriain a écrit :
Hi,
When evaluating classification methods, especially if the classes are
imbalanced, recognition rate is not a good measure.
Some common measures are recall = TP/ (TP + FN) and precision =
TP/(TP+FP)
with TP = true positives, FN = false negatives and FP = false positives
Even better the "F measures" are able to summarize both recall and
precision into a single number.
You can find more on this, for instance, in the paper "Evaluating
metrics for Hard Classifiers"
www.in*f*erence.phy.cam.ac.uk/hmw26/papers/evaluation.ps
Kris West wrote:
Hi Geoffroy,
My two pence:
The number of times better than random is a reasonable statistic in
machine learning. However, its never truly possible to compare
classification experiments on completely different datasets hence
people
usually report accuracy statistics on a single dataset and use the
'times better than random' to look at how powerful the learning
technique was. However its not a great statistic and can't really be
used to compare systems across different datasets. If you have a
number
of measurements of it (across multiple algorithms all tested on the
same
datasets) they can be used as data points to estimate the significance
of the difference between algorithms and variants of them (i.e. I might
do a student's T test to determine if my variant of the C4.5 was always
significantly better than the standard version). However, you still
have
to measure the statistic on the same datasets and are really just using
this stat as a normalisation of the accuracy scores.
To better understand why you can't compare these scores across
datasets,
consider the situation where one dataset is close to linearly
separable,
while the other is non-linearly separable. These properties may arise
from different feature sets, different example tracks or a combination
of the two. The linearly separable case will get good results using
linear classifiers (e.g. LDA or SMO with a first order polynomial
kernel). The second dataset might not be linearly separable but contain
nice contiguous regions of particular classes. Hence, a decision tree
model or lazy classifier might do really well here, while the linear
classifiers do not. Such situations are possible in genre
classification
experiments for example (e.g. a dataset of Classical, Electronic and
Heavy Metal tracks might be linearly separable, where Jazz, Blues and
Country is not, alternatively, you might switch from a MFCC based
features set to a Beat histogram and achieve similar effects). In this
situation, using the scores for the linear classifiers on the first
dataset you would significantly overestimate their performance on the
second dataset.
So to summarise, you have to fix at least one variable to make
comparisons and the comparisons you can make depend on the variable
fixed. This can mean fixing the dataset (and learning about
algorithms),
fixing the algorithm (and learning about datasets) or some other
suitable situation. Hence, if you use the same algorithm in tests on
two different datasets 'times better than random' can tell you how hard
the algorithm found the dataset... but not really much more than the
accuracy told you. An algorithm A that did well in a small test might
be
outperformed by Algorithm B on the larger test, the only useful thing
that 'times better than random' can tell you here is whether the 'times
better than random' stayed constant despite the change in data-size
(hence it might keep scaling up with a fairly constant performance).
K
Geoffroy Peeters wrote:
Dear all,
has anyone already deal with the comparison of classification results
coming from experiments using various number of classes.
In other words: how to compare a recognition-rate X coming from an
experiment with N classes to a recognition-rate Y coming from an
experiment with M classes.
I guess one possibility is to compute for both, the ratio of the
obtained recognition-rate to the random recognition rate (which
depends on the number of classes).
Example:
- recognition-rate of 50% for 2 classes would give 1 (50%/50%)
- recognition-rate of 50% for 4 classes would give 2 (50%/25%);
So this would lead to the conclusion that the second system performs
better.
However, this measure has the drawback that it favors experiments with
large number of classes:
A 2 classes problem will never exceed a ratio of 2 (100%/50%) !
Thanks for any suggestions of references
Best regards
Geoffroy Peeters
|