Re: [MUSIC-IR] Re: Re: How to compare classification results ("Bruno L. Giordano" )


Subject: Re: [MUSIC-IR] Re: Re: How to compare classification results
From:    "Bruno L. Giordano"  <bruno.giordano@xxxxxxxx>
Date:    Wed, 2 Apr 2008 14:19:16 -0400
List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

Hello Geoffroy, A late reply. I agree with the fact that it is important to take into account the chance classification rate. I do not agree with the way in which you compute the chance classification rate = 100/number of classes. One first thing you should take into account is the ratio number of objects / number of classes. Think about the case where you have 10 objects to classify in 10 classes: any decent algorithm would do a perfect job at separating the classes. Clearly, a 100% classification rate in this case is not as good as a 100% classification rate when you have 10 objects and 2 classes. Also, you should consider the complexity of your classifier: the more flexible are the boundaries it draws (e.g., linear vs. quadratic), the more likely it will be to have good classification performances. What I would suggest is that you compute the chance classification rate using a resampling approach (msg me for details), and adjust the observed classification rate so that 0 = chance and 1 = perfect. I am not an expert in machine learning, but I hope this helps. Best, Bruno ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Bruno L. Giordano, Ph.D. Music Perception and Cognition Laboratory CIRMMT Schulich School of Music McGill University 555 Sherbrooke Street West Montréal, QC H3A 1E3 Canada Office: +1 514 398 4535 ext. 00900 http://www.music.mcgill.ca/~bruno/ Geoffroy Peeters wrote: > Thanks all for your replies, and I completly agree on the importance to > take into account the specificities of the statistical properties of the > data-sets and on the use of the F-Measure. > > In fact what I'm interrested is precisely in the case where the > experiments are comparable (i.e. the statistical properties of the > underlying classes are the same -same separability, ...-) but the number > of classes differ. > The question I mentionned is the same whether you use Recall or F-Measure. > > Example: I use the same test set with the same algorithm and the same > measure, but in one case I consider a two class problem and in the other > case I consider a three class problem; how do I compare the results ? > > Best regards > Geoffroy Peeters > > Xavier Amatriain a écrit : >> Hi, >> >> When evaluating classification methods, especially if the classes are >> imbalanced, recognition rate is not a good measure. >> Some common measures are recall = TP/ (TP + FN) and precision = >> TP/(TP+FP) >> >> with TP = true positives, FN = false negatives and FP = false positives >> >> Even better the "F measures" are able to summarize both recall and >> precision into a single number. >> >> You can find more on this, for instance, in the paper "Evaluating >> metrics for Hard Classifiers" >> >> www.in*f*erence.phy.cam.ac.uk/hmw26/papers/evaluation.ps >> >> Kris West wrote: >>> Hi Geoffroy, >>> >>> My two pence: >>> >>> The number of times better than random is a reasonable statistic in >>> machine learning. However, its never truly possible to compare >>> classification experiments on completely different datasets hence people >>> usually report accuracy statistics on a single dataset and use the >>> 'times better than random' to look at how powerful the learning >>> technique was. However its not a great statistic and can't really be >>> used to compare systems across different datasets. If you have a number >>> of measurements of it (across multiple algorithms all tested on the same >>> datasets) they can be used as data points to estimate the significance >>> of the difference between algorithms and variants of them (i.e. I might >>> do a student's T test to determine if my variant of the C4.5 was always >>> significantly better than the standard version). However, you still have >>> to measure the statistic on the same datasets and are really just using >>> this stat as a normalisation of the accuracy scores. >>> >>> To better understand why you can't compare these scores across datasets, >>> consider the situation where one dataset is close to linearly separable, >>> while the other is non-linearly separable. These properties may arise >>> from different feature sets, different example tracks or a combination >>> of the two. The linearly separable case will get good results using >>> linear classifiers (e.g. LDA or SMO with a first order polynomial >>> kernel). The second dataset might not be linearly separable but contain >>> nice contiguous regions of particular classes. Hence, a decision tree >>> model or lazy classifier might do really well here, while the linear >>> classifiers do not. Such situations are possible in genre classification >>> experiments for example (e.g. a dataset of Classical, Electronic and >>> Heavy Metal tracks might be linearly separable, where Jazz, Blues and >>> Country is not, alternatively, you might switch from a MFCC based >>> features set to a Beat histogram and achieve similar effects). In this >>> situation, using the scores for the linear classifiers on the first >>> dataset you would significantly overestimate their performance on the >>> second dataset. >>> >>> So to summarise, you have to fix at least one variable to make >>> comparisons and the comparisons you can make depend on the variable >>> fixed. This can mean fixing the dataset (and learning about algorithms), >>> fixing the algorithm (and learning about datasets) or some other >>> suitable situation. Hence, if you use the same algorithm in tests on >>> two different datasets 'times better than random' can tell you how hard >>> the algorithm found the dataset... but not really much more than the >>> accuracy told you. An algorithm A that did well in a small test might be >>> outperformed by Algorithm B on the larger test, the only useful thing >>> that 'times better than random' can tell you here is whether the 'times >>> better than random' stayed constant despite the change in data-size >>> (hence it might keep scaling up with a fairly constant performance). >>> >>> K >>> >>> >>> >>> Geoffroy Peeters wrote: >>> >>>> Dear all, >>>> >>>> has anyone already deal with the comparison of classification results >>>> coming from experiments using various number of classes. >>>> In other words: how to compare a recognition-rate X coming from an >>>> experiment with N classes to a recognition-rate Y coming from an >>>> experiment with M classes. >>>> >>>> I guess one possibility is to compute for both, the ratio of the >>>> obtained recognition-rate to the random recognition rate (which >>>> depends on the number of classes). >>>> Example: >>>> - recognition-rate of 50% for 2 classes would give 1 (50%/50%) >>>> - recognition-rate of 50% for 4 classes would give 2 (50%/25%); >>>> So this would lead to the conclusion that the second system performs >>>> better. >>>> >>>> However, this measure has the drawback that it favors experiments with >>>> large number of classes: >>>> A 2 classes problem will never exceed a ratio of 2 (100%/50%) ! >>>> >>>> Thanks for any suggestions of references >>>> >>>> Best regards >>>> Geoffroy Peeters >>>> >>>> >>> >>> >> > > > -- > Geoffroy Peeters > Ircam - R&D > tel: +33/1/44.78.14.22 > email: peeters@xxxxxxxx > <http://www.ircam.fr> > > > __________ NOD32 2911 (20080229) Information __________ > > This message was checked by NOD32 antivirus system. > http://www.eset.com


This message came from the mail archive
http://www.auditory.org/postings/2008/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University