If you come up with a better way of doing classification (for now let's just consider classification, but these remarks apply to clustering and other tasks as well), you have to compare it to prior methods to see which works better. (note: this is a tricky problem in clustering that my student Parasaran Raman has been working on: more on that later.).
The obvious way to compare two classification methods is how well they do compared to some ground truth (i.e labelled data), but this is a one-parameter system, because by changing the threshold of the classifier (or if you like, translating the hyperplane around),you can change the false positive and false negative rates.
Now the more smug folks reading these are waiting with 'ROC' and "AUC" at the tip of their tongues, and they'd be right ! You can plot a curve of the false positive vs false negative rate and take the area under the curve (AUC) as a measure of the effectiveness of the classifier.
For example, if the y axis measured increase false negatives, and the x-axis measured increasing false positives, you'd want a curve that looked like an L with the apex at the origin, and a random classifier would look like the line x+y = 1. The AUC score would be zero for the good classifier and 0.5 for the bad one (there are ways of scaling this to be between 0 and 1).
The AUC is a popular way of comparing methods in order to balance the different error rates. It's also attractive because it's parameter-free and is objective: seemingly providing a neutral method for comparing classifiers independent of data sets, cost measures and so on.
But is it ?
There's a line of work culminating (so far) in the ICDM 2011 paper 'An Analysis of Performance Measures For Binary Classifiers' by Charles Parker of BigML.com (sorry, no link yet). The story is quite complicated, and I doubt I can do it justice in a blog post, so I'll try to summarize the highlights, with the caveat that there's nuance that I'm missing out on, and you'll have to read the papers to dig deeper.
The story starts with "Measuring classifier performance: a coherent alternative to the area under the ROC curve" ($$ link) by David Hand (copy of paper here). His central result is a very surprising one:
The AUC measure implicitly uses different misclassification costs when evaluating different classifiers, and thus is incoherent as an "objective" way of comparing classifiers.To unpack this result a little, what's going on is this. Suppose you have a scenario where correct classification costs you nothing, and misclassification costs you a certain amount (that could be different for the two different kinds of misclassification). You can now write down an overall misclassification cost for any threshold used for a classifier, and further you can compute the optimal threshold (that minimizes this cost). If you don't actually know the costs (as is typical) you can then ask for the expected misclassification cost assuming some distribution over the costs.
If you run this computation through, what you end up with is a linear transformation of the AUC, where the distribution over the costs depends on the distribution of scores assigned by the classifier ! In other words, as Hand puts it,
It is as if one measured person A’s height using a ruler calibrated in inches and person B’s using one calibrated in centimetres, and decided who was the taller by merely comparing the numbers, ignoring the fact that different units of measurement had been usedThis is a rather devastating critique of the use of the AUC. While there's been pushback (case in point is an ICML 2011 paper by Flach, Hernandez-Orallo and Ferri which is a very interesting read in its own right), the basic premise and argument is not contested (what's contested is the importance of finding the optimal threshold). Hand recommends a few alternatives, and in fact suggests that the distribution of costs should instead be made explicit, rather than being implicit (and subject to dependence on the data and classifiers)
What Parker does in his ICML paper is take this further. In the first part of his paper, he extends the Hand analysis to other measures akin to the AUC, showing that such measures are incoherent as well. In the second part of his paper, he unleashes an experimental tour de force of classifier comparisons under different quality measures, showing that
- nearly 50% of the time, measures disagree on which classifier in a pair is more effective. He breaks down the numbers in many different ways to show that if you come up with a new classification algorithm tomorrow, you'd probably be able to cherry pick a measure that showed you in a good light.
- It's the measures perceived to be more "objective" or parameter-less that had the most trouble reconciling comparisons between classifiers.
- It's also not the case that certain classifiers are more likely to cause disagreements: the problems are spread out fairly evenly.
- His experiments also reinforce Hand's point that it's actually better to define measures that explicitly use domain knowledge, rather than trying to achieve some objective measure of quality. Measures that were either point-based (not integrating over the entire range) or domain specific tended to work better.
At a deeper level, the pursuit of objective comparisons that can be reduced to a single number seems fundamentally misguided to me. First of all, we know that precise cost functions are often the wrong way to go when designing algorithms (because of modelling issues and uncertainty about the domain). Secondly, we know that individual methods have their own idiosyncracies - hence the need for 'meta' methods. And finally, we're seeing that even the meta-comparison measures have severe problems ! In some ways, we're pursuing 'the foolish hobgoblin of precision and objectivity' in an area where context is more important than we as mathematicians/engineers are used to.
David Hand gave an invited talk on the subject at KDD 2009:
ReplyDeletehttp://kdd09.crowdvine.com/talks/5027
It was great and generated a lot of discussion. But I don't think people actually got the point then
Also have a look beyond the "my classifier is better than yours" world of machine learning.
ReplyDeleteFor example at SIAM Data Mining SDM'2011, the following evaluation paper was published:
http://siam.omnibooksonline.com/2011datamining/data/papers/018.pdf
Interpreting and Unifying Outlier Scores
Hans-Peter Kriegel Peer Kr¨oger Erich Schubert Arthur Zimek
Institut f¨ur Informatik, Ludwig-Maximilians Universit¨at M¨unchen
It also goes beyond "ROC AUC", but uses an intuitive cost model, takes care of inbalancedness, and tries to make the actual scores readable on a 0 to 1 scale.
Oh, and here's a another way of cheating the results: for example with precision @ k, you can choose k, and that is essentially another input parameter. Similar for any top-k method. Also it's easy to choose bad parameters for competing methods.
ReplyDeleteAll results should be published in source code, and independently evaluated.
The obvious way to compare two classification methods is how well they do compared to some ground truth (i.e labelled data)
ReplyDeleteWith no attention at all to efficiency whatsoever?
Are you suggesting a classifier taking 1sec with 80% precision on a ground truth is inferior to one taking 1yr with 80.5% precision?
At a deeper level, the pursuit of objective comparisons that can be reduced to a single number seems fundamentally misguided to me.
Exactly. We have a multiobjective optimization problem, with precision, training efficiency, classification efficiency among others. Users will select the most appropriate one, depending on cost of false positives, false negatives, size of training set, data bandwidth on the classification stream.
In many applications, the ideal situation is to use a composition of very fast, rather primitive classifiers with (nearly) zero false negative rate.
The composition produces a classifier with low false positive rate (assuming the classifiers operate relatively independently) while at the same time being rather efficient as we have exponential drop off on the number of items being classified on each subsequent round.
Alex: sounds to me like you're describing boosting. Notice that the problem here is evaluation. I am not sure why boosting would be ideal for any situation and any objective.
ReplyDeleteAlex, unfortunately computational time is kind of a red herring as an absolute metric for learning. In some applications, if possible, it is worthwhile to throw years of cpu time in classifier training for an extra half percent in performance, while in others the amount of data that needs to be processed per day is so big any increase in training time is prohibitive. Conversely, in some situations it's ok to have long test times, while in other situations submillisecond response times are necessary. Unfortunately I think this aspect of the performance is better left at the hands of the practitioner so he can make the appropriate tradeoff.
ReplyDelete