Occam's Razor and Pattern Recognition

In the context of pattern recognition, Occam's Razor plays an important role. To build a pattern recognizer, we need to select appropriate decision rules. Usually this means we want to choose the classifier with the highest accuracy. How can we know which classifier is the best? Some people argue, citing Occam's Razor, that the simplest classifier or decision rule is the best. But what does this mean? It is not clear in general what is meant by "the simplest", or what criteria to use to select "the best". This confusion can result in misinterpretations of Occam's Razor.

Some researchers argue that the "simplest" classifier is not always the most successful, and therefore Occam's Razor is not a good way to select classifiers. Such arguments are often simplistic. No one would argue that we should always use only the simplest boundaries, eg, linear discriminant functions; rather the question is what role the preference for simple boundaries should play in choosing a decision rule.

A similar argument holds that a complex decision boundary can usually be found which does better than a given simple boundary, and therefore that simple boundaries are of little interest if we are trying to find the best possible boundary. This criticism falls into the common trap of overfitting. A complex boundary can usually be found which performs better on the training set, but it is not at all obvious that a complex boundary can always be found which will perform better on future data points, especially since the Bayes-optimal decision boundary is often simple in shape. As in the chilies vs tomatoes example, when looking for a decision rule which will generalize well there is a tradeoff between accuracy on the training set and simplicity of the boundary.

Consider an example similar to that of Figure 1, but in which the distributions of the two classes are linearly separable.

Figure 2. Two linearly separable classes and two possible discriminant functions.

Classifiers A and B perform equally well on the existing data. By Occam's Razor, B is preferable to A. However, if the data points are distributed in a different way, such as shown in the following figure, with the blue squares having a higher probability density along a straight line, and the red points concentrated around a point, the simplest classifier would appear to be a straight line such as B, but knowing the distributions, the best classifier would in fact be a parabola such as A.

Figure 3. The red dots are concentrated around a point, and the blue squares are along a line.

It may seem that Occam's Razor is failing in this case, but this is due to a misinterpretation of what Occam's Razor is and how it should be applied. Simply put, B would be the correct choice if we did not know anything about the distributions of our data. However, A is the simplest function that is consistent with the probability density of the data.

When we are trying to choose the best classifier, we are interested in its future performance. It may happen that a classifier performs very well on sample data, but is outperformed in general by another classifier with less successful performance on the training set. (A C student gets paid more than an A student). Still, we cannot dismiss the performance of a classifier on the training set, so we must be very careful when balancing simplicity against training set performance.

Simple decision rules/classifiers are preferable to complex ones. However, one must be careful when assessing classifiers to take into account all the available domain knowledge that affects the decisions, i.e. the probability density functions. We could require a perfect fit of our training data, but that by itself does not guarantee that the classifier will make better predictions. One pragmatic reason to prefer a simple classifier is for ease of implementation, but a more fundamental argument is that a "pattern", by definition, implies some regularity in the data.

It is important to notice that measuring simplicity is non-trivial. The most common misapplication of Occam's Razor is due to invalid assumptions about what constitutes simplicity in the context of a particular problem. Some authors define simplicity of a classifier as syntactic simplicity: the smallest function/program that implements the decision rule, in the spirit of the Kolmogorov-Chaitin complexity measure. This formulation is at least an improvement in concreteness over the vague term "simplicity", but it has the character more of an approximate heuristic than a mathematically satisfying theory. For instance, some common functions, such as a sine curve or the natural logarithm, are mathematically simple but do not seem probable as decision boundaries. Should such boundaries really be preferred as "simpler" than, say, quadratic discriminant functions? We believe that a convincing mathematical theory of decision boundary simplicity is not yet available. Perhaps this is because the decision boundaries we prefer are based on our accumulated experience, which is essentially statistical rather than mathematical in nature. If this view is correct then an exact general definition of the simplicity of a decision boundary may be too much to ask.

In summary, Occam's Razor captures the practically important intuition that simpler classifiers should be preferred over complex ones. The concept of "simplicity" is not easily defined, and Occam's Razor can be stated and interpreted in many different ways. This is the reason its usefulness in designing pattern recognizers so often comes under attack. We believe that such criticisms tend to oversimplify by exaggerating or misinterpreting the principle. However difficult it may be to arrive at a satisfying formalism of simplicity, even in its present makeshift form Occam's Razor is an important and unique principle in pattern recognition.

Copyright © 1999 Jacob Eliosoff and Ernesto Posse

Back to Occam's Razor