Consider a pattern recognition problem: we need to distinguish between chilies and tomatoes. Suppose the features we measure are size and color. We are given a training set of sample chilies and tomatoes, distributed like this:
Figure 1. The blue squares depict chilies, the red dots tomatoes.
The figure shows three possible boundaries we might draw to separate the two classes of chilies and tomatoes. Which is likely to do the best job of classifying future samples? Classifier A, a straight line, is in some sense the simplest, but its poor classification of chilies makes it unlikely to work well. Classifier C is overspecialized. It fits the existing data perfectly, but its tangles seem unlikely to capture any essential property of the data. Classifier B is not as simple as A, and doesn't fit the training set as well as C, but it seems more likely to do well on future data than the others. This shows us that simplicity is an important factor in choosing a decision rule. Which is the best classifier? The simplest one? The simplest one that fits known data? The simplest one that, though not perfect, approximately fits the data? These are the questions addressed in these pages.
The preference for simple explanations is ancient. This principle has come to be known as Occam's Razor, after William of Ockham (also spelled Occam), who in the 14th century stated it as:
`Pluralitas non est ponenda sine necessitas.'
(Plurality should not be posited without necessity.)
A modern rephrasing might be "keep things simple", or "simplicity is better". Of course these phrasings are vague. What exactly do we mean here by "simplicity"? How high a priority should it be, and why? The above example is intuitive, but in general the simplest choice may not be obvious. In fact Occam's Razor can be interpreted in several ways, as we discuss below.
Occam's Razor plays a fundamental role in philosophy and science. In the field of pattern recognition it arises for the problem of choosing among different decision rules. That is, we are given a training set of data, and for each element we are given the measurements of certain features and which of several categories it belongs to. We want to use this data to choose a decision rule (or classifier), which will classify future elements based on their measurements. Based on a particular training sample, which of the infinitely many possible decision rules is most likely to be accurate for future data? As seen in the chilies and tomatoes example above, Occam's Razor is an important factor in making this choice.
The following pages discuss these issues in greater detail.