However, suppose that we now let k = 2 for our k-nearest neighbor algorithm, so that new patient 2 would be classified according to the classification of the k = 2 points closest to it. One of these points is dark gray, and one is medium gray, so that our classifier would be faced with a decision between classifying new patient 2 for drugs B and C (dark gray) or drugs A and X (medium gray). How would the classifier decide between these two classifications? Voting would not help, since there is one vote for each of two classifications.
Votingwould help, however, if we let k =3 for the algorithm, so that newpatient 2would be classified based on the three points closest to it. Since two of the three closest points are medium gray, a classification based on voting would therefore choose drugs A and X (medium gray) as the classification for new patient 2. Note that the classification assigned for new patient 2 differed based on which value we chose for k.
Finally, consider new patient 3, who is 47 years old and has a Na/K ratio of 13.5. Figure 5.8 presents a close-up of the three nearest neighbors to new patient 3. For k = 1, the k-nearest neighbor algorithm would choose the dark gray (drugs B and C) classification for new patient 3, based on a distance measure. For k = 2, however, voting would not help. But voting would not help for k = 3 in this case either, since the three nearest neighbors to new patient 3 are of three different classifications.
This example has shown us some of the issues involved in building a classifier using the k-nearest neighbor algorithm. These issues include:
- How many neighbors should we consider? That is, what is k?
- How do we measure distance?
- How do we combine the information from more than one observation?
Later we consider other questions, such as:
- Should all points be weighted equally, or should some points have more influence than others?
Taken From : DISCOVERING KNOWLEDGE IN DATA An Introduction to Data Mining