How should one go about choosing the value of k? In fact, there may not be an obvious best solution. Consider choosing a small value for k. Then it is possible that the classification or estimation may be unduly affected by outliers or unusual observations (noise). With small k (e.g., k = 1), the algorithm will simply return the target value of the nearest observation, a process that may lead the algorithm toward overfitting, tending to memorize the training data set at the expense of generalizability. (more…)
Archive for January, 2009
CHOOSING k
Saturday, January 31st, 2009DATABASE CONSIDERATIONS
Friday, January 30th, 2009For instance-based learning methods such as the k-nearest neighbor algorithm, it is vitally important to have access to a rich database full of as many different combinations of attribute values as possible. It is especially important that rare classifications be represented sufficiently, so that the algorithm does not only predict common classifications. Therefore, the data set would need to be balanced, with a sufficiently large
percentage of the less common classifications. One method to perform balancing is to reduce the proportion of records with more common classifications. (more…)
QUANTIFYING ATTRIBUTE RELEVANCE : STRETCHING THE AXES
Thursday, January 29th, 2009Consider that not all attributes may be relevant to the classification. In decision trees (Chapter 6), for example, only those attributes that are helpful to the classification are considered. In the k-nearest neighbor algorithm, the distances are by default calculated on all the attributes. It is possible, therefore, for relevant records that are proximate to the new record in all the important variables, but are distant from the new record in unimportant ways, to have a moderately large distance from the new record, and therefore not be considered for the classification decision. Analysts may therefore consider restricting the algorithm to fields known to be important for classifying new records, or at least to blind the algorithm to known irrelevant fields. (more…)
Weighted Voting
Wednesday, January 28th, 2009One may feel that neighbors that are closer or more similar to the new record should be weighted more heavily than more distant neighbors. For example, in Figure 5.5, does it seem fair that the light gray record farther away gets the same vote as the dark gray vote that is closer to the new record? Perhaps not. Instead, the analyst may choose to apply weighted voting, where closer neighbors have a larger voice in the classification decision than do more distant neighbors. Weighted voting also makes it much less likely for ties to arise. (more…)
COMBINATION FUNCTION
Tuesday, January 27th, 2009Now that we have a method of determining which records are most similar to the new, unclassified record, we need to establish how these similar records will combine to provide a classification decision for the new record. That is, we need a combination function. The most basic combination function is simple unweighted voting. (more…)
Solution of Merchant Termination
Tuesday, January 27th, 2009Have you ever been facing merchant termination problem? What do you feel about that? Most of you may feel disappointed or even get stressed. Yeah, this condition can be called as the worst condition in the merchant life. There are lots of more problems that occur because of this problem. However, you can still solve this problem by asking some help from the merchant assistant. Then, where is the place where you can have assistant to solve your merchant problem? (more…)
DISTANCE FUNCTION (2)
Monday, January 26th, 2009For example, lets find an answer to our earlier question: Which patient is more similar to a 50-year-old male: a 20-year-old male or a 50-year-old female? Suppose that for the age variable, the range is 50, the minimum is 10, the mean is 45, and the standard deviation is 15. Let patient A be our 50-year-old male, patient B the 20-yearold male, and patient C the 50-year-old female. The original variable values, along with the minmax normalization (ageMMN ) and Z-score standardization (ageZscore), are listed in Table 5.2. (more…)
DISTANCE FUNCTION
Sunday, January 25th, 2009We have seen above how, for a new record, the k-nearest neighbor algorithm assigns the classification of the most similar record or records. But just how do we define similar? For example, suppose that we have a new patient who is a 50-year-old male. Which patient is more similar, a 20-year-old male or a 50-year-old female? (more…)
k-NEAREST NEIGHBOR ALGORITHM (2)
Saturday, January 24th, 2009However, suppose that we now let k = 2 for our k-nearest neighbor algorithm, so that new patient 2 would be classified according to the classification of the k = 2 points closest to it. One of these points is dark gray, and one is medium gray, so that our classifier would be faced with a decision between classifying new patient 2 for drugs B and C (dark gray) or drugs A and X (medium gray). How would the classifier decide between these two classifications? Voting would not help, since there is one vote for each of two classifications. (more…)
k-NEAREST NEIGHBOR ALGORITHM
Friday, January 23rd, 2009The first algorithm we shall investigate is the k-nearest neighbor algorithm, which is most often used for classification, although it can also be used for estimation and prediction. k-Nearest neighbor is an example of instance-based learning, in which the training data set is stored, so that a classification for a new unclassified record may be found simply by comparing it to the most similar records in the training set. Lets consider an example. (more…)


