DATA CLEANING

To illustrate the need to clean up data, let’s take a look at some of the types of errors that could creep into even a tiny data set, such as that in Table 2.1. Let’s discuss, attribute by attribute, some of the problems that have found their way into the data set in Table 2.1. The customer ID variable seems to be fine. What about zip? Let’s assume thatweare expecting all of the customers in the database to have the usual five-numeral U.S. zip code. Now, customer 1002 has this strange (to American eyes) zip code of J2S7K7. If we were not careful, we might be tempted to classify this unusual value as an error and toss it out, until we stop to think that not all countries use the same zip code format. Actually, this is the zip code of St. Hyancinthe, Quebec, Canada, so probably represents real data from a real customer. What has evidently
occurred is that a French-Canadian customer has made a purchase and put their home zip code down in the field required. Especially in this era of the North American Free Trade Agreement, we must be ready to expect unusual values in fields such as zip codes, which vary from country to country.

What about the zip code for customer 1004? We are unaware of any countries that have four-digit zip codes, such as the 6269 indicated here, so this must be an error, right? Probably not. Zip codes for the New England states begin with the numeral 0. Unless the zip code field is defined to be character (text) and not numeric, the software will probably chop off the leading zero, which is apparently what happened here. The
zip code is probably 06269, which refers to Storrs, Connecticut, home of the University of Connecticut.

The next field, gender, contains a missing value for customer 1003. We detail methods for dealing with missing values later in the chapter.

Taken From : DISCOVERING KNOWLEDGE IN DATA An Introduction to  Data Mining

Leave a Reply