Missing data is a problem that continues to plague data analysis methods. Even as our analysis methods gain sophistication, we continue to encounter missing values in fields, especially in databases with a large number of fields. The absence of information is rarely beneficial. All things being equal, more data is almost always better. Therefore, we should think carefully about how we handle the thorny issue of missing data. Read the rest of this entry »
The age field has a couple of problems. Although all the other customers have numerical values for age, customer 1001s age of C probably reflects an earlier categorization of this mans age into a bin labeled C. The data mining software will definitely not like this categorical value in an otherwise numerical field, and we will have to resolve this problem somehow. How about customer 1004s age of 0? Perhaps there is a newborn male living in Storrs, Connecticut, who has made a transaction of $1000. More likely, the age of this person is probably missing and was coded as 0 to indicate this or some other anomalous condition (e.g., refused to provide the age information). Read the rest of this entry »
The income field, which we assume is measuring annual gross income, has three potentially anomalous values. First, customer 1003 is shown as having an income of $10,000,000 per year. Although entirely possible, especially when considering the customers zip code (90210, Beverly Hills), this value of income is nevertheless an outlier, an extreme data value. Certain statistical and data mining modeling techniques do not function smoothly in the presence of outliers; we examine methods of handling outliers later in the chapter. Read the rest of this entry »
To illustrate the need to clean up data, lets take a look at some of the types of errors that could creep into even a tiny data set, such as that in Table 2.1. Lets discuss, attribute by attribute, some of the problems that have found their way into the data set in Table 2.1. The customer ID variable seems to be fine. What about zip? Lets assume thatweare expecting all of the customers in the database to have the usual five-numeral U.S. zip code. Now, customer 1002 has this strange (to American eyes) zip code of J2S7K7. If we were not careful, we might be tempted to classify this unusual value as an error and toss it out, until we Read the rest of this entry »
WHY DO WE NEED TO PREPROCESS THE DATA?
Much of the rawdata contained in databases is unpreprocessed, incomplete, and noisy. For example, the databases may contain:
- Fields that are obsolete or redundant
- Missing values
- Outliers
- Data in a form not suitable for data mining models
- Values not consistent with policy or common sense. Read the rest of this entry »
December
7
PROFILING THE TOURISM MARKET USING k-MEANS CLUSTERING ANALYSIS [23] (2)
4. Modeling Phase
Clustering is a natural method for generating segment profiles. The researchers chose k-means clustering, since that algorithm is quick and efficient as long as you know the number of clusters you expect to find. They explored between two and six cluster models before settling on a five-cluster solution as best reflecting reality. Brief profiles of the clusters are as follows: Read the rest of this entry »
December
6
PROFILING THE TOURISM MARKET USING k-MEANS CLUSTERING ANALYSIS [23]
1. Business/Research Understanding Phase
The researchers, Simon Hudson and Brent Ritchie, of the University of Calgary, Alberta, Canada, are interested in studying intraprovince tourist behavior in Alberta. They would like to create profiles of domestic Albertan tourists based on the decision behavior of the tourists. The overall goal of the study was to form a quantitative basis for the development of an intraprovince marketing campaign, sponsored by Travel Alberta. Toward this goal, the main objectives were to determine which factors were important in choosing destinations in Alberta, to evaluate the domestic perceptions of the Alberta vacation product, and to attempt to comprehend the travel decision-making process. Read the rest of this entry »
Similar to a charitable remainder annuity trust, in a charitable gift annuity the donor contributes assets to a not-for-profit organization in exchange for a promise by the organization to pay a fixed amount over a specified period of time to the donor or to other third parties. It is important to note that the third parties are designated by the donor. Read the rest of this entry »
More specifically, footnote 4 of the 1998 financial statements stated that Elan had an equity venture with Axogen Limited and NeuroLab and that Elan had the option to purchase the rest of Axogens shares and NeuroLabs shares. Apparently Elans management team ignored the existence of the option when they performed their accounting tasks. Why? I could not find separate financial statements for Axogen or for NeuroLab, but my bet is that Elan loaded them up with debt and hoped to keep these items off the balance sheet, at least for a while. Read the rest of this entry »
Asking a where question is fairly straightforward (see the preceding section), but answering ayna questions isnt always as clear-cut. You can answer an ayna question in a number of different ways, ranging from the simple to the convoluted. In order to answer ayna questions, you have to understand the structure of the ayna question reply, which usually follows this format: subject, preposition, object. Read the rest of this entry »


