Archive for December, 2008

STATISTICAL APPROACHES TO ESTIMATION AND PREDICTION

Wednesday, December 31st, 2008

DATA MINING TASKS IN DISCOVERING KNOWLEDGE IN DATA

In Chapter 1 we were introduced to the six data mining tasks:

  • Description
  • Estimation
  • Prediction
  • Classification
  • Clustering
  • Association (more…)

SUMMARY

Tuesday, December 30th, 2008

Let us consider some of the insights we have gained into the churn data set through the use of exploratory data analysis.

  • The four charge fields are linear functions of the minute fields, and should be omitted.
  • The area code field and/or the state field are anomalous, and should be omitted until further clarification is obtained.
  • The correlations among the remaining predictor variables are weak, allowing us to retain them all for any data mining model. (more…)

Greatest Web Information Center

Monday, December 29th, 2008

There is no surprise anymore that internet has become the most favorite source for people to get whatever they want. By using this facility, people can easily browsing the website, chatting and having video calling with their lovely relation. More than that, people can also get the files that their needed by downloading from certain website. However, there is still problem occur when the connection suddenly down because of the high internet traffic.
You may find the problem that often causes this problem by yourself. If you want to do that, you can easily get the answer of traffic rankings by visiting alexa.com. This site is an online web information company that provides you with the most complete web traffic comparisons. In this site, you can found several kinds of website which has high visitors. All of that information will be updated every 5 minutes. You can read down the graph provided in this site if you want to look the web traffic in glance.
On the other hand, there are several kinds of features that you can get in this site. One example is Alexa toolbar which is good if you use Firefox and internet explorer. With this feature, you can see the traffic of website that you visited easily just by look at the bottom of your web browser.

SELECTING INTERESTING SUBSETS OF THE DATA FOR FURTHER INVESTIGATION

Monday, December 29th, 2008

We may use scatter plots (or histograms) to identify interesting subsets of the data, in order to study these subsets more closely. In Figure 3.25 we see that customers with high day minutes and high evening minutes are more likely to churn. But how can we quantify this? Clementine allows the user to click and drag a select box around data points of interest, and select them for further investigation. Here we selected the records within the rectangular box in the upper right. (A better method would be to allow the user to select polygons besides rectangles.) (more…)

EXPLORING MULTIVARIATE RELATIONSHIPS

Sunday, December 28th, 2008

We turn next to an examination of possible multivariate associations of numerical variables with churn, using two- and three-dimensional scatter plots. Figure 3.23 is a scatter plot of customer service calls versus day minutes (note Clementines incorrect reversing of this order in the plot title; the y-variable should always be the first named). Consider the partition shown in the scatter plot, which indicates a high-churn area in the upper left section of the graph and another high-churn area in the right of the graph. (more…)

EXPLORING NUMERICAL VARIABLES (3)

Saturday, December 27th, 2008

Figure 3.20 shows a slight tendency for customers with higher evening minutes to churn. Based solely on the graphical evidence, however, we cannot conclude beyond a reasonable doubt that such an effect exists. Therefore, we shall hold off on formulating policy recommendations on evening cell-phone use until our data mining models offer firmer evidence that the putative effect is in fact present. (more…)

EXPLORING NUMERICAL VARIABLES (2)

Friday, December 26th, 2008

We turn next to graphical analysis of our numerical variables. We show three examples of histograms, which are useful for getting an overall look at the distribution of numerical variables, for the variable customer service calls. Figure 3.16 is a histogram of customer service calls, with no overlay, indicating that the distribution is right-skewed, with a mode at one call. (more…)

EXPLORING NUMERICAL VARIABLES

Thursday, December 25th, 2008

Next, we turn to an exploration of the numerical predictive variables. We begin with numerical summary measures, including minimum and maximum; measures of center, such as mean, median, and mode; and measures of variability, such as standard deviation. Figure 3.14 shows these summary measures for some of our numerical variables. We see, for example, that the minimum account length is one month, the maximum is 243 months, and the mean and median are about the same, at around 101 months, which is an indication of symmetry. Notice that several variables show this evidence of symmetry, including all the minutes, charge, and call fields. (more…)

USING EDA TO UNCOVER ANOMALOUS FIELDS

Wednesday, December 24th, 2008

Exploratory data analysis will sometimes uncover strange or anomalous records or fields which the earlier data cleaning phase may have missed. Consider, for example, the area code field in the present data set. Although the area codes contain numerals, they can also be used as categorical variables, since they can classify customers according to geographical location. We are intrigued by the fact that the area code field contains only three different values for all the records408, 415, and 510all three of which are in California. (more…)

EXPLORING CATEGORICAL VARIABLES (3)

Tuesday, December 23rd, 2008

Again, we may quantify this finding by using cross-tabulations, as in Figure 3.8. First of all, 842 + 80 = 922 customers have the VoiceMail Plan, while 2008 + 403 = 2411 do not.We then find that 403/2411 = 16.7%of those without theVoiceMail Plan are churners, compared to 80/922 = 8.7% of customers who do have the VoiceMail Plan. Thus, customers without the VoiceMail Plan are nearly twice as likely to churn as customers with the plan. (more…)