Clustering is often performed as a preliminary step in a data mining process, with the resulting clusters being used as further inputs into a different technique downstream, such as neural networks. Due to the enormous size of many present-day databases, it is often helpful to apply clustering analysis first, to reduce the search space for the downstream algorithms. In this chapter, after a brief look at hierarchical clustering methods, we discuss in detail k-means clustering; in Chapter 9 we examine clustering using Kohonen networks, a structure related to neural networks. Read the rest of this entry »
Clustering refers to the grouping of records, observations, or cases into classes of similar objects. A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters. Clustering differs from classification in that there is no target variable for clustering. The clustering task does not try to classify, estimate, or predict the value of a target variable. Instead, clustering algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters, where the similarity of the records within the cluster is maximized, and the similarity to records outside this cluster is minimized. Read the rest of this entry »
Next, we apply a neural network model using Insightful Miner on the same adult data set [3] from the UCal Irvine Machine Learning Repository that we analyzed in Chapter 6. The Insightful Miner neural network software was applied to a training set of 24,986 cases, using a single hidden layer with eight hidden nodes. The algorithm iterated 47 epochs (runs through the data set) before termination. The resulting neural network is shown in Figure 7.8. The squares on the left represent the input nodes. For the categorical variables, there is one input node per class. The eight dark circles represent the hidden layer. The light gray circles represent the constant inputs. There is only a single output node, indicating whether or not the record is classified as having income less than $50,000. Read the rest of this entry »
One of the drawbacks of neural networks is their opacity. The same wonderful flexibility that allows neural networks to model a wide range of nonlinear behavior also limits our ability to interpret the results using easily formulated rules. Unlike decision trees, no straightforward procedure exists for translating the weights of a neural network into a compact set of decision rules. Read the rest of this entry »
Clearly, a momentum component will help to dampen the oscillations around optimality mentioned earlier, by encouraging the adjustments to stay in the same direction. But momentum also helps the algorithm in the early stages of the algorithm, by increasing the rate at which the weights approach the neighborhood of optimality. This is because these early adjustments will probably all be in the same direction, so that the exponential average of the adjustments will also be in that direction. Momentum is also helpful when the gradient of SSE with respect to w is flat. If the momentum term is too large, however, the weight adjustments may again overshoot the minimum, due to the cumulative influences of many previous adjustments. Read the rest of this entry »
When the learning rate is very small, the weight adjustments tend to be very small. Thus, if is small when the algorithm is initialized, the network will probably take an unacceptably long time to converge. Is the solution therefore to use large values for ? Not necessarily. Suppose that the algorithm is close to the optimal solution and we have a large value for . This large will tend to make the algorithm overshoot the optimal solution. Read the rest of this entry »
The neural network algorithm would then proceed to work through the training data set, record by record, adjusting the weights constantly to reduce the prediction error. It may take many passes through the data set before the algorithms termination criterion is met. What, then, serves as the termination criterion, or stopping criterion? If training time is an issue, one may simply set the number of passes through the data, or the amount of realtime the algorithm may consume, as termination criteria. However, what one gains in short training time is probably bought with degradation in model efficacy. Read the rest of this entry »
We must therefore turn to optimization methods, specifically gradient-descent methods, to help us find the set of weights that will minimize SSE. Suppose that we have a set (vector) of m weights w = w0,w1,w2, . . . , wm in our neural network model and we wish to find the values for each of these weights that, together, minimize SSE. We can use the gradient descent method, which gives us the direction that we should adjust the weights in order to decrease SSE. Read the rest of this entry »
Whyuse the sigmoid function? Because it combines nearly linear behavior, curvilinear behavior, and nearly constant behavior, depending on the value of the input. Figure 7.3 shows the graph of the sigmoid function y = f (x) = 1/(1 + e?x ), for ?5 < x < 5 [although f (x) may theoretically take any real-valued input]. Through much of the center of the domain of the input x (e.g., ?1 < x < 1), the behavior of f (x) is nearly linear. As the input moves away from the center, f (x) becomes curvilinear. By the time the input reaches extreme values, f (x) becomes nearly constant. Read the rest of this entry »
Let us examine the simple neural network shown in Figure 7.2. A neural network consists of a layered, feedforward, completely connected network of artificial neurons, or nodes. The feedforward nature of the network restricts the network to a single direction of flow and does not allow looping or cycling. The neural network is composed of two or more layers, although most networks consist of three layers: an input layer, a hidden layer, and an output layer. There may be more than one hidden layer, although most networks contain only one, which is sufficient for most purposes. The neural network is completely connected, meaning that every node in a given layer is connected to every node in the next layer, although not to other nodes in the same layer. Each connection between nodes has a weight (e.g., W1A) associated with it. At
initialization, these weights are randomly assigned to values between zero and 1. Read the rest of this entry »


