Skip to main content

Understanding Projection Pursuit Regression

The following article gives an overview of the paper "Projection Pursuit Regression” published by Friedman J. H and Stuetzle W. You will need basic background of Machine Learning and Regression before understanding this article. The algorithms and images are taken from the paper. (

What is Regression?
Regression is a machine learning technology used to predict a response variable given multiple predictor variables or features. The main distinction is that the response to be predicted is any real value and not just any class or cluster name. Hence though similar to Classification in terms of making a prediction, it is largely different given what it’s predicting. 

A simple to understand real world problem of regression would be predicting the sale price of a particular house based on it’s square footage, given that we have data of similar houses sold in that area in the past. The regression solution will be found by coming up with some equation of the form:
Sale_price = square_footage * price_per_sq_ft + some_base_price. 

It is the task of the algorithm to find the values of the “price_per_sq_ft” and “some_base_price” that best fit the given data. Such a form of regression is termed as Linear Regression, simply because the response variable is “assumed” to be a linear combination of the multiple input predictor variables.

Projection Pursuit Regression

Although linear regression is simple to understand and implement, it has it’s own set of limitations. The limitation lies in the assumption made by linear regression. It assumes the function of the regression surface to be a direct linear combination of the predictor variables. While it greatly reduces the task to just finding the correct coefficients (or weights or parameters), it bears the cost of inaccurate predictions. 

Projection Pursuit Regression (PPR) is a non parametric regression algorithm used exactly to overcome this limitation. PPR starts with a more general assumption for the regression surface to be found. It then refines the assumption in successive iterations and simultaneously also finds the parameters that fit best. PPR is more suitable for getting accurate predictions in real world prediction problems where the nature of regression surface, though unknown, is at least definitely not linear. 


  • The regression surface is generally assumed as a linear combination (sum) of empirically determined functions (denoted by S) of the predictor variables (X), as shown below.
  • Working on this assumption, the algorithm works in the following 3 steps: 

  • In the first step, we initialise “residual” to given response variable (y), and the term counter M to 0.
  • In the second step, consider the equation for figure of merit, I(alpha). The numerator in second term is a measure of how close the prediction (S(alpha . x)) is to the actual real value denoted by the residual (r). The smaller the difference, higher will be the value of I(alpha). Using certain numerical optimisation technique, the value of alpha and S is found such that it maximises the value of I(alpha)
  • In the last step, first the value of figure of merit is checked. If less than a certain given threshold, the last term is removed from the model and the algorithm terminated. Else, the residuals are updated by subtracting the prediction made by the latest term. Conceptually,  “residual” thus means the residue from the original response y, which is still not predicted by the terms already found by the algorithm. 

Univariate Smoothing

An important part of the algorithm is determining the function S which is commonly known as the smooth function. Smoothing technique in most basic terms finds a function which fits a set of data points. It does so by finding more potential points which may fit the function. These new points are found by taking an average of already existing neighbour points.

S(zi) =  AVGi-k <= j <= i+k (yj), k = bandwidth

All traditional smoothing algorithms consider a uniform bandwidth throughout. However, in this paper, they have implemented a variable bandwidth smoothing. It uses an average bandwidth value provided by the user. Based on response variability estimates, the  actual bandwidth used is larger or smaller than the given value. The response variance provides an estimate as to how “close” the response variables are. If there is a tight cluster of response variables, the bandwidth used for smoothing there would be larger and vice versa. 


The paper exhibits the working of the algorithm with a couple of example datasets.

Example 1:  Y = X1X2 + e

In this example, dataset with the above model is used and PPR as the result gives the equivalent equation as follows:
Y = 0.25(X1+X2)2 - 0.25(X1-X2)2 
The following figures give a graphical interpretation of the steps of the PPR algorithm.

Figure 1a: Initial plot of Y v/s X2 shows that the smooth (denoted by *) hardly fits the data points (denoted by +). Hence Y is certainly not just directly proportional to X1 or X2 alone. 

Figure 1b: This plot shows Y v/s S1(alpha1. X). It is clearly seen that this smooth nicely fits the data points and is parabolic in nature. The first term of the equation is thus obtained as 0.25(X1+X2)2

Figure 1c:  This plot shows [Y - S1(alpha1. X)] (i.e. residual) v/s S2(alpha2. X). Again it’s a good fit and is an inverted parabola. Thus the second term of the resultant equation is obtained as -0.25(X1-X2)2

Figure 1d: This plot shows residual from previous round v/s S3(alpha3. X). Since it’s not a good fit, this term is discarded. Mathematically too, in this iteration the figure of merit I(alpha) is less than the pre-decided threshold and the algorithm is terminated.

Example 2:  Air pollution Data
The relation between the amount of suspended particulate matter (Y) and predictor variables mean wind speed (X1), average temperature (X2), insolation (X3), and wind direction at 4:00am (X4) and 4:00pm (X5)
Following were the plots obtained for the 3 iterations of the PPR algorithm. All three plots show the smooth to be a reasonable fit, and the regression surface is the combination of all the smooth function terms.

Variations/Options applicable to PPR
  • Backfitting: Readjustment of the smooths along previously determined linear combinations when a new linear combination is found. This is a slightly more refined addition to the PPR algorithm. 
  • Projection Selection: Restrict the search for solution directions to the set of predictors. Thus instead of having all the predictor variables in each term, restrict it to the set of variables having highest contribution in that direction. For instance, in the second example above, perform smoothing only for X1 , X4 and X5 respectively in the three iterations. One can also use a combination of Projection selection and Projection pursuit.

Following are some of the advantages of projection pursuit against other techniques:
  • Sparsity limitation of other non parametric methods like kernel and nearest neighbours is not encountered in PPR since estimation (smoothing) is univariate
  • PPR although uses successive refinement, it doesn’t partition the data
  • Interactions among different predictors is directly considered
  • In terms of ascending generality:
             Linear regression < Projection Selection < Projection Pursuit Regression
  • As seen in examples, PPR can represent each iteration graphically, thus facilitating interpretation.
  • Such output can be used to adjust main parameters of the procedure: average smoother bandwidth and termination threshold.
  • For sample size n, dimensionality p, and number of iterations M, computation of the model grows as M*p*n*log(n)
Projection Pursuit Regression is certainly more general in it’s assumptions and hence tends to give better approximations of the regression surface.  It can surely be used as an approach for the real world datasets where there is a possibility that linear regression will fail because of the restricting assumptions.

Friedman J. H and Stuetzle W. (1981), “Projection Pursuit Regression”, Journal of the American Statistical Association

Mansi Shah
Research Fellow
CereLabs Pvt. Ltd.


Popular posts from this blog

Implement XOR in Tensorflow

XOR is considered as the 'Hello World' of Neural Networks. It seems like the best problem to try your first TensorFlow program.

Tensorflow makes it easy to build a neural network with few tweaks. All you have to do is make a graph and you have a neural network that learns the XOR function.

Why XOR? Well, XOR is the reason why backpropogation was invented in the first place. A single layer perceptron although quite successful in learning the AND and OR functions, can't learn XOR (Table 1) as it is just a linear classifier, and XOR is a linearly inseparable pattern (Figure 1). Thus the single layer perceptron goes into a panic mode while learning XOR – it can't just do that. 

Deep Propogation algorithm comes for the rescue. It learns an XOR by adding two lines L1 and L2 (Figure 2). This post assumes you know how the backpropogation algorithm works.

Following are the steps to implement the neural network in Figure 3 for XOR in Tensorflow:
1. Import necessary libraries

From Cats to Convolutional Neural Networks

Widely used in image recognition, Convolutional Neural Networks (CNNs) consist of multiple layers of neuron collection which look at small window of the input image, called receptive fields.
The history of Convolutional Neural Networks begins with a famous experiment “Receptive Fields of Single Neurons in the Cat’s Striate Cortex” conducted by Hubel and Wiesel. The experiment confirmed the long belief of neurobiologists and psychologists that the neurons in the brain act as feature detectors.
The first neural network model that drew inspiration from the hierarchy model of the visual nervous system proposed by Hubel and Wiesel was Neocognitron invented by Kunihiko Fukushima, and had the ability of performing unsupervised learning. Kunihiko Fukushima’s approach was commendable as it was the first neural network model having the capability of pattern recognition similar to human brain. The model gave a lot of insight and helped future understanding of the brain.
A successful advancement i…