Skip to main content

Understanding Projection Pursuit Regression

The following article gives an overview of the paper "Projection Pursuit Regression” published by Friedman J. H and Stuetzle W. You will need basic background of Machine Learning and Regression before understanding this article. The algorithms and images are taken from the paper. (

What is Regression?
Regression is a machine learning technology used to predict a response variable given multiple predictor variables or features. The main distinction is that the response to be predicted is any real value and not just any class or cluster name. Hence though similar to Classification in terms of making a prediction, it is largely different given what it’s predicting. 

A simple to understand real world problem of regression would be predicting the sale price of a particular house based on it’s square footage, given that we have data of similar houses sold in that area in the past. The regression solution will be found by coming up with some equation of the form:
Sale_price = square_footage * price_per_sq_ft + some_base_price. 

It is the task of the algorithm to find the values of the “price_per_sq_ft” and “some_base_price” that best fit the given data. Such a form of regression is termed as Linear Regression, simply because the response variable is “assumed” to be a linear combination of the multiple input predictor variables.

Projection Pursuit Regression

Although linear regression is simple to understand and implement, it has it’s own set of limitations. The limitation lies in the assumption made by linear regression. It assumes the function of the regression surface to be a direct linear combination of the predictor variables. While it greatly reduces the task to just finding the correct coefficients (or weights or parameters), it bears the cost of inaccurate predictions. 

Projection Pursuit Regression (PPR) is a non parametric regression algorithm used exactly to overcome this limitation. PPR starts with a more general assumption for the regression surface to be found. It then refines the assumption in successive iterations and simultaneously also finds the parameters that fit best. PPR is more suitable for getting accurate predictions in real world prediction problems where the nature of regression surface, though unknown, is at least definitely not linear. 


  • The regression surface is generally assumed as a linear combination (sum) of empirically determined functions (denoted by S) of the predictor variables (X), as shown below.
  • Working on this assumption, the algorithm works in the following 3 steps: 

  • In the first step, we initialise “residual” to given response variable (y), and the term counter M to 0.
  • In the second step, consider the equation for figure of merit, I(alpha). The numerator in second term is a measure of how close the prediction (S(alpha . x)) is to the actual real value denoted by the residual (r). The smaller the difference, higher will be the value of I(alpha). Using certain numerical optimisation technique, the value of alpha and S is found such that it maximises the value of I(alpha)
  • In the last step, first the value of figure of merit is checked. If less than a certain given threshold, the last term is removed from the model and the algorithm terminated. Else, the residuals are updated by subtracting the prediction made by the latest term. Conceptually,  “residual” thus means the residue from the original response y, which is still not predicted by the terms already found by the algorithm. 

Univariate Smoothing

An important part of the algorithm is determining the function S which is commonly known as the smooth function. Smoothing technique in most basic terms finds a function which fits a set of data points. It does so by finding more potential points which may fit the function. These new points are found by taking an average of already existing neighbour points.

S(zi) =  AVGi-k <= j <= i+k (yj), k = bandwidth

All traditional smoothing algorithms consider a uniform bandwidth throughout. However, in this paper, they have implemented a variable bandwidth smoothing. It uses an average bandwidth value provided by the user. Based on response variability estimates, the  actual bandwidth used is larger or smaller than the given value. The response variance provides an estimate as to how “close” the response variables are. If there is a tight cluster of response variables, the bandwidth used for smoothing there would be larger and vice versa. 


The paper exhibits the working of the algorithm with a couple of example datasets.

Example 1:  Y = X1X2 + e

In this example, dataset with the above model is used and PPR as the result gives the equivalent equation as follows:
Y = 0.25(X1+X2)2 - 0.25(X1-X2)2 
The following figures give a graphical interpretation of the steps of the PPR algorithm.

Figure 1a: Initial plot of Y v/s X2 shows that the smooth (denoted by *) hardly fits the data points (denoted by +). Hence Y is certainly not just directly proportional to X1 or X2 alone. 

Figure 1b: This plot shows Y v/s S1(alpha1. X). It is clearly seen that this smooth nicely fits the data points and is parabolic in nature. The first term of the equation is thus obtained as 0.25(X1+X2)2

Figure 1c:  This plot shows [Y - S1(alpha1. X)] (i.e. residual) v/s S2(alpha2. X). Again it’s a good fit and is an inverted parabola. Thus the second term of the resultant equation is obtained as -0.25(X1-X2)2

Figure 1d: This plot shows residual from previous round v/s S3(alpha3. X). Since it’s not a good fit, this term is discarded. Mathematically too, in this iteration the figure of merit I(alpha) is less than the pre-decided threshold and the algorithm is terminated.

Example 2:  Air pollution Data
The relation between the amount of suspended particulate matter (Y) and predictor variables mean wind speed (X1), average temperature (X2), insolation (X3), and wind direction at 4:00am (X4) and 4:00pm (X5)
Following were the plots obtained for the 3 iterations of the PPR algorithm. All three plots show the smooth to be a reasonable fit, and the regression surface is the combination of all the smooth function terms.

Variations/Options applicable to PPR
  • Backfitting: Readjustment of the smooths along previously determined linear combinations when a new linear combination is found. This is a slightly more refined addition to the PPR algorithm. 
  • Projection Selection: Restrict the search for solution directions to the set of predictors. Thus instead of having all the predictor variables in each term, restrict it to the set of variables having highest contribution in that direction. For instance, in the second example above, perform smoothing only for X1 , X4 and X5 respectively in the three iterations. One can also use a combination of Projection selection and Projection pursuit.

Following are some of the advantages of projection pursuit against other techniques:
  • Sparsity limitation of other non parametric methods like kernel and nearest neighbours is not encountered in PPR since estimation (smoothing) is univariate
  • PPR although uses successive refinement, it doesn’t partition the data
  • Interactions among different predictors is directly considered
  • In terms of ascending generality:
             Linear regression < Projection Selection < Projection Pursuit Regression
  • As seen in examples, PPR can represent each iteration graphically, thus facilitating interpretation.
  • Such output can be used to adjust main parameters of the procedure: average smoother bandwidth and termination threshold.
  • For sample size n, dimensionality p, and number of iterations M, computation of the model grows as M*p*n*log(n)
Projection Pursuit Regression is certainly more general in it’s assumptions and hence tends to give better approximations of the regression surface.  It can surely be used as an approach for the real world datasets where there is a possibility that linear regression will fail because of the restricting assumptions.

Friedman J. H and Stuetzle W. (1981), “Projection Pursuit Regression”, Journal of the American Statistical Association

Mansi Shah
Research Fellow
CereLabs Pvt. Ltd.


  1. Amazing content, a lot of amazing websites which is actually new for me

    Chat Ai Detector


Post a Comment

Popular posts from this blog

GPU - The brain of Artificial Intelligence

Machine Learning algorithms require tens and thousands of CPU based servers to train a model, which turns out to be an expensive activity. Machine Learning researchers and engineers are often faced with the problem of running their algorithms fast. Although initially invented for processing graphics in computer games, GPUs today are used in machine learning to perform feature detection from vast amount of unlabeled data. Compared to CPUs, GPUs take far less time to train models that perform classification and prediction. Characteristics of GPUs that make them ideal for machine learning Handle large datasets Needs far less data centre infrastructure Can be specialized for specific machine learning needs Perform vector computations faster than any known processor Designed to perform data parallel computation NVIDIA CUDA GPUs today are used to build deep learning image processing tools for  Adobe Creative Cloud. According to NVIDIA blog future Adobe applicati

Building Commonsense in AI

It is often debated that what makes humans the ultimate intelligent species is the innate quality of doing commonsense reasoning. Humans use common sense knowledge about the world around to take appropriate decisions, and this turns out to be the necessary ingredient for their survival. AI researches have long thought about building commonsense knowledge in AI. They argue that if AI possess necessary commonsense knowledge then it will be a truly intelligent machine. We will discuss two major commonsense projects that exploit this idea: Cyc tries to build a comprehensive ontology and knowledge base of everyday commonsense knowledge. This knowledge can be used by AI applications to do human-like reasoning. Started in 1984, Cyc has come a long way. Today, OpenCyc 4.0 includes the entire Cyc ontology, containing 239,000 concepts and 2,093,000 facts and can be browsed on the OpenCyc website - . OpenCyc is available for download from Source