Skip to main content

Understanding Projection Pursuit Regression

The following article gives an overview of the paper "Projection Pursuit Regression” published by Friedman J. H and Stuetzle W. You will need basic background of Machine Learning and Regression before understanding this article. The algorithms and images are taken from the paper. (

What is Regression?
Regression is a machine learning technology used to predict a response variable given multiple predictor variables or features. The main distinction is that the response to be predicted is any real value and not just any class or cluster name. Hence though similar to Classification in terms of making a prediction, it is largely different given what it’s predicting. 

A simple to understand real world problem of regression would be predicting the sale price of a particular house based on it’s square footage, given that we have data of similar houses sold in that area in the past. The regression solution will be found by coming up with some equation of the form:
Sale_price = square_footage * price_per_sq_ft + some_base_price. 

It is the task of the algorithm to find the values of the “price_per_sq_ft” and “some_base_price” that best fit the given data. Such a form of regression is termed as Linear Regression, simply because the response variable is “assumed” to be a linear combination of the multiple input predictor variables.

Projection Pursuit Regression

Although linear regression is simple to understand and implement, it has it’s own set of limitations. The limitation lies in the assumption made by linear regression. It assumes the function of the regression surface to be a direct linear combination of the predictor variables. While it greatly reduces the task to just finding the correct coefficients (or weights or parameters), it bears the cost of inaccurate predictions. 

Projection Pursuit Regression (PPR) is a non parametric regression algorithm used exactly to overcome this limitation. PPR starts with a more general assumption for the regression surface to be found. It then refines the assumption in successive iterations and simultaneously also finds the parameters that fit best. PPR is more suitable for getting accurate predictions in real world prediction problems where the nature of regression surface, though unknown, is at least definitely not linear. 


  • The regression surface is generally assumed as a linear combination (sum) of empirically determined functions (denoted by S) of the predictor variables (X), as shown below.
  • Working on this assumption, the algorithm works in the following 3 steps: 

  • In the first step, we initialise “residual” to given response variable (y), and the term counter M to 0.
  • In the second step, consider the equation for figure of merit, I(alpha). The numerator in second term is a measure of how close the prediction (S(alpha . x)) is to the actual real value denoted by the residual (r). The smaller the difference, higher will be the value of I(alpha). Using certain numerical optimisation technique, the value of alpha and S is found such that it maximises the value of I(alpha)
  • In the last step, first the value of figure of merit is checked. If less than a certain given threshold, the last term is removed from the model and the algorithm terminated. Else, the residuals are updated by subtracting the prediction made by the latest term. Conceptually,  “residual” thus means the residue from the original response y, which is still not predicted by the terms already found by the algorithm. 

Univariate Smoothing

An important part of the algorithm is determining the function S which is commonly known as the smooth function. Smoothing technique in most basic terms finds a function which fits a set of data points. It does so by finding more potential points which may fit the function. These new points are found by taking an average of already existing neighbour points.

S(zi) =  AVGi-k <= j <= i+k (yj), k = bandwidth

All traditional smoothing algorithms consider a uniform bandwidth throughout. However, in this paper, they have implemented a variable bandwidth smoothing. It uses an average bandwidth value provided by the user. Based on response variability estimates, the  actual bandwidth used is larger or smaller than the given value. The response variance provides an estimate as to how “close” the response variables are. If there is a tight cluster of response variables, the bandwidth used for smoothing there would be larger and vice versa. 


The paper exhibits the working of the algorithm with a couple of example datasets.

Example 1:  Y = X1X2 + e

In this example, dataset with the above model is used and PPR as the result gives the equivalent equation as follows:
Y = 0.25(X1+X2)2 - 0.25(X1-X2)2 
The following figures give a graphical interpretation of the steps of the PPR algorithm.

Figure 1a: Initial plot of Y v/s X2 shows that the smooth (denoted by *) hardly fits the data points (denoted by +). Hence Y is certainly not just directly proportional to X1 or X2 alone. 

Figure 1b: This plot shows Y v/s S1(alpha1. X). It is clearly seen that this smooth nicely fits the data points and is parabolic in nature. The first term of the equation is thus obtained as 0.25(X1+X2)2

Figure 1c:  This plot shows [Y - S1(alpha1. X)] (i.e. residual) v/s S2(alpha2. X). Again it’s a good fit and is an inverted parabola. Thus the second term of the resultant equation is obtained as -0.25(X1-X2)2

Figure 1d: This plot shows residual from previous round v/s S3(alpha3. X). Since it’s not a good fit, this term is discarded. Mathematically too, in this iteration the figure of merit I(alpha) is less than the pre-decided threshold and the algorithm is terminated.

Example 2:  Air pollution Data
The relation between the amount of suspended particulate matter (Y) and predictor variables mean wind speed (X1), average temperature (X2), insolation (X3), and wind direction at 4:00am (X4) and 4:00pm (X5)
Following were the plots obtained for the 3 iterations of the PPR algorithm. All three plots show the smooth to be a reasonable fit, and the regression surface is the combination of all the smooth function terms.

Variations/Options applicable to PPR
  • Backfitting: Readjustment of the smooths along previously determined linear combinations when a new linear combination is found. This is a slightly more refined addition to the PPR algorithm. 
  • Projection Selection: Restrict the search for solution directions to the set of predictors. Thus instead of having all the predictor variables in each term, restrict it to the set of variables having highest contribution in that direction. For instance, in the second example above, perform smoothing only for X1 , X4 and X5 respectively in the three iterations. One can also use a combination of Projection selection and Projection pursuit.

Following are some of the advantages of projection pursuit against other techniques:
  • Sparsity limitation of other non parametric methods like kernel and nearest neighbours is not encountered in PPR since estimation (smoothing) is univariate
  • PPR although uses successive refinement, it doesn’t partition the data
  • Interactions among different predictors is directly considered
  • In terms of ascending generality:
             Linear regression < Projection Selection < Projection Pursuit Regression
  • As seen in examples, PPR can represent each iteration graphically, thus facilitating interpretation.
  • Such output can be used to adjust main parameters of the procedure: average smoother bandwidth and termination threshold.
  • For sample size n, dimensionality p, and number of iterations M, computation of the model grows as M*p*n*log(n)
Projection Pursuit Regression is certainly more general in it’s assumptions and hence tends to give better approximations of the regression surface.  It can surely be used as an approach for the real world datasets where there is a possibility that linear regression will fail because of the restricting assumptions.

Friedman J. H and Stuetzle W. (1981), “Projection Pursuit Regression”, Journal of the American Statistical Association

Mansi Shah
Research Fellow
CereLabs Pvt. Ltd.


  1. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. machine learning projects for final year In case you will succeed, you have to begin building machine learning projects in the near future.

    Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.

    Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.

    The Nodejs Training Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training


Post a Comment

Popular posts from this blog

GPU - The brain of Artificial Intelligence

Machine Learning algorithms require tens and thousands of CPU based servers to train a model, which turns out to be an expensive activity. Machine Learning researchers and engineers are often faced with the problem of running their algorithms fast. Although initially invented for processing graphics in computer games, GPUs today are used in machine learning to perform feature detection from vast amount of unlabeled data. Compared to CPUs, GPUs take far less time to train models that perform classification and prediction. Characteristics of GPUs that make them ideal for machine learning Handle large datasets Needs far less data centre infrastructure Can be specialized for specific machine learning needs Perform vector computations faster than any known processor Designed to perform data parallel computation NVIDIA CUDA GPUs today are used to build deep learning image processing tools for  Adobe Creative Cloud. According to NVIDIA blog future Adobe applicati

Understanding Generative Adverserial Networks - Part 1

This is a two part series on understanding Generative Adversarial Networks (GANs). This part deals with the conceptual understanding of GANs. In the second part we will try to understand the mathematics behind GANs. Generative networks have been in use for quite a while now. And so have discriminative networks. But only in 2014 did someone get the brilliant idea of using them together. These are the generative adversarial networks. This kind of deep learning model was invented by Ian Goodfellow . When we work with data already labelled, it’s called supervised learning. It’s much easier compared to unsupervised learning, which has no predefined labels, making the task more vague.  "Generative Adversarial Networks is the most interesting idea in the last ten years in Machine Learning." - Yann LeCun In this post, we’ll discuss what GANs are and how they work, at a higher , more abstract level. Since 2014, many variations of the traditional GAN have co