Skip to main content

Understanding Projection Pursuit Regression

The following article gives an overview of the paper "Projection Pursuit Regression” published by Friedman J. H and Stuetzle W. You will need basic background of Machine Learning and Regression before understanding this article. The algorithms and images are taken from the paper. (http://www.stat.washington.edu/courses/stat527/s13/readings/FriedmanStuetzle_JASA_1981.pdf

What is Regression?
Regression is a machine learning technology used to predict a response variable given multiple predictor variables or features. The main distinction is that the response to be predicted is any real value and not just any class or cluster name. Hence though similar to Classification in terms of making a prediction, it is largely different given what it’s predicting. 

A simple to understand real world problem of regression would be predicting the sale price of a particular house based on it’s square footage, given that we have data of similar houses sold in that area in the past. The regression solution will be found by coming up with some equation of the form:
Sale_price = square_footage * price_per_sq_ft + some_base_price. 

It is the task of the algorithm to find the values of the “price_per_sq_ft” and “some_base_price” that best fit the given data. Such a form of regression is termed as Linear Regression, simply because the response variable is “assumed” to be a linear combination of the multiple input predictor variables.

Projection Pursuit Regression

Although linear regression is simple to understand and implement, it has it’s own set of limitations. The limitation lies in the assumption made by linear regression. It assumes the function of the regression surface to be a direct linear combination of the predictor variables. While it greatly reduces the task to just finding the correct coefficients (or weights or parameters), it bears the cost of inaccurate predictions. 

Projection Pursuit Regression (PPR) is a non parametric regression algorithm used exactly to overcome this limitation. PPR starts with a more general assumption for the regression surface to be found. It then refines the assumption in successive iterations and simultaneously also finds the parameters that fit best. PPR is more suitable for getting accurate predictions in real world prediction problems where the nature of regression surface, though unknown, is at least definitely not linear. 



Algorithm 

  • The regression surface is generally assumed as a linear combination (sum) of empirically determined functions (denoted by S) of the predictor variables (X), as shown below.
  • Working on this assumption, the algorithm works in the following 3 steps: 



  • In the first step, we initialise “residual” to given response variable (y), and the term counter M to 0.
  • In the second step, consider the equation for figure of merit, I(alpha). The numerator in second term is a measure of how close the prediction (S(alpha . x)) is to the actual real value denoted by the residual (r). The smaller the difference, higher will be the value of I(alpha). Using certain numerical optimisation technique, the value of alpha and S is found such that it maximises the value of I(alpha)
  • In the last step, first the value of figure of merit is checked. If less than a certain given threshold, the last term is removed from the model and the algorithm terminated. Else, the residuals are updated by subtracting the prediction made by the latest term. Conceptually,  “residual” thus means the residue from the original response y, which is still not predicted by the terms already found by the algorithm. 


Univariate Smoothing

An important part of the algorithm is determining the function S which is commonly known as the smooth function. Smoothing technique in most basic terms finds a function which fits a set of data points. It does so by finding more potential points which may fit the function. These new points are found by taking an average of already existing neighbour points.

S(zi) =  AVGi-k <= j <= i+k (yj), k = bandwidth

All traditional smoothing algorithms consider a uniform bandwidth throughout. However, in this paper, they have implemented a variable bandwidth smoothing. It uses an average bandwidth value provided by the user. Based on response variability estimates, the  actual bandwidth used is larger or smaller than the given value. The response variance provides an estimate as to how “close” the response variables are. If there is a tight cluster of response variables, the bandwidth used for smoothing there would be larger and vice versa. 

Examples

The paper exhibits the working of the algorithm with a couple of example datasets.

Example 1:  Y = X1X2 + e

In this example, dataset with the above model is used and PPR as the result gives the equivalent equation as follows:
Y = 0.25(X1+X2)2 - 0.25(X1-X2)2 
The following figures give a graphical interpretation of the steps of the PPR algorithm.




Figure 1a: Initial plot of Y v/s X2 shows that the smooth (denoted by *) hardly fits the data points (denoted by +). Hence Y is certainly not just directly proportional to X1 or X2 alone. 

Figure 1b: This plot shows Y v/s S1(alpha1. X). It is clearly seen that this smooth nicely fits the data points and is parabolic in nature. The first term of the equation is thus obtained as 0.25(X1+X2)2





Figure 1c:  This plot shows [Y - S1(alpha1. X)] (i.e. residual) v/s S2(alpha2. X). Again it’s a good fit and is an inverted parabola. Thus the second term of the resultant equation is obtained as -0.25(X1-X2)2


Figure 1d: This plot shows residual from previous round v/s S3(alpha3. X). Since it’s not a good fit, this term is discarded. Mathematically too, in this iteration the figure of merit I(alpha) is less than the pre-decided threshold and the algorithm is terminated.


Example 2:  Air pollution Data
The relation between the amount of suspended particulate matter (Y) and predictor variables mean wind speed (X1), average temperature (X2), insolation (X3), and wind direction at 4:00am (X4) and 4:00pm (X5)
Following were the plots obtained for the 3 iterations of the PPR algorithm. All three plots show the smooth to be a reasonable fit, and the regression surface is the combination of all the smooth function terms.




Variations/Options applicable to PPR
  • Backfitting: Readjustment of the smooths along previously determined linear combinations when a new linear combination is found. This is a slightly more refined addition to the PPR algorithm. 
  • Projection Selection: Restrict the search for solution directions to the set of predictors. Thus instead of having all the predictor variables in each term, restrict it to the set of variables having highest contribution in that direction. For instance, in the second example above, perform smoothing only for X1 , X4 and X5 respectively in the three iterations. One can also use a combination of Projection selection and Projection pursuit.

Conclusion
Following are some of the advantages of projection pursuit against other techniques:
  • Sparsity limitation of other non parametric methods like kernel and nearest neighbours is not encountered in PPR since estimation (smoothing) is univariate
  • PPR although uses successive refinement, it doesn’t partition the data
  • Interactions among different predictors is directly considered
  • In terms of ascending generality:
             Linear regression < Projection Selection < Projection Pursuit Regression
  • As seen in examples, PPR can represent each iteration graphically, thus facilitating interpretation.
  • Such output can be used to adjust main parameters of the procedure: average smoother bandwidth and termination threshold.
  • For sample size n, dimensionality p, and number of iterations M, computation of the model grows as M*p*n*log(n)
Projection Pursuit Regression is certainly more general in it’s assumptions and hence tends to give better approximations of the regression surface.  It can surely be used as an approach for the real world datasets where there is a possibility that linear regression will fail because of the restricting assumptions.

References:
Friedman J. H and Stuetzle W. (1981), “Projection Pursuit Regression”, Journal of the American Statistical Association


Mansi Shah
Research Fellow
CereLabs Pvt. Ltd.

Comments

Popular posts from this blog

Understanding Generative Adversarial Networks - Part II

In "Understanding Generative Adversarial Networks - Part I" you gained a conceptual understanding of how GAN works. In this post let us get a mathematical understanding of GANs.
The loss functions can be designed most easily using the idea of zero-sum games. 
The sum of the costs of all players is 0. This is the Minimax algorithm for GANs
Let’s break it down.
Some terminology: V(D, G) : The value function for a minimax game E(X) : Expectation of a random variable X, also equal to its average value D(x) : The discriminator output for an input x from real data, represents probability G(z): The generator's output when its given z from the noise distribution D(G(z)): Combining the above, this represents the output of the discriminator when 
given a generated image G(z) as input
Now, as explained above, the discriminator is the maximizer and hence it tries to 
maximize

Understanding Generative Adverserial Networks - Part 1

This is a two part series on understanding Generative Adversarial Networks (GANs). This part deals with the conceptual understanding of GANs. In the second part we will try to understand the mathematics behind GANs.

Generative networks have been in use for quite a while now. And so have discriminative networks. But only in 2014 did someone get the brilliant idea of using them together. These are the generative adversarial networks. This kind of deep learning model was invented by Ian Goodfellow. When we work with data already labelled, it’s called supervised learning. It’s much easier compared to unsupervised learning, which has no predefined labels, making the task more vague. 

"Generative Adversarial Networks is the most interesting idea in the last ten years in Machine Learning." - Yann LeCun

In this post, we’ll discuss what GANs are and how they work, at a higher , more abstract level. Since 2014, many variations of the traditional GAN have come out, but the underlying conc…