The regression analysis is one of the most used models to analyze data. In this blog, I will explain how a regression analysis works by using some practical examples and a real-life business case.

### The least squares

How does a regression analysis work? To understand this, you need to understand the concept of *least squares*. *Least squares* is a technique that reduces the distance between a curve and its data points, as can be seen in the example below.

Jake recorded his pay on a piece of paper when he was 20 years old – something he repeated every 5 years. This is what Jakes pay-graph looks like 20 years later:

In this simple scatterplot, you can see that Jake earned $2,500 when he was 20 years old, and now, at the age of 40, he earns $4,700.

When 40-year-old Jake wanted to predict how much he would earn by the time he would be 45 years old, the easiest way would have been to draw a line that crossed the first and last point in his graph, like this:

This line seems to fit the model, and would enable Jake to make a rough estimation of how much he will earn when he is 45 years old.

The least squares technique calculates the squared distance between all the data points. For example: Jake’s estimation line and the data points at age 25, 30 and 35 differ slightly, with a difference of $230, $120 and $380 respectively (the blue arrows). To find the *least *squares, you need to calculate the *sum of squares *of this line:

230^{2} + 120^{2} + 380^{2} = 211,700

The next objective is to find the *least* squares. By fitting the line closer to the five data points, the *sum of squares* will be lower and the regression line will have a better fit. In fact, the best fit would be a *sum of squares *of 192,000. By using software, we can make this estimation and produce a line which fits the data best. The line looks like this:

In this model, the distance between the individual data points and the line is at its lowest point. In other words: this line has the *least squares*.

### Linear regression analysis

In the previous example we used the least squares technique to create a linear curve. This technique is the most commonly used technique in a linear regression.

A regression is a measure between the relation of two variables. We used a linear curve (a line) in Jake’s example, hence a linear regression.

Using this regression line, we can estimate how much we expect Jake to earn at a given age. Jake’s regression line has the following formula:

*Pay* = 320 + 112 * Age

In other words, when Jake is 20 years old, the regression formula would estimate that he will earn:

320 + 112 * 90 = $2,560

That is pretty close to his actual earnings of $2,500! At age 45, Jake can roughly expect to earn $5,360.

*Side note*: regression curves are not always linear. You can also apply exponential lines, logarithmic lines or other types of lines to fit your data. You can even do this quite easily in Excel! Check the following video for a short explainer.

### Stepwise regression analysis

In our previous regression analysis, we only used the ‘age’ variable to explain an increase in pay. Stepwise regression is a technique to build a regression model by adding multiple different variables one by one.

When a new variable is added, you would expect the explanatory power of the model to increase. If this does not happen, the variable does not add more explanatory power and it can therefore be omitted.

There are different techniques to apply stepwise regression, but we will focus on the simplest form: simple stepwise regression.

A few years ago I conducted research for a major law firm in the Netherlands to find out what drove internal innovation efforts. I obtained data of people’s innovative behavior, gender, age, engagement, as well as the scores they gave themselves for their career self-management. Career self-management measures how actively employees manage their own careers. Characteristics of career self-management are behaviors that promote their visibility within their company, and network behavior with others outside of the company. These behaviors are very beneficial to advance your career.

In the next model, I will add these variables one by one (stepwise).

Model | |||

B |
Significance |
||

1 | Gender | -0.72 | 0.00 |

2 | Gender | -0.48 | 0.03 |

Career Self-Management | 0.67 | 0.00 | |

3 | Gender | -0.49 | 0.03 |

Career Self-Management | 0.66 | 0.00 | |

Engagement | 0.08 | 0.43 |

The left column shows the three different models. The *Gender*, *Career Self-Management* and *Engagement* variables are added to the model step-by-step. The *B* column is the unstandardized beta coefficient (which shows how strong the effect is: the higher, the better) and the *Significance* column says something about the significance level (a number smaller than 0.05 is generally considered significant).

As you can see in model 1, gender is a highly significant predictor of innovation – the significance level is 0.00, which means that gender is a valid predictor of innovative behavior.

Career Self-Management (CSM) is added to model 2. CSM is an even stronger predictor of innovative behavior than gender. Note that when CSM is added, the effect of gender is slightly reduced because CSM explains some of the variance in innovative behavior that gender explained when CSM was not added to the model.

However, when Engagement is added to model 3, it does not have much explanatory value and it is also not significant. This means that high engagement levels do not lead to more innovative behavior in these employees.

By doing this analysis, the firm learned that in order to become more innovative, they have to hire people who are more actively managing their own career. These people are willing to promote the projects they are working on and they are active networkers, which is very helpful in establishing new and innovative ideas. The firm also learned that spending money on improving engagement is *not* an effective measure to become more innovative.

Of course there are a couple more criteria to consider when evaluating a regression model with multiple variables. Suffice it to say that by looking at this table we can see that engagement does not currently help us to explain innovative behavior.

**Fun fact no. 1:** I expected that age would influence innovative behavior and I therefore added both age and gender to model 1. However, age was automatically removed from the model, because it was not significant in the slightest!

**Fun fact no. 2:** Gender explained a lot of the variance in innovative behavior, with men reporting more innovative behavior than women. A similar effect was found by Millward and Freeman (2002). In their study, women reported risk of criticism, risk of not receiving credit for a specific idea and risk of failure as barriers for innovation – these were not reported by men.

In addition, this specific firm showed characteristics of an old boys’ network: most of the law firm’s partners where males. These partners had much more authority and were therefore free to pursue innovative endeavors, while the younger population (of which the majority is female) was less able to do so.

Stepwise regression is already very hard to do in Excel. A tool such as R or SPSS is much more practical for this technique.

### Logistic regression analysis

Another form of regression is the logistic regression. We will go into detail about this part later! To be continued…

I hope you liked this brief overview of the regression analysis. Of course there is way more to it than what I wrote in this article but I am convinced that understanding the basics of a technique will help you understand the power and potential of data-driven people analytics.

Now you know the basics of a regression analysis, you should check out our article on HR metrics: it might give you some new ideas of how to relate and analyze different metrics.

##### Reference

Millward, L. J., & Freeman, H. (2002). Role expectations as constraints to innovation: The case of female managers. *Communication Research Journal, 14*(1), 93-109.