Do you know all the terms that your data scientist uses? A lot of people were interested in learning about the technical terms involved in statistics and machine learning in part 1. What do words like multivariate analysis, random forests and algorithm boosting actually mean? In this post we will tell – and show – you all about it!
1. Multivariate analysis
A multivariate analysis is the opposite of a univariate analysis. A univariate analysis only has a so-called Y variable. The Y variable is also known as the dependent or outcome variable. For example, when you want to predict how age and engagement levels influence someone’s performance ratings, there’s only one dependent variable. However, when you want to predict someone’s performance ratings and pay, there are two dependent variables, hence a multivariate analysis
2. Dependent vs. independent
When you want to predict how engagement leads to performance, you expect performance to depend on someone’s engagement level. Engagement is therefore the independent variable and performance is the dependent variable. You expect the dependent variable to depend on the scores of the independent variable so that when you manipulate the independent variable (engagement increases/decreases) you also anticipate that the dependent variable will change (performance increases/decreases).
When you create an algorithm, you want it to be as predictive and accurate as possible. Boosting is an interactive statistical technique that creates multiple extra training datasets. A model is created for each of these data sets. These datasets are created deliberately (i.e. non-random). This means that the weight of misclassified data points are increased and the next algorithm will therefore fit these misclassifications better. This process repeats itself numerous times. Together these models decide on the most likely outcome. They make this decision based on a weighted vote in which more accurate models have more voting power than less accurate models.
Related: Learn more about Recruiting Metrics
Boosting is a combination of multiple algorithms and is often referred to as a meta-algorithm. The best known boosting classifier is AdaBoost (which is used in this blog by Lyndon). The outcomes of these models are complex and therefore difficult to analyze however, the meta-algorithm has very good performance.
Bagging is another meta-algorithm and stands for Bootstrap aggregation. Bagging is a technique in which several training sets are independently sampled based on the original data set. Multiple models are built on and increase the size of these extra data sets – just as with boosting. The prediction is eventually made by an unweighted majority vote of the different models.
Bagging helps to reduce the effect of outliers in the algorithm, and thus the algorithm’s variance as well. This technique is mostly used for decision tree algorithms, because an outlier can create an entirely different decision tree. The impact is therefore much greater than that of other algorithms.
C4.5 is a decision tree algorithm. C4.5 is a well-known and very accurate data mining algorithm. With every new branch, C4.5 uses the criteria of information gain versus default gain ratio per attribute and then selects the best attribute to split its branch on.
Related: An overview of HR metrics
The tree below shows two weather variables and how they influence the chance that your neighbor will play golf on a random day (the outcome is vitalized in yes and no). It shows that C4.5 produces output that is very easy to understand and visualize. The tree shows that when the weather outlook is sunny, your neighbor is much more likely to play golf compared to a rainy outlook. For a sunny outlook the model predicts the chance that your neighbor will play golf (a yes outcome) five out of six times correctly (the 5.0/1.0 note in the decision outcome).
You don’t understand this decision tree yet? That’s very possible! In this paragraph we will make it even clearer. Pruning is a technique that is used to reduce the complexity of a decision tree. A decision tree is built by taking the most explanatory attribute to split its branches on, and this process continues until the tree is completed. However, such a tree can be big and complex. Pruning is the process of applying a statistical test on all branches of an entire tree. When the statistical confidence factor is too low, a specific branch is removed (hence: pruning). A simpler decision tree is less prone to overfitting. Overfitting is what happens when the tree becomes so detailed that it (almost) perfectly fits the specific data set. When this is the case, the algorithm’s accuracy will reduce when more data is added.
This is the same decision tree as the one we showed above, however, this one is unpruned. As you can see, this tree splits the sunny outlook into three different temperature categories that become very specific. They fit the data very well but run the risk of overfitting. This shows when you compare the accuracy: the pruned model has a 92% accuracy while the unpruned model fits the data perfectly.
7. Random Forest
Unlike the boosting technique, the random forest technique randomizes the algorithm instead of the data. Normally, a decision tree algorithm selects the best attribute to split its branches on. However, in a random forest this procedure of selecting the best attribute is randomized. This produces different decision trees (hence: a forest). These randomized trees produce a better result together.
8. Linear regression
A linear regression analysis is a statistical method to estimate the relationship between a dependent variable and one or multiple independent variables. The regression analysis uses the least squares method to estimate the best fitting curve on the data. This curve can be used to predict various outcomes. To read some examples and a business case in which linear regression is used, check this blog.
9. Data Cleaning
Data cleaning is a well-known subject of HR analytics. What does it mean? Data cleaning is the process of going through data, fixing inconsistencies and gathering missing data to prepare it for analysis. HR data is oftentimes regarded as ‘dirty’. Dirty data has various definitions: some parts of the data can be missing, the same jobs can have different labels so you cannot easily identify them, there might be multiple non-corresponding records for the same person in multiple systems , and so on. Dirty data is a recurring phenomenon in multinational companies especially. These companies often use different systems in different countries to record the same data. As soon as there is the slightest difference in data collection procedures, the data will be inconsistent.
This blog is part 2 of the series 9 HR analytics terms you should know. You can read part 1 here. Part 1 includes terms like data mining, machine learning and supervised learning. If you want us to explain more terms in a subsequent blog, please let us know in the comments!