Employee turnover is a major challenge for companies today, especially when the labor market is competitive and certain skills are in high demand. When an employee leaves, not only is the productivity of that person lost, but the productivity of many others is impacted.
Finding replacements can take months of time and effort on the part of hiring managers and recruiting staff, who are then forced to take time away from the work they could be doing. When a replacement is finally found, it takes weeks to months for that employee to be completely onboarded and working at full capacity. By some estimates, it costs about 20% of an employee’s salary to replace a position, and costs can be much higher — over 200% — for highly skilled and educated positions like many in demand today.
Retention of valued employees makes good business sense. Traditional approaches such as employee engagement surveys and proper training for managers are important components of a good workforce planning strategy. Now that ‘big data’ has arrived, the insights it provides are becoming an increasingly valuable part of that strategy.
This blog presents a relatively simple machine learning approach, using R, to harnessing workforce data to understand a company’s employee turnover, and predict future employee turnover before it happens so that actions can be taken now, before it’s too late.
- Create model to accurately predict employees who leave
- Identify key factors related to employee churn
The dataset represents fictitious/fake data on terminations from the Employee Attrition Kaggle competition. For each of 10 years it shows employees who are active or terminated. Lots of details and ways to explore the data can be found in Lyndon Sundmark’s fine tutorial.
In this analysis, R libraries are intentionally introduced and loaded at the point they are needed, to make it easier for readers to understand which libraries are required for specific portions of the analysis.
1. Let’s first look at the data
# load data emp <- read.csv("MFG10YearTerminationData_Kaggle.csv", header = TRUE) emp$termreason_desc <- as.factor(gsub("Resignaton", "Resignation", emp$termreason_desc)) # correct misspelling in original Kaggle dataset # basic EDA dim(emp) # number of rows & columns in dataData Summary
summary(emp) # summary stats
The dim function (dimension, above) shows that the dataset has 49,653 rows and 18 columns. The summary statistics reveal that there are about 7,000 employee IDs with records across years from 2006–15. The variables include hire and termination dates; birthdate, age, and gender; length of service; city, store, and department names; job titles; and status, status year, and termination type and reason. This list of variables is more limited than typically available to companies, but gives us enough to build and test some models.
First, let’s calculate how many employees leave each year:
# explore status/terminations by year library(tidyr) # data tidying (e.g., spread) library(data.table) # data table manipulations (e.g., shift) library(dplyr) # data manipulation w dataframes (e.g., filter) status_count <- with(emp, table(STATUS_YEAR, STATUS)) status_count <- spread(data.frame(status_count), STATUS, Freq) status_count$previous_active <- shift(status_count$ACTIVE, 1L, type = "lag") status_count$percent_terminated <- 100*status_count$TERMINATED / status_count$previous_active status_count
We can see that from 2006 to 2015 this company had between 4445 and 5215 active employees, and between 105 and 253 terminations. The termination rate jumped from about 2% in 2014 to almost 5% in 2015.
Let’s see the breakdown of employee terminations each year, by termination reason:
# create a dataframe of the subset of terminated employees terms <- as.data.frame(emp %>% filter(STATUS=="TERMINATED")) # plot terminations by reason library(ggplot2) ggplot() + geom_bar(aes(y = ..count..,x = STATUS_YEAR, fill = termreason_desc), data=terms, position = position_stack()) + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
A couple of findings jump out. First, we see that a spike in layoffs occur in 2014–15, compared with no layoffs before 2014. The data also show that many of the terminations are retirements, especially in 2006–10. Resignations increase after 2010, pointing to some downward shift in the desirability of working for the company.
2. Modeling — Terminations
OK, let’s start modeling to see how well we can predict terminations. We begin by selecting the variables to include in the model (as ‘term_vars’).
Here we use all years before 2015 (2006–14) as the training set, with the last year (2015) as the test set. We will start with a basic CART (Classification and Regression Tree) decision tree:
# select variables to be included in model predicting terminations term_vars <- c("age","length_of_service","city_name", "department_name","job_title","store_name","gender_full","BUSINESS_UNIT","STATUS")
# import libraries library(rattle) # graphical interface for data science in R library(magrittr) # For %>% and %<>% operators. library(rpart.plot) # decision tree model and plot
# Partition the data into training and test sets emp_term_train <- subset(emp, STATUS_YEAR < 2015) emp_term_test <- subset(emp, STATUS_YEAR == 2015) set.seed(99) # set a pre-defined value for the random seed so that results are repeatable # Decision tree model rpart_model <- rpart(STATUS ~., data = emp_term_train[term_vars], method = 'class', parms = list(split='information'), control = rpart.control(usesurrogate = 0, maxsurrogate = 0)) # Plot the decision tree rpart.plot(rpart_model, roundint = FALSE, type = 3)
Age is the most important variable, largely because many terminations are retirements of employees 65 years or older. Of those employees under 65, male employees 60 years or older left the company, many as layoffs in 2014–15.
Let’s plot the age distribution of terminated versus active employees:
# plot terminated & active by age library(caret) # data viz, functions to streamline process for # predictive models, & machine learning packages including # gbm (generalized boost regression models)
featurePlot(x=emp[,6], y=emp$STATUS, plot="density", auto.key = list(columns = 2), labels = c("Age (years)", ""))
This plot shows that the majority of terminations are older employees at or near retirement age. But it also shows a peak in resignations among the youngest employees, mainly those in their 20s.
Most companies are interested in identifying the employees, especially their top-performing employees, that are at risk of leaving voluntarily. So, let’s focus the analysis on that employee segment.
3. Modeling — Resignations
To predict future resignations (voluntary terminations), we need to create a ‘resigned’ variable:
# create separate variable for voluntary_terminations emp$resigned <- ifelse(emp$termreason_desc == "Resignation", "Yes", "No") emp$resigned <- as.factor(emp$resigned) # convert to factor (from character) summary(emp$resigned)
We can see that there are only 385 resignations compared to 49,268 non-resignations.
This is a highly imbalanced dataset, so machine learning models will have difficulty identifying the rare class. For example, a random forest model (not shown, for brevity) run on this data had a recall of 0, meaning that the goal of the model completely failed and none of the ‘resigned’ employees in 2015 were correctly identified. There are a variety of options for adjusting for this imbalance, such as up-sampling the minority class, down-sampling the majority class, or using an algorithm to create synthetic data based on feature space similarities from minority samples (read here) for more. Here, we use the ROSE (Random Over Sampling Examples) package to create a more balanced dataset.
# Subset the data again into train & test sets. Here we use all years before 2015 (2006-14) as the training set, with the last year (2015) as the test set emp_train <- subset(emp, STATUS_YEAR < 2015) emp_test <- subset(emp, STATUS_YEAR == 2015) library(ROSE) # "Random Over Sampling Examples"; generates synthetic balanced samples emp_train_rose <- ROSE(resigned ~ ., data = emp_train, seed=125)$data
# Tables to show balanced dataset sample sizes table(emp_train_rose$resigned)