As the dust around the GDPR is starting to settle, I am now sufficiently comfortable to write a piece about privacy again. Before you stop reading, worry not; I won’t take a legal perspective in this article. I will adopt a common-sense approach instead.
Working with HR Analytics and, more specifically, compensation and benefit analytics, involves the use of employee and salary data. But is the raw use of compensation & benefit data considered ‘safe’ in terms of employee privacy?
Often, to conduct an analytics project, an existing source file is used. It can be an existing dataset or a standard extraction from the salary system. Excluding unnecessary data from this dataset, such as the social security numbers or the bank account numbers of the employees, is easily done. However, things become more complicated when date of birth come into play.
This data is often very relevant when making all kinds of age analyses and is therefore almost never excluded. But, there is also a danger of privacy when working with this type of data, especially when combined with other data.
Put simply, there are around 36,500 unique combinations possible with a date of birth (365 days x 100 years). If you add another set of standard data, such as sex (male / female), there will be 73,000 unique combinations. The more unique combinations there are, the faster you can ‘find’ someone and isolate them in the data. Especially if the population is small.
With an ‘anonymous’ sample of 8,000 people
there is already a real privacy danger
A good example of such a case can be found in this study. Here, data is combined and released on different populations to see how unique they become. The researchers used three data types:
- Birth date
- Gender data
- ZIP code
With this information, 87.1% of employees were unique (and thus identifiable). When ZIP code was replaced by ‘city’, 58.4% of people were unique.
This shows that it’s quite easy to find unique people at the ZIP code level. Fun fact: there are about 8,000 people per postcode in the US. Just consider what is happening within a medium-sized company, where people usually live within a 30-mile range.
This means that if you use these variables in a company of a few thousand people, it is relatively easy to recognize an individual in the data.
This is not desirable from a privacy perspective. So, are dates of birth really necessary for all analyses?
You can already do a lot of HR Analytics and Reward Analytics without the entire date of birth, but with only the month or year of birth. The month of birth is sufficient, for example, to calculate the exact outflow of retirees. In fact, just the year of birth variable is enough to determine the age of a person (give or take a few months).
The aforementioned study shows how the use of year and month of birth only, already leads to a lower uniqueness of 37.1% at ZIP code level.
If you choose to only use year of birth, this number gets reduced to 0.04%!
Note that within a company, these numbers will be slightly different as most people will be between 20 and 70 (working age) and employee ZIP codes are clustered around the company. But even then, you will still reduce the privacy risk considerably, by using just year and month of birth.
To conclude, when you use dates of birth in your data, you put people’s privacy in danger. However, through a clever use of data and by looking at what you really need, you can easily use data that is less privacy-sensitive. This can be done while retaining highly granular data to do fantastic analyses!