In HR, one of the challenges we face is to gain an understanding of what goes on in the hearts and minds of our people. There is typically one overlooked source of data that may provide the most direct reflection of that: text. Through text, we can gain insights as to what people are discussing, whether they are communicating effectively, and even – to a certain extent – how they are feeling through sentiment analysis.
Because of this, text mining and natural language processing can help tremendously in putting employees first and supporting them through analytics. It can cement the marriage between supporting the human and cultural side of organizations and optimizing business using data and analytics. Text data can provide insights for continuous listening, employee experience, measuring engagement and more. Analyzing text, however, is not quite as straightforward as crunching numbers in a spreadsheet.
But it can be done! In this article, I will discuss a few techniques as I analyze the articles that have been published on AIHR in 2019. In total, that gives me 50 articles to work with. What was talked about and what can we learn from that? Let’s see what we uncover!
AIHR top words of 2019 | word count
If you want to quickly get an idea of what a given text is about, the most basic analysis would be a word count. In this case, rather than counting the total number of words (as Word could do for you), the occurrences of every individual word are tallied. The result can be presented as a snazzy-looking word cloud, with word size reflecting how often a word occurs.
Using WordCloud for Python, I visualized the most frequently used words in articles of 2019 (after filtering out overly commonly words such as articles – the, a, an – and grouping conjugated verbs and plural/singular to their most frequently occurring form).
Apart from looking nice, this word cloud already tells us a few things:
- Without a doubt, the top terms are ‘data’ and ‘HR’
- ‘people’, ‘employee’, ‘organization’ and ‘company’ share a similar number of occurrences, judging by their size. Could this reflect a common view? Could this simply be the nature of HR in addressing both people/employee and organization/company needs?
- ‘Need’ and ‘use’ are also featured prominently, but does this really tell us anything in particular?
We could go on like this for a while yet, but the results raise many questions. The biggest shortcoming is that we are only looking at word frequency. Words that are common may not always be important. Take ‘use’ and ‘need’ as examples: do they occur often due to a particular focus on what is needed or should be used, or do they typically occur anyway when sharing knowledge through writing?
Distinguishing AIHR content | keyword identification
To identify keywords or key content, a commonly used technique is Term Frequency Inverse Document Frequency (or TFIDF for short). This could allow us to dig a bit deeper than we could with the previous method.
So, how does TFIDF work? The idea is to not simply count the occurrences of a word in a text (Term Frequency), but multiply it by a lower weight the more common a word is across documents (Inverse Document Frequency). In the end, the highest-scoring terms are deemed the most important.
To determine whether or not a word is common, you’ll need a collection of documents to derive that. TFIDF is often applied to a selected set of documents to indicate for any particular document in that collection that sets it apart from the other documents, content-wise. If we were to do this for the blogs in our data set, we would get an idea of what particular blogs were about. That is not what I am aiming for: I am curious to see what it is that AIHR blogs tackle in general and how that differs from standard HR-oriented articles.
So I decided to regard AIHR blogs as one source of content and add 90 academic publications on management and HRM to the set for purposes of calculating TFIDF. (A nice open-source solution to convert PDF and many other document formats to plain text and metadata for further analysis is Apache Tika.) Using TFIDF this way, AIHR blogs give us the following word cloud:
Interesting changes have occurred from the previous word cloud to this one! (And no, it’s not the shape I decided to use.) ‘Data’ has made way for ‘HR Analytics’ as a top-scoring term. ‘Need’ and ‘use’ have fallen by the wayside. We see new terms related to quantification, digital, and tech: HRIS, KPIs, metrics, SWP, HR system, dashboards, algorithms, etc. Other new terms that I find intriguing are ‘HRBPs’ and ‘HR professionals’.
What do these words tell us? After looking up where these terms tend to occur in articles on AIHR, I interpret these results as follows: AIHR brings a decidedly ‘digital and tech’ view to HR and translates that to practical knowledge (e.g. discussing metrics, KPIs, and the odd HRIS) for people working in HR roles. This seems nicely in line with the AIHR mission: ‘skills for your HR future’.
Shortcomings of TFIDF are also illustrated: how did words such as ‘really’, ‘big’ and ‘lot’ end up being marked as key terms? Blog articles tend to use a more colloquial tone than academic publications. In addition, HR still scores high. Why is that? Because most academic publications in my data set tend to use the term ‘HRM’ rather than ‘HR’. So when using TFIDF, take care when selecting the documents to compare against each other.
Meaning of words across articles | word embeddings
Can we distill messages or themes from the blogs on AIHR? We would need to have some idea of which words belong together and form those themes. Methods to cluster documents by content, such as Latent Dirichlet Allocation (LDA) could help with that.
Another way is to use word embeddings to represent how similar words are in terms of their meaning. Potentially, this also allows for subsequently building more advanced applications from suggesting synonyms to full-fledged chatbots.
For word embedding, Google’s Word2vec is a popular option. You may be familiar with the classic example of asking a trained word2vec model to give the term most similar in meaning to ‘king’ and ‘woman’, yet most dissimilar to ‘man’, with the model answering ‘queen’.
Word2vec uses an intuitive way to estimate which words are semantically similar or related: if a word is encountered in the vicinity of specific words, it is likely to mean something similar to other terms encountered with those specific words. The ‘meaning’ of a word is then represented as a numerical vector so that distances between words can be calculated (i.e. words with similar or related meanings will end up with similar vector representations while dissimilar words end up with dissimilar vectors).
Training such a model on only 50 articles of modest length will not provide the best possible results. The chance of reliably finding synonyms and the like is slim. But I went ahead anyway to demonstrate word relations. (I used the gensim word2vec implementation in Python for this.)
Let’s see some results, shall we? Those HR Professionals and HRBPs that we saw earlier piqued my interest. Top terms returned by the model for ‘hr’ + ‘professionals’ + ‘hrbps’ are related to measurement: ‘derive’, ‘predictive’ and ‘quantify’. The list continues with words like ‘tools’ and ‘insights’ and seems primarily focused on supporting decisions with data.
Does that mean that this is all that HR Professionals or HRBPs do? Of course not! It is what is reflected by the content on AIHR and makes sense in that context. After all, AIHR deals with the theme of ‘digital’ through educating HR professionals. Is this supported by the results for the word ‘digital’? I’ll leave that for you to decide:
Several techniques can help to make sense of text data and provide insights regarding what occupies minds, what meaning specific terms have within an organization and even how people may feel about their work. Such themes are at the core of HR.
The increasing use of natural language processing techniques will allow for more effective HR and bring positive changes. Changes may be small at first: think of large multiple-choice surveys getting replaced by formats using a few open questions, reducing survey fatigue while providing richer insights. What would then be next?
I think we will increasingly use technology to support and understand rather than supplant the human and cultural side of organizations. People are central, also when it comes to interpreting analysis results. In my opinion, text mining and natural language processing are prime examples of that. Our digital future revolves around understanding and supporting humankind, making things easier without losing sight of our values.
But that’s just me venting my opinion. Don’t take my word for it. Instead, let’s ask the model one last question. What does the future of digital hold? ‘Future’ + ‘digital’ matches:
Does that match your view? Whatever you read in the words listed above, I hope that you have new ideas for how text data may fit into your work now or in the future.
Additional links for reading:
- TFIDF in a nutshell – https://monkeylearn.com/blog/what-is-tf-idf/
- LDA introduction – https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d
- LDA introduction – https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158
- Word2vec introduction – https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
- Word2vec paper – https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf