* Originally published on Reddit Blog

The Benefits of Machine Learning to Study Small Dataset of Social Conversations

I remember the days when I would prefer dialing up a person to communicate rather than typing an SMS message; round-robining through the T9 keyboard crammed on 8 keys from key-2 to key-9 was the only option.

Today we have services such as iMessage, WhatsApp, WeChat, and many more that allow millions of people to send and receive billions of messages each day. Unlike email services these systems are instantaneous and engaging. There is a sense of urgency and excitement surrounding the fast-paced nature of messaging back and forth. But words don’t always convey intentions, feelings, body language, social, and cultural signals. It becomes easier to take liberties on textual comments than to verbalize if the other person was standing in front of us.

These types of conversations can evolve into spam or other categories that break a platform’s content policies. This might result in the quality of discussion being degraded and the users hesitating to engage with the platform.

Machine learning to the rescue

Wouldn’t it be nice if there were algorithms that automatically identified common topics of social interactions? For instance, if a computer could identify topics around sport, music, spam, and harassment without being explicitly told what any of those are. In terms of content moderation, this type of clustering would help reduce manual review time to action unwanted content.

As someone who studied Artificial Intelligence and Data Mining at a graduate level, the landscape of AI and ML has changed significantly. Understanding the mathematical principles behind such machine learning (ML) algorithms was the only way forward at the time. Today’s tools have successfully managed to abstract the complexity to bring ML to the masses.

I echo my mentor, and our former VP of Engineering, Nick Caldwell’s sentiment. He recently said:

“Practically anybody can learn to be an ML engineer and attempt to replicate the first five years of my career in five days with modern toolkits”

The Experiment

Allow me to walk you through an experiment I conducted. I decided to analyze social conversations and understand what lies beneath the surface of “just words”. I extracted 2708 chat messages from 13 subreddit communities. These were the messages that were reported by users or moderators for further review.

Here’s a quick overview of the two most common ML paradigms to help understand how machines learn from raw data.

Supervised learning

In supervised learning problems, we make use of datasets that contain training examples with associated correct labels. For example, if we had thousands of emails labeled as spam or not spam, we could train a model that can classify previously unseen emails as spam or not.

Supervised learning is being used in a lot of our daily life activities, e.g.,

  • Depositing a check into your bank account by taking a picture with your smartphone app.
  • Understanding speech by assistant systems.

The big disadvantage of supervised learning is the time consuming and expensive preprocessing task of labeling the data.

Unsupervised Learning

In unsupervised learning techniques, we deal with data that has no associated labels. How could this be useful? Simple! Let the computer find commonalities in the dataset and base decisions on the presence or absence of such commonalities in new pieces of data.

Unsupervised learning is particularly useful for categorizing (or clustering) unlabelled data.

Not enough data? No problem!

How many times have you heard that “we don’t have enough data, we need more” in order to make significant determinations? I wanted to challenge this common belief. It turns out that ML can actually be very valuable even if the data at our disposal is not exhaustive.

Since the chat messages were not labeled, I wanted to rely on unsupervised learning methods. These algorithms provide the intuition one may need to explain the categorization. I’m talking about a collection of methods referred to as topic modeling.

Topic modeling, as the name suggests, is a procedure for discovering topics from a collection of texts. Each topic represents a pattern of repeating word co-occurrences across the text corpus. For instance, in a good topic model, the word “president”, “minister” and “government” should all account for the topic about politics because they are often present together in the same text.

Data Preprocessing

When working with extracted texts in the natural language, it’s important to preprocess the data first. This is a very common process in Data Mining. Cleaning the data can significantly improve the performance of ML models, it reduces noise and is especially important when dealing with scarce datasets.

Here is a list of preprocessing steps one might find useful:

  • Tokenization – Splitting the chat messages into sentences and then the sentences into words.
  • Normalizing cases – converting to a single case.
  • Anonymizing data by removing Personal -PII data (e.g., email, url, username).
  • Removing stopwords, such as the, is, at and on.
  • Word stemming or lemmatization — reducing words to their root or lemma forms.

Here are the two models I used for comparison analysis:

Latent Dirichlet Allocation and Non-negative Matrix Factorization

Latent Dirichlet Allocation (LDA) is a probabilistic approach for automatically drawing two distributions – a distribution of words that describe each topic and a distribution of topics that describes each text. For applications in which people directly interact with emerged topics, LDA warrants strong consideration since the topics generated by LDA are more interpretable to humans compared to other topic modeling methods.

Non-negative Matrix Factorization (NMF) is another popular model for topic modeling which often works very well in practice. NMF factors the large document-word matrix (a matrix where rows correspond to document and columns correspond to words in our dictionary) into a product of two smaller matrices. The first matrix represents the discovered topics (clusters), while the second matrix represents the weight of each topic in the corresponding document.

Number of topics

Both LDA and NMF expect “number of topics” as an input parameter as part of the training. You can use your intuition, or gradually increase the number and test the model performance.

Another way to assess the quality of the learned topics is through the coherence score, which measures the degree of semantic similarity between the most commonly occurring words in each topic. I was able to determine that the optimum number of topics for my experiment as 9.

Training the model is very straight forward. There is a lot of help available online. scikit-learn has a nice example on how to perform topic extraction using the LDA and NMF models.

Results

In my experiments with ~2K chat messages extracted from different subreddits, LDA was more promising compared to NMF.

NMF categorized more than half of the messages into one specific topic. On the other hand, LDA was able to distribute messages across all the 9 topics. Hence, I decided to further analyze the LDA model only.

The next objective at this stage was to interpret the results and take a closer look at the most common words in each topic and manually label them. Note that now I’m only labeling 9 topics at most, whereas in supervised classification one would need to label ~2K pieces of text.

This is important for several reasons. First and foremost, it reduces the time required for manual review of the data. It also identifies areas proactively that are more likely to require human review, so certain reports can be prioritized for more urgent review. In the same vein, spam can be proactively detected and removed before it degrades the user’s experience. And more broadly, it opens the door to learn more about communities norms and trends that could inform product and feature developments.

Statistical Variance

Machine learning engineers would know that the main problem of small datasets revolves around high variance. Though getting more data helps to reduce the variance, additional data is not always easy to get.

The results of this experiment were sensitive to small fluctuations in the training set and parameter space. Consequently, I had to spend some time to carefully choose model parameters.

Final Comments

Unsupervised learning models can be of value even if you have a relatively small dataset. Topic modeling allowed me to take a bunch of messages and extract shared commonalities from the content. I was able to detect several specific topic categories without spending time annotating the content with specific labels. This demonstrates that certain ML paradigms can offer value in reducing manual review of content, using insights to prioritize moderation activity, and proactively identifying content that hinders the user experience.

Update to original blog:

People asked me for recommendations on training or bootcamps, Here you go.

Recommendation

If you’re wondering how do I bring myself up to speed with 2019 technology, I highly recommend Francesco Mosconi’s fully immersive 5 day bootcamp. Zero to Deep Learning Bootcamp, makes it easy to learn and use tools like Python, Jupyter Notebooks, Keras, and Tensorflow.