Catching cyberbullies in the act

Detecting online harassment with neural networks

Wessel Stoop CLST, Radboud University Nijmegen
Florian Kunneman Vrije Universiteit Amsterdam
Antal van den Bosch Meertens Institute
Ben Miller Emory University

Digital harassment is a problem in many corners of the internet, like internet fora, comment sections and game chat. In this article you can play with techniques to automatically detect users that misbehave, preferably as early in the conversation as possible. What you see is that neural networks do a better job than simple words lists, but also that they are black boxes; one of our goals is to help show how these networks come to their decisions. Also, we apologize in advance for all of the swear words :).

According to a 2016 report, 47% of internet users have experienced online harassment or abuse [1], and 27% of all American internet users self-censor what they say online because they are afraid of being harassed. On a similar note, a survey by The Wikimedia Foundation (the organization behind Wikipedia) showed that 38% of the editors had encountered harassment, and over half them said this lowered their motivation to contribute in the future [2]; a 2018 study found 81% of American respondents wanted companies to address this problem [3]. If we want safe and productive online platforms where users do not chase each other away, something needs to be done.

One solution to this problem might be to use human moderators that read everything and take action if somebody crosses a boundary, but this is not always feasable (nor safe for the mental health of the moderators); popular online games can have the equivalent population of a large city playing at any one time with hundreds of thousands of conversations taking place simultaneously. And much like a city, these players can be young, old, and diverse. At the same time, certain online games are notorious for their toxic community. According to a survey by League of Legends player Celianna in 2020, 98% of League of Legend players have been 'flamed' (online argument with personal attacks) during a match, and 79% have been harassed afterwards [4]. A conversation that is sadly not untypical for the game:

Z	fukin bot n this team.... so cluelesss gdam
V	u cunt
Z	wow ....u jus let them kill me
V	ARE YOU RETARDED
V	U ULTED INTO 4 PEOPLE
Z	this game is like playign with noobs lol....complete clueless lewl
L	ur shyt noob

For this article, we therefore use a dataset of conversations from this game and show different techniques to separate 'toxic' players from 'normal' players automatically. To keep things simple, we take 10 conversations where 1 person misbehaves as an example, and try to build a system that can pinpoint this 1 player, preferably quickly and early in the conversation.

Can't we just use a list of bad words?

A first approach for an automated detector might be to use a simple list of swear words and insults like 'fuck', 'suck', 'noob' and 'fag', and label a player as toxic if they use a word from the list more often than a particular threshold. Below, you can slide through ten example conversations simultaneously. Normal players are represented by green faces, toxic players by red faces. When our simple system marks a player as toxic, it gets a toxic symbol: . These are all the possible options:

	Normal players	Toxic players
System says nothing (yet)	Normal situation	Missed toxic player
System says: toxic	False alarm	Detected toxic player

You can choose between detectors with thresholds of 1, 2, 3 and 5 bad words, to see what they do where in the conversation.

As you can see, the detector with the low threshold detects all toxic players early in the game, but has lots of false alarms (false positives). On the other hand, the detector with the high threshold does not have this problem, but misses a lot of toxic players (false negatives). This tension between false positives and false negatives is a problem any approach will have; our goal is to find an approach where this balance is somehow optimal.

Teaching language to machines

A better solution might be to use machine learning: we give thousands of examples of conversations with toxic players to a training algorithm and ask it to figure out how to recognize harassment by itself. Of course, such an algorithm will learn that swear words and insults are good predictors for toxicity, but it can also pick up more subtle word combinations and other phenomena. For example, if you look at how often the green and red faces open their mouths in the visualization above, you'll see that the average toxic player is speaking a lot more than the other players.

The average toxic player is speaking a lot more than the other players

The most successful algorithms in tasks like this these days are so-called neural networks. While even experts have trouble fully understanding why exactly they are so successful, there are some techniques to look under the hood and see what the network has learned. For example, texts that are processed by a neural network first pass through a so-called embedding layer, which tries to learn what each word means. A way to think about meaning is by thinking about what words are similar to each other, and a way to represent similarity is to physically put words closer to some words, and further away from others. A dictionary does this, but just on the basis of similar spellings. An embedding layer does this based on meaning, and we can visualize this is in a 3D space:

As you explore the 3D space, you will find many interesting clusters of words that indeed are related in meaning. For example, there is a cluster of words related to time (highlight), a number cluster (highlight), a cluster of adjectives to rate something (highlight), but also (and more useful to the current task) a cluster of insults (highlight) and a cluster of variants of the word fuck (highlight). And the system figured all of this out just by analyzing a lot of gaming conversations!

However, just knowing the rough meaning of relevant words is not enough... we need to know when to act. This is what individual neurons in the rest of the neural network do; you can think of a neuron as an individual worker that during the training phase tries to find a simple meaningful task for itself. A neural network is basically a collection of these workers, all doing simple but different tasks. What really sets a neural network apart from a simple word list-based detector is that these workers are organized in layers. Neurons in the lower layers typically pick up a small concrete task, like recognizing particular words, while the higher layers do something increasingly more abstract, like monitoring the temperature of the conversation as a whole. In the interactive visualization below, you can see which neurons respond positively (green words) or negatively (red words) to what part of the conversations [5].

In the first layer, we see that example neuron 1 has developed an interest in several abbreviations like 'gj' (good job), 'gw' (good work), 'ty' (thank you) and to a lesser extent 'kk' (okay, or an indication of laughter in the Korean community) and brb 'be right back'. Example neuron 12 focuses on a number of unfriendly words, activating on 'stupid', 'dumb', 'faggot' and 'piece of shit', and also somewhat for 'dirty cunt'. Note that its results are swapped compared to neuron 1 (red for good predictors of toxicity and green for good predictors of collaborative players instead of vice versa), which will be corrected by a neuron in a later layer. Neuron 16 activates on 'mia' (missing in action), which is typically used to notify your team mates of possible danger... and thus a sign that this person is collaborative and probably not toxic.

The neurons in the second layer monitor the conversation on a higher level

The neurons in the second layer monitor the conversation on a higher level. In contrast to the abrupt changes in the first layer, the colors in the second layer are fading more smoothly. Neuron 17 is a good example, where we see that the conversation is green in the beginning and slowly goes from yellow to orange, and then later back to green again. Several neurons, like neuron 6 in the second layer, find the repetitive use of 'go' suspicious: they are activated more with each repetition.

But does it work?

The big question is whether a harassment detector using a neural network instead of a word list performs better. Below you can compare the previous word list-based approach against three neural networks with different thresholds [5]. The threshold now is not the number of bad words, but the network's confidence: a number between 0 and 1 indicating how sure the network is that a particular player is toxic. Here you see the results for a neural network with three different confidence thresholds, compared to a word list-based detector with a treshold of 2 bad words:

Like with the word list based approach, we see that a higher threshold means fewer false alarms (false positives) but also fewer true positives. However, we see that two of the neural network based detectors find way more toxic players during the conversation while having fewer false positives at the same time... progress!

The bigger picture

Besides the technical challenge of detecting bad actors early, automated conversation monitoring raises a number of ethical questions: do we only want to use this technique to study the toxicity within a community, or do we really want to monitor actual conversations in real time? And if so, what does it mean for a community to have an automatic watch dog, always looking over your shoulder? And even if it is okay to have a watch dog for toxicity, something broadly desired by people who spend time online, what if the techniques described here are used to detect other social phenomena?

And say we have a system that can perfectly detect bad behavior in online conversations, what should be done when it detects somebody? Should they be notified, warned, muted, banned, reported to an authority? And at what point should action be taken... how toxic is too toxic? The former director of Riot Games' Player Behavior Unit attributes most toxicity to 'the average person just having a bad day'... Is labeling a whole person as toxic or non-toxic not too simplistic?

Whatever the best answer to these questions might be, just doing nothing is not it.

Whatever the best answer to these questions might be, just doing nothing is not it; the Anti-Defamation League and several scholars who study hate speech argue that toxic behavior and harassment online leads to more hate crimes offline [7]. Automatic detection seems a good first step for the online communities that are fighting this problem. Below you can play with both detection techniques introduced in this article, and set the thresholds for yourself. What threshold do you think would make most sense in what use case? For example, how would you tune a system that will later be judged by humans versus a system that can automatically ban users? Are you willing to allow false positives if you catch all toxic players?

[1] https://www.datasociety.net/pubs/oh/Online_Harassment_2016.pdf

[2] https://blog.wikimedia.org/2017/02/07/scaling-understanding-of-harassment/

[3] ADL. "Online Hate and Harassment: The American Experience." 2019. https://www.adl.org/onlineharassment

[4] https://imgur.com/a/X6iR4WE#alloBvs

[5] All of the confidence values in this article come out of the software package HaRe, funded by the Language in Interaction project. All of the thresholds were picked manually, based on trial and error.

[6] The network consisted of an embedding layer of 300 units, two bidirectional GRU layers of 16 units, a pooling layer and two dense layers of 256 units, and was trained on 5000 conversations.

[7] https://www.adl.org/onlineharassment#survey-report