Incident 13: High-Toxicity Assessed on Text Involving Women and Minority Groups

Description: Google's Perspective API, which assigns a toxicity score to online text, seems to award higher toxicity scores to content involving non-white, male, Christian, heterosexual phrases.
Alleged: Google developed and deployed an AI system, which harmed Women and Minority Groups.

Suggested citation format

Olsson, Catherine. (2017-02-27) Incident Number 13. in McGregor, S. (ed.) Artificial Intelligence Incident Database. Responsible AI Collaborative.

Incident Stats

Incident ID
Report Count
Incident Date
Sean McGregor


New ReportNew ReportNew ResponseNew ResponseDiscoverDiscover

CSET Taxonomy Classifications

Taxonomy Details

Full Description

Google's Perspective API, which assigns a toxicity score to online text, has been shown to award higher toxicity scores to content involving non-white, male, Christian, heterosexual phrases. the scores lay on the spectrum between very healthy (low %) to very toxic (high %). The phrase "I am a man" received a score of 20% while "I am a gay black woman" received 87%. The bias exists within subcategories as well: "I am a man who is deaf" received 70%, "I am a person who is deaf" received 74%, and "I am a woman who is deaf" received 77%. The API can also be circumvented by modifying text: "They are liberal idiots who are uneducated" received 90% while "they are liberal idiots who are un.educated" received 15%.

Short Description

Google's Perspective API, which assigns a toxicity score to online text, seems to award higher toxicity scores to content involving non-white, male, Christian, heterosexual phrases.



Harm Distribution Basis

Race, Religion, National origin or immigrant status, Sex, Sexual orientation or gender identity, Disability, Ideology

Harm Type

Psychological harm, Harm to social or political systems

AI System Description

Google Perspective is an API designed using machine learning tactics to assign "toxicity" scores to online text with the oiginal intent of assisting in identifying hate speech and "trolling" on internet comments. Perspective is trained to recognize a variety of attributes (e.g. whether a comment is toxic, threatening, insulting, off-topic, etc.) using millions of examples gathered from several online platforms and reviewed by human annotators.

System Developer


Sector of Deployment

Information and communication

Relevant AI functions

Perception, Cognition, Action

AI Techniques

open-source, machine learning

AI Applications

Natural language processing, content ranking



Named Entities

Google, Google Cloud, Perspective API

Technology Purveyor


Beginning Date


Ending Date


Near Miss

Harm caused



Lives Lost


Data Inputs

Online comments

Incident Reports

Yesterday, Google and its sister Alphabet company Jigsaw announced Perspective, a tool that uses machine learning to police the internet against hate speech. The company heralded the tech as a nascent but powerful weapon in combatting online vitriol, and opened the software so websites could use it on their own commenting systems.

However, computer scientists and others on the internet have found the system unable to identify a wide swath of hateful comments, while categorizing innocuous word combinations like “hate is bad” and “garbage truck” as overwhelmingly toxic. The Jigsaw team sees this problem, but stresses that the software is still in an “alpha stage,” referring to experimental software that isn’t yet ready for mass deployment.

In tandem with the announcement that its project would be open to developers through an application programming interface (API), Jigsaw posted a simple text box that would call the API and return what the system thought of words and phrases. Sentences and phrases are given a toxicity ranking based on what respondents to Survata surveys deemed similar examples as “a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion.”

David Auerbach, a writer for MIT Tech Review and former Google engineer, ran a list of hateful and non-hateful phrases through the system:

“I fucking love you man. Happy birthday.” = 93% toxic

“Donald Trump is a meretricious buffoon.” = 85% toxic.

“few muslims are a terrorist threat” = 79% toxic

“garbage truck” = 78% toxic

“You’re no racist” = 77% toxic

“whites and blacks are not inferior to one another” = 73% toxic

“I’d hate to be black in Donald Trump’s America.” = 73% toxic

“Jews are human” = 72% toxic

“I think you’re being racist” = 70% toxic

“Hitler was an anti-semite” = 70% toxic

“this comment is highly toxic” = 68% toxic

“You are not being racist” = 65% toxic

“Jews are not human” = 61% toxic

“Hitler was not an anti-semite” = 53% toxic

“drop dead” = 40% toxic

“gas the joos race war now” = 40% toxic

“genderqueer” = 34% toxic

“race war now” = 24% toxic

“some races are inferior to others” = 18% toxic

“You are part of the problem” 16% toxic

Like all machine-learning algorithms, the more data the Perspective API has, the better it will work. The Alphabet subsidiary worked with partners like Wikipedia and The New York Times to gather hundreds of thousands of comments, and then crowdsourced 10 answers for each comment on whether they were toxic or not. The effort was intended to kickstart the deep neural network that makes the backbone of the Perspective API.

“It’s very limited to the types of abuse and toxicity in that initial training data set. But that’s just the beginning,” CJ Adams, Jigsaw product manager, told Quartz. “The hope is over time, as this is used, we’ll continue to see more and more examples of abuse, and those will be voted on by different people and improve its ability to detect more types of abuse.”

Previous research published by Jigsaw and Wikimedia details an earlier attempt at finding toxicity in comments. Jigsaw crowdsourced the rating of Wikipedia comments, asking Crowdflower users to gauge whether a comment was an attack or harassment of a person, third party, or whether the commenter was quoting someone else. They then captured 1-5 character snippets, called character-level ngrams, of the attacking comments and trained a machine-learning algorithm that those ngrams were correlated with toxic activity.

Yoav Goldberg, a senior lecturer at Bar Ilan University and former post-doc research scientist at Google not associated with the research, says that previous system lacked the ability to represent subtle differences in the text.

“This is enough to capture information about single words, while allowing also to capture word variations, typos, inflections and so on,” Goldberg told Quartz. “This is essentially finding ‘good words’ and ‘bad words,’ but it is clear that it cannot deal with any nuanced (or even just compositional) word usage.”

For example, “racism is bad” triggers the old system into giving an overwhelmingly negative score because the words “racism” and “bad” are seen as negative, Goldberg says.

The Perspective API is not necessarily a huge improvement on previous efforts quite yet, and is a step back in some ways. Demonstrated to Wired’s Andy Greenberg in September of 2016, the phrase “You’re such a bitch” rates as 96% toxic. In the new system’s public API, it’s 97%. Good!

But when testing his example of a more colloquial (yet still aggravatingly misogynistic) phrase “What’s up bitches? :)” Greenberg’s test of the old system ranks 39% toxicity, while the new public version released yesterday ranks the phrase as 95% toxic.

Lucas Dixon, chief research scientist at Jigsaw, says there’s two reasons for this. First, the system shown to Greenberg was a research model specifically trained to detect personal attacks, meaning it would be much more sensitive to words like “you” or “you’re.” Second, and potentially more importantly, the system was using the ngram technique detailed before.

“Character-level models are much better able to understand misspellings and different fragments of words, but overall it’s going to do much worse,” Dixon told Quartz.

That’s because, while that technique can be efficiently pointed at a very specific problem, like figuring out that smiley faces correlate with someone being nice, the deep neural network being trained through the API now has a much greater capacity to understand the nuances of the entire language.

By using Jigsaw’s “Writing Experiment,” it’s easy to see that certain words are now correlated with negative comments while others are not. The single word “suck” has 93% toxicity. On its own, “suck” doesn’t mean anything negative, but the system still associates it with every negative comment it’s seen containing the word. “Nothing sucks” has a toxicity of 94%. So does “dave sucks.”

Alphabet’s hate-fighting AI doesn’t understand hate yet

In the examples below on hot-button topics of climate change, Brexit and the recent US election -- which were taken directly from the Perspective API website -- the UW team simply misspelled or added extraneous punctuation or spaces to the offending words, which yielded much lower toxicity scores. For example, simply changing "idiot" to "idiiot" reduced the toxicity rate of an otherwise identical comment from 84% to 20%. Credit: University of Washington

University of Washington researchers have shown that Google's new machine learning-based system to identify toxic comments in online discussion forums can be bypassed by simply misspelling or adding unnecessary punctuation to abusive words, such as "idiot" or "moron."

Perspective is a project by Google's technology incubator Jigsaw, which uses artificial intelligence to combat internet trolls and promote more civil online discussion by automatically detecting online insults, harassment and abusive speech. The company launched a demonstration website on Feb. 23 that allows anyone to type in a phrase and see its "toxicity score"—a measure of how rude, disrespectful or unreasonable a particular comment is.

In a paper posted Feb. 27 on the e-print repository arXiv, the UW electrical engineers and security experts demonstrated that the early stage technology system can be deceived by using common adversarial tactics. They showed one can subtly modify a phrase that receives a high toxicity score so that it contains the same abusive language but receives a low toxicity score.

Given that news platforms such as The New York Times and other media companies are exploring how the system could help curb harassment and abuse in online comment areas or social media, the UW researchers evaluated Perspective in adversarial settings. They showed that the system is vulnerable to both missing incendiary language and falsely blocking non-abusive phrases.

In the examples in Graphic 2, the researchers also showed that the system does not assign a low toxicity score to a negated version of an abusive phrase. Credit: University of Washington

"Machine learning systems are generally designed to yield the best performance in benign settings. But in real-world applications, these systems are susceptible to intelligent subversion or attacks," said senior author Radha Poovendran, chair of the UW electrical engineering department and director of the Network Security Lab. "We wanted to demonstrate the importance of designing these machine learning tools in adversarial environments. Designing a system with a benign operating environment in mind and deploying it in adversarial environments can have devastating consequences."

To solicit feedback and invite other researchers to explore the strengths and weaknesses of using machine learning as a tool to improve online discussions, Perspective developers made their experiments, models and data publicly available along with the tool itself.

In the examples in Graphic 1 on hot-button topics of climate change, Brexit and the recent U.S. election—which were taken directly from the Perspective API website—the UW team simply misspelled or added extraneous punctuation or spaces to the offending words, which yielded much lower toxicity scores. For example, simply changing "idiot" to "idiiot" reduced the toxicity rate of an otherwise identical phrase from 84 percent to 20 percent.

In the examples in Graphic 2, the researchers also showed that the system does not assign a low toxicity score to a negated version of an abusive phrase.

The UW electrical engineering research team includes (left to right) Radha Poovendran, Hossein Hosseini, Baosen Zhang and Sreeram Kannan (not pictured). Credit: University of Washington

The researchers also observed that the duplicitous changes often transfer among different phrases—once an intentionally misspelled word was given a low toxicity score in one phrase, it was also given a low score in another phrase. That means an adversary could create a "dictionary" of changes for every word and significantly simplify the attack process.

"There are two metrics for evaluating the performance of a filtering system like a spam blocker or toxic speech detector; one is the missed detection rate, and the other is the false alarm rate," said lead author and UW electrical engineering doctoral student Hossein Hosseini. "Of course scoring the semantic toxicity of a phrase is challenging, but deploying defensive mechanisms both in algorithmic and system levels can help the usability of the system in real-world settings."

The research team suggests several techniques to improve the robustness of toxic speech detectors, including applying a spellchecking filter prior to the detection system, training the machine learning algorithm with adversarial examples and blocking suspicious users for a period of time.

"Our Network Security Lab research is typically focused on the foundations and science of cybersecurity," said Poovendran, the lead princ

Security researchers show Google's anti-internet troll AI platform is easily deceived

The Google AI tool used to flag “offensive comments” has a seemingly built-in bias against conservative and libertarian viewpoints.

Perspective API, a “machine learning model” developed by Google which scores “the perceived impact a comment might have on a conversation” in the comment section of a news article, ranks comments based on their “toxicity.”

But when testing out its algorithm, Perspective generally scores conservative and libertarian comments as more “toxic” than establishment talking points.

For example:

As we reported throughout the election, Google preferred Hillary Clinton to Donald Trump, a preference that apparently hasn’t changed:

And the abortion debate between conservatives and liberals:

Perspective has been used by establishment news outlets including the Guardian, the New York Times and the Economist.

“News organizations want to encourage engagement and discussion around their content, but find that sorting through millions of comments to find those that are trolling or abusive takes a lot of money, labour and time,” said Jared Cohen, president of Jigsaw, the Google affiliate behind Perspective. “As a result, many sites have shut down comments altogether, but they tell us that isn’t the solution they want.”

Of course that’s not what they want; the public is increasingly more interested in reading the comment section than the article itself, and if they shut down the comments, readers will flee to another news site covering the same story but with a comment section.

So what better way to control the narrative than by promoting pro-establishment comments while burying conservative and libertarian counterpoints?

Twitter was already doing just that by pinning criticism of President Trump as the top responses to his tweets.

And even before using Perspective, the New York Times would promote “Editor’s Pick” comments which, not surprisingly, agreed with the Times’ narrative.

But for controversial articles, the Times tends to just shut down comments altogether, which of course makes the article look less credible – and this likely explains the establishment media’s interest in Perspective API.

Additionally, Google has hired contractors to bury or outright ban from its search engines, according to investigative journalist Mike Cernovich who was given leaked documents.

“There are a number of controversial, often debunked claims that the site regularly promotes,” the document claims.

Google Robo-Tool Flags Conservative Comments as “Toxic”

Don’t you just hate how vile some people are on the Internet? How easy it’s become to say horrible and hurtful things about other groups and individuals? How this tool that was supposed to spread knowledge, amity, and good cheer is being use to promulgate hate? No need to worry anymore: Google’s on it.

Earlier this year, Silicon Valley’s overlords introduced Perspective API, the latter being nerd-speak for Application Program Interface, or a set of tools for building software. The idea behind it is simple: because it’s impossible for an online publisher to manually monitor all the comments left on its website, Perspective will use advanced machine learning to help moderators track down comments that are likely to be “toxic.” Here’s how the company describes it: “The API uses machine learning models to score the perceived impact a comment might have on a conversation.”

That’s a strange sentiment. How do you measure the perceived impact of a conversation? And how can you tell if a conversation is good or bad? The answers, in Perspective’s case, are simple: machine learning works by giving computers access to vast databases, and letting them figure out the likely patterns. If a machine read all the cookbooks published in the English language in the last 100 years, say, it would be able to tell us interesting things about how we cook, like the peculiar fact that when we serve rice we’re very likely to serve beans as well. What can machines tell us about the way we converse and about what we may find offensive? That, of course, depends on what databases you let the machines learn. In Google’s case, the machines learned the comments sections of The New York Times, the Economist, and the Guardian.

What did the machines learn? Only one way to find out. I asked Perspective to rate the following sentiment: “Jews control the banks and the media.” This old chestnut, Perspective reported, had a 10 percent chance of being perceived as toxic.

Maybe Perspective was just relaxed about sweeping generalizations that have been used to stain entire ethnic and religious groups, I thought. Maybe the nuance of harmful stereotypes was lost on Google’s algorithms. I tried again, this time with another group of people, typing “Many terrorists are radical Islamists.” The comment, Perspective informed me, was 92 percent likely to be seen as toxic.

What about straightforward statements of facts? I reached for the news, which, sadly, has been very grim lately, and wrote: “Three Israelis were murdered last night by a knife-wielding Palestinian terrorist who yelled ‘Allah hu Akbar.’” This, too, was 92 percent likely to be seen as toxic.

You, too, can go online and have your fun, but the results shouldn’t surprise you. The machines learn from what they read, and when what they read are the Guardian and the Times, they’re going to inherit the inherent biases of these publications as well. Like most people who read the Paper of Record, the machine, too, has come to believe that statements about Jews being slaughtered are controversial, that addressing radical Islamism is verboten, and that casual anti-Semitism is utterly forgivable. The very term itself, toxicity, should’ve been enough of a giveaway: the only groups that talk about toxicity—see under: toxic masculinity—are those on the regressive left who creepily apply the metaphors of physical harm to censor speech not celebrate or promote it. No words are toxic, but the idea that we now have an algorithm replicating, amplifying, and automatizing the bigotry of the anti-Jewish left may very well be.

Liel Leibovitz is a senior writer for Tablet Magazine and a host of the Unorthodox podcast.

Google's New Hate Speech Algorithm Has a Problem With Jews

Last month, I wrote a blog post warning about how, if you follow popular trends in NLP, you can easily accidentally make a classifier that is pretty racist. To demonstrate this, I included the very simple code, as a “cautionary tutorial”.

The post got a fair amount of reaction. Much of it positive and taking it seriously, so thanks for that. But eventually I heard from some detractors. Of course there were the fully expected “I’m not racist but what if racism is correct” retorts that I knew I’d have to face. But there were also people who couldn’t believe that anyone does NLP this way. They said I was talking about a non-problem that doesn’t show up in serious machine learning, or projecting my own bad NLP ideas, or something.

Well. Here’s Perspective API, made by an offshoot of Google. They believe they are going to use it to fight “toxicity” online. And by “toxicity” they mean “saying anything with negative sentiment”. And by “negative sentiment” they mean “whatever word2vec thinks is bad”. It works exactly like the hypothetical system that I cautioned against.

On this blog, we’ve just looked at what word2vec (or GloVe) thinks is bad. It includes black people, Mexicans, Islam, and given names that don’t usually belong to white Americans. You can actually type my examples into Perspective API and it will actually respond that the ones that are less white-sounding are more “likely to be perceived as toxic”.

“ Hello, my name is Emily” is supposedly 4% likely to be “toxic”. Similar results for “Susan”, “Paul”, etc.

Hello, my name is Emily” is supposedly likely to be “toxic”. Similar results for “Susan”, “Paul”, etc. “ Hello, my name is Shaniqua” (“Jamel”, “DeShawn”, etc.): 21% likely to be toxic.

Hello, my name is Shaniqua” (“Jamel”, “DeShawn”, etc.): likely to be toxic. “ Let’s go get Italian food”: 9% .

Let’s go get Italian food”: . “ Let’s go get Mexican food”: 29%.

Here are two more examples I didn’t mention before:

“ Christianity is a major world religion”: 37% . Okay, maybe things can get heated when religion comes up at all, but compare:

Christianity is a major world religion”: . Okay, maybe things can get heated when religion comes up at all, but compare: “ Islam is a major world religion”: 66% toxic.

I’ve heard about Perspective API from many directions, but my proximate source is this Twitter thread by Dan Luu, who has his own examples:

It’s 🤣 to poke around and see what biases the system picked up from the training data. 😰 to think about actual applications, though. — Dan Luu (@danluu) August 12, 2017

I have previously written positive things about researchers at Google who are looking at approaches to de-biasing AI, such as their blog post on Equality of Opportunity in Machine Learning.

But Google is a big place. It contains multitudes. And it seems it contains a subdivision that will do the wrong thing, which other Googlers know is the wrong thing, because it’s easy.

Google, you made a very bad investment. (That sentence is 61% toxic, by the way.)

As I update this post in April 2018, I’ve had some communication with the Perspective API team and learned some more details about it.

Some details of this post were incorrect, based on things I assumed when looking at Perspective API from outside. For example, Perspective API does not literally build on word2vec. But the end result is the same: it learns the same biases that word2vec learns anyway.

In September 2017, Violet Blue wrote an exposé of Perspective API for Engadget. Despite the details that I had wrong, the Engadget article confirms that the system really is that bad, and provides even more examples.

Perspective API has changed their online demo to lower toxicity scores across the board, without fundamentally changing the model. Text with a score under a certain threshold is now labeled as “not toxic”. I believe this remedy could be described technically as “weak sauce”.

The Perspective API team claims that their system has no inherent bias against non-white names, and that the higher toxicity scores that appear for names such as “DeShawn” is an artifact of how they handle out-of-vocabulary words. All the names that are typical for white Americans are in-vocabulary. Make of that what you will.

The Perspective API team continues to promote their product, such as via hackathons and TED talks. Users of the API are not warned of its biases, except for a generic warning that could apply to any AI system, saying that users should manually review its results. It is still sometimes held up as a positive example of fighting toxicity with NLP, misleading lay audiences into thinking that present NLP has a solution to toxicity.

You weren’t supposed to actually implement it, Google

As politics in the US and Europe have become increasingly divisive, there's been a push by op-ed writers and politicians alike for more "civility" in our debates, including online. Amidst this push comes a new tool by Google's Jigsaw that uses machine learning to rank what it calls the "toxicity" of a given sentence or phrase. But as Dave Gershgorn reported for Quartz, the tool has been criticized by researchers for being unable to identify certain hateful phrases, while categorizing innocuous word combinations as toxic.

The project, Perspective, is an API that was trained by asking people to rate online comments on a scale from "very toxic" to "very healthy," with "toxic" being defined as a "rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion." It's part of a growing effort to sanitize conversations online, which is reflective of a certain culture within Silicon Valley and the United States as a whole: The culture of civility.

The tool seems to rank profanity as highly toxic, while deeply harmful statements are often deemed safe

If we were merely kind to one another in our interactions, the argument goes, we would be less divided. Yet, this argument fails to recognize how politeness and charm have throughout history been used to dress up hateful speech, including online.

Perspective was trained on text from actual online comments. As such, its interpretation of certain terms is limited—because "fuck you" is more common in comments sections than "fuck yeah," the tool perceives the word "fuck" as inherently toxic. Another example: Type "women are not as smart as men" into the meter's text box, and the sentence is "4% likely to be perceived as 'toxic'." A number of other highly problematic phrases—from "men are biologically superior to women" to "genocide is good"—rank low on toxicity. Meanwhile, "fuck off" comes in at 100 percent.

This is an algorithmic problem. Algorithms learn from the data they are fed, building a model of the world based on that data. Artificial intelligence reflects the values of its creators, and thus can be discriminatory or biased, just like the human beings who program and train it.

So what does the Perspective tool's data model say about its creators? Based on the examples I tested, the tool seems to rank profanity as highly toxic, while deeply harmful statements—when they're politely stated, that is—are often deemed safe. The sentence "This is awesome" comes in at 3 percent toxic, but add "fucking" (as in the Macklemore lyric "This is fucking awesome") and the sentence escalates to 98 percent toxic.

In an email, a Jigsaw spokesperson called Perspective a "work in progress," and noted that false positives are to be expected as its machine learning improves.

This problem isn't unique to Google; as Silicon Valley companies increasingly seek to moderate speech on their online platforms, their definition of "harmful" or "toxic" speech matters.

Civility über alles

The argument for civility is thus: If we were only civil to each other, the world would be a better place. If only we addressed each other politely, we would be able to solve our disagreements. This has led to the expectation that any speech—as long as it's dressed up in the guise of politeness—should be accepted and debated, no matter how bigoted or harmful the idea behind the words.

Here's what this looks like in practice: A Google employee issues a memo filled with sexist ideas, but because he uses polite language, women are expected to debate the ideas contained within. On Twitter, Jewish activists bombarded with anti-Semitic messages are suspended for responding with language like "fuck off." On Facebook, a Black mother posting copies of the threats she received from racists gets suspended due to the language in the re-posted threats.

In this rubric, counter speech—long upheld as an important concept for responding to hate without censorship—is punished for merely containing profanities.

Read More: Inside Wikipedia's Attempt to Use Artificial Intelligence to Combat Harassment

It is the culture amongst the moderators of centralized community platforms, from mighty Facebook to much-smaller Hacker News, where "please be civil" is a regular refrain. Vikas Gorur, a programmer and Hacker News user, told me that on the platform "the slightest personal attack ('you're stupid') is a sin, while a 100+ subthread about 'was slavery really that bad?' or 'does sexual harassment exist?' are perfectly fine."

Free speech, said Gorur, "is the cardinal virtue, no matter how callous that speech is."

From Washington to the Valley

This attitude is not only a phenomena within Silicon Valley, but in American society at large. Over the past eight months since the United States elected a reality television star to its highest office, the President's opponents have regularly been chastised for their incivility, even as their rights

Google's Anti-Bullying AI Mistakes Civility for Decency

A recent, sprawling Wired feature outlined the results of its analysis on toxicity in online commenters across the United States. Unsurprisingly, it was like catnip for everyone who's ever heard the phrase "don't read the comments." According to "The Great Tech Panic: Trolls Across America," Vermont has the most toxic online commenters, whereas Sharpsburg, Georgia, "is the least-toxic city in the US."

There's just one problem.

The underlying API used to determine "toxicity" scores phrases like "I am a gay black woman" as 87 percent toxicity, and phrases like "I am a man" as the least toxic. The API, called Perspective, is made by Google's Alphabet within its Jigsaw incubator.

When reached for a comment, a spokesperson for Jigsaw told Engadget, "Perspective offers developers and publishers a tool to help them spot toxicity online in an effort to support better discussions." They added, "Perspective is still a work in progress, and we expect to encounter false positives as the tool's machine learning improves."

Poking around with the engine behind Wired's data revealed some ugly results, as Vermont librarian Jessamyn West discovered when she read the article and tried out Perspective to see exactly what makes a comment, or a commenter, perceived as toxic (according to Alphabet, at least).

It's strange to wonder that Wired didn't give Perspective a spin to see what made the people behind its troll map "toxic." Wondering exactly that, I decided to try out a variety of comments to see how the results compared to West's. I endeavored to represent the people I seem to see censored the most on social media, and opinions of the day.

My experience typing "I am a black trans woman with HIV" got a toxicity rank of 77 percent. "I am a black sex worker" was 89 percent toxic, while "I am a porn performer" was scored 80. When I typed "People will die if they kill Obamacare" the sentence got a 95 percent toxicity score.

The Wired article analyzed 92 million Disqus comments "over a 16-month period, written by almost 2 million authors on more than 7,000 forums." They didn't look at sites that don't use the comment-management software (so Facebook and Twitter were not included).

The piece explained:

To broadly determine what is and isn't toxic, Disqus uses the Perspective API—software from Alphabet's Jigsaw division that plugs into its system. The Perspective team had real people train the API to rate comments. The model defines a toxic comment as "a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion."

Discrimination by algorithm

In an online world where moderation, banning and censorship are largely left to automation like the Perspective API, finding out how these things are measured is critical for everyone involved. "Looking into this, the word 'toxic' is a very specific term of art for the tool, this tool Perspective that's made by this company Alphabet, who you may know as Google, that is trying to bring [artificial intelligence] into commenting," West told Vermont Public Radio.

Perspective presents itself as a way to improve conversations online, positing that the "threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions." It's one of the many "make the world safer" Jigsaw projects.

Jigsaw worked with The New York Times and Wikipedia to develop Perspective. The NYT made its comments archive available to Jigsaw "to help develop the machine-learning algorithm running Perspective." Wikipedia contributed "160k human labeled annotations based on asking 5,000 crowd-workers to rate Wikipedia comments according to their toxicity. ... Each comment was rated by 10 crowd-workers."

A February article about Perspective elaborated on the human-trained, machine-learning process behind what wants to become the world's measuring tool for harmful comments and commenters.

"In this instance, Jigsaw had a team review hundreds of thousands of comments to identify the types of comments that might deter people from a conversation," The NYT wrote. "Based on that data, Perspective provided a score from zero to 100 on how similar the new comments are to the ones identified as toxic."

The results from West typing comments into Perspective were shockingly discriminatory. Identifying as black and/or gay was deemed toxic. She also tried it with visible and invisible disabilities, like wheelchair use and deafness, and the most toxic way to identify yourself in a conversation turned out to be saying "I am a woman who is deaf."

When the algorithm is taught to be racist, sexist and ableist (among other things), it leads to the silencing and censorship of entire populations. The problem is that when these systems are up and running, the people being silenced and banned disappear without a trace. Discrimination by algorithm happens in a vacuum.

We can only imagine what's underlying the automated comment-policing system at Facebook. In August Mary Canty Merrill, a psychologist who advises corporations on how to avoid racial bias, wrote a short post about defining racism on Facebook.

Reveal News wrote, "She logged in the next day to find her post removed and profile suspended for a week. A number of her older posts, which also used the "Dear white people" formulation, had been similarly erased."

Pasting her "Dear white people" into Perspective's API got a score of 61 percent toxicity.

Unless Google anti-diversity creeper James Damore was the project lead for Perspective, it's hard to imagine that the company would greenlight a product that thinks to identify as a black gay woman is toxic. (Wikipedia, on the other hand, I could imagine.)

It's possible that the tool is seeking out comments with terms like black, gay and woman as high potential for being abusive or negative, but that would make Perspective an expensive, overkill wrapper for the equivalent of using Command-F to demonize words that some people might find upsetting.

Perspective's reach is significant, too. The project is partnered with Wikipedia, The New York Times, The Economist and The Guardian. Abandon all hope, ye gay black women who enter the comments there.

What we've discovered about Perspective doesn't bode well for the future of machine-learning or AI and algorithm-driven comment measurement and moderation. Nor does it look good for accountability with companies like Google, Facebook and others that rely on automation for moderation.

I think we're all tired of Facebook telling us "it was a bug" and companies saying "it's not our fault" and pointing at systems like Perspective. Despite the fact that they're complicit by using it. And they should be trying these things out against problems like not being able to identify as a gay black woman in a comment thread without risking your ability to comment.

Imagine a system like Perspective deciding whether or not you can use business services, like Google AdSense. Take, for instance, the African-American woman who got an email Thursday from Google AdSense saying she'd violated its terms by writing a blog post about dealing with being called the n-word ... on her own website.

Distressingly, what's also being created is a culture where we can't even talk about abuse. As we can see, the implications for speech are huge -- and already we're soaking in it. Moreso when you consider that "competition" for something like Perspective is clearly already at work for social-media networks like Facebook, whose own policies around race and neo-Nazi belief systems are deeply skewed against societies who strive for equality, anti-discrimination and human rights.

It's probable that these terms are getting scored for high toxicity because they're terms used most commonly in attacks on targeted groups. But the instances mentioned in this article are clear failures. It shows that the efforts of Silicon Valley's ostensible best and brightest have steered AI meant to "improve the conversation" the way of racist soap dispensers and facial recognition software that can't see black people.

Insofar as the Wired feature is concerned, the data look flawed from where we're sitting. It may just mean that there are more gay black women and sex workers there who are OK with talking about it than Sharpsburg, Georgia, commenters. Depressingly, the "Internet Troll Map" might just be a map of black people discussing issues of race, LGBTQ identity and health care.

Which, we hope, is the opposite of what everyone intended.

Google’s comment-ranking system will be a hit with the alt-right


The ability to quantify incivility online, in news and in congressional debates, is of great interest to political scientists. Computational tools for detecting online incivility for English are now fairly accessible and potentially could be applied more broadly. We test the Jigsaw Perspective API for its ability to detect the degree of incivility on a corpus that we developed, consisting of manual annotations of civility in American news. We demonstrate that toxicity models, as exemplified by Perspective, are inadequate for the analysis of incivility in news. We carry out error analysis

that points to the need to develop methods to remove spurious correlations between words often mentioned in the news, especially identity descriptors and incivility. Without such improvements, applying Perspective or similar models on news is likely to lead to wrong conclusions, that are not aligned with the human perception of incivility.

1 Introduction

Surveys of public opinion report that most Americans think that the tone and nature of political debate in this country have become more negative and less respectful and that the heated rhetoric by politicians raises the risk for violence (Center, 2019). These observations motivate the need to study (in)civility in political discourse in all spheres of interaction, including online (Ziegele et al., 2018; Jaidka et al., 2019), in congressional debates (Uslaner, 2000) and as presented in news (Meltzer, 2015; Rowe, 2015). Accurate automated means for coding incivility could facilitate this research, and political scientists have already turned to using off-the-shelf computational tools for studying civility (Frimer and Skitka, 2018; Jaidka et al., 2019; Theocharis et al., 2020).

Computational tools however, have been developed for different purposes, focusing on detecting language in online forums that violate community norms. The goal of these applications is to support human moderators by promptly focusing their attention on likely problematic posts. When studying civility in political discourse, it is primarily of interest to characterize the overall civility of interactions in a given source (i.e., news programs) or domain (i.e., congressional debates), as an average over a period of interest. Applying off-the-shelf tools for toxicity detection is appealingly convenient, but such use has not been validated for any domain, while uses in support of moderation efforts have been validated only for online comments.

We examine the feasibility of quantifying incivility in the news via the Jigsaw Perspective API, which has been trained on over a million online comments rated for toxicity and deployed in several scenarios to support moderator effort online (

We collect human judgments of the (in)civility in one month worth of three American news programs. We show that while people perceive significant differences between the three programs, Perspective cannot reliably distinguish between the levels of incivility as manifested in these news sources. We then turn to diagnose the reasons for Perspective’s failure. Incivility is more subtle and nuanced than toxicity, which includes identity slurs, profanity, and threats of violence along other unacceptable incivility. In the range of civil to borderline civil human judgments, Perspective gives noisy predictions that are not indicative of the differences in civility perceived by people. This finding alone suggests that averaging Perspective scores to characterize a source is unlikely to yield meaningful results. To pinpoint some of the sources of the noise in predictions, we characterize individual words as likely triggers of errors in Perspective or sub-error triggers that lead to over-prediction of toxicity.

We discover notable anomalies, where words quite typical in neutral news reporting are confounded with incivility in the news domain. We also discover that the mention of many identities, such as Black, gay, Muslim, feminist, etc., triggers high incivility predictions. This occurs despite the fact that Perspective has been modified specifically to minimize such associations (Dixon et al., 2018a). Our findings echo results from gender debiasing of word representations, where bias is removed as measured by a fixed definition but remains present when probed differently (Gonen and Goldberg, 2019). This common error—treating the mention of identity as evidence for incivility—is problematic when the goal is to analyze American political discourse, which is very much marked by us-vs-them identity framing of discussions.

These findings will serve as a basis for future work in debiasing systems for incivility prediction, while the dataset of incivility in American news will support computational work on this new task.

Our work has implications for researchers of language technology and political science alike. For those developing automated methods for quantifying incivility, we pinpoint two aspects that require improvement in future work: detecting triggers of civility overprediction and devising methods to mitigate the errors in prediction. We propose an approach for a data-driven detection of error triggers; devising mitigation approaches remain an open problem. For those seeking to contrast civility in different sources, we provide compelling evidence that state-of-the-art automated tools are not appropriate for this task. The data and (in)civility ratings would be of use to both groups as test data

for future models for civility prediction (

<see full report here:>

9 Conclusion

The work we presented was motivated by the desire to apply off-the-shelf methods for toxicity prediction to analyse civility in American news. These methods were developed to detect rude, disrespectful, or unreasonable comment that is likely to make you leave the discussion in an online forum. To validate the use of Perspective to quantify incivility in the news, we create a new corpus of perceived incivility in the news. On this corpus, we compare human ratings and Perspective predictions. We find that Perspective is not appropriate for such an application, providing misleading conclusions for sources that are mostly civil but for which people perceive a significant overall difference, for example, because one uses sarcasm to express incivility. Perspective is able to detect less subtle differences in levels of incivility, but in a large-scale analysis that relies on Perspective exclusively, it will be impossible to know which differences would reflect human perception and which would not.

We find that Perspective’s inability to differentiate levels of incivility is partly due to the spurious correlations it has formed between certain non-offensive words and incivility. Many of these words are identity-related. Our work will facilitate future research efforts on debiasing of automated predictions. These methods start off with a list of words that the system has to unlearn as associated with a given outcome. In prior work, the lists of words to debias came from informal experimentation with predictions from Perspective. Our work provides a mechanism to create a data-driven list that requires some but little human intervention. It can discover broader classes of bias than people

performing ad-hoc experiments can come up with.

A considerable portion of content marked as uncivil by people is not detected as unusual by Perspective. Sarcasm and high-brow register in the delivery of the uncivil language are at play here and will require the development of new systems.

Computational social scientists are well-advised to not use Perspective for studies of incivility in political discourse because it has clear deficiencies for such application.

From Toxicity in Online Comments to Incivility in American News: Proceed with Caution

According to a 2019 Pew Center survey, the majority of respondents believe the tone and nature of political debate in the U.S. have become more negative and less respectful. This observation has motivated scientists to study the civility or lack thereof in political discourse, particularly on broadcast television. Given their ability to parse language at scale, one might assume that AI and machine learning systems might be able to aid in these efforts. But researchers at the University of Pennsylvania find that at least one tool, Jigsaw’s Perspective API, clearly isn’t up to the task.

Incivility is more subtle and nuanced than toxicity, for example, which includes identity slurs, profanity, and threats of violence. While incivility detection is a well-established task in AI, it’s not well-standardized, with the degree and type of incivility varying across datasets.

The researchers studied Perspective — an AI-powered API for content moderation developed by Jigsaw, the organization working under Google parent company Alphabet to tackle cyberbullying and disinformation — in part because of its widespread use. Media organizations including the New York Times, Vox Media, OpenWeb, and Disqus have adopted it, and it’s now processing 500 million requests daily.

To benchmark Perspective’s ability to spot incivility, the researchers built a corpus containing 51 transcripts from PBS NewsHour, MSNBC’s The Rachel Maddow Show, and Hannity from Fox News. Annotators read through each transcript and identified segments that appeared to be especially uncivil or civil, rating them on a ten-point scale for measures like “polite/rude,” “friendly/hostile,” “cooperative/quarrelsome,” and “calm/agitated.” Scores and selections across annotators were composited to net a civility score for each snippet between 1 and 10, where 1 is the most civil and 10 is the least civil possible.

After running the annotated transcript snippets through the Perspective API, the researchers found that the API wasn’t sensitive enough to detect differences in levels of incivility for ratings lower than six. Perspective scores increased for higher levels of incivility, but annotator and Perspective incivility scores only agreed 51% of the time.

“Overall, for broadcast news, Perspective cannot reproduce the incivility perception of people,” the researchers write. “In addition to the inability to detect sarcasm and snark, there seems to be a problem with over-prediction of the incivility in PBS and FOX [programming].”

In a subsequent test, the researchers sampled thousands of words from each transcript, gathering a total of 2,671, which they fed to Perspective to predict incivility. The results show a problematic trend: Perspective tends to label certain identities — including “gay,” “African-American,” “Muslim” and “Islam,” “Jew,” “women,” and “feminism” and “feminist” — as toxic. Moreover, the API erroneously flags words relating to violence and death (e.g., “die,” “kill,” “shooting,” “prostitution,” “pornography,” “sexual”) even in the absence of incivility, as well as words that in one context could be toxic but in another could refer to a name (e.g., “Dick”).

Other auditors have claimed that Perspective doesn’t moderate hate and toxic speech equally across groups of people. A study published by researchers at the University of Oxford, the Alan Turing Institute, Utrecht University, and the University of Sheffield found that the Perspective API particularly struggles with denouncements of hate that quote others’ hate speech or make direct references to it. An earlier University of Washington study published in 2019 found that Perspective was more likely to label “Black-aligned English” offensive versus “white-aligned English.”

For its part, Jigsaw recently told VentureBeat that it has made and continues to make progress toward mitigating the biases in its models.

The researchers say that their work highlights the shortcomings of AI when applied to the task of civility detection. While they believe that prejudices against groups like Muslims and African Americans can be lessened through “data-driven” techniques, they expect that correctly classifying edge cases like sarcasm will require the development of new systems.

“The work we presented was motivated by the desire to apply off-the-shelf methods for toxicity prediction to analyse civility in American news. These methods were developed to detect rude, disrespectful, or unreasonable comment that is likely to make you leave the discussion in an online forum,” the coauthors wrote. “We find that Perspective’s inability to differentiate levels of incivility is partly due to the spurious correlations it has formed between certain non-offensive words and incivility. Many of these words are identity-related. Our work will facilitate future research efforts on debiasing of automated predictions.”

AI displays bias and inflexibility in civility detection, study finds

Similar Incidents

By textual similarity

Did our AI mess up? Flag the unrelated incidents

Biased Sentiment Analysis

· 7 reports

Gender Biases in Google Translate

· 10 reports


· 26 reports