Incident 21: Tougher Turing Test Exposes Chatbots’ Stupidity

Description: The 2016 Winograd Schema Challenge highlighted how even the most successful AI systems entered into the Challenge were only successful 3% more often than random chance.
Alleged: Researchers developed and deployed an AI system, which harmed Researchers.

Suggested citation format

Yampolskiy, Roman. (2016-07-14) Incident Number 21. in McGregor, S. (ed.) Artificial Intelligence Incident Database. Responsible AI Collaborative.

Incident Stats

Incident ID
Report Count
Incident Date
Sean McGregor


New ReportNew ReportDiscoverDiscover

CSET Taxonomy Classifications

Taxonomy Details

Full Description

The Winograd Schema Challenge in 2016 highlighted shortcomings of an artificially intelligent system's ability to understand context. The Challenge is designed to present ambiguous sentences and ask AI systems to decipher them. In the Winograd Scheme Challenge, the two winning entries were successful 48% of the time, while random chance was correct 45% of the time. Quan Liu of the University of Science and Technology of China (partnering with University of Toronto and National Research Council of Canada) and Nicos Isaak of the Open University of Cyprus presented the most successful systems. It is notable that Google and Facebook did not participate.

Short Description

The 2016 Winograd Schema Challenge highlighted how even the most successful AI systems entered into the Challenge were only successful 3% more often than random chance.



AI System Description

Artificially intelligent systems meant to understand ambiguous English sentences.

Sector of Deployment

Professional, scientific and technical activities

Relevant AI functions

Perception, Cognition, Action


New York, NY

Named Entities

Winograd Schema Challenge, University of Science and Technology of China, Quan Liu, University of Toronto, National Research Council of Canada, Nicos Isaak, Open University of Cyprus

Technology Purveyor

Quan Liu, Nicos Isaak

Beginning Date


Ending Date


Near Miss




Lives Lost


Incidents Reports

User: Siri, call me an ambulance.

Siri: Okay, from now on I’ll call you “an ambulance.”

Apple fixed this error shortly after its virtual assistant was first released in 2011. But a new contest shows that computers still lack the common sense required to avoid such embarrassing mix-ups.

The results of the contest were presented at an academic conference in New York this week, and they provide some measure of how much work needs to be done to make computers truly intelligent.

Illustration by Max Bode

The Winograd Schema Challenge asks computers to make sense of sentences that are ambiguous but usually simple for humans to parse. Disambiguating Winograd Schema sentences requires some common-sense understanding. In the sentence “The city councilmen refused the demonstrators a permit because they feared violence,” it is logically unclear who the word “they” refers to, although humans understand because of the broader context.

The programs entered into the challenge were a little better than random at choosing the correct meaning of sentences. The best two entrants were correct 48 percent of the time, compared to 45 percent if the answers are chosen at random. To be eligible to claim the grand prize of $25,000, entrants would need to achieve at least 90 percent accuracy. The joint best entries came from Quan Liu, a researcher at the University of Science and Technology of China, and Nicos Issak, a researcher from the Open University of Cyprus.

“It’s unsurprising that machines were barely better than chance,” says Gary Marcus, a research psychologist at New York University and an advisor to the contest. That’s because giving computers common-sense knowledge is notoriously difficult. Hand-coding knowledge is impossibly time-consuming, and it isn’t simple for computers to learn about the real world by performing statistical analysis of text. Most of the entrants in the Winograd Schema Challenge try to use some combination of hand-coded grammar understanding and a knowledge base of facts.

Marcus, who is also the cofounder of a new AI startup, Geometric Intelligence, says it’s notable that Google and Facebook did not take part in the event, even though researchers at these companies have suggested they are making major progress in natural language understanding. “It could’ve been that those guys waltzed into this room and got a hundred percent and said ‘hah!’” he says. “But that would’ve astounded me.”

The contest does not only serve as a measure of progress in AI. It also shows how hard it will be to build more intuitive and graceful chatbots, and to train computers to extract more information from written text.

Researchers at Google, Facebook, Amazon, and Microsoft are turning their attention to language. They are using the latest machine learning techniques, especially “deep learning” neural networks, to develop smarter, more intuitive chatbots and personal assistants (see “Teaching Machines to Understand Us”). As a matter of fact, with chatbots and voice assistants becoming more common, and with dramatic progress in areas like image and speech recognition, you might think that machines were getting pretty good at understanding language.

One of the two first-place entries did, in fact, use a cutting-edge machine learning approach. Liu’s group, which included researchers from York University in Toronto and the National Research Council of Canada, used deep learning to train a computer to recognize the relationship between different events, such as “playing basketball” and “winning” or “getting injured,” from thousands of texts.

“I was delighted to see deep learning used,” says Leora Morgenstern, a senior scientist at Leidos Corporation, a technology consulting firm, and one of the organizers of the challenge.

Liu’s team claims that after fixing a problem with the way its system parsed the contest’s questions, it is almost 60 percent accurate. Morgenstern cautions, however, that even if these claims were confirmed, the accuracy would still be far worse than a human's.

Winograd Schema sentences were first highlighted as a way to gauge machine comprehension by Hector Levesque, an artificial-intelligence researcher at the University of Toronto. They are named after Terry Winograd, a pioneer in the field and a professor at Stanford University who built one of the first conversational computer programs.

The challenge was proposed in 2014 as an improvement on the Turing Test. Alan Turing, a forefather of computing and artificial intelligence who in the 1950s pondered whether machines might one day think as humans do, suggested a simple way of testing a machine’s intelligence. His idea was for a machine to try to fool a person into thinking that he was conversing with a real person in a text conversation.

The problem with the Turing Test is that it’s often easy for a program to fool a person using simple tricks and evasions. But a program cannot parse Winograd Schema or other ambiguous sentences without some form of general

Tougher Turing Test Exposes Chatbots’ Stupidity

Similar Incidents

By textual similarity

Did our AI mess up? Flag the unrelated incidents


· 26 reports

Gender Biases in Google Translate

· 10 reports