Incident 399: Meta AI's Scientific Paper Generator Reportedly Produced Inaccurate and Harmful Content

Description: Meta AI trained and hosted a scientific paper generator that sometimes produced bad science and prohibited queries on topics and groups that are likely to produce offensive or harmful content.
Alleged: Meta AI , Meta and Facebook developed and deployed an AI system, which harmed Meta AI , Meta , Facebook and Minority Groups.

Suggested citation format

Hundt, Andrew. (2022-11-15) Incident Number 399. in McGregor, S. (ed.) Artificial Intelligence Incident Database. Responsible AI Collaborative.

Incident Stats

Incident ID
399
Report Count
4
Incident Date
2022-11-15
Editors
Sean McGregor

Tools

New ReportNew ReportDiscoverDiscover

Incidents Reports

On Tuesday, Meta AI unveiled a demo of Galactica, a large language model designed to "store, combine and reason about scientific knowledge." While intended to accelerate writing scientific literature, adversarial users running tests found it could also generate realistic nonsense. After several days of ethical criticism, Meta took the demo offline, reports MIT Technology Review.

Large language models (LLMs), such as OpenAI's GPT-3, learn to write text by studying millions of examples and understanding the statistical relationships between words. As a result, they can author convincing-sounding documents, but those works can also be riddled with falsehoods and potentially harmful stereotypes. Some critics call LLMs "stochastic parrots" for their ability to convincingly spit out text without understanding its meaning.

Enter Galactica, an LLM aimed at writing scientific literature. Its authors trained Galactica on "a large and curated corpus of humanity’s scientific knowledge," including over 48 million papers, textbooks and lecture notes, scientific websites, and encyclopedias. According to Galactica's paper, Meta AI researchers believed this purported high-quality data would lead to high-quality output.

Enlarge / A screenshot of Meta AI's Galactica website before the demo ended.

Starting on Tuesday, visitors to the Galactica website could type in prompts to generate documents such as literature reviews, wiki articles, lecture notes, and answers to questions, according to examples provided by the website. The site presented the model as "a new interface to access and manipulate what we know about the universe."

While some people found the demo promising and useful, others soon discovered that anyone could type in racist or potentially offensive prompts, generating authoritative-sounding content on those topics just as easily. For example, someone used it to author a wiki entry about a fictional research paper titled "The benefits of eating crushed glass."

Even when Galactica's output wasn't offensive to social norms, the model could assault well-understood scientific facts, spitting out inaccuracies such as incorrect dates or animal names, requiring deep knowledge of the subject to catch.

I asked #Galactica about some things I know about and I'm troubled. In all cases, it was wrong or biased but sounded right and authoritative. I think it's dangerous. Here are a few of my experiments and my analysis of my concerns. (1/9)

— Michael Black (@Michael_J_Black) November 17, 2022

As a result, Meta pulled the Galactica demo Thursday. Afterward, Meta's Chief AI Scientist Yann LeCun tweeted, "Galactica demo is off line for now. It's no longer possible to have some fun by casually misusing it. Happy?"

The episode recalls a common ethical dilemma with AI: When it comes to potentially harmful generative models, is it up to the general public to use them responsibly, or for the publishers of the models to prevent misuse?

Where the industry practice falls between those two extremes will likely vary between cultures and as deep learning models mature. Ultimately, government regulation may end up playing a large role in shaping the answer.

New Meta AI demo writes racist and inaccurate scientific literature, gets pulled

The Meta team behind Galactica argues that language models are better than search engines. “We believe this will be the next interface for how humans access scientific knowledge,” the researchers write.

This is because language models can “potentially store, combine, and reason about” information. But that “potentially” is crucial. It’s a coded admission that language models cannot yet do all these things. And they may never be able to.

“Language models are not really knowledgeable beyond their ability to capture patterns of strings of words and spit them out in a probabilistic manner,” says Shah. “It gives a false sense of intelligence.”

Gary Marcus, a cognitive scientist at New York University and a vocal critic of deep learning, gave his view in a Substack post titled “A Few Words About Bullshit,” saying that the ability of large language models to mimic human-written text is nothing more than “a superlative feat of statistics.”

And yet Meta is not the only company championing the idea that language models could replace search engines. For the last couple of years, Google has been promoting language models, such as LaMDA, as a way to look up information.

It’s a tantalizing idea. But suggesting that the human-like text such models generate will always contain trustworthy information, as Meta appeared to do in its promotion of Galactica, is reckless and irresponsible. It was an unforced error.

Why Meta’s latest large language model survived only three days online

On November 15 Meta unveiled a new large language model called Galactica, designed to assist scientists. But instead of landing with the big bang Meta hoped for, Galactica has died with a whimper after three days of intense criticism. Yesterday the company took down the public demo that it had encouraged everyone to try out.

Meta’s misstep—and its hubris—show once again that Big Tech has a blind spot about the severe limitations of large language models. There is a large body of research that highlights the flaws of this technology, including its tendencies to reproduce prejudice and assert falsehoods as facts.

However, Meta and other companies working on large language models, including Google, have failed to take it seriously.

Galactica is a large language model for science, trained on 48 million examples of scientific articles, websites, textbooks, lecture notes, and encyclopedias. Meta promoted its model as a shortcut for researchers and students. In the company’s words, Galactica “can summarize academic papers, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.”

But the shiny veneer wore through fast. Like all language models, Galactica is a mindless bot that cannot tell fact from fiction. Within hours, scientists were sharing its biased and incorrect results on social media.

“I am both astounded and unsurprised by this new effort,” says Chirag Shah at the University of Washington, who studies search technologies. “When it comes to demoing these things, they look so fantastic, magical, and intelligent. But people still don’t seem to grasp that in principle such things can’t work the way we hype them up to.”

Asked for a statement on why it had removed the demo, Meta pointed MIT Technology Review to a tweet that says: “Thank you everyone for trying the Galactica model demo. We appreciate the feedback we have received so far from the community, and have paused the demo for now. Our models are available for researchers who want to learn more about the work and reproduce results in the paper.”

A fundamental problem with Galactica is that it is not able to distinguish truth from falsehood, a basic requirement for a language model designed to generate scientific text. People found that it made up fake papers (sometimes attributing them to real authors), and generated wiki articles about the history of bears in space as readily as ones about protein complexes and the speed of light. It’s easy to spot fiction when it involves space bears, but harder with a subject users may not know much about.

Many scientists pushed back hard. Michael Black, director at the Max Planck Institute for Intelligent Systems in Germany, who works on deep learning, tweeted: “In all cases, it was wrong or biased but sounded right and authoritative. I think it’s dangerous.”

Even more positive opinions came with clear caveats: “Excited to see where this is headed!” tweeted Miles Cranmer, an astrophysicist at Princeton. “You should never keep the output verbatim or trust it. Basically, treat it like an advanced Google search of (sketchy) secondary sources!”

Galactica also has problematic gaps in what it can handle. When asked to generate text on certain topics, such as “racism” and “AIDS,” the model responded with: “Sorry, your query didn’t pass our content filters. Try again and keep in mind this is a scientific language model.”

The Meta team behind Galactica argues that language models are better than search engines. “We believe this will be the next interface for how humans access scientific knowledge,” the researchers write.

This is because language models can “potentially store, combine, and reason about” information. But that “potentially” is crucial. It’s a coded admission that language models cannot yet do all these things. And they may never be able to.

“Language models are not really knowledgeable beyond their ability to capture patterns of strings of words and spit them out in a probabilistic manner,” says Shah. “It gives a false sense of intelligence.”

Gary Marcus, a cognitive scientist at New York University and a vocal critic of deep learning, gave his view in a Substack post titled “A Few Words About Bullshit,” saying that the ability of large language models to mimic human-written text is nothing more than “a superlative feat of statistics.”

And yet Meta is not the only company championing the idea that language models could replace search engines. For the last couple of years, Google has been promoting language models, such as LaMDA, as a way to look up information.

It’s a tantalizing idea. But suggesting that the human-like text such models generate will always contain trustworthy information, as Meta appeared to do in its promotion of Galactica, is reckless and irresponsible. It was an unforced error.

And it wasn’t just the fault of Meta’s marketing team. Yann LeCun, a Turing Award winner and Meta’s chief scientist, defended Galactica to the end. On the day the model was released, LeCun tweeted: “Type a text and Galactica will generate a paper with relevant references, formulas, and everything.” Three days later, he tweeted: “Galactica demo is off line for now. It’s no longer possible to have some fun by casually misusing it. Happy?”

It's not quite Meta's Tay moment. Recall that in 2016, Microsoft launched a chatbot called Tay on Twitter—then shut it down 16 hours later when Twitter users turned it into a racist, homophobic sexbot. But Meta’s handling of Galactica smacks of the same naivete.

“Big tech companies keep doing this—and mark my words, they will not stop—because they can,” says Shah. “And they feel like they must—otherwise someone else might. They think that this is the future of information access, even if nobody asked for that future.”

Correction: A previous version of this story stated that Google has been promoting the language model PaLM as a way to look up information for a couple of years. The language model we meant to refer to is LaMDA.

Why Meta’s latest large language model survived only three days online

🧵 Some thoughts about the recent release of Galactica by @MetaAI (everything here is my personal opinion) 👀

Let's start with the positive / What went well

[1] The model was released and Open Source*

Contrary to the trend of very interesting research being closed or just accessible through paid APIs, by open-sourcing the models and building on top of existing OS tools, evaluation can be reliably done in a transparent and open way

[2] There was a demo with the release**

Demos allow for a much wider audience to understand how models work. By having a demo with the release, a much more diverse audience can explore the model, identify points of failure, new biases, and more.

[3] Technically impressive!

Big Kudos where it's deserved. The model is technically impressive, with strong performance in different benchmarks, 50% citation accuracy, generation of latex and SMILES formulas, and more.

/ What went wrong [1] Hype in announcements, mixing end-product with research. The announcement and page talk about "solving" information overload in science and that this can be used to write scientific code.

This communication style is very misleading and will cause misuse

[2] Safety Filter in demo erasing communities

Although I imagine this was well-intentioned, the (non-transparent) safety filter removed content about queer theory and AIDS

https://twitter.com/willie_agnew/status/1592829238889283585

OpenAI has been doing the same with Dalle 2 and received backlash as well

The safety filter - Censors content about minorities, further marginalizing people - Contradicts the idea of storing and reasoning about scientific knowledge

See more at

https://twitter.com/mmitchell_ai/status/1593351384573038592?s=20&t=8W0DbEqaln7hDKPY_xGhYQ

[3] Use cases were unclear, undocumented, or misleading

The limitations stated in the site and paper are quite limited and somewhat unclear. The paper says, "we would not advise using it for tasks that require this type of knowledge as this is not the intended use-case."

There is also a somewhat hidden model card in github.com/paperswithcode…

But I find again that the documentation around limitations, biases, and use cases is too limited, given how powerful the model is

[4] Demo Although having a demo was nice, it could have done a better job in - Adding clearer disclaimers - Changing the UI to make it less like real-papers - Having a mechanism to identify such generated content

- Adding a way to flag toxic and erroneous content

[5] Related to the previous point, there was a lack of opportunity for the community to discuss and report issues, just by Twitter.

At @huggingface we learned that creating a space for public, open and transparent discussions on models is essential

https://twitter.com/mmitchell_ai/status/1583516905276837888

thread#showTweet data-screenname=osanseviero data-tweet=1594420190619439104 dir=auto> As such, users have mechanisms to report outputs generated by the demo, explore the code used to create it, and discuss with the community about the work openly and transparently.

So TL;DR, what could be done better - More explicit use cases and limitations - Better documentation of the model - Consider OpenRAIL licenses, which dive into use cases much more than classical software licenses

Thread by @osanseviero on Thread Reader App

Similar Incidents

By textual similarity

Did our AI mess up? Flag the unrelated incidents