Incident 956: Alleged Inclusion of 12,000 Live API Keys in LLM Training Data Reportedly Poses Security Risks

Description: A dataset used to train large language models allegedly contained 12,000 live API keys and authentication credentials. Some of these were reportedly still active and allowed unauthorized access. Truffle Security found these secrets in a December 2024 Common Crawl archive, which spans 250 billion web pages. The affected credentials could have been exploited for unauthorized data access, service disruptions, financial fraud, and a variety of other malicious uses.

Tools

New Report New Response DiscoverView History

Entities

View all entities

Alleged: Microsoft , OpenAI and Common Crawl developed an AI system deployed by Microsoft , OpenAI , Common Crawl and Microsoft Azure OpenAI Service, which harmed Microsoft , AWS , Slack , Mailchimp , Google , Intel , Huawei , PayPal , IBM and Tencent.

Alleged implicated AI systems: Common Crawl dataset (December 2024 archive) , Microsoft Copilot , Google Gemini , Anthropic Claude , ChatGPT , xAI Grok , DeepSeek and LLMs trained on compromised data

Incident Stats

Incident ID

956

Report Count

Incident Date

2025-02-28

Editors

Daniel Atherton

Applied Taxonomies

MIT

MIT Taxonomy Classifications

Machine-Classified

Taxonomy Details

Risk Subdomain

2.1. Compromise of privacy by obtaining, leaking or correctly inferring sensitive information

Risk Domain

Privacy & Security

Entity

Human

Timing

Pre-deployment

Intent

Unintentional

Incident Reports

Reports Timeline

12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training

thehackernews.com

thehackernews.com · 2025

A dataset used to train large language models (LLMs) has been found to contain nearly 12,000 live secrets, which allow for successful authentication.

The findings once again highlight how hard-coded credentials pose a severe security risk t…

Variants

A "variant" is an AI incident similar to a known case—it has the same causes, harms, and AI system. Instead of listing it separately, we group it under the first reported incident. Unlike other incidents, variants do not need to have been reported outside the AIID. Learn more from the research paper.

Seen something similar?

Similar Incidents

By textual similarity

Did our AI mess up? Flag the unrelated incidents

Similar Incidents

By textual similarity

Did our AI mess up? Flag the unrelated incidents

Incident 956: Alleged Inclusion of 12,000 Live API Keys in LLM Training Data Reportedly Poses Security Risks

Tools

Entities

Incident Stats

MIT Taxonomy Classifications

Incident Reports

Reports Timeline

12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training

12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training

Variants

Similar Incidents

By textual similarity

Biased Sentiment Analysis

Wikipedia Vandalism Prevention Bot Loop

Fake LinkedIn Profiles Created Using GAN Photos

Similar Incidents

By textual similarity

Biased Sentiment Analysis

Wikipedia Vandalism Prevention Bot Loop

Fake LinkedIn Profiles Created Using GAN Photos