Common Crawl

Incidents involved as both Developer and Deployer

Incident 9561 Report
Alleged Inclusion of 12,000 Live API Keys in LLM Training Data Reportedly Poses Security Risks

2025-02-28

A dataset used to train large language models allegedly contained 12,000 live API keys and authentication credentials. Some of these were reportedly still active and allowed unauthorized access. Truffle Security found these secrets in a December 2024 Common Crawl archive, which spans 250 billion web pages. The affected credentials could have been exploited for unauthorized data access, service disruptions, financial fraud, and a variety of other malicious uses.

Incident 10442 Report
Reported Emergence of 'Vegetative Electron Microscopy' in Scientific Papers Traced to Purported AI Training Data Contamination

2025-04-15

Researchers reportedly traced the appearance of the nonsensical phrase "vegetative electron microscopy" in scientific papers to contamination in AI training data. Testing indicated that large language models such as GPT-3, GPT-4, and Claude 3.5 may reproduce the term. The error allegedly originated from a digitization mistake that merged unrelated words during scanning, and a later translation error between Farsi and English.

Common Crawl

Incidents involved as both Developer and Deployer

Incident 9561 ReportAlleged Inclusion of 12,000 Live API Keys in LLM Training Data Reportedly Poses Security Risks

Incidents implicated systems

Incident 10442 ReportReported Emergence of 'Vegetative Electron Microscopy' in Scientific Papers Traced to Purported AI Training Data Contamination

Related Entities Related EntitiesOther entities that are related to the same incident. For example, if the developer of an incident is this entity but the deployer is another entity, they are marked as related entities.

Related Entities

Microsoft

Incidents involved as both Developer and Deployer

Incidents Harmed By

OpenAI

Incidents involved as both Developer and Deployer

Microsoft Azure OpenAI Service

Incidents involved as Deployer

AWS

Incidents Harmed By

Slack

Incidents Harmed By

Mailchimp

Incidents Harmed By

Google

Incidents Harmed By

Intel

Incidents Harmed By

Huawei

Incidents Harmed By

PayPal

Incidents Harmed By

IBM

Incidents Harmed By

Tencent

Incidents Harmed By

Common Crawl dataset (December 2024 archive)

Incidents implicated systems

Microsoft Copilot

Incidents implicated systems

Google Gemini

Incidents implicated systems

Anthropic Claude

Incidents implicated systems

ChatGPT

Incidents implicated systems

xAI Grok

Incidents implicated systems

DeepSeek

Incidents implicated systems

LLMs trained on compromised data

Incidents implicated systems

Anthropic

Incidents involved as both Developer and Deployer

Researchers

Incidents Harmed By

Incidents involved as Deployer

Scientific authors

Incidents Harmed By

Incidents involved as Deployer

Scientific publishers

Incidents Harmed By

Peer reviewers

Incidents Harmed By

Scholars

Incidents Harmed By

Readers of scientific publications

Incidents Harmed By

Scientific record

Incidents Harmed By

Academic integrity

Incidents Harmed By

GPT-3

Incidents implicated systems

GPT-4

Incidents implicated systems

Claude 3.5

Incidents implicated systems

Incident 9561 Report
Alleged Inclusion of 12,000 Live API Keys in LLM Training Data Reportedly Poses Security Risks

Incident 10442 Report
Reported Emergence of 'Vegetative Electron Microscopy' in Scientific Papers Traced to Purported AI Training Data Contamination

Related Entities