Chinese-speaking users of ChatGPT

Incidents Harmed By

Incident 7291 Report
GPT-4o's Chinese Tokens Reportedly Compromised by Spam and Pornography Due to Inadequate Filtering

2024-05-14

OpenAI's GPT-4o was found to have its Chinese token training data compromised by spam and pornographic phrases due to inadequate data cleaning. Tianle Cai, a Ph.D. student at Princeton University, identified that most of the longest Chinese tokens were irrelevant and inappropriate, primarily originating from spam and pornography websites. The polluted tokens could lead to hallucinations, poor performance, and potential misuse, undermining the chatbot's reliability and safety measures.

Related Entities

OpenAI

Incidents involved as both Developer and Deployer

Incident 729
1 Report
GPT-4o's Chinese Tokens Reportedly Compromised by Spam and Pornography Due to Inadequate Filtering

Incidents Harmed By

Incident 729
1 Report
GPT-4o's Chinese Tokens Reportedly Compromised by Spam and Pornography Due to Inadequate Filtering

GPT-4o

Incidents involved as Deployer

Incident 729
1 Report
GPT-4o's Chinese Tokens Reportedly Compromised by Spam and Pornography Due to Inadequate Filtering

Researchers

Incidents Harmed By

Incident 729
1 Report
GPT-4o's Chinese Tokens Reportedly Compromised by Spam and Pornography Due to Inadequate Filtering

OpenAI users

Incidents Harmed By

Incident 729
1 Report
GPT-4o's Chinese Tokens Reportedly Compromised by Spam and Pornography Due to Inadequate Filtering