Entities

Chinese-speaking users of ChatGPT

Incidents Harmed By

Incident 7291 Report
GPT-4o's Chinese Tokens Reportedly Compromised by Spam and Pornography Due to Inadequate Filtering

2024-05-14

OpenAI's GPT-4o was found to have its Chinese token training data compromised by spam and pornographic phrases due to inadequate data cleaning. Tianle Cai, a Ph.D. student at Princeton University, identified that most of the longest Chinese tokens were irrelevant and inappropriate, primarily originating from spam and pornography websites. The polluted tokens could lead to hallucinations, poor performance, and potential misuse, undermining the chatbot's reliability and safety measures.

More

Related Entities