← Create an Automated AI Sales System Many-Shot Visual In-Context Learning →

Warning: GPT-4o Chinese Translation Issues

by Fede Nolasco | Jun 6, 2024

 AI Translation Issues | Chinese Translation | Data Pollution | gpt-4o | MIT Report

In the video titled ‘Warning GPT-4o: DON’T translate to Chinese (MIT),’ the channel code_your_own_AI discusses significant issues with GPT-4o’s Chinese translations as reported by MIT. The video warns users against using GPT-4o for translating business correspondence into Chinese due to heavy data pollution in the Chinese token-training data, which is contaminated by spam and explicit content. The video explains that GPT-4o’s tokenizer includes around 200,000 tokens, with 75% for English and 25% for other languages. However, the Chinese tokens are heavily polluted, making translations unreliable and potentially offensive. MIT’s findings indicate that the Chinese token data was likely sourced from spam and explicit websites, leading to a high percentage of inappropriate language. The video advises users to double-check translations with an independent source before sending any business communications to Chinese partners. Similar issues are noted for Korean tokens, while Hindi and Bengali are reported to have basic implementations without significant pollution. The video concludes by urging caution and patience as OpenAI works to resolve these issues.

 code_your_own_AI

 Not Applicable

 June 1, 2024

 GPT-4o’s Chinese token-training data is polluted by spam and porn websites

← Create an Automated AI Sales System Many-Shot Visual In-Context Learning →