CommonLID: language identification on noisy web data
Language Identification (LID) is an upstream dependency for virtually every multilingual NLP pipeline: machine translation routing, content moderation, corpus construction, and search indexing all assume that LID is solved. Published accuracy figures for leading LID systems (fastText LangID, GlotLID, OpenLID) frequently exceed 95% on clean benchmark data, creating false confidence in the field.
Real-world web data from Southeast Asia systematically violates the assumptions of existing LID benchmarks: text is code-switched (mixing Bahasa Indonesia with English and Javanese in a single sentence), written in romanized orthographies that diverge from standardized script forms, and contains phonetic spellings and social media abbreviations. Published LID systems trained on curated parallel corpora degrade significantly on this distribution -- a degradation that is not measurable on existing benchmarks because they do not sample from realistic web data.
Re-benchmarked fastText LangID, GlotLID, and OpenLID on CommonCrawl- sourced text reflecting realistic SEA web data distributions -- code-switched content, romanized scripts, and orthographic variation. Identified critical accuracy degradation patterns (system-specific failure modes across language families) and proposed a more rigorous evaluation protocol for future LID systems claiming low-resource SEA coverage.
Challenges published accuracy claims on SEA languages and motivates re-evaluation of data pipelines that rely on LID as a preprocessing step -- including the CommonCrawl-based training corpora used by major LLMs.
Directly affects NLP pipeline reliability for downstream tasks including content moderation, machine translation routing, and LLM training corpus construction in SEA. The rigorous evaluation protocol serves as a methodological reference for future low-resource LID benchmarking work across any underrepresented language family.