SEA-VL: Multicultural Vision-Language Benchmark for Southeast Asia

Situation

State-of-the-art Vision-Language Models (GPT-4V, Gemini 1.5, Claude 3) are trained predominantly on English-centric, Western-origin image-text data. Southeast Asia -- 11 countries, 700M+ people, hundreds of languages, and visually distinct cultural practices -- is systematically underrepresented in both training corpora and evaluation benchmarks. There is no rigorous, community-built benchmark that can measure VLM capability on culturally situated reasoning tasks specific to the region.

The Challenge

Creating a culturally valid benchmark requires more than web scraping: image-question pairs must encode genuine cultural knowledge (traditional food, festivals, architecture, script signage, social practices) that is invisible to models trained on global web data. Standard crowdsourcing platforms (MTurk) lack the regional annotator coverage and cultural competence necessary. Automated generation using GPT-4 or Gemini introduces circularity -- benchmarking models against data partially generated by similar models. Language diversity (11 SEA languages including low-resource scripts) further complicates quality control.

The Approach

Co-led a distributed crowdsourcing operation across 100+ community annotators spanning Indonesia, Vietnam, Philippines, Thailand, Myanmar, Malaysia, Cambodia, Laos, Singapore, Timor-Leste, and Brunei. Annotators submitted culturally grounded image-question-answer triplets from lived experience, not from web retrieval.

Designed and implemented the data quality pipeline: annotation guideline authoring, multi-stage review protocol combining automated filtering with human-in-the-loop validation, large-scale image deduplication across the full candidate corpus (comparing multiple similarity methods against a human-validated reference set to select the most reliable approach), language validation, and cultural accuracy review. Contributed to 10,000+ final image-question pairs across 11 languages and diverse cultural domains.

Benchmark exposes systematic blind spots in GPT-4V, Gemini, and Claude on culturally situated visual reasoning -- quantifying the gap that motivated the entire research program. Results directly inform model localization priorities for AI labs targeting SEA markets.

Impact

Establishes the first rigorous, community-built VLM benchmark for Southeast Asia. Results are actively being used by AI labs to identify and prioritize model localization gaps for the 700M+ person SEA market. Accepted at ACL 2025 (Main Conference) -- the top-tier NLP venue -- validating the community-science methodology as publishable at the highest standards.

Tech Stack

GPT-4VGemini 1.5Claude 3LLaVAInternVLpHashCLIP-ViTSigLIPNomic Embed Vision100+ annotators, 11 countries10,000+ image-question pairs

← All Use Cases View Publication ↗ arXiv ↗