GG-EZ: Regional Adaptation Framework for Vision-Language Models in SEA

Situation

General-purpose VLMs (LLaVA, InternVL, Phi-Vision, SDXL) exhibit systematic failure modes on culturally situated visual tasks specific to Southeast Asia, misidentifying local food, misreading regional scripts, failing to recognize traditional practices. The commercial default response is full retraining on regional data, which is computationally prohibitive and requires proprietary data access most regional stakeholders do not have.

The Challenge

The core scientific question is whether cultural correctness can be recovered through targeted adaptation without full retraining, and if so, what the minimal intervention is. This requires a testable hypothesis about where cultural knowledge is encoded in model weights and how it can be efficiently modified. Evaluation is doubly hard. Cultural correctness is subjective and demands human raters with genuine regional expertise that automated metrics cannot provide.

The Approach

Proposed an anthropogenic regional adaptation framework built on the hypothesis that cultural context is learnable from structured regional knowledge sources rather than raw scale. Three adaptation mechanisms were investigated. (1) retrieval-augmented prompting with curated regional knowledge bases, (2) region-specific fine-tuning with targeted cultural datasets, (3) evaluation scaffolds for measuring cultural correctness systematically.

Led the image generation arm of the project. Fine-tuned the image generation model on curated Southeast Asian visual data, then designed a linear model merging strategy, blending the original general-purpose model weights with the regionally adapted version at different mixing ratios, to control the tradeoff between cultural specificity and global image quality. Human evaluation by regional annotators confirmed cultural correctness improvement (1.569 vs. 1.491 baseline) while automated benchmarking confirmed 98%+ retention of global generation quality, validating the core hypothesis that regional specialization and global capability are not zero-sum.

Human evaluation protocol with regional annotators for cultural correctness scoring; automated DPG-Bench for global generation quality. Both metrics tracked per adaptation variant to build a Pareto frontier of regional gain vs. global degradation.

Impact

Provides a replicable, computationally accessible playbook for adapting general-purpose VLMs to underrepresented geographies. The 98%+ global quality retention result is critical. It demolishes the common objection that regional adaptation necessarily degrades global model utility, making the business case for SEA localization significantly stronger for commercial AI labs.

1.569 cultural correctness adapted 1.491 cultural correctness baseline 98.0 global benchmark retention pct

Tech Stack

SDXLLLaVAInternVLPhi-VisionPyTorchdiffusersDPG-Benchhuman evaluationRAG scaffolds

← All Use Cases arXiv ↗