VLMsCultural AIRegional AdaptationDiffusion ModelsKnowledge DistillationECCV 2026

GG-EZ: regional adaptation framework for vision-language models in SEA

ECCV 2026 (Under Review) + arXiv preprint 2026 · Lead -- Diffusion Model Arm

General-purpose VLMs (LLaVA, InternVL, Phi-Vision, SDXL) exhibit systematic failure modes on culturally situated visual tasks specific to Southeast Asia -- misidentifying local food, misreading regional scripts, failing to recognize traditional practices. The commercial default response is full retraining on regional data, which is computationally prohibitive and requires proprietary data access most regional stakeholders do not have.

The core scientific question: can cultural correctness be recovered through targeted adaptation without full retraining -- and if so, what is the minimal intervention? This requires a testable hypothesis about where cultural knowledge is encoded in model weights and how it can be efficiently modified. Evaluation is doubly hard: cultural correctness is subjective and requires human raters with genuine regional expertise, not automated metrics.

Proposed an anthropogenic regional adaptation framework: the hypothesis that cultural context is learnable from structured regional knowledge sources rather than raw scale. Three adaptation mechanisms investigated: (1) retrieval-augmented prompting with curated regional knowledge bases, (2) region-specific fine-tuning with targeted cultural datasets, (3) evaluation scaffolds for measuring cultural correctness systematically.

Led the image generation arm of the project. Fine-tuned the image generation model on curated Southeast Asian visual data, then designed a linear model merging strategy — blending the original general-purpose model weights with the regionally adapted version at different mixing ratios — to control the tradeoff between cultural specificity and global image quality. Human evaluation by regional annotators confirmed cultural correctness improvement (1.569 vs. 1.491 baseline) while automated benchmarking confirmed 98%+ retention of global generation quality, validating the core hypothesis that regional specialization and global capability are not zero-sum.

Human evaluation protocol with regional annotators for cultural correctness scoring; automated DPG-Bench for global generation quality. Both metrics tracked per adaptation variant to build a Pareto frontier of regional gain vs. global degradation.

Provides a replicable, computationally accessible playbook for adapting general-purpose VLMs to underrepresented geographies. The 98%+ global quality retention result is critical: it demolishes the common objection that regional adaptation necessarily degrades global model utility, making the business case for SEA localization significantly stronger for commercial AI labs.

1.569 cultural correctness adapted 1.491 cultural correctness baseline 98.0 global benchmark retention pct
SDXLLLaVAInternVLPhi-VisionPyTorchdiffusersDPG-Benchhuman evaluationRAG scaffolds