Do Multilingual VLMs Abstain Under Cross-Modal Conflict in Low-Resource Languages?

Situation

Vision-Language Models (VLMs) must arbitrate between visual evidence and a textual claim, a safety-critical decision when the two contradict. A model may safely abstain under cross-modal conflict in English, but hundreds of millions of people interact with multilingual VLMs in low-resource languages such as Hindi and Telugu, where both training data and red-teaming are scarce. The harm in high-stakes settings is concrete. An X-ray that depicts a fracture, paired with a report asserting no abnormality, can lead a model to silently adopt the false caption over its own correct perception. Whether cross-modal arbitration survives the shift to a low-resource language was largely unmeasured.

The Challenge

Raw text-bias conflates two distinct failures. A model can follow a false caption, or it can never have perceived the correct answer in the first place. A single aggregate accuracy number hides exactly the cells that matter most, the model that trusts its eyes in English yet capitulates to false text in Telugu. Worse, abstention itself can collapse into a failure mode. A model that defaults to "unable to answer" looks cautious while sacrificing all visual utility. Existing mitigations rely on expensive fine-tuning or contrastive decoding, and almost all prior conflict evaluation is English-only.

The Approach

Built and open-sourced four multilingual cross-modal conflict datasets (English, Hindi, Telugu) spanning natural scenes, physics renders, remote sensing, and rendered 3D objects, with high-quality translations generated via NLLB-200-600M and Claude and verified by humans. Designed a paired perception-control benchmark in which every item is scored both with and without the counterfactual caption, yielding an "override gap" and a normalized "override share" that isolate genuine caption-driven override from baseline perceptual inability, a perception-corrected metric that re-ranks model safety.

Evaluated nine open multilingual VLMs across three languages and four datasets (10,260 conflict evaluations), identifying two distinct, language-dependent failure modes, text capture (the caption wins) and abstention collapse (the model refuses everything). The override-share metric exposed "quietly unsafe" abstainers. Qwen2.5-VL-7B posts the lowest raw text-bias yet by far the highest override share (0.81), flipping the naive safety ranking.

Cached last-prompt-token residual-stream activations per layer and fit L2-regularized logistic probes (5-fold CV) to decode image-faithful vs. text-following outcomes. The conflict signal stays strongly linearly decodable even where behaviour collapses (peak accuracy 0.97 in English and still 0.92 in Telugu), establishing a dissociation, where behaviour collapses along the resource axis while the representation does not.

Exploited that dissociation with cross-lingual contrastive activation steering by fitting an abstain-vs-assert direction in English, injecting it mid-depth through forward hooks, sweeping its strength under an off-target perception guard, and transferring it to Hindi and Telugu. For abstention-amenable models a single English-fit vector drives steered text-override to zero in every target language with a transfer gap of 0.00. The mitigation runs entirely at inference time, with no labels in the target language. Code (MIT) and all four datasets are released on GitHub and the Hugging Face Hub.

Impact

Delivers the first multilingual, perception-controlled audit of cross-modal conflict in VLMs, showing that genuine visual grounding evaporates precisely in the low-resource languages that English-only evaluation never sees. The override-gap design gives deployers and regulators a concrete auditing instrument that names the most caption-suggestible model-by-language cells, and the cross-lingual steering result demonstrates a cheap, inference-time fix. An honesty direction estimated in a high-resource language transfers down the resource axis. Open-sourced benchmark, steering pipeline, and four datasets enable reproducible follow-up work on multilingual VLM safety.

9 models evaluated 3 languages 4 datasets 10,260 conflict evaluations 0.92 probe accuracy telugu 0.81 max override share 0.0 transfer gap abstention

Tech Stack

InternVL3-2B/8BQwen2.5-VL-3B/7BQwen3-VL-8BLLaVA-OneVision-7BGLM-4.1V-9B-ThinkingGranite-Vision-3.3-2BSEA-LION-v4-8B-VLContrastive activation steeringLinear probesLogistic regressionForward hooksNLLB-200-600MClaudeGeminiCOCO-CounterfactualsPendulumRemote-Sensing-VQA3D-ObjectsEnglishHindiTelugu

← All Use Cases View Publication ↗