Qwen VL Fine-tuning for AI City Challenge 2026 Track 2

Situation

The AI City Challenge Track 2 requires fine-tuning vision-language models for video captioning tasks under compute-constrained environments (single V100 GPU with 14GB memory). General-purpose VLMs like Qwen2.5-VL achieve strong baseline performance but require task-specific adaptation for domain-specific captioning quality. The constraint environment demands careful memory management and optimization to achieve meaningful fine-tuning within hardware limits.

The Challenge

Fine-tuning a 3-billion parameter vision-language model on a single 14GB GPU pushes against the physical limits of what the hardware can hold. Standard fine-tuning loads the entire model and training state into GPU memory simultaneously — an approach that fails at this scale. Multi-GPU parallelism, the conventional fallback, was unavailable due to cluster networking constraints. The additional challenge: the video captioning task required processing multiple visual frames per training example, generating sequences that exceeded memory capacity even with memory-efficient training enabled. Software environment conflicts between the cluster's system-wide libraries and the project's required versions added further setup overhead before any training could begin.

The Approach

Applied Parameter-Efficient Fine-Tuning (PEFT) using Low-Rank Adaptation (LoRA) — a technique that updates only a small fraction of model parameters rather than all 3 billion, dramatically reducing training memory requirements. Extended the model's context window to handle longer visual sequences and enabled gradient checkpointing — a technique that trades additional computation for reduced memory by recomputing intermediate values during training rather than storing them. This combination made fine-tuning feasible on a single constrained GPU. Multi-GPU scaling is configured and ready for deployment when cluster infrastructure allows.

Redesigned the training data format to reduce memory consumption per training step. The original dataset structure packed multiple video frames per example, producing sequences that saturated GPU memory before training could begin. Reformatting to a single-frame-per-example structure preserved the task semantics while making each training step memory-safe — a data engineering decision that unblocked the entire fine-tuning run.

Resolved software environment conflicts between the cluster's system-wide deep learning installation and the project's required library versions — a common constraint in shared GPU environments that required careful dependency isolation and GPU runtime configuration to ensure reproducible, conflict-free training execution.

Built an automated end-to-end training pipeline handling dataset preprocessing, configuration generation, and training execution as a single reproducible workflow — reducing manual setup time for each experimental run and making the pipeline transferable to other GPU-constrained fine-tuning tasks. Achieved stable, converging training across the dataset within the 14GB memory budget.

Impact

Establishes a replicable methodology for fine-tuning large vision-language models under GPU memory constraints common in academic and resource-limited research environments. The design decisions — parameter-efficient adaptation, memory-safe data formatting, environment isolation, and automated pipeline orchestration — form a practical guide applicable to any VLM fine-tuning task on shared or consumer-grade hardware. Outputs include ready-to-use adapted model weights and a reproducible training pipeline for downstream captioning applications.

1.46 train loss 1.75 training time minutes 3 epochs completed 5 samples 14 gpu memory gb

Tech Stack

Qwen2.5-VL-3B-InstructLoRALLaMA-FactoryDeepSpeed ZeRO-2PyTorch 2.5.1+cu121pefttransformersV100 14GBCUDA 12.5

← All Use Cases