Contributing to llm-d/inference-scheduler — Kubernetes LLM Inference Scheduling

Situation

Production LLM inference at scale requires intelligent request routing across multiple model serving backends. The llm-d/inference-scheduler project provides the Endpoint Picker Plugin (EPP) -- a Kubernetes-native scheduling component that routes inference requests to vLLM backends based on KV cache state, prefill locality, and load. As LLM inference grows in commercial importance, contributing to this infrastructure-level project enables impact at scale across the open-source AI infrastructure ecosystem.

The Challenge

Contributing to infrastructure-level Go projects requires understanding multiple interacting subsystems: Gateway API CRDs and inference extensions, Envoy's ext-proc callback mechanism for request interception, vLLM's serving architecture with P/D (prefill/decode) disaggregation, KV cache-aware scheduling policies, and Kubernetes deployment patterns (Kustomize, Helm, RBAC). The codebase uses Go 1.24+ with distributed tracing, Prometheus metrics, and a modular plugin architecture (filters and scorers) that requires careful API design to maintain extensibility.

The Approach

Planning to contribute to the inference scheduling framework by: (1) Understanding how the scheduling layer intercepts and routes inference requests to the appropriate GPU backend, (2) Exploring disaggregation modes — strategies for splitting the prefill and decode phases of LLM inference across nodes — to identify scheduling optimization opportunities, (3) Contributing to scheduler plugin development for improved load balancing and KV cache reuse, (4) Identifying latency optimization opportunities in the request processing pipeline.

Deepen expertise in: Kubernetes operator patterns, Envoy proxy extensibility, distributed systems for ML inference, Go concurrency patterns, and production ML infrastructure design. The project provides exposure to real-world inference serving challenges at the intersection of cloud infrastructure and applied ML.

Setting up a local development environment matching the project's cluster configuration. Reviewing the codebase architecture and existing test patterns (unit, integration, and end-to-end). Familiarising with the disaggregation deployment modes and the local simulation environment for iterative development before contributing upstream.

Impact

Contributing to llm-d/inference-scheduler provides the opportunity to work on foundational AI infrastructure used by teams deploying LLM inference at scale. The project's architecture (Gateway API extension, Envoy integration, KV cache awareness) represents the current state-of-the-art in production LLM serving. Contributions benefit the broader open-source AI ecosystem and strengthen expertise in distributed ML systems.

Tech Stack

Go 1.24+Gateway APIEnvoy ext-procKustomizeHelmvLLMP/D DisaggregationKV CachePrometheusGrafanaDistributed tracingUnit testsIntegration testsE2E (Kind clusters)TinyLlama-1.1BQwen3-VL-2B

← All Use Cases