
vLLM devs compile forgotten AI image encoder for 3% speedup: Because fixing three bugs and waiting 13 seconds is the real innovation in trillion-dollar AI
Researchers have successfully compiled the Vision Encoder component of the Qwen3-VL model, resulting in a 3.4% increase in throughput on NVIDIA H200 GPUs. The compilation process, which utilizes PyTorch's torch.compile feature, was previously only applied to the model's decoder. By extending compilation to the encoder, the team was able to identify and fix three previously unknown bugs, ultimately achieving a significant performance boost. The compilation process involves fusing operators and capturing CUDA graphs, which reduces Python overhead and enables kernel fusion. The results show that the compiled encoder achieves a 4.4% speedup, with the largest gains coming from the VisionBlocks component. The improvement is expected to be even more significant for larger models, video workloads, and bandwidth-constrained hardware. The compilation can be enabled with a single flag, making it easily accessible to users. The discovery has significant implications for the development of more efficient AI models and highlights the importance of thorough testing and optimization.