
A few months ago, the engineering team at Mistral AI embarked on an investigation into a suspected memory leak within vLLM, their advanced model serving technology. Initially, the team anticipated that the source of the leak would be easily identifiable, potentially limited to the higher layers of the codebase. However, as their investigation progressed, it became evident that the issue was more intricate than they had anticipated.
The memory leak first surfaced during pre-production testing for disaggregated serving using the Mistral Medium 3.1 model, with graph compilation enabled. Notably, the system demonstrated a gradual increase in memory usage, climbing at a rate of 400 MB per minute under production-like conditions. While there were no crashes or obvious errors, the slow increase threatened to result in an “out of memory” state after extended operation.
In an effort to identify the leak, the team adopted a systematic approach, beginning with high-level Python tools before delving into kernel-level tracing. They attempted to replicate the leak using different models and settings but faced challenges; the leak only manifested in a specific Prefill/Decode (P/D) disaggregated setup with NIXL.
P/D disaggregated serving divides the processing of a query into two distinct phases executed by different instances. A “prefill request” is first sent to a prefill vLLM instance to compute the KVCache of the request. Upon completion, the router transmits the KVCache metadata along with a subsequent “decode request” to a decode vLLM instance. During this phase, the team discovered that the leak was primarily observed on the decode side, indicating that the KV Cache transfer through NIXL was likely the root cause.
The investigation continued with Python memory profiling tools such as Memray and Guppy 3, but initially yielded no definitive evidence of a leak. Attempts to utilize GDB resulted in crashes, and the complexity and load of the vLLM setup rendered tools like Valgrind either too slow or impractical.
Recognizing the need for a more robust tool, the team decided to verify the reproducibility of the issue by reaching out to the vLLM team. This collaborative step confirmed that other users had encountered similar difficulties, warranting a deeper investigation.
To enhance their tracking capabilities, the team turned to Heaptrack, a memory profiler capable of documenting memory allocations and freeing events. They implemented Heaptrack by preloading its library in the vLLM setup and subsequently visualized the data through heaptrack_interpret.
Although Heaptrack provided insightful visualizations, the team quickly recognized that the stable heap memory did not reveal the actual memory leak. Instead, discrepancies in the peak resident memory (RSS) between two snapshots taken at different intervals highlighted that the leak was occurring outside the standard heap memory management, leading the investigators to explore allocations that were happening at the system level.
The investigation then turned to the Linux process memory layout, examining how resident memory is utilized. They employed the pmap command to discern patterns of growth in anonymous memory mappings over time, suggesting possible memory leak sources. The analysis indicated that only specific anonymous memory mappings were expanding, potentially linked to the mremap system call behavior.
With the leak now narrowed down to raw mmap or mremap calls, BPFtrace emerged as the tool of choice for real-time tracing of system calls. By closely monitoring these calls, the investigation was able to confirm the origin of the problematic memory allocations.
Through automated GDB testing and BPFtrace outputs, the team correlated memory leak calls back to UCX, the high-performance communication layer employed by NIXL. They learned that UCX was utilizing a hooking mechanism for memory operations in a manner that led to excessive allocations not being effectively cleaned up.
Upon disabling this hook, the team observed that the memory leak was resolved without adversely affecting performance. Additionally, a fix was proposed to manage memory limits to ensure reliable operations moving forward.
This investigation underscores the complexities and potential pitfalls inherent in modern software stacks reliant on layers of dependencies, particularly as they evolve with performance optimizations. Transparency and collaboration among teams are vital in solving critical issues swiftly. The Mistral AI team expressed gratitude to the vLLM, NIXL, and UCX teams for their collaborative efforts in diagnosing and addressing the underlying problem.
Finally, the investigation serves as a reminder of the value of preparedness and the necessity of delving deep into debugging efforts, considering that underlying issues can often arise from intertwined dependencies within intricate software architectures.
As Mistral AI continues to build the future of AI infrastructure, they are seeking talented engineers and researchers to join their team, emphasizing the ongoing challenge and excitement found within cutting-edge technology.