vLLM, llama.cpp, Ollama Benchmarked on a Single RTX 3090 With 24GB VRAM
A home-lab test compared three popular LLM inference frameworks — vLLM, llama.cpp, and Ollama — across five models ranging from 1B to 116.8B parameters on a single RTX 3090 GPU paired with 128GB of system RAM. Within the 24GB VRAM limit, vLLM's continuous batching delivered 3.9x–5.4x throughput gains from concurrency 1 to 8, significantly outpacing llama.cpp's 1.2x–1.9x scaling. When models exceeded VRAM capacity and spilled into RAM, both llama.cpp and Ollama continued generating at single-digit tokens per second, while vLLM crashed with out-of-memory errors at around 22GB used. Steady-state decode speeds were nearly identical across frameworks once warmed up, but time-to-first-token varied sharply — Ollama took 274 seconds versus llama.cpp's 7.3 seconds on the largest model, largely due to Ollama's automatic GPU-layer splitting on a partially RAM-resident model. The energy cost gap was equally stark, with Ollama consuming roughly seven times more electricity per million tokens than llama.cpp on the same 120B model.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)
Log in to join the discussion and vote.
Log in