vLLM, llama.cpp, Ollama Benchmarked on a Single RTX 3090 With 24GB VRAM

·1 views

A home-lab test compared three popular LLM inference frameworks — vLLM, llama.cpp, and Ollama — across five models ranging from 1B to 116.8B parameters on a single RTX 3090 GPU paired with 128GB of system RAM. Within the 24GB VRAM limit, vLLM's continuous batching delivered 3.9x–5.4x throughput gains from concurrency 1 to 8, significantly outpacing llama.cpp's 1.2x–1.9x scaling. When models exceeded VRAM capacity and spilled into RAM, both llama.cpp and Ollama continued generating at single-digit tokens per second, while vLLM crashed with out-of-memory errors at around 22GB used. Steady-state decode speeds were nearly identical across frameworks once warmed up, but time-to-first-token varied sharply — Ollama took 274 seconds versus llama.cpp's 7.3 seconds on the largest model, largely due to Ollama's automatic GPU-layer splitting on a partially RAM-resident model. The energy cost gap was equally stark, with Ollama consuming roughly seven times more electricity per million tokens than llama.cpp on the same 120B model.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

How Trino, Spark, and DuckDB each query the same Apache Iceberg table

Apache Iceberg allows multiple query engines to read the same table stored in object storage without duplicating data, with each engine differing only in how it accesses the shared metadata. Trino connects via a catalog and offers clean, straightforward SQL for interactive queries, making it well-suited for shared lakehouse environments. Spark requires additional session configuration with Iceberg extensions but is the preferred choice when queries are part of larger data pipelines involving transforms or batch writes. DuckDB provides the fastest path for local, read-only inspection by scanning Iceberg metadata files directly, though it can also attach a REST catalog for broader catalog-backed workflows. Understanding how all three engines interact with the same underlying table is essential for teams building and operating real lakehouse architectures.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

UUID v7 outperforms v4 as a database primary key due to time-ordered sorting

A developer exploring UUID internals found that version 7 UUIDs, standardised in RFC 9562 in 2024, offer a meaningful advantage over the widely used v4 format for database primary keys. Both types are 128-bit identifiers that allow distributed systems to generate unique IDs locally without any central coordination or risk of collision. While v4 UUIDs fill 122 bits with cryptographic randomness, v7 embeds a 48-bit Unix millisecond timestamp at the most significant end, making IDs sort chronologically as plain strings. This time-ordered structure prevents the random B-tree index scattering caused by v4 keys, which triggers page splits and cache inefficiency at scale. No browser built-in currently generates v7, but the logic can be hand-rolled by writing the timestamp byte-by-byte to avoid JavaScript's 32-bit bitwise operator limitation.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Developers Build AI Cyber-Attack Detective Tool at Hackathon After All-Night Coding Sprint

A three-member team — Mehraan, Aqib, and Ubaid — built KoshurLock Holmes, an AI-powered cybersecurity investigation tool, during a WeMakeDevs hackathon. The tool addresses a core problem in post-breach forensics: evidence from VPN logs, badge readers, email gateways, and other sources is scattered across systems, forcing analysts to manually connect the dots over days or weeks. KoshurLock Holmes parses uploaded evidence files, extracts entities and relationships using Cognee, and constructs a unified knowledge graph so that the same individual appearing across multiple logs resolves to a single node. Users can query the system in plain English, receive cited multi-hop reasoning, and even instruct it to discard false planted clues, causing flawed conclusions to collapse. The project nearly stalled at 4 AM on the final day when the team hit Groq's free-tier token limit with the submission deadline hours away, but they pushed through to complete it by morning.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

How to Build a Bot-Resistant AI Browser Agent Using Playwright, Gemini, and Bright Data

A technical guide published on DEV Community walks developers through building an AI-powered browser automation agent using Node.js, Playwright, Google Gemini, and Bright Data. The tutorial addresses a core challenge in modern web automation: most AI browser agents fail not due to poor reasoning but because websites detect and block non-human traffic through fingerprinting, IP reputation checks, and behavioral analysis. To bypass these defenses without building custom anti-detection infrastructure, the guide delegates browser identity and anti-bot handling to Bright Data while the AI layer focuses on reasoning and task execution. The result is a functional agent called SIKKI Agent, capable of browsing real websites, extracting product data, analyzing content, and generating reports. The entire stack relies on open tools — Node.js, Playwright, Bright Data, and Gemini — with no proprietary AI frameworks required.

0 comments Read more at DEV Community

vLLM, llama.cpp, Ollama Benchmarked on a Single RTX 3090 With 24GB VRAM

Discussion (0)

Related stories

How Trino, Spark, and DuckDB each query the same Apache Iceberg table

UUID v7 outperforms v4 as a database primary key due to time-ordered sorting

Developers Build AI Cyber-Attack Detective Tool at Hackathon After All-Night Coding Sprint

How to Build a Bot-Resistant AI Browser Agent Using Playwright, Gemini, and Bright Data