Go developer builds AVX2 scanner to split 10GB JSON files at 4.1 GB/s

·1 views

A developer working with large JSON files — including ML datasets and analytics dumps ranging up to tens of gigabytes — found that standard Go parsing approaches consumed three to four times the file size in RAM. Instead of fully parsing the JSON, the solution treats the data as a raw byte stream and uses a lightweight state machine to locate element boundaries by tracking nesting depth and string delimiters. This boundary-scanning approach skips token classification, memory allocation, and type conversion entirely, returning slices of the original buffer with zero extra copies. The hot loop was further optimized using AVX2 SIMD assembly, processing 32 bytes per CPU cycle. Benchmarks showed the scanner achieving approximately 4.1 GB/s throughput with no meaningful memory overhead, compared to around 107 MB/s for standard encoding/json and 400–700 MB/s for optimized parsers like sonic or simdjson-go.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

How One Dev Fixed Months of Wrong-Language Emails Using Stripe's Currency Field

A SaaS product unknowingly sent Japanese-language emails to English-speaking overseas customers for months after all four Stripe-triggered email types were hardcoded to Japanese. The development team evaluated three approaches to fix the language detection issue inside Stripe webhooks, including storing a language column in the database, querying the Stripe API, or inferring language from the currency field already present in the webhook payload. They chose currency-based inference because it requires no database migration, no extra API calls, and automatically applies the correct language to both new and existing users. A simple helper function maps USD to English and defaults all other currencies to Japanese, with room to expand for EUR or GBP markets later. The team also encountered a subtle PHP mb_language configuration trap during implementation that nearly undermined the fix.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

PDF Toolkit API lets developers merge, split, and watermark PDFs via HTTP calls

A lightweight HTTP API called PDF Toolkit has been introduced to handle common PDF operations without requiring native libraries like Ghostscript or pdftk. The API exposes six endpoints covering merging, splitting, rotating, watermarking, metadata extraction, and image-to-PDF conversion. Developers can integrate it using simple cURL commands or Node.js code by sending POST requests with file attachments. The service is available on RapidAPI with a free tier offering 100 requests per day and no credit card required to start. It is aimed at developers working in serverless or restricted hosting environments who want a single integration instead of managing multiple PDF libraries.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Mistral and open-source MinerU race to make PDFs readable for AI

French AI company Mistral launched an updated hosted document-reading service on June 25, 2026, claiming state-of-the-art accuracy in converting complex PDFs into clean, structured text. Around the same time, the open-source project MinerU gained significant traction on GitHub by offering a self-hosted, free alternative that processes PDFs and office files into AI-ready formats. Both tools tackle document intelligence, the process of extracting properly ordered, structured text from scanned contracts, multi-column papers, and table-heavy invoices that standard text extraction cannot handle. The quality of this conversion matters because AI systems built on top of poorly parsed documents will produce unreliable outputs, with errors occurring invisibly before any language model is even involved. The two tools represent a broader industry tension between convenient, paid cloud services and free, privacy-preserving tools that organisations run on their own infrastructure.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

The Hidden Risk in Codebases: Behavior With No Documented Proof

As software systems age, a dangerous gap grows between how code actually behaves and what the repository can formally prove about that behavior. Critical logic — such as fraud rules, retry handling, or edge-case workarounds — often exists only in a developer's memory, an old Slack thread, or a long-forgotten pull request comment. Tests help, but they only validate what someone remembered to assert, leaving many real user-facing behaviors entirely unprotected. The rise of AI coding tools has sharpened this risk, as agents can silently simplify or remove undocumented logic while tests continue to pass. The author argues that missing behavioral evidence should be treated as a warning signal, and that code reviews must ask not just whether code looks correct, but what behavior it claims to preserve and where the proof lives.

0 comments Read more at DEV Community