Go developer builds AVX2 scanner to split 10GB JSON files at 4.1 GB/s
A developer working with large JSON files — including ML datasets and analytics dumps ranging up to tens of gigabytes — found that standard Go parsing approaches consumed three to four times the file size in RAM. Instead of fully parsing the JSON, the solution treats the data as a raw byte stream and uses a lightweight state machine to locate element boundaries by tracking nesting depth and string delimiters. This boundary-scanning approach skips token classification, memory allocation, and type conversion entirely, returning slices of the original buffer with zero extra copies. The hot loop was further optimized using AVX2 SIMD assembly, processing 32 bytes per CPU cycle. Benchmarks showed the scanner achieving approximately 4.1 GB/s throughput with no meaningful memory overhead, compared to around 107 MB/s for standard encoding/json and 400–700 MB/s for optimized parsers like sonic or simdjson-go.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in