Solo developer finds 52% duplicates and 32% missing docs in self-built knowledge graph

·1 views

A systems administrator in the Dominican Republic building ANIMUS, a Rust-based knowledge graph trained on 800+ Dominican banking regulation PDFs, discovered two separate silent failures during routine audits of his pipeline. The first issue revealed that over half the graph's nodes were duplicates, caused by re-running ingestion without a uniqueness check, skewing retrieval results without triggering any errors. The second and more serious flaw showed that 262 of 817 documents were never integrated into the graph because scanned or digitally signed PDFs returned too little extractable text to pass a 100-character threshold, yet were still marked as successfully processed. Adding an OCR fallback using pytesseract recovered 259 of the missing 262 documents, though noise in the output remains a partial challenge. A further complication emerged when some documents that passed the text check were found to contain corrupted text baked in by low-quality OCR during their original digitization.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

Solo developer builds open-source sports analytics SaaS to replace coaches' clipboards

Jonas, a developer with a CS and Entrepreneurship background, is publicly documenting the solo development of SportsFlow, a sports analytics platform aimed at amateur coaches. The app allows coaches to live-track shots, assists, and saves during handball games on a phone or tablet, generating season-level analytics such as heatmaps and shooting percentages. Jonas identified the problem firsthand from his time on the bench, noting that in-game data is routinely lost once the final whistle blows. The platform is being built offline-first to handle poor connectivity in sports halls, with plans to expand beyond handball to volleyball, basketball, and ice hockey. Jonas intends to publish biweekly build-in-public updates covering both technical decisions and honest business trade-offs.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

NPM Safety Guard offers 23-layer supply chain protection for JS developers

NPM Safety Guard is a free, open-source developer tool built by SendWaveHub that provides 23 layers of security scanning for npm projects. It detects threats that standard npm audit misses, including known malicious packages, typosquatting, dependency confusion, exposed secrets, and AI credential theft hidden in node_modules. The tool integrates with both VS Code and JetBrains IDEs and is available on their respective marketplaces. It leverages multiple intelligence sources such as OSSF Scorecard, Socket.dev, and ReversingLabs to assess supply chain risk in real time. Released under the MIT license, the project is also hosted on GitHub where developers can review its source code and contribute.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

AI Security Gate: A Proposed Architecture to Safeguard AI-Generated Code

A software engineer has proposed an architectural concept called the AI Security Gate, designed to enforce deterministic security controls on artifacts produced by AI agents in modern development workflows. As AI systems increasingly generate code, infrastructure configs, and CI/CD scripts autonomously, the author argues that human-dependent security checkpoints no longer scale reliably. Unlike AI code reviewers that reason probabilistically, the proposed gate applies fixed, rule-based checks — such as detecting exposed secrets or policy violations — consistently and without exception. The gate is envisioned as a distinct architectural layer, separate from quality review, positioned before any AI-generated artifact is accepted into a repository or deployment pipeline. The concept draws on existing tools like secret scanners and IaC validators, framing them collectively as implementations of a single, mandatory security role.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

ClickHouse 2026 Guide: When to Denormalize vs. Join in Analytical Workloads

A new technical guide from DEV Community outlines a decision framework for choosing between denormalization and normalization in ClickHouse analytical data modeling. Denormalization has long been the default for latency-sensitive analytics because pre-joining data at ingestion delivers faster reads for known access patterns. However, advances in columnar database join algorithms — including parallel hash joins, bloom filters, and dictionary-based direct joins — have made runtime joins viable for many modern workloads. While denormalization still offers superior raw read performance, normalization provides operational advantages such as simpler pipelines, flexible schemas, and easier data governance. The guide recommends engineers evaluate their specific workload characteristics — including concurrency, freshness requirements, and pipeline complexity — rather than defaulting to either approach.

0 comments Read more at DEV Community