Solo developer finds 52% duplicates and 32% missing docs in self-built knowledge graph
A systems administrator in the Dominican Republic building ANIMUS, a Rust-based knowledge graph trained on 800+ Dominican banking regulation PDFs, discovered two separate silent failures during routine audits of his pipeline. The first issue revealed that over half the graph's nodes were duplicates, caused by re-running ingestion without a uniqueness check, skewing retrieval results without triggering any errors. The second and more serious flaw showed that 262 of 817 documents were never integrated into the graph because scanned or digitally signed PDFs returned too little extractable text to pass a 100-character threshold, yet were still marked as successfully processed. Adding an OCR fallback using pytesseract recovered 259 of the missing 262 documents, though noise in the output remains a partial challenge. A further complication emerged when some documents that passed the text check were found to contain corrupted text baked in by low-quality OCR during their original digitization.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in