Developer builds internal doc Q&A bot using RAG, shares lessons on embedding pitfalls

·3 views

A software developer spent a weekend building a question-and-answer bot to help their team quickly search through 200-plus pages of internal documentation spread across Confluence, Google Docs, and PDFs. The project used a Retrieval-Augmented Generation (RAG) approach, combining OpenAI embeddings, a Pinecone vector database, and GPT-4 to generate answers from retrieved document chunks. Early attempts with fixed-character chunking and naive retrieval produced poor results, with relevant content often split across chunks or buried beyond the top results. The developer ultimately settled on a hybrid pipeline that chunks documents by paragraph, combines dense embeddings with keyword search, retrieves ten candidate chunks, and reranks them using a lightweight cross-encoder before passing the best three to GPT-4. The experience highlighted that chunking strategy and reranking logic are critical, often overlooked factors in building reliable document search systems.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

How to Build a Stable Parser for Claude Code's Undocumented JSONL Logs

Claude Code stores every conversation session as JSONL files on disk under ~/.claude/projects/, but the format is an undocumented internal detail with no version field or stability guarantee. The schema can change with nearly every daily CLI update, making it unreliable for developers who build tools on top of these logs. A developer building a read-only replay and search tool identified key patterns for handling this instability, including preserving unknown data types rather than discarding them, and normalizing input at the parsing boundary. The approach uses an explicit whitelist of known message types, archiving unrecognized entries as raw JSON for future re-parsing once the schema is better understood. A versioning mechanism called SUMMARY_VERSION automates re-indexing of stale sessions whenever the parser is updated, avoiding manual data migrations.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

DIY Creator Teardown Fogger Vape Pod to Salvage Li-ion Cells and Charging Board

A DIY electronics creator has released a teardown video demonstrating how to disassemble a Fogger vape pod dock and recover its reusable internal components. The salvaged parts include a charging board and rechargeable lithium-ion cells, which the creator repurposed for DIY electronics projects. Using the recovered components, they built a simple 3.7V lithium-ion charger to extend the life of salvaged cells rather than discarding them as e-waste. The project is aimed at hobbyists interested in upcycling, electronics salvage, and DIY power builds. The creator is also accepting donations via Ko-fi to fund future teardown and salvage content.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Why a 2-Hour Codebase Audit Should Always Precede a Software Rewrite Quote

A software consultant argues that quoting a full system rewrite without first auditing the codebase is either guesswork or financially motivated, and insists on a two-hour code review before providing any estimate. The consultant notes that most troubled systems do not require a complete rewrite, but rather targeted fixes to a small number of genuinely broken components, which is typically faster and cheaper. The audit framework is designed for small-to-mid SaaS products under 100,000 lines of code, with larger distributed systems requiring days of structured assessment rather than hours. Before beginning, the consultant requests repository access, infrastructure details, a database schema dump, a three-month incident log, and a clear description of what the client considers broken. The incident log is highlighted as the most undervalued input, since recurring user-facing errors reveal real risk patterns that static code analysis alone cannot surface.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

How Fake Payslips Fool HR Teams and What Metadata Reveals About Them

Payslip fraud is a widespread form of credential misrepresentation in hiring, with industry surveys suggesting between 10% and 20% of job applicants falsify documents. Payslips are a prime target because they influence salary negotiations, employment verification, and visa decisions simultaneously. Candidates typically alter genuine payslips using PDF editors or generate entirely fake ones through online tools, both methods leaving detectable traces in the file's internal structure. When a PDF is edited and re-saved, metadata fields such as the producer tag and cross-reference table are updated, creating a forensic mismatch — for example, a payslip originally created by ADP but last saved via Smallpdf. These structural inconsistencies are invisible to human reviewers but can be flagged by automated metadata analysis tools.

0 comments Read more at DEV Community