Fantastic Robo

High-throughput, multi-format ingestion pipeline with adaptive extraction, OCR, semantic chunking, and a production-grade LLM load balancer for resilient RAG service levels.

By Aditya Singh Khichi, Full Stack Engineer, New Delhi, India.

Tech stack: Docker, Sentry, Vector Embeddings, Mistral OCR, CI/CD, DigitalOcean.

input formats end-to-end: 6+

Problem

RAG pipelines that work on PDFs fall apart the moment users upload DOCX, PPTX, XLSX, scanned images, or email exports. Each format needs a different extractor, a different chunking strategy, and a different OCR fallback. Most teams ship a PDF-only MVP and accumulate technical debt every time a new format is requested.

Approach

Adaptive extraction: detect format → pick extractor → measure extraction quality → fall back to Mistral OCR if the text-layer is empty or scrambled. Semantic chunking respects document structure (slides for PPTX, sheets for XLSX, threads for emails) instead of brute-force splitting on token count. Embedding batching plus an HNSW vector index gives sub-100ms similarity search over the full corpus. The LLM load balancer routes requests across providers with telemetry-driven failover so a single upstream outage doesn't degrade the whole service.

Outcome

Production-grade ingestion across 6+ formats with sub-second retrieval. Hybrid retrieval (dense + BM25) plus dynamic Top-K turned out to outperform either alone for legal-style documents where exact phrase recall matters as much as semantic relevance.

Live link: https://github.com/Raghav-45/fantastic-robo