Architecture notice PDF page and LLM index reference method example

Cutting Output Tokens by 90% and Latency by 87% with Index References in LLM-Based PDF Chunking

TL;DR: Just ask the LLM “from where to where,” and let the server retrieve the text directly. Measured on 3 pages: 90% fewer output tokens, 87% lower latency, 61% cost savings. Background: From Docling to PyMuPDF + VLM While building an AI system for architecture regulation review, we needed to split architecture notice/guideline PDFs into semantically meaningful chunks. These chunks serve as retrieval units in a RAG pipeline. We initially used IBM’s Docling, which uses OCR models to analyze document structure before chunking. However, we ran into two problems: ...

March 9, 2026 · 6 min · Kim Bo-geun