Before the AI Can Answer, Something Has to be Read 📖❇️

When people talk about AI answering questions from documents, they jump straight to the exciting parts — the language model, the vector database, the clever retrieval logic. Nobody talks about the unglamorous step that happens first: turning a PDF into something an AI can actually work with.

I spent time this week doing exactly that. And it taught me more about how AI systems fail than I expected.

The Problem With PDFs

A PDF is not a document. It is a set of instructions for a printer — a list of: place this glyph at these coordinates, in this font, at this size. There is no concept of "paragraph" or "table" or "heading" baked in. When you extract text from a PDF naively, you get whatever order those glyphs happen to appear in the file's internal stream. For a two-column document, that might mean the left column's first sentence, then the right column's first sentence, alternating back and forth — completely unreadable.

This is the problem Docling solves. It is an open-source document processing tool from IBM Research that uses machine learning to look at a PDF the way a human does — understanding which regions are headings, which are tables, which are figures, and what order they should be read in. Then it converts that understanding into formats like Markdown or JSON that a language model can actually use.

Docling is a core part of Ramalama, a Fedora project that lets you run AI models and build RAG (Retrieval-Augmented Generation) pipelines locally on your own machine. Understanding Docling means understanding the first — and arguably most critical — step of the entire pipeline.

What I Did

I installed Docling (version 2.82.0) inside a Python virtual environment on a CPU-only Linux machine, took a real-world PDF — the PyTorch Conference 2026 Sponsorship Prospectus (a 4.7 MB brochure with multi-column layout, a 7-column sponsorship comparison table, and 25 embedded images) — and ran it through nine different conversion experiments:

#	Command	What I was testing
1	`docling pytorch-conference.pdf`	Default conversion (Markdown)
2	`--to html`	HTML with embedded images
3	`--to html --image-export-mode referenced`	HTML with separated image files
4	`--to json`	Full document model as JSON
5	`--to text`	Plain text output
6	`--to yaml`	Full document model as YAML
7	`--pipeline legacy`	Older ML pipeline
8	`--table-mode fast`	Lighter table recognition model
9	`--force-ocr --profiling`	Force OCR on native PDF + timing

Here is what I found.

First: What Happens When Docling Processes a PDF

Before comparing outputs, I needed to understand what Docling actually does internally. It is not one step — it is a pipeline of eight distinct stages:

CLI parsing — reads your flags and sets up the run
PDF backend — extracts raw text positions and renders each page as a bitmap image
Layout analysis — a deep learning model classifies every region on every page: section_header, text, table, figure, caption, footnote
Table structure recognition — a second model (TableFormer) analyses table regions to identify rows, columns, merged cells, and headers
Reading order resolution — sorts all detected regions into the correct logical sequence, especially important for multi-column layouts
OCR — runs on regions that have no native text (figures, scanned areas)
DoclingDocument assembly — combines all of the above into a single internal data structure
Export — serialises the DoclingDocument into whichever format you requested

The key insight here: stages 1 through 7 are identical regardless of your output format. The DoclingDocument assembled at stage 7 is the same whether you asked for Markdown, HTML, JSON, or text. Only stage 8 changes. This means you can compare output formats directly — any difference between them is purely about how the same information is serialised, not about what information was extracted.

The Output Formats: A Spectrum of Detail

Markdown — The Default (And Why)

docling pytorch-conference.pdf --output ./output/default/

Output size: 1.2 MB | Lines: 281

Markdown is the default because it sits at the right intersection for RAG:

LLMs understand Markdown natively — models trained on internet data have seen enormous amounts of it. Headings, tables, and emphasis are semantically meaningful.
Natural chunking points — the heading hierarchy (#, ##, ###) gives RAG chunkers meaningful split points that follow the document's logical structure.
Human-verifiable — I can open the file and immediately see whether the conversion looks correct.

The output starts like this:

7-8 April 2026 | Paris, France

## 2026 SPONSORSHIP PROSPECTUS

![Image](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...)

## About PyTorch Conference

Those long data:image/png;base64,... strings are the 25 images embedded directly in the file. They account for almost all of the 1.2 MB. The actual text and table content is only about 26 KB.

HTML — Same Content, Browser-Ready

docling pytorch-conference.pdf --to html --output ./output/html/

Output size: 1.2 MB

The pipeline through stage 7 is identical. Stage 8 changes. Instead of Markdown syntax, the DoclingDocument is serialised to HTML elements:

Section headers → <h1>, <h2>, <h3>
Body text → <p>
Tables → <table> with <tr>, <th>, <td>
Images → <img src="data:image/png;base64,..."> (embedded)

The file is self-contained — one file that renders completely in any browser. But it is the same 1.2 MB because the same 25 base64-encoded images are still embedded.

HTML with Referenced Images — A Structural Insight

docling pytorch-conference.pdf --to html --image-export-mode referenced \
  --output ./output/html-referenced-image-export-mode/

Output: 22 KB HTML + ~900 KB PNG files (25 images)

This was one of the more interesting experiments. With --image-export-mode referenced, Docling extracts each image as a separate PNG file and references it from the HTML:

<img src="pytorch-conference_artifacts/image_000001_4285d2ca...png" />

The HTML file shrank from 1.2 MB to 22 KB — a 55x reduction. The images now live separately in an _artifacts/ folder.

The three image export modes:

Mode	HTML size	Images	Self-contained?
`embedded` (default)	1.2 MB	Inside the HTML as base64	Yes
`referenced`	22 KB	Separate PNG files	No
`placeholder`	Smallest	Not exported	N/A

For a RAG pipeline, this separation is actually what you want. You index the text structure from the 22 KB file. The PNG files sit separately, available for a vision model if needed. Loading a 22 KB file into a text pipeline is dramatically more efficient than loading 1.2 MB.

The trade-off: the HTML file is no longer portable on its own. Move it without the PNG folder and all images break. In a managed pipeline, that is acceptable.

JSON — The Full Picture

docling pytorch-conference.pdf --to json --output ./output/json/

Output size: 6.5 MB

This was the most revealing output to explore. JSON does not produce a rendered document — it serialises the entire internal DoclingDocument model to disk. Everything the ML pipeline computed is there:

{
  "schema_name": "DoclingDocument",
  "texts": [
    {
      "label": "section_header",
      "text": "About PyTorch Conference",
      "prov": [{ "page_no": 1, "bbox": { "l": 72.0, "t": 340.0, "r": 540.0, "b": 355.0 } }]
    }
  ],
  "tables": [
    {
      "data": {
        "grid": [[{ "text": "DIAMOND 4 AVAILABLE", "is_header": true }]]
      }
    }
  ]
}

What JSON preserves that Markdown discards:

The exact bounding box of every element on every page
The semantic label of every element (section_header, text, caption, table, figure)
The page number of every element
The full table cell grid with row/column indices and header flags
The document hierarchy — which heading owns which paragraphs

JSON is 6.5 MB because it carries all of that metadata plus the 25 base64 images. It is the format for programmatic RAG pipelines — when a framework like LangChain or LlamaIndex needs to filter by element type, attribute source pages, or build intelligent chunking strategies from heading hierarchy.

Text — The Floor

docling pytorch-conference.pdf --to text --output ./output/text/

Output size: 26 KB

Text output strips everything structural. The output opens like this:

7-8 April 2026 | Paris, France
2026 SPONSORSHIP PROSPECTUS
<!-- image -->
About PyTorch Conference

Headings are indistinguishable from body text. The 7-column sponsorship table is completely flattened — cell structure is gone. Images become  comments.

The file is 26 KB because all image data and markup have been discarded. But that small size is not efficiency — it is information loss. An LLM reading this output cannot reliably answer "How many conference passes does a Gold sponsor receive?" because the row-column relationship that carries that answer no longer exists.

Text output only makes sense for simple prose documents with no tables.

YAML — JSON's Twin

docling pytorch-conference.pdf --to yaml --output ./output/yaml/

Output size: 6.4 MB

YAML serialises the identical DoclingDocument as JSON — same content, same metadata, same images. Only the syntax differs:

- label: section_header
  text: About PyTorch Conference
  prov:
    - page_no: 1
      bbox:
        l: 72.0
        t: 340.0

The practical difference comes down to ecosystem fit. For Ramalama's Python-based pipeline, JSON is the better default — it is in the standard library (import json), has no edge-case parsing quirks, and is more widely supported in APIs and web services. YAML would only be preferable if a specific downstream tool required it.

The File Size Story

Looking at all six formats together, an important insight emerges:

Format	Size	What dominates
JSON	6.5 MB	Metadata + images
YAML	6.4 MB	Metadata + images
Markdown	1.2 MB	Images (25 × base64)
HTML (embedded)	1.2 MB	Images (25 × base64)
HTML (referenced)	22 KB HTML + 900 KB PNGs	Separated
Text	26 KB	Text content only

File size is not proportional to information quality. The largest files (JSON, YAML) carry the most structural information. But Markdown and HTML are large primarily because of embedded image data, not because they are structurally rich. And the smallest file (text) is small because structure was discarded — not because it is efficient.

The Most Surprising Finding: Smaller Is Not Worse

My first pipeline experiment compared the standard pipeline (default) against the legacy pipeline:

docling pytorch-conference.pdf --output ./output/default/
docling pytorch-conference.pdf --pipeline legacy --output ./output/legacy/

The results stopped me:

	Standard Pipeline	Legacy Pipeline
Output file size	1.2 MB	28 KB
Images	25 embedded (base64)	25 placeholders
Table structure	Correct	Correct
Text content	Complete	Complete

The legacy pipeline produced a file 43 times smaller. My first instinct was that something must have been lost. I opened both files and compared them carefully, line by line.

The table content was identical. Every row, every column, every cell value — the same. The sponsorship tiers, the pricing, the benefit descriptions, all correct in both outputs.

The entire difference was in how images were handled. Where the standard pipeline embeds actual image data:

![Image](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...)

The legacy pipeline writes:

<!-- 🖼️❌ Image not available. Please use `PdfPipelineOptions(generate_picture_images=True)` -->

The legacy pipeline detects that images exist at those positions but does not render or encode them. All 25 images become 25 identical placeholder comments — a capability limitation of the older pipeline generation, not a bug.

For this document, the images are logos and decorative graphics. None of them contain answerable knowledge. A RAG system querying "How many attendee passes does a Gold sponsor receive?" does not need the PyTorch logo to answer correctly.

The lesson: pipeline choice is not a global quality setting. It is a document-type decision. For documents with decorative images, the legacy pipeline gives you equivalent knowledge at a fraction of the cost — 28 KB vs 1.2 MB, faster processing, lower memory usage. On constrained hardware like a personal laptop running Ramalama, that trade-off is worth considering.

The Scariest Finding: Silent Table Corruption

This one genuinely surprised me.

I compared the default table mode (accurate) against fast table mode:

docling pytorch-conference.pdf --output ./output/default/
docling pytorch-conference.pdf --table-mode fast --output ./output/table-fast/

The fast mode output was 1.2 MB — same size as default. No errors in the terminal. No warnings about table processing. The file looked fine at a glance. The line count was 283 instead of 281, but I almost missed it.

Then I diffed the two files.

Accurate mode (correct):

| Promotion of Activity in Sponsor Booth: A session... | Promotion of (2) in-booth activities/ time slots | ...
| Attendee Registration Contact List: Opt-in only | ✔ (List provided pre and post event) | ✔ (List provided post event) | ...
| Social Media Promotion: From PyTorch X handle... | 1 Custom Post, 1 Group Post, and 1 Re-Post | 1 Group Post and 1 Re-Post | ...

Fast mode (corrupted):

| Promotion of Activity in Sponsor Booth: A session, demo... will be | Promotion of (2) in-booth | ...
| communicated by Sponsor Services, and may not overlap conference sessions. | activities/ time slots | ...
| Attendee Registration Contact List: Opt-in only Social Media Promotion: From PyTorch X handle. All custom | ✔ (List provided pre and post event) 1 Custom Post, 1 Group Post, and | ...

Three specific failures, all in the same table:

Cell text overflow — a long cell description was split across two phantom rows that do not exist in the original. This is where the extra 2 lines came from. More lines, less information.
Row merging — "Attendee Registration Contact List" and "Social Media Promotion" were collapsed into a single row. Their data values were concatenated inside shared cells. The Social Media Promotion row effectively disappeared as a distinct entry.
Column content bleeding — in the last two columns, truncated cell content from one row bled into the adjacent row: the text "wifi App Only No physical device provided." appeared as a cell value in the wrong row.

No error was thrown. The file was produced successfully. It just contained wrong data.

This is the most dangerous kind of failure in an AI pipeline — silent corruption. If this output were fed into a RAG system, an LLM answering "What social media promotion is included with a Gold sponsorship?" would return incorrect information with full confidence. There is no signal that anything went wrong.

The lesson: --table-mode fast is appropriate only for simple tables with clear borders and single-line cells. For any document where tables are the primary knowledge source, --table-mode accurate is not optional.

The Counterintuitive Finding: More OCR Is Not Better

The --force-ocr flag sounds thorough. Run OCR on the entire document, not just the image regions. Surely that produces more complete output?

docling pytorch-conference.pdf --force-ocr --profiling --output ./output/force-ocr/

For a scanned document with no native text layer, --force-ocr would be the right call. But pytorch-conference.pdf is a digitally typeset PDF — produced by a layout tool, not scanned from paper. It already has a perfect text layer: every character correct, every Unicode symbol intact, every piece of punctuation as intended.

When you force OCR over a document that already has perfect text, you replace known-good characters with pixel guesses. OCR working from bitmaps introduces errors that did not exist in the source:

l, I, and 1 are visually similar at small sizes — OCR confuses them
Unicode symbols like ✔ may become v or disappear entirely
Small text (footnotes, fine print) falls below reliable OCR resolution
Typographic ligatures (fi, fl) common in professional typesetting are often split or misread

--force-ocr on a native PDF is strictly worse on every dimension: lower text quality, slower processing, and no benefit whatsoever.

The --profiling flag made the cost concrete. Instead of subjectively saying "it felt slower," profiling prints a stage-by-stage timing breakdown after the run. The OCR stage — which touches only a handful of image regions in a standard run — balloons dramatically when it has to process every region of every page.

The lesson: know your document type before choosing your settings. Default OCR is correct for digitally typeset PDFs. Reserve --force-ocr for confirmed scanned documents where the native text layer is absent or corrupt.

Why This All Matters for RAG

RAG works like this:

PDF → Docling → Chunks → Embeddings → Vector Store → LLM → Answer

Docling is step one. Every mistake it makes travels through every step that follows:

A corrupted table (as seen with --table-mode fast) produces bad chunks
Bad chunks produce bad embeddings
Bad embeddings surface the wrong context during retrieval
The wrong context produces a confidently wrong answer from the LLM

There is no error correction downstream for document processing failures. The LLM does not know that the table it is reading had two rows merged. It reads what it is given and answers accordingly.

This is what makes Docling — and this task — more important than it first appears. It is not about learning a CLI tool. It is about understanding where AI systems break before they start, and building the intuition to prevent it.

What I Would Choose for Ramalama

For a local Ramalama deployment on personal hardware processing typical PDFs:

For standard documents (reports, brochures, papers):

docling <file.pdf> --pipeline standard --table-mode accurate --output ./output/

Best quality. Worth the extra processing time when answer accuracy matters.

For high-volume processing on constrained hardware with decorative images:

docling <file.pdf> --pipeline legacy --output ./output/

Equivalent text and table quality at ~43x smaller output. Trade images for speed.

For programmatic pipeline integration:

docling <file.pdf> --to json --output ./output/

JSON preserves all structural metadata for intelligent chunking and source attribution.

For confirmed scanned PDFs only:

docling <file.pdf> --force-ocr --output ./output/

Never apply to digitally typeset documents.

The most important rule: pipeline settings are per-document decisions, not global ones. Know what kind of document you are processing before you choose how to process it.

Closing Thought

I expected this task to be about learning commands. It turned out to be about learning how AI pipelines fail — and specifically, how they fail quietly, without telling you.

The --table-mode fast finding stays with me. A system that produces wrong output with no error signal is more dangerous than one that crashes. A crash tells you something went wrong. Silent corruption tells you nothing — it just slowly poisons the answers your AI gives.

Good document processing is not glamorous. But it is the foundation everything else stands on. If the documents feeding your RAG system are not being read correctly, no amount of downstream sophistication will save you.

Check out everything in detail, in my GitHub repository here. Don't forget to leave a star on my repo 🌟

This is just another part of my blogs for my journey on the Outreachy Internship Contribution Phase. See you in the next one 🥂

Before the AI Can Answer, Something Has to be Read 📖❇️

The Problem With PDFs

What I Did

First: What Happens When Docling Processes a PDF

The Output Formats: A Spectrum of Detail

Markdown — The Default (And Why)

HTML — Same Content, Browser-Ready

HTML with Referenced Images — A Structural Insight

JSON — The Full Picture

Text — The Floor

YAML — JSON's Twin

The File Size Story

The Most Surprising Finding: Smaller Is Not Worse

The Scariest Finding: Silent Table Corruption

The Counterintuitive Finding: More OCR Is Not Better

Why This All Matters for RAG

What I Would Choose for Ramalama

Closing Thought

Comments

More from this blog

How RamaLama Made Working with AI Feel Boring (in a Good Way)

How I Would Onboard to Outreachy Differently If I Were Starting Today (An Outreachy Onboarding Guide)

Understanding Fedora: My First Real Experience with Open Source at Scale ❇️

A Yearlong Odyssey: #365DaysOfSoftwareEngineering (2025)

Command Palette

The Problem With PDFs

What I Did

First: What Happens When Docling Processes a PDF

The Output Formats: A Spectrum of Detail

Markdown — The Default (And Why)

HTML — Same Content, Browser-Ready

HTML with Referenced Images — A Structural Insight

JSON — The Full Picture

Text — The Floor

YAML — JSON's Twin

The File Size Story

The Most Surprising Finding: Smaller Is Not Worse

The Scariest Finding: Silent Table Corruption

The Counterintuitive Finding: More OCR Is Not Better

Why This All Matters for RAG

What I Would Choose for Ramalama

Closing Thought

Comments

More from this blog