Discussion - Topic: Docling
Details
This week's topic: Docling
Discussion resources to help guide the conversation will be posted below a few days before the meetup.
Zoom link will be added about 5 min before the event starts.
As described in Thoughtworks Technology Radar Vol. #34.
Docling is an open-source Python and TypeScript library for converting unstructured documents into clean, machine-readable outputs. Using a computer vision–based approach to layout and semantic understanding, it processes complex inputs — including PDFs and scanned documents — into structured formats such as JSON and Markdown. That makes it a strong fit for retrieval-augmented generation (RAG) pipelines and for producing structured outputs from LLMs, in contrast to vision-first
retrieval approaches such as ColPali.
Docling provides an open-source, self-hostable alternative to proprietary cloud-managed services such as Azure Document Intelligence, Amazon Textract and Google Document AI, while integrating well with frameworks such as LangGraph. In our experience, it performs well in production-scale extraction workloads across digital and scanned PDFs, including very large files containing text, tables and images. It delivers a strong quality-to-cost balance for downstream agentic RAG workflows. Based on these results, we’re moving Docling to Trial.
Discussion Resources :
Will be added a few days before the event.
