Skip to content

Details

This week's topic: Docling

Discussion resources to help guide the conversation will be posted below a few days before the meetup.

Zoom link will be added about 5 min before the event starts.

As described in Thoughtworks Technology Radar Vol. #34.

Docling is an open-source Python and TypeScript library for converting unstructured documents into clean, machine-readable outputs. Using a computer vision–based approach to layout and semantic understanding, it processes complex inputs — including PDFs and scanned documents — into structured formats such as JSON and Markdown. That makes it a strong fit for retrieval-augmented generation (RAG) pipelines and for producing structured outputs from LLMs, in contrast to vision-first
retrieval approaches such as ColPali.

Docling provides an open-source, self-hostable alternative to proprietary cloud-managed services such as Azure Document Intelligence, Amazon Textract and Google Document AI, while integrating well with frameworks such as LangGraph. In our experience, it performs well in production-scale extraction workloads across digital and scanned PDFs, including very large files containing text, tables and images. It delivers a strong quality-to-cost balance for downstream agentic RAG workflows. Based on these results, we’re moving Docling to Trial.

Discussion Resources :

Will be added a few days before the event.

Related topics

Artificial Intelligence

You may also like