Apache Arrow: The In-Memory Layer Your Iceberg, Spark, and Parquet
Details
10 years of Arrow. 30 minutes to understand why it's everywhere.
If you work with modern data infrastructure, Arrow is almost certainly running somewhere in your stack. Most engineers never notice it.
Arrow solved a real problem: moving data between systems required serializing and deserializing at every boundary. CPU cycles, memory copies, latency. At scale, that cost compounds fast. Arrow's solution was a language-agnostic columnar memory format any system could share without copying. What started as a memory layout spec became the execution substrate of the modern data stack.
In this 30-minute session, Badal Singh, who has contributed to Apache Iceberg Go and built OLake's Arrow-based ingestion writer at 550,000+ rows/second, will cover:
- From niche interoperability project to de-facto standard: Apache Arrow's 10-year journey
- What Arrow actually is beyond "columnar in-memory format" and why that definition undersells it
- How zero-copy data sharing eliminates serialization overhead and what that means for pipeline performance
- Where Arrow runs today: Spark, Pandas, ClickHouse, Polars, and inside open table formats like Apache Iceberg Go
- What's next: Arrow Flight, ADBC, nanoarrow, and the ecosystem reshaping how data systems talk to each other
Related topics
Big Data
Data Analytics
Data Engineering
Data Management
Database Professionals
