Skip to content

What we’re about

Apache Hudi ingests & manages storage of large analytical datasets over Distributed File Systems (including cloud stores). Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing. Apache Hudi is used extensively by companies such as Uber, Robinhood, Alibaba and more to build their data lakes and solve a wide variety of business use-cases. Some of the notable features provided by Hudi based Data Lakes are

  • Upsert support with fast, pluggable indexing.
  • Atomically publish data with rollback support.
  • Snapshot isolation between writer & queries.
  • Savepoints for data recovery.
  • Manages file sizes, layout using statistics.
  • Async compaction of row & columnar data.
  • Timeline metadata to track lineage.
    Some popular use-cases users use Apache Hudi based Data lakes for are record level data deletions to meet compliance and privacy requirements, reduce storage costs by using features such as automated archival, cleaning and compactions, provide quick actionable insights with minute level freshness on the data lake.
    You can engage with the Apache Hudi community on mailing lists (dev@hudi.apache.org) or on slack. Find more information here -> https://hudi.apache.org/community.html
    This meetup is for anyone wanting to build the next generation of Data Lakes.
    Developers of Hudi will be able to use this platform to learn about internals and how to run Hudi @ Scale.
    Users will be able to learn about use-cases, new features and future roadmap and more.
    Find past presentations, tutorials and videos here -> https://hudi.apache.org/docs/powered_by.html and blog posts about use-cases and internals here -> https://hudi.apache.org/blog.html