Toronto Open Source Data Infrastructure Meetup - September 2023

Name: Toronto Open Source Data Infrastructure Meetup - September 2023
Start: 2023-09-28T18:00:00-04:00
End: 2023-09-28T20:00:00-04:00
Location: Workhaus Coworking & Office Space

Hosted by Team A. and Angie

Toronto Open Source Data Infrastructure Meetup

Details

Are you interested in learning more about open-source data technologies? Do you want to network with local and international tech professionals in a fun, relaxed environment?

Then come join us on Thursday, September 28th for an evening full of inspiring conversations and exciting talks by speakers from Hookdeck, Instacart, and Dremio.

After the talks, you'll network with speakers and local tech professionals over some food.

Venue:

Stateroom in Workhaus

30 Wellington St West

5th Floor

Toronto, ON

M5L 1E2

It's in a Workhaus shared space. Attendees will be directed to the stateroom when they get off the elevator on the 5th floor.

Major intersection: Bay and Wellington (between King and Front Street on Bay)

Check this out for directions.

Program:

18:00 - 18:20 - Food* + welcome

18:20 - 18:40 - Scaling Analytics with Clickhouse: ingest at 100k/sec, query stateful data in under a 1/2 second (Maurice Kherlakian - CTO & Founding Engineer at Hookdeck)

18:40 - 19:00 - Apache Iceberg: enabling an open data architecture for large-scale analytics - (Dipankar Mazumdar - Data (Eng/Sci) Advocate at Dremio)

19:00 - 19:20 - Postgres as search and personalization engine (Ankit Mittal - Senior Software Engineer at Instacart)

19:20 - 20:00 - Q&A, networking

* Please note that this is an alcohol-free event. Light bites will be provided.

* By attending this event, you agree to abide by Aiven community code of conduct.

* Recording equipment will be present.

Talk details:

Talk # 1 Scaling Analytics with Clickhouse: ingest at 100k/sec, query stateful data in under a 1/2 second

Clickhouse is one of the most powerful open-source analytics databases available. Their strength lies not only in the speed at which they enable you to query data, but also the speed at which they can ingest data. To achieve that speed, Clickhouse does not support updating records (it supports mutations which are async operations on large data sets, but not updates as we know them from traditional databases). instead, for every update, we insert a new record. This makes it challenging to query stateful data, as we have to take care of duplicates.

In this talk, we'll take a look at some of the modeling techniques, and how we can use one of Clickhouse's mergeTree engines, VersionedCollapsingMergeTtree, to our advantage with some clever queries to get aggregation queries over hundreds of millions of records to run in under a half second.

Talk # 2 Apache Iceberg: enabling an open data architecture for large-scale analytics

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data.

A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format, released by Facebook, which addresses some of these problems, but falls short on data, user, and application scale.

Apache Iceberg is a foundational technology for implementing an open data lakehouse, an architecture that addresses the limitations of traditional data architecture patterns. These limitations include having to ETL the data into each tool creating data drift and data silos, high costs making it cost prohibitive to make warehouse features available to all of your data and lack of flexibility forcing you to adjust your workflow to the tool your data is locked in. Apache Iceberg provides the capabilities, performance, scalability and savings that fulfill the promise of an open data lakehouse. In this talk we will go through:

What is a Table format in Data lakes?
Hive table format & why do we need a new one?
Architecture of an Iceberg table
What happens under the covers when we CRUD?
Benefits of this architecture (cost savings, etc.) & how it enables an open lakehouse

Talk # 3 POSTGRES AS SEARCH AND PERSONALIZATION ENGINE

We at Instacart have been running a large scale search and recommender system in Postgres. We send trained models metadata and features data to Postgres and do inference inside PG. This cluster being a Postgres cluster works with a workload that is a combination of multiple table joins, full text search and personalized ranking via embeddings. The cluster is self hosted, replicas receive WAL shipped by pgbackrest. The applications can survive primary loss by serving stale data and replica loss by banning individual nodes.

---

Toronto Open Source Data Infrastructure Meetup

Toronto Open Source Data Infrastructure Meetup - September 2023

Toronto Open Source Data Infrastructure Meetup

Details

Related topics

You may also like