Past Meetup

Complex data queries in distributed + Handling billions of rows in Postgres

Hosted by Big Data Spain

Public group

This Meetup is past

74 people went

Details

RUNNING COMPLEX DATA QUERIES IN A DISTRIBUTED SYSTEM

It is getting increasingly hard to store and get data back efficiently. The first distributed databases put all the burden of sharding on the application code. There are now some smarter solutions that handle most of the data distribution and resilience tasks inside the database.

This talk will give an overview of some challenges:
- how are other than by-primary-key queries actually organized and executed in a distributed system, so that they can run most efficiently?
- how do the contemporary distributed databases actually achieve transactional semantics for non-trivial operations that affect different shards/servers?

Jan will discuss the available solutions that some open source distributed databases have picked to solve them.

Speaker: Jan Steemann, Senior Developer
Jan is a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality.

ADVANCED TECHNIQUES TO HANDLE BILLIONS OF ROWS IN POSTGRES

Distributed systems and databases are a powerful option when dealing with high volumes of data, but they come with a cost in complexity and infrastructure costs. Setting up and maintaining such systems is not trivial and the learning curve to understand the intricacies of most of them can be very steep. How to proceed when the technical team in your start up is small and want to keep costs as low as possible?

Postgres is a powerful relational database in constant evolution with the support of a vibrant community. Recent versions contain new features or improvements on existing ones to allow processing a lot of data in a non-distributed environment. We will go through some techniques we use in Geoblink to power our Location Analysis tool with Postgres, like Foreign Data Wrappers, two-level partitions, JSONB data type and others. These allow us to implement simple solutions for some of our Big Data needs, using non-complex and single-node systems.

Speaker: Miguel Angel Fajardo & Guillermo Sánchez-Valdepeñas

Miguel Ángel is the CTO of Geoblink, a Spanish startup that is changing how businesses harness the power of data through Location Analysis. He has extensive experience leading teams and pushing for innovation in both sides of the Atlantic in different industries like telecom, media, e-commerce and videogames.

Guillermo is a Data Engineer in the Spanish startup Geoblink, transforming scattered information into powerful data to push the boundaries of Location Analysis. He loves Spark, Scala and distributed systems, and spends most of his time building pipelines to support the data transformation processes that provide insights to the users of the Geoblink app.