Building a best-in-class Data Lake on AWS and Azure


Ryan Murray, Principal Consulting Engineer at Dremio



Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. However, there are still a number of key challenges when it comes to building a cloud-based data lake. Most data in the cloud doesn’t start in S3 and ADLS. Instead, it’s stored in a variety of data sources, ranging from relational databases like Amazon RDS and Azure SQL DB to NoSQL databases like MongoDB and Elasticsearch. Logs often start their life in a data pipeline layer such as Kafka. The data also needs to be processed, explored, and analyzed using a variety of engines, including Spark, Impala, Athena and Dremio. While an on-premises data lake is static, a cloud data lake enables these engines to run independently on a common storage layer with their own individual lifecycle and scale. And S3 and ADLS are typically slower than a Hadoop distributed file system (HDFS). This introduces challenges for real-time workloads.

Ryan Murray explains how you can build data lakes in the cloud using S3 and ADLS as storage layers while leveraging multiple processing engines to address needs including batch processing, ad hoc data exploration, reporting, and ML and AI. In addition to exploring best practices, they provide several real-world examples from different industries.


Ryan Murray is a Principal consulting engineer at Dremio in the professional services organization since July 2019, previously in the financial services industry doing everything from bond trader to data engineering lead. Ryan is a PhD in Theoretical Physics and an active open source contributor who dislikes when data isn't accessible in an organisation. Passionate about making customers successful and self-sufficient. Still one day dreams of winning the Stanley Cup.