Skip to content

March Meeting: Understanding File Formats within the Lakehouse

Photo of Chris Hyde
Hosted By
Chris H. and Keith T.
March Meeting: Understanding File Formats within the Lakehouse

Details

The Albuquerque SQL Server User Group (ABQSQL) will be having its next meeting on Friday, March 14th from 11:30 AM to 1:00 PM online via Microsoft Teams.

This month John Miner, Senior Data Architect and Microsoft Data Platform MVP, will be teaching us about the details of the different types of storage within Microsoft Fabric. Please join us for his presentation “Understanding File Formats within the Lakehouse". Ask John your data questions, and get to know your peers!

Presentation description: "Microsoft Fabric has OneLake Storage at the center of all services. Storage is based upon existing Azure Data Lake Storage and can be accessed with tools that you are familiar with. Many different file formats have been used over time. Understanding the pros and cons of each file type is important.

Most of the talk will be centered around five years of S&P 500 stock data stored in comma separated values (CSV) files by year and stock symbol. However, there are other formats that you might encounter when working with customers. Web services typically use a JSON document as the input and/or output to REST API calls. The Apache foundation projects came up with three different file formats: AVRO shines at data deserialization for RPC calls, ORC is suited for Hadoop processing, and PARQUET is optimized for Spark processing. There are edge cases in which a file is in a special format. One can always use the TEXT format to parse out the data.

All of the above formats do not support the ACID properties of a database. That is why Databricks developed the DELTA file format which was opened source in 2019. This format is the foundation of most files in OneLake.

The Fabric Lakehouse is an implementation of Apache Spark. One can read and write all of these file formats using Spark dataframes. Additionally, one can create either managed (INTERNAL) or unmanaged (EXTERNAL) tables in the hive catalog. Only managed tables are accessible by the SQL endpoint at this time and should be used most cases. One cool feature is the short hand notation in Spark SQL to read up a file format given a directory. This can be used as input to the create table as (CTAS) statement to create managed tables.

At the end of this talk, the developer will have a full understanding of all the file formats than can be managed by Fabric."

Attend this FREE data platform training event!

(The Teams link will be made available Friday morning roughly two hours before the presentation.)

Photo of Albuquerque SQL Server User Group group
Albuquerque SQL Server User Group
See more events