Microsoft Fabric has OneLake Storage at the center of all services. Storage is based upon existing Azure Data Lake Storage and can be accessed with tools that you are familiar with. Many different file formats have been used over time. Understanding the pros and cons of each file type is important.
I will be exploring several different datasets during the talk: zip files, stock data, earthquake data, NASA website data, and Fisher Iris dataset.
Our exploration will start with the CSV format which is widely used. However, there are many other formats that you might encounter. Web services typically use a JSON document as the input and/or output to REST API calls. The Apache foundation projects came up with three different file formats: AVRO shines at data deserialization for RPC calls, ORC is suited for Hadoop processing, and PARQUET is optimized for Spark processing.
There are edge cases in which a file is in a special format. One can always use the TEXT format to parse out the data row by row. All of the above formats do not support the ACID properties of a database.
That is why Databricks developed the DELTA file format which was opened source in 2019. This format is the foundation of most files in OneLake.
The Fabric Lakehouse is an implementation of Apache Spark which can have both managed and unmanaged Tables. I suggest using managed tables since they are support by both the SQL Analytics endpoint and the semantic model.
At the end of this talk, the developer will have a full understanding of all the file formats than can be managed by Fabric.