- Tools with Apache Arrow and Dask: Uwe Korn and Florian Jetter
Tools with Apache Arrow and Dask: Uwe Korn and Florian Jetter Uwe Korn (QuantCo) (Efficient) Data Exchange with "Foreign" Ecosystems Florian Jetter (Blue Yonder – JDA Software) Kartothek – Table Management for Cloud Object Stores Powered by Apache Arrow and Dask Detailed abstracts below! ---- Agenda: 17:45 – Doors open 18:00 – Welcome / Opening with Paul Hilmer (Denic eG) / NumFocus Introduction by Alexander Hendorf (Königsweg) 18:15 Uwe Korn (QuantCo) (Efficient) Data Exchange with "Foreign" Ecosystems 19:15 Break with refreshments 19:45 Florian Jetter (Blue Yonder – JDA Software) Kartothek – Table Management for Cloud Object Stores Powered by Apache Arrow and Dask 20:30 Lightning Talks 21:15 – End Lightning talks welcome! Please reach out if you want to give one [masked]. Space is limited, please release your spot if you cannot make it. Thanks a lot, to the speakers, KÖNIGSWEG for organizing and DENIC eG for hosting this PyData Frankfurt. This event will be in English. Any questions or suggestions? please feel free to ping us via meetup or [masked] or join our Telegram group: https://t.me/joinchat/CeKOXBACWgvtkjpz8z7hQA ---- Uwe Korn (Efficient) Data Exchange with "Foreign" Ecosystems As a Data Scientist/Engineer in Python, we focus in our work to solve problems with large amounts of data but still stay in Python. This is where we are the most effective and feel comfortable. Libraries like Pandas and NumPy provide us with efficient interfaces to deal with this data while still getting optimal performance. The main problem appears when we have to deal with systems outside of our comfort ecosystem. We need to write cumbersome and mostly slow conversion code that ingests data from there into our pipeline until we can work efficiently. Using Apache Arrow and Parquet as base technologies, we get a set of tools that eases this interaction and also brings us a huge performance improvement. As part of the talk we will show a basic problem where we take data coming from a Java application through Python into using these tools. Uwe is a Data & Machine Learning Engineer and open source developer. While he started his career building machine learning models, he quickly came into the business of developing the underlying platform that is needed to build and run successful data products. His main focus is on building software artifacts and a culture that supports a highly effective collaboration between data scientists and engineers. As part of his work on providing an efficient data interchange between systems, he became a core committer to the Apache Parquet and Apache Arrow projects. ---- Florian Jetter Kartothek – Table Management for Cloud Object Stores Powered by Apache Arrow and Daskk Storing and processing data efficiently is an integral part of successful data-driven applications. An efficient and scalable way to store big data is by using object stores of public cloud providers like ABS, S3 or GCS. These storages come with downsides attached which make the management of tabular data distributed over many objects not a trivial task. Kartothek is a recently open sourced Python library we develop and use at Blue Yonder – JDA Software to manage tabular data in cloud object stores. It is built on Apache Arrow, Apache Parquet and is powered by Dask. It’s specification is compatible with the de-facto standard storage layouts used by other big data processing tools like Apache Spark and Hive but offers a native, seamless integration into the Python ecosystem. Florian Jetter started his career at Blue Yonder – JDA Software by building and running machine learning models. Eventually, he got frustrated by slow and inefficient data pipelines and built libraries and tools to support his fellow Data Scientists and Engineers. His focus quickly shifted from Machine Learning to Data Engineering and he leverages the knowledge of both worlds to build a reliable data platform.