Mark your calendar for the next session of the PyData Paris Meetup, November 27th 2019. This Meetup will be hosted at Dataiku, 203 Rue de Bercy, Paris.
The speakers for this session are Nelle Varoquaux and Tim Hunter.
7:00pm - 7:15pm: Community announcements
7:15pm - 8:00pm: Nelle Varoquaux (CNRS)
The lifecycle of open-source software: Mining GitHub to understand community dynamics
8:00pm - 8:45pm: Tim Hunter (Databricks)
Koalas: Making an Easy Transition from Pandas to Apache Spark
8:45pm - 9:30pm: Standing buffet
*Nelle Varoquaux: The lifecycle of open-source software: Mining GitHub to understand community dynamics*
GitHub (https://www.github.com) is a code-sharing platform used by many open-source software developers to coordinate the creation and maintenance of software. Because open-source software projects are often maintained by communities of volunteers working to sustain shared (and often vital) software infrastructure, the ability of these communities to attract and maintain new members is vital; otherwise, these projects would languish as existing community members leave. Here, my collaborators and I analyze community members' interactions on GitHub to understand the social dynamics that make communities more welcoming—or more hostile—to newcomers.
In this talk, I will present the data collection and stastical modeling we use in this project, introducing a mixture of Python and R tools.
Joint work with
- Alexandra Paxton, from University of Connecticut
- R. Stuart Geiger, from University of California, Berkeley
- Chris Holdgraf, from University of California, Berkeley
*Tim Hunter: Koalas: Making an Easy Transition from Pandas to Apache Spark*
In this talk, I will present Koalas, an open-source project that aims at bridging the gap between the big data and small data for data scientists, and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand:
– how to effectively leverage both pandas and Spark inside the same code base
– how to leverage powerful pandas concepts such as lightweight indexing with Spark
– technical considerations for unifying the different behaviors of Spark and pandas
I am a research faculty at GEM and BCM, in the TIMC laboratory in Grenoble. I am interested in machine learning and causal inference methods to better understand gene regulatory networks, with a particular focus on how the 3D structure of the genome affects and is affected by gene regulation. I am also involved in scientific computing activities. In particular, I am a contributor to scientific Python softwares including scikit-learn -- Machine learning in Python-- and matplotlib --a python 2D plotting library--.
Tim Hunter is a software engineer at Databricks and is the co-creator of the Koalas project. He holds an engineering degree from Ecole Polytechnique and a Ph.D in Computer Science from UC Berkeley. He contributes to the Apache Spark MLlib project, as well as the GraphFrames, TensorFrames and Deep Learning Pipelines libraries. He has been building distributed Machine Learning systems with Spark since version 0.0.2, before Spark was an Apache Software Foundation project.