- PyData Paris - November 27th Meetup
Mark your calendar for the next session of the PyData Paris Meetup, November 27th 2019. This Meetup will be hosted at Dataiku, 203 Rue de Bercy, Paris. The speakers for this session are Nelle Varoquaux and Tim Hunter. Schedule ------------- 7:00pm - 7:15pm: Community announcements 7:15pm - 8:00pm: Nelle Varoquaux (CNRS) The lifecycle of open-source software: Mining GitHub to understand community dynamics 8:00pm - 8:45pm: Tim Hunter (Databricks) Koalas: Making an Easy Transition from Pandas to Apache Spark 8:45pm - 9:30pm: Standing buffet Abstracts ------------- *Nelle Varoquaux: The lifecycle of open-source software: Mining GitHub to understand community dynamics* GitHub (https://www.github.com) is a code-sharing platform used by many open-source software developers to coordinate the creation and maintenance of software. Because open-source software projects are often maintained by communities of volunteers working to sustain shared (and often vital) software infrastructure, the ability of these communities to attract and maintain new members is vital; otherwise, these projects would languish as existing community members leave. Here, my collaborators and I analyze community members' interactions on GitHub to understand the social dynamics that make communities more welcoming—or more hostile—to newcomers. In this talk, I will present the data collection and stastical modeling we use in this project, introducing a mixture of Python and R tools. Joint work with - Alexandra Paxton, from University of Connecticut - R. Stuart Geiger, from University of California, Berkeley - Chris Holdgraf, from University of California, Berkeley *Tim Hunter: Koalas: Making an Easy Transition from Pandas to Apache Spark* In this talk, I will present Koalas, an open-source project that aims at bridging the gap between the big data and small data for data scientists, and at simplifying Apache Spark for people who are already familiar with the pandas library in Python. Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle. When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes. Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas Bios ------ Nelle Varoquaux ----------------------- I am a research faculty at GEM and BCM, in the TIMC laboratory in Grenoble. I am interested in machine learning and causal inference methods to better understand gene regulatory networks, with a particular focus on how the 3D structure of the genome affects and is affected by gene regulation. I am also involved in scientific computing activities. In particular, I am a contributor to scientific Python softwares including scikit-learn -- Machine learning in Python-- and matplotlib --a python 2D plotting library--. Tim Hunter --------------- Tim Hunter is a software engineer at Databricks and is the co-creator of the Koalas project. He holds an engineering degree from Ecole Polytechnique and a Ph.D in Computer Science from UC Berkeley. He contributes to the Apache Spark MLlib project, as well as the GraphFrames, TensorFrames and Deep Learning Pipelines libraries. He has been building distributed Machine Learning systems with Spark since version 0.0.2, before Spark was an Apache Software Foundation project.
- Digital Geoscience Hackathon
A hackathon in Paris focusing on open-source geoscience and datascience. With funding from the EAGE, the Digital Geosciences Hackathon will bring together computer scientists, machine learning specialists and geoscience experts to explore new ideas for digitalization and automation in geophysical research, exploration and industrial production processes. You don’t need to code to come, all levels of experience are welcome. 1. Bring your own ideas and pet projects. 2. Form a team and develop an idea. 3. Present your solution with Voilà (why not?) When: Friday 15th and Saturday 16th November 2019. Where: Institut de Physique du Globe de Paris. More info: https://digitalgeohack.github.io/ Tikets cost €12 (includes lunch and drinks both days) and can be bought at eventbrite: https://www.eventbrite.co.uk/e/billets-digital-geosciences-hackathon-73786843435
- PyData Paris - October 3rd Meetup
Mark your calendar for the next session of the PyData Paris Meetup on October 3rd 2019. This Meetup will be hosted at Scaleway's headquarters located 11bis Rue Roquépine, Paris. The speakers for this session are Marianne Corvellec and Arnaud Wald. Schedule ------------- 7:00pm - 7:15pm: Community announcements 7:15pm - 8:00pm: Marianne Corvellec: Building custom analytics web apps for bioinformatics with Plotly’s Dash Bio 8:00pm - 8:30pm: Arnaud Wald RAPIDS.ai - Leveraging GPUs for accelerated data science & data analytics 8:30pm - 9:30pm: Standing buffet sponsored by our host, Scaleway Bios ------ Marianne Corvellec A PhD in statistical physics, Marianne works in industry as a data scientist and a software developer. She is also a free software activist (with April.org) and an independent researcher (with IGDORE). Her research interests include data science, education, and assessment. Since 2013, she has been a regular speaker and contributor in the Python, Carpentries, and FLOSS communities. Arnaud Wald Arnaud Wald has been working as a Machine Learning Engineer since he graduated from Centrale-Supélec in 2017. Now a member of Scaleway’s AI team, he can keep up to date with the latest advances in the field while working on practical solutions to make AI in the cloud more accessible. Abstracts ------------- Building custom analytics web apps for bioinformatics with Plotly’s Dash Bio Dash Bio is a free and open-source library for creating customizable data dashboards in the field of the life sciences (molecular biology, analytical chemistry, ...). These dashboards are interactive, reactive, web-based applications, which can be written in either Python or R. They let users perform bioinformatics-related tasks such as exploring, analyzing, and visualizing genomic data. First, we focus on the extensibility and Plotly compatibility of the library. Dash Bio is highly extensible, being one of several domain-specific component libraries in the Dash ecosystem. Dash is a Python-based web framework for building analytics (data science) web applications. Running Flask in the backend and React.js in the frontend, it leverages the power of Plotly graphs. Dash has recently gained much interest and adoption in the scientific and business communities. We then discuss how Dash Bio fits in the bioinformatics space, from parsing utilities, at the lower level, to research use cases, at the higher level. Finally, we demo how to write a Dash app from scratch, in Python, using available Dash Bio components and the very elegant Plotly Express. RAPIDS.ai - Leveraging GPUs for accelerated data science & data analytics RAPIDS makes it possible to have end-to-end data science pipelines run entirely on GPU architecture. It capitalizes on the parallelization capabilities of GPUs to accelerate data preprocessing pipelines, with a pandas-like dataframe syntax. GPU-optimized versions of scikit-learn algorithms are available, and RAPIDS also integrates with major deep learning frameworks. This talk will present RAPIDS and its capabilities, and how to integrate it in your pipelines. Misc. ------- We need all attendees to provide their full name as they register on meetup.com if it does not match their meetup username. Please bring a photo id with you to access the event.
- PyData Paris - July 5th Meetup
Mark your calendar for the next session of the PyData Paris Meetup on july 5th 2019. This Meetup will be hosted at EDF's tower located 20 place de La Défense, 92050 PARIS LA DEFENSE The speakers for this session are Jean-Charles Vialatte and Sylvain Corlay. We need people to enlist with their real name/surname and an email for security reasons. Remember to bring your id (passeport / carte d’identité) in order to be allowed. Schedule ------------- 6:00pm - 6:15pm: Community announcements 6:15pm - 7:00pm: Jean-Charles Vialatte: WARP10: Combining the strengths of Python and Java to leverage time series and geo time series datasets. 7:00pm - 7:45pm: Sylvain Corlay: Voilà: From Jupyter notebooks to standalone web applications and dashboards. 8:00pm - 9:00pm: Standing buffet Abstracts ------------- - Warp 10: Combining the strengths of Python and Java to leverage time series and geo time series datasets : Warp 10 is a time series database with optional geo support written in Java. One of its key differencing factor is WarpScript, which not only is a query language, but is also a full-fledge programming language tailored to ease time series processing. In this presentation, we will explain how Python and WarpScript can interoperate efficiently, using bridges built between both the Python and Java ecosystems by libraries such as Py4J, Pyrolite and PySpark. We will see the benefits of doing so through examples and Jupyter notebooks. - Et voilà! From Jupyter notebooks to standalone web applications and dashboards: The goal of Project Jupyter is to improve the workflows of researchers, educators, and other practitioners of scientific computing, from the exploratory phase of their work to the communication of the results. But interactive notebooks are not the best communication tool for all audiences. While they have proven invaluable to provide a narrative alongside the source, they are not ideal to address non-technical readers, who may be put off by the presence of code cells, or the need to run the notebook to see the results. In this talk, we will present voilà, a new dashboarding tool built upon Jupyter protocols and standard formats meant to address these challenges and bridge that gap in the Jupyter ecosystem. Bios : ------- - Jean-Charles Vialatte, Ph. D., works as a machine learning engineer at SenX. On December 2018, he defended successfully his thesis on "Convolution of Graph Signals" and "Deep Learning on Graph Domains" at IMT Atlantique. He obtained its engineering degree from the same institution in 2015. - Sylvain Corlay is the founder and CEO of QuantStack. He holds a PhD in applied mathematics from University Paris VI. As an Open Source Developer, Sylvain is very involved with Project Jupyter and is a steering committee member of the Project. He was honored with the rest of the Jupyter Steering Council with the 2017ACM Software System Award for Jupyter. Beyond Jupyter, Sylvain contributes to a number of scientific computing open-source projects such as bqplot, xtensor, and voila. Sylvain founded QuantStack in September 2016. Prior to founding QuantStack, Sylvain was a Quant Researcher at Bloomberg and an Adjunct Faculty member at the Courant Institute and Columbia University. Besides QuantStack, Sylvain serves as a member of the board of directors of the NumFOCUS foundation.
- PyData Paris - June 2019 Meetup
Mark your calendar for the next session of the PyData Paris Meetup on June 5th 2019! This Meetup is hosted by the Centre de Recherche Interdisciplinaire (CRI), 8 rue Charles V, 75004 Paris. This is a special edition of our event series in that it is organized in the occasion of a community workshop on Project Jupyter and a large number of developers of the Jupyter ecosystem will be in town. Hence we will have one main talk by our invited speaker, Emmanuelle Gouillart, followed with a series of "lightning talks" by participants to the Jupyter workshop. Schedule ------------- 6:00pm - 6:15pm: Community announcements 6:15pm - 7:00pm: Emmanuelle Gouillart Dash: a web framework for writing highly-tuned apps for data science in pure Python 7:00pm - 7:45pm: Jupyter Community Lightning talks. (Series of short presentations in rapid succession). 7:45pm - 8:30pm: Standing buffet Abstract ------------ Dash (https://dash.plot.ly/) is an open-source Python web application framework developed by Plotly. Written on top of Flask, Plotly.js, and React.js, Dash is meant for building data visualization apps with highly custom user interfaces in pure Python. I will give a demo of how to write Dash apps with pure Python, starting from very simple apps to more advanced ones, including reactive apps based on advanced data visualization. I will also discuss the performance and deployment of Dash apps. Dash benefits from several component libraries, from the core components (eg sliders, radio buttons, file dialogs) to more custom and application-specific components, such as components for engineering or life sciences applications, or data tables. I will take the example of the dash-canvas library, which provides an interactive component for annotating images (eg with freehand brush, lines, bounding boxes...). The library also provides utility functions for using user-provided annotations for several image processing tasks such as segmentation, transformation, measures, etc. The latter functions are based on libraries such scikit-image and openCV. A gallery of examples showcases some typical uses of Dash for image processing on https://dash-canvas.plotly.host/. I will eventually mention how to write your own component libraries for custom Dash components. Bio ---- Emmanuelle is a materials science researcher at Saint-Gobain, and a part-time developer at Plotly, where she works on image processing and documentation. She has been a core contributor of scikit-image for several years, and her interest in image processing was triggered by her use of 3-D imaging of materials at high temperature. She recently created the dash-canvas library for integrating image annotating and processing into the Dash Python web framework. In software development, besides image processing she is interested in documentation and teaching scientific Python. She has been a co-organizer of the Euroscipy conference for several years.
- PyData Paris - March 2019 Meetup
Mark your calendar for the next session of the PyData Paris Meetup on March 26th 2019. This Meetup will be hosted by the Conservatoire National des Arts et Metiers (Cnam), 292 rue Saint-Martin, 75003 Paris. The speakers for this session are Olivier Grisel, Sarah Diot-Girard, and Stephanie Bracaloni. Schedule ------------- 7:00pm - 7:15pm: Community announcements 7:15pm - 8:00pm: Olivier Grisel Scikit-learn: what's new and what's under development 8:00pm - 8:45pm: Sarah Diot-Girard, and Stephanie Bracaloni From ML experiments to production: versioning and reproducibility with MLV-tools 8:45pm - 9:30pm: Standing buffet Abstracts ------------- * Olivier Grisel: * Scikit-learn: what's new and what's under development Scikit-learn is one of the most popular machine learning libraries. This talk will present a selection of recently released features and introduce some new developments including much more scalable models such as fast Histogram-based Gradient Boosting Decision Trees, an efficient reimplementation of k-means and much more. * Sarah Diot-Girard, and Stephanie Bracaloni: * From ML experiments to production: versioning and reproducibility with MLV-tools You're a data scientist. You have a bunch of analyses you performed in Jupyter Notebooks, but anything older than 2 months is totally useless because it's never working right when you open the notebook again. Also, you cannot remember the dropout rate on the second to last layer of this convolutional neural network which gave really great results 2 weeks ago and that you now want to deploy into production. Does that ring a bell? You're a software engineer in a data science team. You can’t imagine life without Git. Reviews on readable files, tests, code analysis, CI, used to belong to your daily basis. You were thinking of Jupyter Notebooks only as a demo tool. You need reproducibility for every step of your work even if you lose a server. And last but not least, you want to be able to deliver to production something usable by anyone. Is there a magical solution? No! But we can find compromise to satisfy those two worlds... We had these kind of issues in PeopleDoc. Building on open-source solutions, we have developed a set of open-source tools and designed a process that works for us. We are thrilled to present our project and we hope to spark a discussion with the community. See you on Github: https://github.com/peopledoc/ml-versioning-tools Bios ------ Olivier Grisel is a core developer of scikit-learn working at Inria and supported by the scikit-learn initiative at Fondation Inria https://scikit-learn.fondation-inria.fr/ Sarah Diot-Girard is working as a Machine Learning engineer since 2012 and she enjoys finding solutions to engineering problems using Data Science. She is particularly interested in practical issues, both ethical and technical, coming from applying ML into real life. In the past, she gave talks about data privacy and algorithmic fairness, but she also promotes a DataOps culture. Stephanie Bracaloni has been working as a software engineer for more than 6 years. She is now working on the industrialization of machine learning projects (from POC to production). She likes development but she is not “just a coder” she always keeps in mind systems and projects as a whole. Finding solutions to new problems or improve day to day process is something she really enjoys.
- PyData Paris - January 2019 Meetup
Mark your calendar for the next session of the PyData Paris Meetup on January 21st 2019. This Meetup will be hosted by IPGP (Institut de Physique du Globe de Paris), rue Jussieu. The speakers for this session are Joris Van den Bossche and Viviane Pons. Schedule 7:00pm - 7:15pm: Community announcements 7:15pm - 8:00pm: Joris Van den Bossche GeoPandas: easy, fast and scalable geospatial analysis in Python 8:00pm - 8:45pm: Viviane Pons Teaching with Jupyter at Université Paris-Sud 8:45pm - 9:30pm: Standing buffet, catering offered by the InterRift ANR project at IPGP. Bios: Joris Van den Bossche is an open source Python enthusiast currently working at the Université Paris-Saclay Center for Data Science (at Inria), working both on data science projects as contributing to Pandas and scikit-learn. Before that, Joris completed a PhD at Ghent University and VITO (Belgium) on air quality research. Joris regularly gives Python data analysis workshops. He is a core contributor to Pandas and the maintainer of GeoPandas. Viviane is a computer scientist and a faculty member at Université Paris-Sud (Orsay). Her research is at the boundary between theoretical computer science and mathematics, and involves a lot of exploratory computing. She teaches algorithmics and programming both at beginner and graduate levels. As an open-source developer, she contributes to the SageMath and OpenDreamKit projects. She also volunteers for the community by co-leading the Paris chapter of PyLadies with Anna-Livia Gomart. Stay tuned for more details about the Meetup! https://twitter.com/pydataparis -- Many thanks to the IPGP and the InterRift Project for hosting the event and for the catering!
- PyData Paris - October 2018 Meetup
Mark your calendar for the next session of the PyData Paris Meetup on October 8th 2018. This Meetup will be hosted at CFM, rue de l'Université. The speakers for this session are Jessica Hamrick and Nicolas Thiéry, with an introduction by Laurent Laloux, Chief Product Officer at CFM. Schedule 7:00pm - 7:15pm: Community announcements 7:15pm - 8:00pm: Nicolas Thiéry Modeling mathematics in Python & SageMath: some fun challenges 8:00pm - 8:45pm Jessica Hamrick Nbgrader: a tool for creating and grading assignments in the Jupyter notebook Bios: Jessica Hamrick is a Research Scientist at DeepMind in London, having recently completed her Ph.D. in Psychology at the University of California, Berkeley working with Tom Griffiths. Previously, she received her M.Eng. in Computer Science from MIT working with Josh Tenenbaum. Jessica's research focuses on model-based reasoning and planning, situated at the intersection of cognitive science, machine learning, and AI. In addition to research, Jessica is involved in several open source projects including Project Jupyter. Jessica is both a member of the Project Jupyter steering committee, and is the lead maintainer of nbgrader, a tool for grading Jupyter notebook assignments. Nicolas M. Thiéry is Professor at the Laboratoire de Recherche en Informatique of Université Paris Sud. His teaching ranges from introductory programming (with C++, in Jupyter) to computational methods in algebra (with SageMath, in Jupyter). His research at the borderline between Math and Computer Science, studying algebraic combinatorics with the help of computer exploration. He has been promoting software sharing for algebraic Combinatorics since 2000 and contributing to SageMath since 2008. To help fund the computational math software and Jupyter ecosystems, he leads the OpenDreamKit European project [masked]).
- PyData Paris - June 2018 Meetup
Mark your calendar for the next session of the PyData Paris Meetup on June 19th 2018. The speakers for this session are Tim Head and Tom Dupré La Tour Schedule 7:00pm - 7:15pm: Community announcements 7:15pm - 8:00pm: Binder - one click sharing of your data science, by Tim Head When other people want to run the code of the cool data project you did last week you usually think: “Great someone cares!” and then “Oh no, now I need to play support desk till they get it running.” The Binder project lets anyone run the contents of a git repository by clicking a link. For example try out the latest JupyterLab demo by clicking this link. Binder lets you describe the dependencies of your repository in a way that we can automatically create a Docker container from it. Removing the need for you to spend a lot of time to help others who are trying to get your code to run. Some example uses: - Reproduce and explore Right for the "Right Reasons: Training Differentiable Models by Constraining their Explanations" by Ross et al (https://mybinder.org/v2/gh/dtak/rrr/master?urlpath=lab). - Learn about "Foundations of numerical computing" (https://mybinder.org/v2/gh/ssanderson/foundations-of-numerical-computing/master?filepath=notebooks) with Scott Sanderson. - Dive into Julia Evans’ "Pandas cookbook" (https://mybinder.org/v2/gh/jvns/pandas-cookbook/master). I will tell you about the Binder project, how to use it to share work, what the tools behind it are, and how you can join the team working on Binder. 8:00pm - 8:45pm: Nearest neighbors in scikit-learn estimators, API challenges, by Tom Dupré la Tour Scikit-learn is a very popular machine learning library in Python. It is well known for its simple and elegant API, which has been reused in multiple other Python libraries. However, some parts of the library could still benefit from a better API. In particular, several scikit-learn estimators rely internally on some nearest neighbors computations. Yet, they use different API, they can't use custom neighbors estimators, and during a grid-search they recompute the nearest neighbors graph for each hyper-parameter. We will present ongoing work on improving their API, discussing implementation and deprecation challenges. Bios: Tim Head builds data driven products for clients all around the world, from startups to UN organisations. His company www.wildtreetech.com specialises in digital products that leverage machine-learning and deploying custom JupyterHub setups. Tim contributes to the Binder project and helped create scikit-optimize. When he isn’t travelling he trains for triathlons. Tom Dupré la Tour is a third-year PhD student at Télécom ParisTech, interested in signal processing, machine learning and neural oscillations. He joined the core developer team of scikit-learn in 2015.
- Worldwide Pandas Documentation Sprint
On March 10th, the Pandas community is organizing a worldwide documentation sprint! https://python-sprints.github.io/pandas/ The Paris event will be held at the CFM (Capital Fund Management) headquarters. Please sign up here for the Paris sprint. Seats are limited! Only 14 attendees will be selected to participate in the event (knowing Python, Pandas and Git is required to be able to contribute). • What we'll do Contributors throughout the world are going to improve Pandas' documentation. Each contributed hour has the potential to transform countless collective hours of difficulties into as many hours of productive work. This is a great opportunity to learn from fellow programmers, to learn more about Pandas and to have a significant impact on data science. • What to bring A laptop. Ideally with Python (2.7, 3.5 or 3.6) installed, along with the Pandas library, and Git. • Miscellaneous Capital Fund Management will provide breakfast (served upon arrival), coffee and tea, pizzas for lunch, and a wifi connection. Pair programming will be encouraged.