• Python in the web browser

    Online event

    The PyData Paris Meetup is back!

    In this first installment since the pandemics, we are holding a special session about Python in the web browser, with

    - a keynote by Roman Yurchak, the maintainer of PyOdide,
    - a presentation by Romain Casati and Nicolas Poulain on Basthon and Capytale,
    - a presentation by Jeremy Tuloup about JupyterLite.

    Mark your calendars for this online event, and stay tuned for more information!

    Schedule
    -------------

    7pm - 7:15pm: Community announcements

    7:15pm - 8pm: Roman Yurchak (Keynote)
    "Python and the scientific stack compiled to WebAssembly with Pyodide"

    8pm - 8:30pm: Romain Casati and Nicolas Poulain
    "A new code learning platform in french education"

    8:30pm - 9pm: Jeremy Tuloup
    "JupyterLite: a JupyterLab distribution that runs entirely in the browser"

    9pm - 9:30pm: Lightning talks

  • PyData Paris - November 27th Meetup

    Dataiku

    Mark your calendar for the next session of the PyData Paris Meetup, November 27th 2019. This Meetup will be hosted at Dataiku, 203 Rue de Bercy, Paris.

    The speakers for this session are Nelle Varoquaux and Tim Hunter.

    Schedule
    -------------

    7:00pm - 7:15pm: Community announcements

    7:15pm - 8:00pm: Nelle Varoquaux (CNRS)
    The lifecycle of open-source software: Mining GitHub to understand community dynamics

    8:00pm - 8:45pm: Tim Hunter (Databricks)
    Koalas: Making an Easy Transition from Pandas to Apache Spark

    8:45pm - 9:30pm: Standing buffet

    Abstracts
    -------------

    *Nelle Varoquaux: The lifecycle of open-source software: Mining GitHub to understand community dynamics*

    GitHub (https://www.github.com) is a code-sharing platform used by many open-source software developers to coordinate the creation and maintenance of software. Because open-source software projects are often maintained by communities of volunteers working to sustain shared (and often vital) software infrastructure, the ability of these communities to attract and maintain new members is vital; otherwise, these projects would languish as existing community members leave. Here, my collaborators and I analyze community members' interactions on GitHub to understand the social dynamics that make communities more welcoming—or more hostile—to newcomers.

    In this talk, I will present the data collection and stastical modeling we use in this project, introducing a mixture of Python and R tools.

    Joint work with
    - Alexandra Paxton, from University of Connecticut
    - R. Stuart Geiger, from University of California, Berkeley
    - Chris Holdgraf, from University of California, Berkeley

    *Tim Hunter: Koalas: Making an Easy Transition from Pandas to Apache Spark*

    In this talk, I will present Koalas, an open-source project that aims at bridging the gap between the big data and small data for data scientists, and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.

    Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.

    When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.

    Through live demonstrations and code samples, you will understand:
    – how to effectively leverage both pandas and Spark inside the same code base
    – how to leverage powerful pandas concepts such as lightweight indexing with Spark
    – technical considerations for unifying the different behaviors of Spark and pandas

    Bios
    ------

    Nelle Varoquaux
    -----------------------

    I am a research faculty at GEM and BCM, in the TIMC laboratory in Grenoble. I am interested in machine learning and causal inference methods to better understand gene regulatory networks, with a particular focus on how the 3D structure of the genome affects and is affected by gene regulation. I am also involved in scientific computing activities. In particular, I am a contributor to scientific Python softwares including scikit-learn -- Machine learning in Python-- and matplotlib --a python 2D plotting library--.

    Tim Hunter
    ---------------

    Tim Hunter is a software engineer at Databricks and is the co-creator of the Koalas project. He holds an engineering degree from Ecole Polytechnique and a Ph.D in Computer Science from UC Berkeley. He contributes to the Apache Spark MLlib project, as well as the GraphFrames, TensorFrames and Deep Learning Pipelines libraries. He has been building distributed Machine Learning systems with Spark since version 0.0.2, before Spark was an Apache Software Foundation project.

    6
  • Digital Geoscience Hackathon

    Paris Globe Institute of Physics

    A hackathon in Paris focusing on open-source geoscience and datascience. With funding from the EAGE, the Digital Geosciences Hackathon will bring together computer scientists, machine learning specialists and geoscience experts to explore new ideas for digitalization and automation in geophysical research, exploration and industrial production processes. You don’t need to code to come, all levels of experience are welcome.

    1. Bring your own ideas and pet projects.
    2. Form a team and develop an idea.
    3. Present your solution with Voilà (why not?)

    When: Friday 15th and Saturday 16th November 2019.
    Where: Institut de Physique du Globe de Paris.
    More info: https://digitalgeohack.github.io/

    Tikets cost €12 (includes lunch and drinks both days) and can be bought at eventbrite:
    https://www.eventbrite.co.uk/e/billets-digital-geosciences-hackathon-73786843435

    4
  • PyData Paris - October 3rd Meetup

    Scaleway

    Mark your calendar for the next session of the PyData Paris Meetup on October 3rd 2019. This Meetup will be hosted at Scaleway's headquarters located 11bis Rue Roquépine, Paris.

    The speakers for this session are Marianne Corvellec and Arnaud Wald.

    Schedule
    -------------

    7:00pm - 7:15pm: Community announcements
    7:15pm - 8:00pm: Marianne Corvellec:
    Building custom analytics web apps for bioinformatics with Plotly’s Dash Bio
    8:00pm - 8:30pm: Arnaud Wald
    RAPIDS.ai - Leveraging GPUs for accelerated data science & data analytics
    8:30pm - 9:30pm: Standing buffet sponsored by our host, Scaleway

    Bios
    ------

    Marianne Corvellec

    A PhD in statistical physics, Marianne works in industry as a data
    scientist and a software developer. She is also a free software
    activist (with April.org) and an independent researcher (with IGDORE).
    Her research interests include data science, education, and
    assessment. Since 2013, she has been a regular speaker and contributor
    in the Python, Carpentries, and FLOSS communities.

    Arnaud Wald

    Arnaud Wald has been working as a Machine Learning Engineer since he graduated from Centrale-Supélec in 2017. Now a member of Scaleway’s AI team, he can keep up to date with the latest advances in the field while working on practical solutions to make AI in the cloud more accessible.

    Abstracts
    -------------

    Building custom analytics web apps for bioinformatics with Plotly’s Dash Bio

    Dash Bio is a free and open-source library for creating customizable
    data dashboards in the field of the life sciences (molecular biology,
    analytical chemistry, ...). These dashboards are interactive,
    reactive, web-based applications, which can be written in either
    Python or R. They let users perform bioinformatics-related tasks such
    as exploring, analyzing, and visualizing genomic data.
    First, we focus on the extensibility and Plotly compatibility of the library. Dash Bio is highly extensible, being one of several
    domain-specific component libraries in the Dash ecosystem. Dash is a
    Python-based web framework for building analytics (data science) web
    applications. Running Flask in the backend and React.js in the
    frontend, it leverages the power of Plotly graphs.
    Dash has recently gained much interest and adoption in the scientific
    and business communities. We then discuss how Dash Bio fits in the
    bioinformatics space, from parsing utilities, at the lower level, to
    research use cases, at the higher level. Finally, we demo how to write
    a Dash app from scratch, in Python, using available Dash Bio
    components and the very elegant Plotly Express.

    RAPIDS.ai - Leveraging GPUs for accelerated data science & data analytics

    RAPIDS makes it possible to have end-to-end data science pipelines run entirely on GPU architecture. It capitalizes on the parallelization capabilities of GPUs to accelerate data preprocessing pipelines, with a pandas-like dataframe syntax. GPU-optimized versions of scikit-learn algorithms are available, and RAPIDS also integrates with major deep learning frameworks. This talk will present RAPIDS and its capabilities, and how to integrate it in your pipelines.

    Misc.
    -------

    We need all attendees to provide their full name as they register on meetup.com if it does not match their meetup username. Please bring a photo id with you to access the event.

    5
  • PyData Paris - July 5th Meetup

    Tour EDF

    Mark your calendar for the next session of the PyData Paris Meetup on july 5th 2019. This Meetup will be hosted at EDF's tower located 20 place de La Défense, 92050 PARIS LA DEFENSE

    The speakers for this session are Jean-Charles Vialatte and Sylvain Corlay.

    We need people to enlist with their real name/surname and an email for security reasons. Remember to bring your id (passeport / carte d’identité) in order to be allowed.

    Schedule
    -------------

    6:00pm - 6:15pm: Community announcements
    6:15pm - 7:00pm: Jean-Charles Vialatte:
    WARP10: Combining the strengths of Python and Java to leverage time series and geo time series datasets.
    7:00pm - 7:45pm: Sylvain Corlay:
    Voilà: From Jupyter notebooks to standalone web applications and dashboards.
    8:00pm - 9:00pm: Standing buffet

    Abstracts
    -------------

    - Warp 10: Combining the strengths of Python and Java to leverage time series and geo time series datasets :
    Warp 10 is a time series database with optional geo support written in Java. One of its key differencing factor is WarpScript, which not only is a query language, but is also a full-fledge programming language tailored to ease time series processing. In this presentation, we will explain how Python and WarpScript can interoperate efficiently, using bridges built between both the Python and Java ecosystems by libraries such as Py4J, Pyrolite and PySpark. We will see the benefits of doing so through examples and Jupyter notebooks.

    - Et voilà! From Jupyter notebooks to standalone web applications and dashboards:
    The goal of Project Jupyter is to improve the workflows of researchers, educators, and other practitioners of scientific computing, from the exploratory phase of their work to the communication of the results.
    But interactive notebooks are not the best communication tool for all audiences. While they have proven invaluable to provide a narrative alongside the source, they are not ideal to address non-technical readers, who may be put off by the presence of code cells, or the need to run the notebook to see the results.
    In this talk, we will present voilà, a new dashboarding tool built upon Jupyter protocols and standard formats meant to address these challenges and bridge that gap in the Jupyter ecosystem.

    Bios :
    -------

    - Jean-Charles Vialatte, Ph. D., works as a machine learning engineer at SenX. On December 2018, he defended successfully his thesis on "Convolution of Graph Signals" and "Deep Learning on Graph Domains" at IMT Atlantique. He obtained its engineering degree from the same institution in 2015.

    - Sylvain Corlay is the founder and CEO of QuantStack. He holds a PhD in applied mathematics from University Paris VI.
    As an Open Source Developer, Sylvain is very involved with Project Jupyter and is a steering committee member of the Project. He was honored with the rest of the Jupyter Steering Council with the 2017ACM Software System Award for Jupyter.
    Beyond Jupyter, Sylvain contributes to a number of scientific computing open-source projects such as bqplot, xtensor, and voila. Sylvain founded QuantStack in September 2016. Prior to founding QuantStack, Sylvain was a Quant Researcher at Bloomberg and an Adjunct Faculty member at the Courant Institute and Columbia University.
    Besides QuantStack, Sylvain serves as a member of the board of directors of the NumFOCUS foundation.

    18
  • PyData Paris - June 2019 Meetup

    Cri

    Mark your calendar for the next session of the PyData Paris Meetup on June 5th 2019! This Meetup is hosted by the Centre de Recherche Interdisciplinaire (CRI), 8 rue Charles V, 75004 Paris.

    This is a special edition of our event series in that it is organized in the occasion of a community workshop on Project Jupyter and a large number of developers of the Jupyter ecosystem will be in town. Hence we will have one main talk by our invited speaker, Emmanuelle Gouillart, followed with a series of "lightning talks" by participants to the Jupyter workshop.

    Schedule
    -------------

    6:00pm - 6:15pm: Community announcements
    6:15pm - 7:00pm: Emmanuelle Gouillart
    Dash: a web framework for writing highly-tuned apps for data science in
    pure Python
    7:00pm - 7:45pm: Jupyter Community Lightning talks. (Series of short presentations in rapid succession).
    7:45pm - 8:30pm: Standing buffet

    Abstract
    ------------

    Dash (https://dash.plot.ly/) is an open-source Python web application
    framework developed by Plotly. Written on top of Flask, Plotly.js, and
    React.js, Dash is meant for building data visualization apps with highly
    custom user interfaces in pure Python. I will give a demo of how to write
    Dash apps with pure Python, starting from very simple apps to more
    advanced ones, including reactive apps based on advanced data
    visualization. I will also discuss the performance and deployment of Dash
    apps.

    Dash benefits from several component libraries, from the core components
    (eg sliders, radio buttons, file dialogs) to more custom and
    application-specific components, such as components for engineering or
    life sciences applications, or data tables. I will take the example of
    the dash-canvas library, which provides an interactive component for
    annotating images (eg with freehand brush, lines, bounding boxes...). The
    library also provides utility functions for using user-provided
    annotations for several image processing tasks such as segmentation,
    transformation, measures, etc. The latter functions are based on
    libraries such scikit-image and openCV. A gallery of examples showcases
    some typical uses of Dash for image processing on
    https://dash-canvas.plotly.host/. I will eventually mention how to write
    your own component libraries for custom Dash components.

    Bio
    ----

    Emmanuelle is a materials science researcher at Saint-Gobain, and a
    part-time developer at Plotly, where she works on image processing and
    documentation. She has been a core contributor of scikit-image for
    several years, and her interest in image processing was triggered by her
    use of 3-D imaging of materials at high temperature. She recently created
    the dash-canvas library for integrating image annotating and processing
    into the Dash Python web framework. In software development, besides
    image processing she is interested in documentation and teaching
    scientific Python. She has been a co-organizer of the Euroscipy
    conference for several years.

    18
  • PyData Paris - March 2019 Meetup

    National Conservatory of Arts and Crafts

    Mark your calendar for the next session of the PyData Paris Meetup on March 26th 2019. This Meetup will be hosted by the Conservatoire National des Arts et Metiers (Cnam), 292 rue Saint-Martin, 75003 Paris.

    The speakers for this session are Olivier Grisel, Sarah Diot-Girard, and Stephanie Bracaloni.

    Schedule
    -------------

    7:00pm - 7:15pm: Community announcements
    7:15pm - 8:00pm: Olivier Grisel
    Scikit-learn: what's new and what's under development
    8:00pm - 8:45pm: Sarah Diot-Girard, and Stephanie Bracaloni
    From ML experiments to production: versioning and reproducibility with MLV-tools
    8:45pm - 9:30pm: Standing buffet

    Abstracts
    -------------

    * Olivier Grisel:
    * Scikit-learn: what's new and what's under development

    Scikit-learn is one of the most popular machine learning libraries. This talk will present a selection of recently released features and introduce some new developments including much more scalable models such as fast Histogram-based Gradient Boosting Decision Trees, an efficient reimplementation of k-means and much more.

    * Sarah Diot-Girard, and Stephanie Bracaloni:
    * From ML experiments to production: versioning and reproducibility with MLV-tools

    You're a data scientist. You have a bunch of analyses you performed in Jupyter Notebooks, but anything older than 2 months is totally useless because it's never working right when you open the notebook again. Also, you cannot remember the dropout rate on the second to last layer of this convolutional neural network which gave really great results 2 weeks ago and that you now want to deploy into production. Does that ring a bell?

    You're a software engineer in a data science team. You can’t imagine life without Git. Reviews on readable files, tests, code analysis, CI, used to belong to your daily basis. You were thinking of Jupyter Notebooks only as a demo tool. You need reproducibility for every step of your work even if you lose a server. And last but not least, you want to be able to deliver to production something usable by anyone. Is there a magical solution?

    No! But we can find compromise to satisfy those two worlds...

    We had these kind of issues in PeopleDoc. Building on open-source solutions, we have developed a set of open-source tools and designed a process that works for us. We are thrilled to present our project and we hope to spark a discussion with the community.

    See you on Github: https://github.com/peopledoc/ml-versioning-tools

    Bios
    ------

    Olivier Grisel is a core developer of scikit-learn working at Inria and supported by the scikit-learn initiative at Fondation Inria https://scikit-learn.fondation-inria.fr/

    Sarah Diot-Girard is working as a Machine Learning engineer since 2012 and she enjoys finding solutions to engineering problems using Data Science. She is particularly interested in practical issues, both ethical and technical, coming from applying ML into real life. In the past, she gave talks about data privacy and algorithmic fairness, but she also promotes a DataOps culture.

    Stephanie Bracaloni has been working as a software engineer for more than 6 years. She is now working on the industrialization of machine learning projects (from POC to production). She likes development but she is not “just a coder” she always keeps in mind systems and projects as a whole. Finding solutions to new problems or improve day to day process is something she really enjoys.

    3
  • PyData Paris - January 2019 Meetup

    Institut de Physique du Globe de Paris

    Mark your calendar for the next session of the PyData Paris Meetup on January 21st 2019. This Meetup will be hosted by IPGP (Institut de Physique du Globe de Paris), rue Jussieu.

    The speakers for this session are Joris Van den Bossche and Viviane Pons.

    Schedule

    7:00pm - 7:15pm: Community announcements
    7:15pm - 8:00pm: Joris Van den Bossche
    GeoPandas: easy, fast and scalable geospatial analysis in Python
    8:00pm - 8:45pm: Viviane Pons
    Teaching with Jupyter at Université Paris-Sud
    8:45pm - 9:30pm: Standing buffet, catering offered by the InterRift ANR project at IPGP.

    Bios:

    Joris Van den Bossche is an open source Python enthusiast currently working at the Université Paris-Saclay Center for Data Science (at Inria), working both on data science projects as contributing to Pandas and scikit-learn. Before that, Joris completed a PhD at Ghent University and VITO (Belgium) on air quality research. Joris regularly gives Python data analysis workshops. He is a core contributor to Pandas and the maintainer of GeoPandas.

    Viviane is a computer scientist and a faculty member at Université Paris-Sud (Orsay). Her research is at the boundary between theoretical computer science and mathematics, and involves a lot of exploratory computing. She teaches algorithmics and programming both at beginner and graduate levels. As an open-source developer, she contributes to the SageMath and OpenDreamKit projects. She also volunteers for the community by co-leading the Paris chapter of PyLadies with Anna-Livia Gomart.

    Stay tuned for more details about the Meetup!
    https://twitter.com/pydataparis

    --

    Many thanks to the IPGP and the InterRift Project for hosting the event and for the catering!

  • PyData Paris - October 2018 Meetup

    Capital Fund Management

    Mark your calendar for the next session of the PyData Paris Meetup on October 8th 2018. This Meetup will be hosted at CFM, rue de l'Université.

    The speakers for this session are Jessica Hamrick and Nicolas Thiéry, with an introduction by Laurent Laloux, Chief Product Officer at CFM.

    Schedule

    7:00pm - 7:15pm: Community announcements
    7:15pm - 8:00pm: Nicolas Thiéry
    Modeling mathematics in Python & SageMath: some fun challenges
    8:00pm - 8:45pm Jessica Hamrick
    Nbgrader: a tool for creating and grading assignments in the Jupyter notebook

    Bios:

    Jessica Hamrick is a Research Scientist at DeepMind in London, having recently completed her Ph.D. in Psychology at the University of California, Berkeley working with Tom Griffiths. Previously, she received her M.Eng. in Computer Science from MIT working with Josh Tenenbaum. Jessica's research focuses on model-based reasoning and planning, situated at the intersection of cognitive science, machine learning, and AI. In addition to research, Jessica is involved in several open source projects including Project Jupyter. Jessica is both a member of the Project Jupyter steering committee, and is the lead maintainer of nbgrader, a tool for grading Jupyter notebook assignments.

    Nicolas M. Thiéry is Professor at the Laboratoire de Recherche en
    Informatique of Université Paris Sud. His teaching ranges from
    introductory programming (with C++, in Jupyter) to computational
    methods in algebra (with SageMath, in Jupyter). His research at the
    borderline between Math and Computer Science, studying algebraic
    combinatorics with the help of computer exploration. He has been
    promoting software sharing for algebraic Combinatorics since 2000 and
    contributing to SageMath since 2008. To help fund the computational
    math software and Jupyter ecosystems, he leads the OpenDreamKit
    European project [masked]).

    2
  • PyData Paris - June 2018 Meetup

    Telecom Paristech

    Mark your calendar for the next session of the PyData Paris Meetup on June 19th 2018.

    The speakers for this session are Tim Head and Tom Dupré La Tour

    Schedule

    7:00pm - 7:15pm: Community announcements

    7:15pm - 8:00pm: Binder - one click sharing of your data science, by Tim Head

    When other people want to run the code of the cool data project you did last week you usually think: “Great someone cares!” and then “Oh no, now I need to play support desk till they get it running.”

    The Binder project lets anyone run the contents of a git repository by clicking a link. For example try out the latest JupyterLab demo by clicking this link. Binder lets you describe the dependencies of your repository in a way that we can automatically create a Docker container from it. Removing the need for you to spend a lot of time to help others who are trying to get your code to run.

    Some example uses:

    - Reproduce and explore Right for the "Right Reasons: Training Differentiable Models by Constraining their Explanations" by Ross et al (https://mybinder.org/v2/gh/dtak/rrr/master?urlpath=lab).
    - Learn about "Foundations of numerical computing" (https://mybinder.org/v2/gh/ssanderson/foundations-of-numerical-computing/master?filepath=notebooks) with Scott Sanderson.
    - Dive into Julia Evans’ "Pandas cookbook" (https://mybinder.org/v2/gh/jvns/pandas-cookbook/master).

    I will tell you about the Binder project, how to use it to share work, what the tools behind it are, and how you can join the team working on Binder.

    8:00pm - 8:45pm: Nearest neighbors in scikit-learn estimators, API challenges, by Tom Dupré la Tour

    Scikit-learn is a very popular machine learning library in Python.
    It is well known for its simple and elegant API, which has been reused in multiple other Python libraries.
    However, some parts of the library could still benefit from a better API.
    In particular, several scikit-learn estimators rely internally on some nearest neighbors computations.
    Yet, they use different API, they can't use custom neighbors estimators, and during a grid-search they recompute the nearest neighbors graph for each hyper-parameter.
    We will present ongoing work on improving their API, discussing implementation and deprecation challenges.

    Bios:

    Tim Head builds data driven products for clients all around the world, from startups to UN organisations. His company www.wildtreetech.com specialises in digital products that leverage machine-learning and deploying custom JupyterHub setups.
    Tim contributes to the Binder project and helped create scikit-optimize. When he isn’t travelling he trains for triathlons.

    Tom Dupré la Tour is a third-year PhD student at Télécom ParisTech, interested in signal processing, machine learning and neural oscillations.
    He joined the core developer team of scikit-learn in 2015.

    5