We are very happy to announce our next meetup with Tom White, author of Hadoop, the Definitive Guide (it's the de facto book on Hadoop in case you missed it) and Andrew Lowe, Research fellow at Wigner Research Centre for Physics.
Tom White is in town only for a few days, that's why we organize a meetup so close to the last one.
Check out the abstracts below and see you there!
Petascale Genomics with Apache Hadoop
Tom White, Engineer at Cloudera and author of Hadoop, the Definitive Guide
The advent of next-generation DNA sequencing technologies is poised to
revolutionize the way life sciences research is practiced. These new
technologies are scaling significantly faster than Moore’s law, and
promise to catapult life sciences research and the biotech industry
into the realm of big data. However, bioinformatics and data
management in the life sciences has been slow to adopt the latest big
data technologies pioneered by the internet industry (e.g., Google and
Facebook), in part because these tools are only beginning to become
In this talk, I will review several ways in which distributed
computing tools (e.g., the Hadoop ecosystem) can be used to
significantly advance the state of the art in life sciences research.
It will cover the ADAM and GATK projects for doing genomics ETL on top
of Spark, and will touch on tools like Hadoop, Spark, Impala, and Kudu
among others. This talk is not only for people interested in biology
or genomics, since the techniques used for data storage and analytics
apply equally well to other domains too.
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He works for Cloudera, a company set up to offer Hadoop support and training. Previously he was as an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has written numerous articles for O'Reilly, java.net and IBM's developerWorks, and has spoken at several conferences, including at ApacheCon 2008 on Hadoop. Tom has a Bachelor's degree in Mathematics from the University of Cambridge and a Master's in Philosophy of Science from the University of Leeds, UK.
Machine learning for particle physics using R
Andrew John Lowe Scientific Research Fellow, Wigner Research Centre for Physics
Search strategies for new subatomic particles often depend on being able to efficiently discriminate between signal and background processes. Particle physics experiments are expensive, the competition between rival experiments is intense, and the stakes are high. This has lead to increased interest in advanced statistical methods to extend the discovery reach of experiments. This talk will present a walk-through of the development of a prototype machine learning classifier for differentiating between decays of quarks and gluons at experiments like those at the Large Hadron Collider at CERN. The power to discriminate between these two types of particle would have a huge impact on many searches for new physics at CERN and beyond. I will discuss why I chose to perform this analysis in R, how switching to R has helped my work and enabled me to adopt a more efficient reproducible research workflow, and how I have overcome the problems that I encountered when working with large datasets in R.
Andrew Lowe is a particle physicist at the Wigner Research Centre for Physics, Hungarian Academy of Sciences, in Budapest. He spent several years based at the European Organization for Nuclear Research (CERN) in Geneva and was a member of the collaboration that discovered the Higgs boson. He played a major role in the development of the core software and algorithms for a real-time multi-stage cascade classifier that filters and reduces the collision event data rate from 60 TB/s to a manageable 300 MB/s that can be written to permanent storage for subsequent offline analysis. He now works on using machine learning techniques to develop classification algorithms for recognising particles based on their decay properties.