Skip to content

Scaling Up Genomics with Spark. Understanding your customers using public data

Photo of John Mulhall
Hosted By
John M. and Uli B.
Scaling Up Genomics with Spark. Understanding your customers using public data

Details

It gives us great pleasure to announce our (Monday) May 9th Meetup at Bank of Ireland, Grand Canal Square from 6pm. We have a Data Science V Big Data Stack themed night with Apache Spark taking a leading role alongside great Data Analytics in a must see event for all big data practitioners and enthusiasts alike!

The agenda for the day with Sean Owen, Director of Data Science from Cloudera and Michael Crawford, Founder of AppliedAI is as follows:

Scaling Up Genomics with Spark by Sean Owen, Director of Data Science, Cloudera

It's amazing that our genome so completely and uniquely encodes each of us with a simple 4-protein code, like a file. More amazingly, we're so similar that we can build a reference map of human genomes and reason about commonalities. Genomics has taken off in the last two decades driven largely by advances in computing; the work of mapping the genome is incredibly data and compute intensive. This talk will briefly introduce the problem of genomics and existing home-grown efforts to bring ""big data"" technology to solve it. It will compare these with the separate rise of technologies like Apache Hadoop and Spark, and how these ideas are helping genomics scale up even further..

Understanding Your Customers Using Public Data by Michael Crawford, Founder of Applied AI

A data science use case! We were involved in a project to model the quality of a large life insurer's customer base and wanted to see if socio-economic factors were a useful predictor. Experian provide this type of information but it was prohibitively expensive to use for our purposes (200k+ customers). We took a look at the Irish census data and reckoned we could have a crack at doing what Experian do ourselves. We also thought that it would be a fun and we'd probably learn new stuff along the way. What we did:

We started with the 2011 Irish census data at the highest level of detail that is publicly available and ran a selection of clustering algorithms on the data. We used various techniques to visualise the clustered data, revealing a lot of structure. Finally we displayed the data on a map of Ireland. The results were remarkable. However, due to the unsupervised nature of the algorithms used to cluster the data we had no real idea of what features made up each cluster. We then wrote code to visualise their underlying structure and put a narrative on each cluster. We then fed back the results of the exercise into our models of customer loyalty. The presentation consists of slides, python code and visualisations from each stage of the project outlining what we did and the also decisions and compromises made along the way. Technologies Used Python:

• Anaconda & IPython Notebooks

• Numpy, Scipy, Matplotlib

• Pandas, Seaborn, Scikit-learn

• tSNE

• JavaScript

D3.js & DC.js

Leaflet.js

As you can see, the event is action packed and promises to engage our membership for another great evening of big data with HUG Ireland. Hashtag as always is #HUGIreland. RSVP your place today and join us on May 9th at Bank of Ireland, Grand Canal Square, Dublin 2 from 6pm.

Photo of Data Engineering and Data Architecture Group (DEDAG) group
Data Engineering and Data Architecture Group (DEDAG)
See more events