SBG Python Meetup #2


Details
Quick and Dirty Haplotype Assembly Using Python and graph-tool (Milos Popovic)
Humans are diploid organisms, meaning that they have a two complete sets of chromosomes in each cell, each set inherited from one parent. One such set can also be called a haplotype. While traditional NGS sequencing technique and bioinformatics software allows us to sequence both sets of the human genome and analyze the differences of that particular genome when compared to a single haploid reference a each position, we are not able to determine if two variants are a part of the same haplotype or not.
This talk will outline a new sequencing technique called Target Locus Amplification coupled with a new bioinformatics algorithm developed at SBG which can be used to correctly assign variants to a haplotype and fully assemble both haplotypes for a given locus. We will be showing how Pythons rich scientific computing toolbox allows us to quickly and efficiently prototype algorithms and verify their correctness.
Scientific Computing with Apache Spark and PySpark API (Milos Nesic)
Spark is a general purpose computation engine for large-scale data processing. While Spark core is written in Scala and runs on the JVM, a Python API also exists. It exposes functions that make aggregation, filtering and projection over large datasets relatively simple.
In the heart of Spark engine is the concept of Resilient Distributed Dataset (RDD), which is a distributed memory abstraction. It's key feature is enabling in-memory computations in a fault-tolerant manner. Spark also takes care of the large amount of technical overhead associated with distributed computing such as scheduling and straggler mitigation. Combined with the expressiveness of PySpark API this makes it the perfect candidate for real-time analysis of genomic data, which will be the focus of this talk.

SBG Python Meetup #2