use python and Spark to dig into 2.5GB log report/Find your twitter superfan


Details
Presented by NYC Data Science Academy students who just finished 12 weeks full time program, apply for Jan 2016 program to be a Data Scientist (http://nycdatascience.com/data-science-bootcamp/).
NYC data science bootcamp Capstone project Demo Day
Preparation:
Please bring your laptop, we will show some trick and codes to get your on fast track!
Event schedule:
6:30-7:00 Which is the most popular R packages?! see from 2.5GB log report
7:00-7:15 Q&A
7:15-7:45 Find your twitter Superfan and what they like about you (a joint project with Fusion.net (http://fusion.net/))
7:45-8:00 Q&A
Speaker bio:
Xavier Capdepon (MSc Urban Engineering & Master in Corporate Finance)
Xavier has more than 12 years experience in analytical modeling. In a career ranging from transportation research to insurance securitization trading and esoteric securities banking, Xavier has deep roots in Data Science. From C and VBA, to SQL, R and Python, Xavier believes in knowing the best-in-breed toolkits. An experienced instructor and consultant in the field of finance, Xavier continues to deepen his abiding interest in prediction through taking the ASA/CAS actuarial exams (first two completed) and experimenting with machine learning models using Hadoop and Spark. Previously at Guggenheim Partners, he is currently a Fellow at the NYC Data Science Academy.
Fangzhou Cheng (MS in Management Information Systems at NYU)Fangzhou loves data and hard questions. Having been a journalist in the giant metropolises of Guangzhou, Bejing (CCTV) and NewYork (UN Headquarter), she was fascinated by the power of communication in transforming technology's future. During her masters study in Management Information Systems at NYU, she further developed a love for using data to drive business strategy and performance. With skills in Python, R, SQL and Hadoop, she deploy flexible code and devastatingly clear visualizations to tell data-driven stories.
Shu Yan obtained his Ph.D degree in Physics at the University of South Carolina. As a physicist with proficient analytical skills and strong programming background, he brings coding, data science and critical problem solving skills together to tackle real world problems. His physical intuition and mathematical reasoning always bring more insight when thinking about statistical models and machine learning.
As an organizer for President Obama’s 2012 reelection campaign, Alex became fascinated with data science when Nate Silver’s 538 accurately predicted the outcome for all 50 states. Alex combines a broad range of knowledge with his data expertise to seek novel solutions to real world phenomena. He loves Python, R, and SQL, and he keeps telling himself he’ll learn Scala some day.
Agenda/Content:
A Demonstration of a parallel computing script using Python and Spark to find the most popular R packages and to visualize the dependency between the R packages.R statistical software capabilities are extended through user-created packages, which, primarily, allow specialized statistical techniques and graphical devices. Any R user is consistently downloading additional packages on his computer and these packages often depend on other packages that also need to be downloaded as a consequence of the intended “root” packages. In his project, Xavier explored the CRAN R project website displaying all the R packages available and the R package download log files and proposed a methodology to extract the most popular R package based on the package dependencies. First, Xavier scrapped the CRAN website, using Spark and Python, for the 7,050+ webpages describing all the R packages in order to download the data related to package dependency and built a dependency matrix using the “sparse” matrix concept available within numpy in Python. Then, after downloading the 150 log files (2.5GB+) and using parallel computing, Xavier processed the 47 million line-items contained in the logs using the dependency matrix built previously in order to get only the records related to the “root” packages.Finally, the dependency between all the packages can be visualized using R network graphic packages; the most popular R packages are extracted from the “root” package data and are visualized using Python.
Through collaboration with Fusion.net, our students delivered a super fan detection tool using shiny app. This project designs a program for charting the influencers and patterns of the Twitter community. The findings will be able to help online media companies identify their most important influencers on Twitter and understand their interests, patterns, and behaviors. Furthermore, recommendation systems can be built based on natural language processing ofinfluencers’ Twitter timeline to suggest content to attract their attention.
Recommendation systems have been widely used by news media companies. Just like “Recommended for You” section of NYTimes.com, and “Recommendations” plugin on HuffPost Social News, personalizing the placement of articles on the apps and website has been used to guide news readers to find their most interested articles. The core of the recommendation systems is to recommend the right content to the right people at the right times. While a few recommendation algorithms (such as content-based method and collaborative filtering) are already in the market, this project serves as a basis to find out the influencers in social network, such as Twitter. As an extension, this project also uses AlchemyAPI as a natural language processing tool to extract entities and relevance scores from user timeline. This way, the system can understand the user’s interest from everything he/she has participated in the past. Based on the same idea of natural language processing from user timeline, a content-based recommendation system for influencers can be built in following steps:
● Text mining on all superfans timeline content
● Clustering texts on vectors from AlchemyAPI
● Observe distributions of superfan scores and interests in different clusters
● Content-based recommendation (match vectorized user timeline text and article text using cosine similarity)
Reference links:
Xavier's talk on Python and Spark, see all the files R, Python, Spark scripts, text files on github.

use python and Spark to dig into 2.5GB log report/Find your twitter superfan