Skip to content

Details

Presented by NYC Data Science Academy (https://nycdatascience.com/)students who just completed the 12-week full-time program.

During this event, you will see some of the best machine learning and big data projects created by NYC Data Science Academy 12-week Data Science Bootcamp students.

You will also have an opportunity to meet our bootcamp students and find out more about what it is like to be a student at NYC Data Science Academy and gain an overview of the program. Join us for data wrangling tips, fun facts and in-depth discussions.

Event schedule:

6:30 pm - 7:00 pm Check in, mingle, enjoy food & drinks

7:00 pm - 8:00 pm Student presentation

8:00 pm - 8:30 pm Network and meet our students

-----------------------------

Project 1: From Boom to Bust - Predicting Housing Prices in Moscow

Ranked #15 out of 3,274 teams on Kaggle Team Members (https://www.kaggle.com/c/sberbank-russian-housing-market/leaderboard) - Brandy Freitas (https://www.linkedin.com/in/brandyalexandrafreitas/), Chase Edge (https://www.linkedin.com/in/chaseedge/) and Grant Webb (https://www.linkedin.com/in/grantdwebb/)

Given 4 years of housing price data in a foreign market, predicting the following year’s prices should be pretty straightforward, right? But what if in that last year of data, the country’s stock market, the value of its currency and the price of its number 1 export, all dropped by nearly 50%. And on top of all that, the country was slapped with economic sanctions by the EU and the US. This was Moscow in 2014 and as you can see, it was anything but straightforward.

We were able to overcome these challenges and in the two weeks of working together, were able to achieve a top 1% ranking on Kaggle. Our success is a product of our in depth data cleaning, feature engineering and our approach to modeling. With a focus on interpretability and simplicity, we begin modeling using linear regression and decision trees which gave us a better understanding of the data. We then utilized more complicated models such as random forests and XGBoost which ultimately resulted in our top submission.

https://lh3.googleusercontent.com/Hv93rViHLxbIWFu-SB7LfWxoKHwDMeFJn9qfatgQZUQu8ROyHPs5AjP37I9mDIkLGcSHq-RPc4GjDRBsGqpabGGWQlP-rKNqjEhgHhFdoet7u9-nviAFVDEGE6qoFk0kCoqa-hxC

Project 2: A Hybrid Recommender with Yelp Challenge Data

Chao Shi, Sam O'Mullane, Sean Kickham, Reza Rad and Andrew Rubino

People make decisions on where to eat based on friends’ recommendations. Since they know you, their suggestions matter more than those of strangers.

For the capstone project, we built a hybrid Yelp recommendation system that can provide individualized recommendations based on your friend’s reviews on the social network. We built the machine learning models using Spark, and set up a Flask-Kafka-RDS-Databricks pipeline that allows a continuous stream of user requests.

During the presentation, we will talk about the development framework and technical implementation of the pipeline.

Project 3: Wikipedia: Tuned Predictions on Big Data

Scott Dobbins (presenting) and Rachel Kogan

Given that both Wikipedia and comments sections of most websites are freely open to anyone to edit at any time, how has Wikipedia managed to remain such a useful resource while most comments sections are ridden with vandalism, ads, and other counterproductive user behavior?

We believe the answer is two-fold: 1) Wikipedia has an army of bots that quickly identify and revert vandalism so that the worst edits are usually never seen by people and the site generally maintains itself in a well-kempt state, and 2) Wikipedia has a strong community of administrators and other contributors who routinely clean the site’s flagged contents.

Vandalism is relatively easy to flag, though a few clever edits manage to stay on the site for a long time. What about site content problems that are more subjective, like bias? Wikipedia users do routinely manually flag pages with point-of-view (POV) issues, though with millions of pages and no machine-based approaches, the site can only manage to confidently maintain neutrality on the more well-trafficked pages.

Here we propose a solution to solve some of the more intractable content issues for Wikipedia and other sites using Natural Language Processing (NLP) and machine learning approaches. The sheer quantity of data managed by Wikipedia and similar sites requires distributed computing approaches, so we show here how Apache Spark can upgrade common algorithms to run on massive data sets.

Related topics

You may also like