PyData Triangle July 2021 Meetup
Details
PyData Triangle welcomes you to another exciting event.
This will be an online event. You must RSVP to this meetup event in order to see the Zoom URL. If prompted, the password is 578141.
Speakers:
- Leo Gokce
- Ryan Woodall
- YOU: Lightning Talks (Sign-up for a 5 minute lightning talk slot at the meeting by posting in the chat. Or pre-sign-up by posting a comment into this announcement.)
Schedule:
6:00-6:15 announcements
6:15-6:30 Leo Gokce - project intro/overview
6:30-7:30 Ryan Woodall - data engineering
7:30-8:15 Leo Gocke - project results
8:15-8:30 Lightning talks
The PyData code of conduct ( http://pydata.org/code-of-conduct.html ) is enforced at this Meetup. Attendees violating these rules may be asked to leave the meetup at the sole discretion of the meetup organizer.
NOTE: This meeting will be recorded.
Please propose a presentation or speaker for a future PyData Triangle meetup. Contact any of the organizers, Aarthi Janakiramen, Dhruv Sakalley, Gene Ferruzza, or Mark Hutchinson through meetup messages.
Follow us on twitter at: https://twitter.com/pydatatriangle
Presenter: Leo Gokce
Title: Selecting optimal subsets of Amazon Reviews
Presentation Overview:
- Project goal is to distill Amazon reviews into a representative subset.
- The overall method works, 3-grams didn’t help much, and you need to read at least 3 reviews to get a sense of all reviews. No single review ever covers all aspects of the review corpus.
- We also realized eliminating stop-words is fairly unfounded.
Bio:
Leo started his college education in Istanbul/Turkey at Marmara University with the Bachelor’s Degree in Business Administration, spending time at the Warsaw School of Economics in Poland, and finishing at UNC-W with the Supply Chain Management concentration.
Leo recently completed his Master of Science degree in Computer Science and Information Systems at UNC-W, working with Dr. Douglas Kline for 2 years as his assistant, focusing mainly on Introduction to Database Management.
Leo is currently a full stack software developer at CentralSquare Technologies.
Presenter: Ryan Woodall
Title: Large Scale Data Pipeline for Scraping Amazon Reviews
Presentation Overview:
- Drop a .csv of product urls in a folder in cloud storage
- Come back later to find reviews in a relational database
- Use a variety of mechanisms, including a 3rd party scraper called cloudscraper.io, azure functions on http endpoints, rest apis, azure data factory, sql azure, etc.
- Evaluating and reducing costs
- Azure Data Factory is still young, with limitations, and breaking changes occurring, and seems inefficient, costly. Kind of a low-code platform. Ryan ended up writing some old-school SQL to get around Azure Data Factory limitations.
Bio:
Ryan received his Bachelor of Arts degree in Music from the UNC-W in 2003 and has worked as a professional jazz bassist since 1998. In 2005, he began work at Measurement Inc.
Ryan returned to UNC-W in 2017 to pursue a Master of Science degree in Computer Science and Information Systems. While in school, he began to take an interest in data science and data engineering working with Python, SQL, R, and Julia. As a capstone project, he designed and implemented a fully automated data pipeline for collecting and preparing Amazon reviews that existed solely in the cloud.
Before graduating Ryan was inducted into the Honor Society of Phi Kappa Phi and Upsilon Pi Epsilon honor society for Computing and Information Disciplines. Since receiving his degree in May of 2021, Ryan works as a mentor for aspiring high school students through Polygence and is currently exploring job opportunities in data engineering and related fields.