Skip to content

A Scalable Scraping Architecture [Eddie Bell, Lyst.com]

Photo of Tomaz Kovacic
Hosted By
Tomaz K.
A Scalable Scraping Architecture [Eddie Bell, Lyst.com]

Details

Web scraping is an integral part of data acquisition at Lyst (http://www.lyst.com/). Almost all fashion products sold on our site come from scraping. We run hundreds of spiders in parallel via a distributed scheduling platform and scrape millions of pages each day. One of our main problems is that the data from scraping is not reliable. In this talk I will explain how we built a robust and scalable scraping architecture with the help of machine learning and crowd sourcing.

Dr Edward John Lancaster Bell III (https://twitter.com/ejlbell) (Eddie) is an ex-finance PhD who saw the light and joined a start-up. He is the lead data scientist (aka "The Fashematician") at Lyst and he solves fashion data problems using NLP, ML and image processing. He likes describing himself in the third person and long walks on the beach.

Photo of Čez glavo v #vblatu group
Čez glavo v #vblatu
See more events
Fakulteta za Računalništvo in Informatiko [NEW VENUE]
Večna pot 113, 1000 Ljubljana · Ljubljana