Web Scraping with Trafilatura and SpaCy
Hosted By
David G.

Details
You've invested a lot of time and effort in learning how to extract just the good bits of text from a target site using BeautifulSoup. What happens when you want to add data from another target? Will any of your hard work transfer over? Do you really want to find out how many different names someone can give a class in a div tag? Will all your work go down the drain when your target changes their tech? Is there a better way? Yes! We'll play with the Trafilatura library, which was designed to make just-the-text-you-want extraction fast, simple, and universal. We'll use it to get data from some local LI sites, and then use the incredible SpaCy library to find Named Entities in the text.

PyData Long Island
See more events
North Bellmore Public Library
1551 Newbridge Road · North Bellmore, NY
Web Scraping with Trafilatura and SpaCy