Skip to content

Web Scraping with Trafilatura and SpaCy

D
Hosted By
David G.
Web Scraping with Trafilatura and SpaCy

Details

You've invested a lot of time and effort in learning how to extract just the good bits of text from a target site using BeautifulSoup. What happens when you want to add data from another target? Will any of your hard work transfer over? Do you really want to find out how many different names someone can give a class in a div tag? Will all your work go down the drain when your target changes their tech? Is there a better way? Yes! We'll play with the Trafilatura library, which was designed to make just-the-text-you-want extraction fast, simple, and universal. We'll use it to get data from some local LI sites, and then use the incredible SpaCy library to find Named Entities in the text.

Photo of PyData Long Island group
PyData Long Island
See more events
North Bellmore Public Library
1551 Newbridge Road · North Bellmore, NY