Scalable Data Harvesting for AI
Details
Every AI pipeline starts with one question: where does the data come from?
In this hands-on workshop, we'll answer that question using Scrapy, one of Python's most powerful web scraping frameworks. You'll go from a blank project to a working spider that harvests structured data from the web, ready to feed into your next AI or data science project.
We'll cover:
• Scrapy vs. lighter tools (Requests & Beautiful Soup), and when each makes sense
• Extracting data from HTML using CSS selectors
• Following links and handling pagination at scale
• Writing clean, structured output to a file
• The ethical and legal side of web scraping
Bonus (Gold Star): a chapter on scraping JavaScript-rendered pages for those who want to go further.
🐍 Python basics assumed — no prior scraping experience needed
Agenda
- 18:00 Doors Open
- 18:30 Start of the Workshop
- 20:15 Workshop Closing & Announcements
- 20:30 Networking
- 21:00 Event Closing
GitHub Repo
Scalable Data Harvesting for AI
Stream
YouTube Stream
📧 Contact
Are you interested in speaking at one of our events? Have a good idea for a Meetup? Get in touch with us at [amsterdam@pyladies.com](mailto:amsterdam@pyladies.com)
💬 Find us on the PyLadies Global workspace:
- https://slackin.pyladies.com enter your email address.
Accept the email invitation - Go to workspace https://pyladies.slack.com
- Join channel #city-amsterdam
- Scalable
