Search engines and tracking ML workflows
Details
We are very pleased to announce that our next event will be taking place on Monday the 20th of February.
This event will be held at Tramshed Tech, kindly sponsored by Antiverse who are providing us with the venue together with food and refreshments (free beer, soft drinks and pizza!).
The venue will open at 18:00 for a workshop/introduction to tree-based machine learning algorithms (decision trees, random forests, etc.). The main event with two great speakers will start at 19:00. After the event, we will be heading to a local pub that is close to Cardiff Central Station.
We are fortunate to have 1 external speaker for this event, and one of the team will be providing a talk/workshop:
--------------------------------------------------------------------------------------
Talk 1: Daoud Clarke - The challenges of building a non-profit search engine
Building a non-profit search engine is hard because there's less money. This means we can't afford lots of servers and storage space. But it's possible if you design the search engine to be cost-effective. That's what we've done with Mwmbl, an open-source, non-profit search engine built from scratch, which also comes with its own Firefox extension.
We use a specially designed index based around pages of memory. To make retrieval efficient, we store search results in compressed pages of 4096 bytes which can be read into memory very efficiently. The whole search engine is effectively one large hash map built around these compressed pages. This reduces the cost of running the search engine since we need much less storage space than a typical search engine index. We use other tricks too, like discarding unnecessary data, and only storing the most important search results. Ultimately this means we can store billions of search results on a single machine.
Our crawler is also unique in that it is distributed via a Firefox extension which volunteers run on their own machines. The crawler retrieves a set of URLs to crawl from a central server, retrieves and processes them, then batches up the results and sends them back to the central server.
In this talk, I will discuss the technical design of the index and crawler, and what it's like starting an open source project and community.
Talk 2: Tim Vivian-Griffiths - Members' choice!
There is a choice of 2 topics for the workshop session, and members can decide on this over the next week! The topics are:
- Tracking your machine learning workflows with MLFlow
- Learn about feature engineering - with a particular focus on the Feature Engine library
Please vote here
The last day for voting is Sat 11th Feb.
We also have some local representatives from a Cardiff based recruitment firm, IntaPeople, present. So if anyone is interested in learning more about the local job market for data science, then this is a great opportunity.
The event will be covered by the PyData Code of Conduct: https://numfocus.org/code-of-conduct, and we would like to stress that one of the main goals of NumFocus and PyData is to increase diversity in the data science field. Our aim is to have an entertaining and informative evening, which will be a welcoming and safe space for everyone, from all backgrounds and technical abilities.
