Pizza, beer, and mingling.
Real-Time and Offline, Exposure and Focus at Flickr
Peter Welch – Flickr Backend, Yahoo!
"Combining Batch and Realtime Processing with Summingbird, Storm, and Hadoop"
Bill Darrow - Revenue Engineering, Twitter
We've covered a number of topics related to the performance, reliability, and scalability of web sites and web applications, but this has always been based on real-time behaviors. We think of these in terms os failover handling and response time. But many applications these days rely on systems that do not respond in real time. These dependencies include things like recommendation systems, analytics systems, machine learning systems, and computations carried out by multi-host frameworks such as Hadoop and GraphLab.
For our July event, I'd like to hear about how people are integrating these into their real-time systems. By definition, these non-real-time systems don't produce results in real-time, so you can't directly depend on them for page (or web service result) generation, as the response time would be too long. So how do you integrate these systems into your application?
In the simplest cases, such as using an analytics system to maintain a leaderboard for a game platform, you might run the query periodically, cache the result, and always show the latest cached result. But even something as simple as that may cause problems for real-time serving because of the demands the analytics system places on your data store while it is doing its work.
In other cases, such as training a spam classifier for email, how often do you do it? What do you do inbetween trainings? How and when do you augment the training set? For a recommendation system, this gets complex because you need a distinct training set for each user, and have to constantly retrain that based on new ratings.
For our July event, I'd like to hear talks on these topics. Does your application rely on a non-real-time system for its operation? How do you connect it to your real-time serving infrastructure? Does this system compete with your real-time serving for access to the data store? How do you manage that? How do you handle upgrades if you change this system; for example, suppose you decide to change the features used for machine learning -- when do you retrain everything? If you're clustering news articles how do you keep up with the constant feed of new information that you need to check against all your existing data? Can you find ways to affect or improve your results with recent changes without having to completely retrain or reclassify all the time?
I'm looking for[masked] minute talks. If you can give a talk, please contact me, Chris Westin, through meetup.
As well as the evening's theme talks, we can fit in 2-3 five minute lightning talks at the beginning of the evening; any topic that would be interesting to the #lspe audience is welcome. If you're interested in giving a lightning talk, contact me, Chris Westin, through meetup.