Testing and Debugging Distributed Systems


Details
Pizza, beer, and mingling.
"Testing and Debugging Distributed Data Storage Systems"
Mukund Madhugiri (http://www.linkedin.com/pub/mukund-madhugiri/0/a02/18) - Sr. Director of Engineering, Cloud Services and Hadoop Quality and Release at Yahoo!
"Inside the Snow-globe: Testplus and TARDIS at Chegg"
Rodney Gomes (http://www.linkedin.com/pub/rodney-gomes/1/735/260/) - Senior Software Engineer in Test, Chegg Inc.
"Data Driven Testing for Distributed Systems: Case study with Apache Helix"
Kishore Gopalakrishna (http://www.linkedin.com/in/kgopalak) - Data Infra Engineer, LinkedIn
Distributed Systems are, well, distributed. This makes them a lot harder to deal with. We've had evenings where we've talked about the challenges of building horizontally scalable systems, and how to run them, but we haven't really talked about the inbetween state where these systems are being developed, and aren't ready for prime time yet.
When testing monolithic applications, we tend to think of things like automated tests that can produce logs that can be compared to known good results. When we don't get the expected result, we can run a test case with a debugger attached, and see what happens. For performance, we can run a test suite and time it. If we're not satisfied, there are various tools we can use to profile our application and see where it is spending time.
Distributed systems aren't amenable to such straightforward techniques. The distribution of the data introduces timing variations as data is shuffled around the network. The need to communicate between nodes introduces the possibility that messages are routed incorrectly, or are otherwise lost. The use of multiple distinct hosts means we can't easily get a picture of where our application is spending its time.
If you have a distributed system, how have you dealt with this? How do you test your system? How do you debug problems when they arise? Have you found tools that help you with this, or do you have to resort to ad hoc approaches and write your own tools?
For our April event, I'd like to see talks about Testing and Debugging Distributed Systems. How are you doing it? What difficult problems have you run into? What have you learnt about it?
I'm looking for 2-4 20-25 minute talks. If you can give a talk, please contact me, Chris Westin, through meetup.
As well as the evening's theme talks, we can fit in 2-3 five minute lightning talks at the beginning of the evening; any topic that would be interesting to the #lspe audience is welcome. If you're interested in giving a lightning talk, contact me, Chris Westin, through meetup.

Testing and Debugging Distributed Systems