Fault Tolerant Systems: A Chat
Details
Making sure things keep working when some stuff breaks can be an important part of software engineering.
This is a pretty wide area of engineering in general I think, so to keep things manageable we'll just focus on cloud native and adjacent topics, like any of the following
(Apologies to all, Meetup doesn't seem to do nested lists...)
- High Availability
- Durability
- Decorrelation
- Single points of failure
- Fault isolation
- Zones, Regions, Global, Partitions (and per-cloud provider nuances)
- (Indent) Intersections with Individual Services
- (Indent) Inter-Service Dependencies
- Control planes vs data planes
- (Indent) "Static stability" (and its limitations)
- Disaster Recovery
- (Indent) Backups
- (Indent) Restores
- Sociotechnical resilience (e.g. incident response, runbooks, documentation, lottery factor, communication tools, coordination)
- Retries, Idempotency, and Re-entrancy
- Testing
- (Indent) Chaos Engineering
- (Indent) Fault Injection
As always if you need any accommodations (e.g. ASL interpretation) just reach out and I will do my best on them.
