Stop Chasing 9s! Design for Failure, Rather Than Uptime


Details
Traditional administrator-turned-architects spend too much time chasing 9’s of availability. The system engineer/architect can instead accept that failure and downtime are a fact of life and design their data pipelines to survive failures. In an ideal world, every data source will be equipped with a Splunk Universal Forwarder, and those forwarders will wait patiently until downstream ingest points are available. Unfortunately, we don’t live in an ideal world, and downstream issues can lead to lost data and missed critical events. In this session I propose changing from the traditional Push model to a Pull model for non-UF streams, and integrating buffering into the data stream (using native AWS tools in this example, though other means could be employed), which minimizes the potential for data loss due to an ingest node failure.

Stop Chasing 9s! Design for Failure, Rather Than Uptime