Message Processing & Resiliency
Wallaroo's integrated state management includes optional support for recovering state after a crash and assuring that each message is processed once and only once.
For example, let's take a counter managed by Wallaroo that receives 'increment' messages. Each message we process contains a number to increase the count.
increment 1 increment 1 increment 3
When applied, our final value is '5'. When we say that we have "exactly-once message processing," we mean that during a failure:
- Increment messages may be sent more than once to assure that we don't lose any, and
- Increment messages that have been previously applied to our counter will not be reapplied.
So, what's the result? You get '5' when everything works perfectly. You get still get '5' even if you experience a crash and recovery. We are working on in-depth documentation on all the nuances of how we do resilience, but - in the meantime - here is a high-level overview of the basics.
Wallaroo uses a local, on-disk event log to save changes to application state. As your application updates state, Wallaroo will write those changes out to the event log. If you are worried about the loss of machines rather than individual processes, the event log can be written to a network file system.
Upstream message cache
Each Wallaroo worker keeps a queue of messages that it has sent to downstream workers. Messages are evicted from the queue as the downstream worker acknowledges that they have been processed, any state changes have been written to disk, and any changes further downstream have also been completed.
Wallaroo supports exactly-once message processing by doing at-least-once message delivery combined with deduplication. Together this means that we can replay work and guarantee that we won't process a message more than once.
When a process is recovering from a crash and reconnecting to a Wallaroo cluster, it goes through our recovery protocol. At a high-level, the recovery protocol is:
- Recover state from the event-log up to the last acknowledged message
- Create a deduplication list of acknowledged messages
- Reconnect to upstream workers
- Request a replay of messages from the upstream cache
- Drop any duplicate messages
- Get notification that recovery replay is complete
- Drop the deduplication list
- Resume normal processing
We currently support saving state changes to a file-based event log and we have work planned to replicate state across a Wallaroo cluster. State replication would permit us to allow for recovery from machine failure without the need to use a network file system.
Exactly-once message processing
Exactly-once message processing works well within Wallaroo; however, additional support is needed when interfacing with external systems. To complete this job, Wallaroo needs:
- To support message replay from external sources
- Acknowledgment and deduplication protocol with external sink to prevent duplicate output messages
We are currently planning on completing this work by early Q1 2018.
Wallaroo makes the infrastructure virtually disappear so you get rapid deployment, very low operating cost, and elastic capacity with zero downtime for your applications in big data, stream processing, machine learning, and microservices.