I’m working on a paxos implementation and am currently using the Async_rpc library for my communcation layer. The issue I’m having is that when the system is under load, the client requests coming in mean that the data getting sync’d to disk is delayed (which means I can’t respond quickly to any of the client requests, and once the client requests stop the system performs excellently again).
What would be optimal would be a strategy saying something like: do all outgoing requests before incomming ones, however I can’t see how this would be possible to implement.
Other approaches I’ve considered is having a dedicated thread for each and hoping that the scheduler will evenly spread time between them; explicitly limiting the number of requests inside the system to some value (via chucking them into a throttle).
Is there any general purpose approaches to dealing with this inside Async?
In transaction-processing there’s a term “run-to-completion” which refers to processing all immediately-schedulable work that results from an input action, before any other input is processed. And for example, queueing records to the persistent log, and then starting the write-operation, would count as such work, as would any other immediately-runnable work. In a LWT-like system with a run-queue, one would think that the simplest way of achieving this would be to place such work at the front of the queue, not the back, and that that should suffice?
This doesn’t address the issue of actually -achieving- the goal of getting those disk-blocks written, but that would be a matter of equitable distribution of scheduling to active I/O events, which is a different matter, and something that would be addressed in the event-loop or libevent of whatever.
Concretely, is there a way or you to enqueue immediately-runnable work to the front of whatever Async uses for a scheduler?
It’s part of Async. It returns a deferred that becomes determined when the job queue becomes empty. I use it to prevent network updates from livelocking my app, I think similar to the problem you are facing.
One thing you can do is have the Async_rpc handler do the bare minimum and stick the request into an explicit queue (like a Pipe), and have whatever services the queue call bind to the Scheduler yield before it handles each item in the queue.
Yeah, I like that style of solution better. It depends on where the starvation is coming from: if it’s coming from just the traffic of handling the network connections themselves, you might want a separate process that handles those requests and shields the core process from clients. But if, as seems more likely, ti’s the substantive processing of the requests that’s the issue, then mbac’s idea seems right on: as quickly as you can, register the state of the unanswered requests in an ordinary synchronous data structure, and then make prioritization decisions there.
In general, I find it’s better to pull information out of the Async world as quickly as you can, and structure as much of your program as possible as a synchronous, transaction-processing machine. That lets you make simple and explicit decisions about things like prioritization and rate limiting in the ordinary code of your program, without having to dive into the plumbing of something like Async.
I would be curious what sorts of starvation you’re seeing (or perhaps what sorts you’re expecting) in a little more detail? And in the case where the action that transpires after an input request is received … is not confined to a single callback, how are the multiple callbacks interconnected? That is, what is the structure of the handler for an inbound message?
I feel like there are a few things you might want to keep in mind, when designing this system you desire. These considerations are core considerations from transaction-processing, and if, for instance, you consult Gray&Reuter (which I encourage you to do, on general principles), you will find them expressed there in … quite insistent terms.
Admission control: this is a key idea in all transaction-processing monitors; indeed, it and #2 are the reason that the first TP monitor, CICS, was implemented by IBM and their power company client. It is critical to be able to constrain how many in-flight requests are in-progress in a TP monitor process. Requests consume resources, which leads directly into …
Capacity/resource limitation: the other key attribute of TP monitors since the first, is control of the maximum amount of resources requested from the operating system. Enqueueing a request to an in-memory queue consumes resources, and there needs to be strict limits on that.
Run-to-completion: Finally, there is a key idea (again, from the early history of TP, including in TPF, the TWA Airline Control Program, which you can find documented in CACM 84, Spector&Gifford), that when a transaction starts running, until it hits a suspension point due to unavailable data or waiting on some sort of latency-bound operations (e.g. disk flush) it “runs to completion”. As an example the original TPF TP monitor did not have a system interrupt: a transaction ran until it finished. This (perhaps paradoxical) idea is sound: eventually, the tran will need to finish running, and so you might as well let it do so now, and consume all the resources -now-, so that it can finish and then release them for other trans. Suspending it while it continues to consume resources only hurts other trans, that could have used those resources.
Finally, there is an important rule to keep in mind: “two-phase locking”. This rule isn’t just about locks, but also about all forms of resources. A transaction should never give up resources that it has acquired, until it either reaches a suspension point or is completed. In either case, it must give up as much resources as is reasonably possible.
Concretely, I’ll suggest two immediate ramifications:
Offloading units of work to a queue is a good idea when those UOWs are low-frequency, and the offloading can free up resources. Doing so for the typical UOWs in a system, and doing so to a memory-queue, doesn’t free up resources.
Whatever inbound workload management mechanism you use, must be instructible as to how much uncompleted work (typically == “in-flight trans”) can be tolerated at any time, and it must strictly enforce those limits.
As an example of why run-to-completion is so critically important, imagine a process with a single kernel thread, that is presented 100 UOWs, each of which is ten steps of computation (suppose each step takes 1ms) – perhaps chained-together with promises.[1] If the scheduling mechanism employs run-to-completion, the average latency is 50 * 10ms == 500ms. If the scheduler employs some round-robin “fair” scheduling strategy that is oblivious to the chained nature of the computations, then it is easy to end up with the average latency being 100 * 10ms == 1000ms (or thereabouts). The term for the related behaviour when memory is overcommtted in virtual memory operating systems is “thrashing”, and this is equally an example of thrashing, though of different resources.
[1] consider for instance accessing a remote server with a local in-memory cache, that is almost-always a hit. So you need to provide a promise-based implementation, but with extremely high probably the promise is immediately fulfilled with no suspension/latency.
Last thought: naive implementations of promises may violate run-to-completion, or not provide adequate expressive capability to allow the programmer to express that their computation is runnable-to-completion. This is something to carefully investigate before using them.
That sounds like a very good idea regarding manually scheduling that kind of thing. I’m just chucking together a prototype of that rn, do you have any intuition regarding the overhead of such an approach versus getting Async to schedule this in a friendly manner?
Thanks for the detailed post, I have been kind of flailing around trying to work out how to write this kind of thing, and was going to ask you if you had any good book references while reading your first one :).
The run-to-completion strategy is rather interesting, I’ve implemented various ‘fast-path’ style optimisations which essentially tried to avoid switching which job is run, but having an actual strategy to employ gives that a bit better of a basis and as well as a more structured way of thinking about the job scheduling.
Have you thought about using I/O automata as a structuring model? I’m somewhat biased, but I believe that the really good descriptions of Paxos and other consensus protocols in the literature are all based around state-machines. And Mark Hayden’s Ensemble implementation is entirely built using I/O automata as a model. With such a model, there are explicitly only a few event types/IO-operations: {in,out}bound-message, disk-write-{start,completion}, timer. So you can end up with something that doesn’t need to complexity of monadic I/O, and hence, you get to control things like I/O prioritization (always prioritize send over receive, disk-write over network I/O) for not much effort (b/c no abstractions between you and the event-loop).
Also, I don’t know which presentation of Paxos you’ve settled-on, but for myself, I find Mencius by Keith Marzullo et al to be the … well, the most compelling variation.
I’m trying to build something to demonstrate how close Paxos is to Raft, so in order for it to be vaguely recognisable to Raft people it needs to use the rpc style formulation… (end result should be a nice comparison between Paxos and Raft
That all being said that representation sounds quite attractive, I’ll have a read at that tomorrow.
Paxos … Raft? But … Raft is an implementation of Paxos. By which I mean that, many things are left unspecified in Leslie’s original paper, because he felt they were either obvious, or extraneous to the thrust of the paper and protocol. Raft is one version of Paxos, where some of those things are specified in a particular way. I can understand Raft aficionados arguing that their particular version of Paxos is better than somebody else’s, but … to argue that it’s better than “Paxos”? I don’t even know what that would -mean-.
There is quite substantial differences between them with regards to leader election.
Most Paxos implementations can elect any node by replicating the entries which have not yet been committed (alternatively, any proposed values in synod) to that candidate (essentially taking the union of values which are proposed but uncommitted).
While most Raft implementations are more restricted in which node can become the leader. Specifically that the candidate has a log which is at least as up to date as a quorum of followers.
Otherwise yes they are identical
Also I should have phrased that as ‘How close Raft is to Paxos’
grin Fair enough. I would have said something like:
Leslie doesn’t specify leader-election, except to suggest some ways it might happen, and to strongly note that regardless of how it’s done, and whether it’s buggy or not, the core replication protocol remains safe; that correct leader-election is only a performance optimization.
But yeah, I get you: my own preference when implementing Paxos is to build on top of a virtual synchrony system (Ensemble) and just use primary-partition groups. Ensemble picks who’s the leader, etc, etc, and I just don’t think about it.
BTW, have you looked at Keith Marzullo’s Mencius? It … dispenses with leader election by using a rotating leader. And if a node is down, other nodes can use normal Paxos to void/cancel that leader’s slots. To my mind, it might be the cleanest version of Paxos around. When I was thinking thru how to write a BFT consensus algorithm, I had worked-thru how to “Byzantinize” Mencius, b/c it seemed to offer the greatest possibility of both simplicity and resistance to bad actors.
Combine it with “epochs determined by a decree changing the set of members”, and I think you get a really clean design for a consensus service.