I feel like there are a few things you might want to keep in mind, when designing this system you desire. These considerations are core considerations from transaction-processing, and if, for instance, you consult Gray&Reuter (which I encourage you to do, on general principles), you will find them expressed there in … quite insistent terms.
-
Admission control: this is a key idea in all transaction-processing monitors; indeed, it and #2 are the reason that the first TP monitor, CICS, was implemented by IBM and their power company client. It is critical to be able to constrain how many in-flight requests are in-progress in a TP monitor process. Requests consume resources, which leads directly into …
-
Capacity/resource limitation: the other key attribute of TP monitors since the first, is control of the maximum amount of resources requested from the operating system. Enqueueing a request to an in-memory queue consumes resources, and there needs to be strict limits on that.
-
Run-to-completion: Finally, there is a key idea (again, from the early history of TP, including in TPF, the TWA Airline Control Program, which you can find documented in CACM 84, Spector&Gifford), that when a transaction starts running, until it hits a suspension point due to unavailable data or waiting on some sort of latency-bound operations (e.g. disk flush) it “runs to completion”. As an example the original TPF TP monitor did not have a system interrupt: a transaction ran until it finished. This (perhaps paradoxical) idea is sound: eventually, the tran will need to finish running, and so you might as well let it do so now, and consume all the resources -now-, so that it can finish and then release them for other trans. Suspending it while it continues to consume resources only hurts other trans, that could have used those resources.
Finally, there is an important rule to keep in mind: “two-phase locking”. This rule isn’t just about locks, but also about all forms of resources. A transaction should never give up resources that it has acquired, until it either reaches a suspension point or is completed. In either case, it must give up as much resources as is reasonably possible.
Concretely, I’ll suggest two immediate ramifications:
-
Offloading units of work to a queue is a good idea when those UOWs are low-frequency, and the offloading can free up resources. Doing so for the typical UOWs in a system, and doing so to a memory-queue, doesn’t free up resources.
-
Whatever inbound workload management mechanism you use, must be instructible as to how much uncompleted work (typically == “in-flight trans”) can be tolerated at any time, and it must strictly enforce those limits.
As an example of why run-to-completion is so critically important, imagine a process with a single kernel thread, that is presented 100 UOWs, each of which is ten steps of computation (suppose each step takes 1ms) – perhaps chained-together with promises.[1] If the scheduling mechanism employs run-to-completion, the average latency is 50 * 10ms == 500ms. If the scheduler employs some round-robin “fair” scheduling strategy that is oblivious to the chained nature of the computations, then it is easy to end up with the average latency being 100 * 10ms == 1000ms (or thereabouts). The term for the related behaviour when memory is overcommtted in virtual memory operating systems is “thrashing”, and this is equally an example of thrashing, though of different resources.
[1] consider for instance accessing a remote server with a local in-memory cache, that is almost-always a hit. So you need to provide a promise-based implementation, but with extremely high probably the promise is immediately fulfilled with no suspension/latency.
Last thought: naive implementations of promises may violate run-to-completion, or not provide adequate expressive capability to allow the programmer to express that their computation is runnable-to-completion. This is something to carefully investigate before using them.