Thanks for the detailed comments @art-w! I’ve included most of your suggestions and am now looking into implementing lock free bag as an alternative batching option! In other news, here’s an update along with some open questions.
BATCHER design
The implicit batching support that I’m using relies on promises to mimic the batching logic. This is quite different from the design that was described in the paper at least in terms of fairness to data structure operations. With the original batcher design, if a worker (domain) encounters data structure operation, it will be committed to only work on batch related operations. This also means that it can no longer be a part of the main program execution that will call a data structure operation, therefore the batch size is limited to the number of domains running the program. On the other hand in the promise-based batcher, the batch size is theoretically limited by the number of asynchronous tasks that are spawned. In my opinion, the promise-based one seems to be the better choice but am wondering if there’s any value in providing the original batcher. I’ve thought about the pros and cons of each design and here’s what I have come up with.
Original batcher:
Pros:
- DS related tasks are biased to run to completion first.
Cons
-
Whilst throughput of the DS is fast, it is at the expense of the progress made by other workers executing non DS related work.
-
No customizability on the size of the batches. Less flexibility on performance tuning of the program. E.g. If operations are cheap, committing 1 domain per operation could slowdown the progress of the entire program
Promise-based batcher:
Pros:
-
All tasks are run equally according to the Work-stealing FIFO policy. The program makes progress more fairly and generally runs to completion faster.
-
Batch sizes become customizable by the user’s program. Performance gains can be made by adjusting the batch sizes according to the cost of operations. (However, this could also be a bad thing that users now have to control the number of DS tasks that are running)
Cons:
- At the cost of collecting a larger batch, the time between when the operation was launched and when it is completed is longer.
A gross simplification of the design trade-off is between how quick you want your data structure operations to be fulfilled against how quick you want the program complete. The original design provides the former and the promise-based design the latter.
There are just my hypothesis, I would still need to implement the batcher true to it’s specification to validate them. Any thoughts about other things to consider regarding the design limitations?
Performance tuning with variable batch sizes on batched skip-list
I tried to reproduce the experimental results of the paper by implementing a skip-list with “set-like” properties. I can’t quite get the throughput that the paper reported, but the trends seem reasonably similar.
Initialized: 1 Million elements
Inserts: 100,000 elements
num_domains: 2 3 4 5 6 7 8
Seq_ins: 299 284 299 301 298 284 297 ops/ms
Batched_ins: 346 432 465 563 585 644 627 ops/ms
------------------------------------------------------------------------------------
Initialized: 10 Million elements
Inserts: 100,000 elements
num_domains: 2 3 4 5 6 7 8
Seq_ins: 137 142 153 153 149 153 149 ops/ms
Batched_ins: 156 240 260 409 432 423 522 ops/ms
Running 8 cores, the implicit batch insertions are performing 2X on 1 Million preset elements and 3X on 10 million insertions.
As mentioned in the Pros of our promise-based design, we can play around with the batch sizes and see how it affects performance. In the above, I used the default chunk_sizes that are generated by parallel for
. Below are some performance metrics when we vary the chunk_sizes.
Initialized: 10 Million elements
Inserts: 100,000 elements
batch limit = 1
num_domains: 1 2 3 4 5 6 7 8
ImpBatch_ins: 148 147 145 130 134 138 138 133 ops/ms
batch limit = 64
ImpBatch_ins: 149 234 236 382 412 456 487 476 ops/ms
batch limit = 127
ImpBatch_ins: 246 332 525 571 644 663 646 700 ops/ms
batch limit = 512
ImpBatch_ins: 144 215 281 350 410 428 483 469 ops/ms
batch limit = 4096
ImpBatch_ins: 141 225 250 267 285 430 457 475 ops/ms
batch limit = 100,000
num_domains: 1 2 3 4 5 6 7 8
ImpBatch_ins: 136 103 357 183 494 462 506 571 ops/ms
It seems like a batch limit of 127 operations is the sweet spot for this skip-list workload. However, this might not be the case for a different workload and different data structures. The batched counter best performance comes from a batch limit of 4096 operations for example.
I suppose for now we could leave it up to the default chunk_sizes to pick an appropriate batch size but I think there should be a better way to think about how to select batch_sizes or provide some facility to automatically pick a good batch size. Something like a recommended_batch_size
function but I don’t have a good idea of how to come up with it at the moment, any suggestions?
Another thing is on how to limit the batch size. Currently the batched DS carries around a large array of configurable size to store the batch. This has a performance cost and the problem also comes when the number of operations exceed the size of the array. There isn’t any way of handling it and the program will crash with an assert failure. Ideally I would like the behavior of the batcher to either suspend the current task so that the user can set up the batch size once and not have to worry about accidentally creating too many tasks that make data structure calls. I’m thinking that an implicit yield could be helpful in this regard but am also open to other ideas.