Hi. I read a file line by line and do some processing for each line. If I were to use Async to read line by line instead of ordinary reading a file via an Inchannel.input_line, will there be any difference in performance? Since there is a scheduler running I wonder if it anticipates that there will be another file read and let’s the system wait for a line read while doing some processing in parallel? That is, the application reads file and processes lines in parallel. I think since the file reading is buffered probably this effect isn’t large since the system reads ahead several lines of the file into memory anyway.
It depends on whether reading the file is the only thing your application is trying to do at that time.
If reading the file is all you do at that point, then in principle running an additional scheduler (e.g. Async, Lwt, Eio, Miou) will cause a slowdown, but since IO is very slow compared to CPU, the slowdown is negligible. There is no benefit in using Async or anything alike in this case, but also no very tangible disadvantage.
If reading the file is not all you do at that point, and the way you wrote your application allows other parts of the code to progress while the read operation is blocking on IO, then it’s likely you will get something done in “parallel” timeline wise.
Whether everything combined improves the overall performance or not depends on the actual workload pattern of your pipeline, i.e. you have Reading -> Processing
:
- If
Processing
stage processes at least as fast asReading
can read, then the pipeline is working as fast as it can, i.e. the pipeline is IO bound- You generally cannot mitigate this without changing the underlying storage layer or minimising IO required (e.g. swapping to a more compact format, or loading compressed data into memory, then decompress it)
- If
Processing
stage processes slower thanReading
stage can read, then the pipeline is not optimal, i.e. pipeline is CPU bound- To mitigate this you can scale your code vertically (make it just more performant directly), or scale it horizontally (multithreading), or both
If your application is consistently IO bound or CPU bound, then in principle buffer size doesn’t impact your performance consideration that much. But if workload pattern is not constant, i.e. some lines might be more CPU heavy than other lines, stalling the Processing
stage, then the buffer can help avoid stalling the entire pipeline.
- If
Processing
stage cannot catch up withReading
sporadically, then the buffer can allowReading
to continue if it has enough free space - If
Reading
stage cannot catch up withProcessing
sporadically, then the buffer can allowProcessing
to continue if has enough backlog built up
That is, the application reads file and processes lines in parallel. I think since the file reading is buffered probably this effect isn’t large since the system reads ahead several lines of the file into memory anyway.
I’m not quite sure what you mean by effect here.