So it seems there is no loophole. Function boundaries can only be crossed in a box.
Actually, there is a small trick you can use here. Do not use a generic reference, but use a custom type instead:
type t = {mutable f:float}
let calc2 rounds =
let double = 2 * rounds in
let upto = double + 1 in
let sum = { f = 0. } in
let rec loop curr =
if curr >= upto
then ()
else
let _ = sum.f <- sum.f +. 1. /. float_of_int curr in
loop (curr + 4) in
loop (1 - double); sum.f
That way, the compiler will directly inline the floating-point value inside the structure, so the code will not incur any allocation (except the very first one) nor any write barrier.
Wow, this is quite non-obvious to me. I would have thought the compiler knows the same in either case, as also the previous sum was known to be a float ref. Apparently an 'a ref where 'a = float is not optimized in the same way as a monomorphic custom type.
Would a Floatarray.t also work? I’m guessing a float array wouldn’t? Or does that depend on no-flat-float-array?
I’m wondering if this trick can be a way to allow breaking up numerical code into small functions, circumventing the float unboxing limitations. One would convert float arguments and return values into mutations, basically writing fortran routines.
Yes, floatarray (and float array in flat mode) would also work just fine. But note that they would imply some bound checking. So, even if these checks are perfectly predicted by the processor, they would still incur some undue pressure on the decoding unit. So, you would have to use unsafe_get and unsafe_set to achieve the same performance as with a custom record.
Hello,
I can not reproduce what the benchmark claim. I just check gcc versus ocaml 4.14 and 5.5
I use the c program from the github (pi2.c) (which output a different result), the ml program and another c program more similar to the c code (pi.c) and I get:
With gcc and ocaml 4.14.0
raffalli@oulala2:~/Programming/tmp$ gcc -O3 pi.c -o pi_c
raffalli@oulala2:~/Programming/tmp$ time ./pi_c
3.1415926525897171
real 0m0,861s
user 0m0,861s
sys 0m0,000s
raffalli@oulala2:~/Programming/tmp$ ocamlopt -O3 pi.ml -o pi_ml
raffalli@oulala2:~/Programming/tmp$ time ./pi_ml
3.1415926525897171
real 0m0,903s
user 0m0,902s
sys 0m0,000s
raffalli@oulala2:~/Programming/tmp$ gcc -O3 pi2.c -o pi2_c
raffalli@oulala2:~/Programming/tmp$ time ./pi2_c
3.1415926545880506
real 0m0,864s
user 0m0,864s
sys 0m0,000s
ocaml 5.4.0 is a bit faster.
raffalli@oulala2:~/Programming/tmp$ time ./pi_ml5
3.1415926525897171
real 0m0,863s
user 0m0,863s
sys 0m0,000s
Remark: I think reference in OCaml are optimized to register in that case (not specialized structure as proposed above).
Remark bis: the benchmark is using -02 for ocaml, but it hardly makes a difference.
And if I try the same gcc option as the benchmark I get the same results
raffalli@oulala2:~/Programming/tmp$ gcc pi.c -o pi_opt -O3 -s -static -march=native -mtune=native -fomit-frame-pointer -fno-signed-zeros -fno-trapping-math -fassociative-math
raffalli@oulala2:~/Programming/tmp$ time ./pi_opt
3.1415926525897171
real 0m0,862s
user 0m0,862s
sys 0m0,000s
I am wondering if it is not a problem with ocaml running inside docker ?
You need to at least pass a -O option to gcc, otherwise the performance are expected to be poor.
Argh, a space was missing. I fixed my post thanks.
If you want a correct rounding for pi (program that to no output this should be rejected), you must sum first the small values with code like this (even the parentheses are important)![]()
let calc n =
let sum = ref 0. in
for i = n downto 0 do
sum := !sum +. (1. /. float_of_int (4*i + 1)
-. 1. /. float_of_int (4*i + 3));
done;
!sum
let pi = 4 * calc (rounds/2)
Gives 3.14159265258979347735
and not 3.141592652589717094 as the original does
(correct: 3.1415926535897932384626433…)
We are much nearer to the correct IEEE 64 bits rounding.
I have looked at the famous benchmark game, and find that OCaml is 2,74 times slower than C if we reject SSE/MMX code and multithread code. But there are 2 pragma optimisations. (I have computed the exponential of the mean of logarithm of the time ratio).
The worse code seems to be the k-nucleotide where indexing an array can be very fast in C, and less in OCaml (bound checking…).
This comparison is limited… it would be best to convert C code into OCaml and be sure to compare the same logic. (C codes have multiple propositions and usually OCaml only one… then less effort to optimise)
I note a ratio of 3,9 between the C code I have rejected and C code I have taken into a taken into a consideration with OCaml. In an other word, the difference between 2 different languages matters less than some very specific way (MMX/thread) of optimisation.
God I’m tempted to Solve All The Problems after reading this post but I guess we should take it slowly ![]()
Hmmm, will degrade performance I think, but not that much on linux?
This is also curious you should probably tell the author.
We are in front of Haskell and Python.
That’s very good news to me.
Python is not a good reference… one of the slowest.
There is a interesting benchmark here: Guillaume Marceau: The speed, size and dependability of programming languages We can see the code conciseness vs. performance balance. But I guess it use the game benchmark when it was at https://shootout.alioth.debian.org (then it can have the same biases of the previous link)
One intersting thing, I have taken one of the OCaml contribution: fannkuchredux (not the faster, since the faster uses process spawning), then convert it in C (with the help of Perplexity… it is quite convenient, but I have to compare both version and fix some two curious bugs in the translation).
Then:
$ time ./fannkuch_redux_c.exe 11
556355
Pfannkuchen(11) = 51
real 0m3.561s
user 0m0.000s
sys 0m0.078s
frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./fannkuch_redux_ml.exe 11
556355
Pfannkuchen(11) = 51
real 0m4.207s
user 0m0.031s
sys 0m0.000s
I find this comparison very good for OCaml. The C code is quite usual (idiomatic). I could have written the same manualy. But we note a function passing and a recursive function. In the other hand, the code use many array accesses where C can be quite faster (no bound checking), then I am very surprised.
EDIT : With n_body… here a heavy float computation:
frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./n_body_ml.exe 100000000
-0.169075164
-0.169035465
real 0m10.632s
user 0m0.015s
sys 0m0.000s
frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./n_body_c.exe 100000000
-0.169075164
-0.169035465
real 0m7.640s
user 0m0.000s
sys 0m0.047s
I thought that float boxing would make the C quite advantaged, but the difference is not that important.
However with spectral_norm, Many float arrays… (created simply with Array.make) and float ref:
$ time ./spectral_norm_ml.exe 5500
1.274224153
real 0m7.223s
user 0m0.031s
sys 0m0.015s
frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./spectral_norm_c.exe 5500
1.274224153
real 0m1.894s
user 0m0.031s
sys 0m0.031s
With fasta (conversion by Claude… Perplexity was too buggy). DNA handling (floats and char array… very closed results)
$ time ./fasta_ml.exe 25000000 |tail
gatctatgattcttaccagttgcaatattcaatttagcatgtgttcctaattattgtgtt
attatggtctatctcatcatgtaaatgaagatcatgacgtcaacacagattctagtcagg
atcatcagttcctcggggaaatcgcacctaggaacagccttatgcaaccgctaaacaaag
caatgaggatgtaccgacaaaagctcgatttaaaagcctcgaaacgagatgtacgaatcg
tttactgccttttatgaggagtcgagtactgttggttcatatttgctacatgattgtatg
taataacgatcccgccctttatcggttcgatcctttatggcgataagttatgaatcgtca
gtatctttagatcaaaaactcaactagtacccagttccccggaggaacggtcatgattaa
tgcgttttacggtctcccgtccctcttcttgtcagaggaatcagtttcatccgatcccac
tcgatgattggtatagctatttgccgaaaagccacaacgtattcggtactatcttgtttg
attcccctgtatcttaattc
real 0m5.602s
user 0m0.234s
sys 0m1.608s
frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./fasta2_c.exe 25000000 |tail
gatctatgattcttaccagttgcaatattcaatttagcatgtgttcctaattattgtgtt
attatggtctatctcatcatgtaaatgaagatcatgacgtcaacacagattctagtcagg
atcatcagttcctcggggaaatcgcacctaggaacagccttatgcaaccgctaaacaaag
caatgaggatgtaccgacaaaagctcgatttaaaagcctcgaaacgagatgtacgaatcg
tttactgccttttatgaggagtcgagtactgttggttcatatttgctacatgattgtatg
taataacgatcccgccctttatcggttcgatcctttatggcgataagttatgaatcgtca
gtatctttagatcaaaaactcaactagtacccagttccccggaggaacggtcatgattaa
tgcgttttacggtctcccgtccctcttcttgtcagaggaatcagtttcatccgatcccac
tcgatgattggtatagctatttgccgaaaagccacaacgtattcggtactatcttgtttg
attcccctgtatcttaattc
real 0m5.083s
user 0m0.390s
sys 0m1.623s
With binary-trees. Note that Claude optimise the code with an ad’hoc pool allocation:
$ time ./binarytrees_c.exe 21
stretch tree of depth 22 check: 8388607
2097152 trees of depth 4 check: 65011712
524288 trees of depth 6 check: 66584576
131072 trees of depth 8 check: 66977792
32768 trees of depth 10 check: 67076096
8192 trees of depth 12 check: 67100672
2048 trees of depth 14 check: 67106816
512 trees of depth 16 check: 67108352
128 trees of depth 18 check: 67108736
32 trees of depth 20 check: 67108832
long lived tree of depth 21 check: 4194303
real 0m2.461s
user 0m0.000s
sys 0m0.047s
frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./binarytrees_ml.exe 21
stretch tree of depth 22 check: 8388607
2097152 trees of depth 4 check: 65011712
524288 trees of depth 6 check: 66584576
131072 trees of depth 8 check: 66977792
32768 trees of depth 10 check: 67076096
8192 trees of depth 12 check: 67100672
2048 trees of depth 14 check: 67106816
512 trees of depth 16 check: 67108352
128 trees of depth 18 check: 67108736
32 trees of depth 20 check: 67108832
long lived tree of depth 21 check: 4194303
real 0m9.143s
user 0m0.016s
sys 0m0.046s
If we replace the adhoc pool allocator with simple malloc/free…
$ time ./binarytrees2_c.exe 21
stretch tree of depth 22 check: 8388607
2097152 trees of depth 4 check: 65011712
524288 trees of depth 6 check: 66584576
131072 trees of depth 8 check: 66977792
32768 trees of depth 10 check: 67076096
8192 trees of depth 12 check: 67100672
2048 trees of depth 14 check: 67106816
512 trees of depth 16 check: 67108352
128 trees of depth 18 check: 67108736
32 trees of depth 20 check: 67108832
long lived tree of depth 21 check: 4194303
real 0m35.190s
user 0m0.000s
sys 0m0.061s
A typical (idiomatic) C program can be 4 times slower than its equivalent OCaml !!!
Not yet compared : mandelbrot (the OCaml submission I have seen uses a fork, but there is a “normal” contribution), reverse-complement : in my todo list, pidigits (it would be a comparison zarith vs. gmp more than compilers), regex-redux (it would be a comparison of both Regex libraries rather than the compiler itself).
Then, I find OCaml very efficient… excepted in one case where it could be interesting to analyse the OCaml program I have linked to discover why. And the C malloc/free (and surely new/free in C++) functions are very inefficient. We can avoid them if it is very critical, but who does ? (However, an had’hoc pool storage is more efficient than the OCaml allocator).
When computing the geometric mean of the ratios… OCaml is 1,9 times slower than C (with the ad’hoc allocator), and 1,1 slower than C (with its malloc/free).
(I have used gcc with -O3 option, and just ocamlopt)
See the code here.