@parallel vs. native loops in julia

Home / Uncategorized / @parallel vs. native loops in julia

Question:
I run some example and I got some result. I got for the large number of iteration we can get a good result but for less amount of iteration we can get a worse result.

I know there is a little overhead and it’s absolutely ok, but is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
x = 0
@time for i=1:200000000
x = Int(rand(Bool)) + x
end

7.503359 seconds (200.00 M allocations: 2.980 GiB, 2.66% gc time)
x = @time @parallel (+) for i=1:200000000
Int(rand(Bool))
end

0.432549 seconds (3.91 k allocations: 241.138 KiB)

I got good result for parallel here but in following example not.
x2 = 0
@time for i=1:100000
x2 = Int(rand(Bool)) + x2
end

0.006025 seconds (98.97 k allocations: 1.510 MiB)
x2 = @time @parallel (+) for i=1:100000
Int(rand(Bool))
end

0.084736 seconds (3.87 k allocations: 239.122 KiB)


Answer:
Q: is there any way to run some loop with less amount of iteration in parallel way better than sequential way?

A: Yes.

1) Acquire more resources ( processors to compute, memory to store ) if all this ought get sense

2) Arrange the workflow smarter – to benefit from register-based code, from harnessing the cache-lines’s sizes upon each first fetch, deploy re-use where possible ( hard work? yes, it is hard work, but why to repetitively pay 150+ [ns] instead of having paid this once and reuse well-aligned neighbouring cells just within ~ 30 [ns] latency-costs ( if NUMA permits )? ). Smarter workflow also often means code re-designs with respect to increasing the resulting assembly-code "density"-of-computations and tweaking the code so as to better by-pass the ( optimising-)-superscalar processor hardware design tricks, which are of no use / positive-benefit in highly-tuned HPC computing payloads.

3) Avoid headbangs into any blocking resources & bottlenecks ( central singularities alike a host’s hardware unique source-of-randomness, IO-devices et al )

4) Get familiar with your optimising compilers internal options and "shortcuts" — sometimes anti-patterns get generated at a cost of extended run-times

5) Get maximum from your underlying operating system’s tweaking. Not doing this, your optimised code still waits ( and a lot ) in O/S-scheduler’s queue
Read more

Leave a Reply

Your email address will not be published. Required fields are marked *