I run some example and I got some result. I got for the large number of iteration we can get a good result but for less amount of iteration we can get a worse result.
I know there is a little overhead and it’s absolutely ok, but is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
x = 0
@time for i=1:200000000
x = Int(rand(Bool)) + x
7.503359 seconds (200.00 M allocations: 2.980 GiB, 2.66% gc time)
x = @time @parallel (+) for i=1:200000000
0.432549 seconds (3.91 k allocations: 241.138 KiB)
I got good result for parallel here but in following example not.
x2 = 0
@time for i=1:100000
x2 = Int(rand(Bool)) + x2
0.006025 seconds (98.97 k allocations: 1.510 MiB)
x2 = @time @parallel (+) for i=1:100000
0.084736 seconds (3.87 k allocations: 239.122 KiB)
Q: is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
1) Acquire more resources ( processors to compute, memory to store ) if all this ought get sense
2) Arrange the workflow smarter – to benefit from register-based code, from harnessing the cache-lines’s sizes upon each first fetch, deploy re-use where possible ( hard work? yes, it is hard work, but why to repetitively pay 150+ [ns] instead of having paid this once and reuse well-aligned neighbouring cells just within ~ 30 [ns] latency-costs ( if NUMA permits )? ). Smarter workflow also often means code re-designs with respect to increasing the resulting assembly-code "density"-of-computations and tweaking the code so as to better by-pass the ( optimising-)-superscalar processor hardware design tricks, which are of no use / positive-benefit in highly-tuned HPC computing payloads.
3) Avoid headbangs into any blocking resources & bottlenecks ( central singularities alike a host’s hardware unique source-of-randomness, IO-devices et al )
4) Get familiar with your optimising compilers internal options and "shortcuts" — sometimes anti-patterns get generated at a cost of extended run-times
5) Get maximum from your underlying operating system’s tweaking. Not doing this, your optimised code still waits ( and a lot ) in O/S-scheduler’s queue