fivemack: (Default)
[personal profile] fivemack
I have just emerged from a five-hour optimisation trance, fuelled by sushi and plum wine.

Here is the code; it looks for numbers which can be written as the sum of three sixth powers in two different ways. It's multi-threaded (by constants in the code assuming you have four cores spare) and uses a cache-friendly blocked linked-list structure; it slices up the problem into chunks by sum-modulo-P and then into buckets by sum-modulo-Q, then sorts the bucket contents. It uses moderately prodigious amounts of memory - about 20N^2 bytes. It takes 1m15s wall-time on my machine (2.4GHz quad-core) to run 'sumsix_t4 2003', 10m35s wall-time for 'sumsix_t4 4007'. I'm surprised that the output from 'time' indicates that it spends non-trivial time in the OS: how? The only bits that look like OS calls are print statements and memory allocation, and it only does O(N) of those.

real 10m35.586s
user 31m20.462s
sys 5m20.408s

I'm sure it could be significantly faster - it's trivially parallel and I'm only getting x3 speedup on four CPUs - but I am not quite sure how. I've looked at profiles, I suspect I may have been foiled by out-of-order execution, where quick-op ( result of slow-op ) has to wait for slow-op to complete and so appears as a hotspot when the real hotspot is elsewhere. Hard-wiring the values of 'N' and 'bucket' so that the compiler can replace the modulo operations with multiplies by magic constants doesn't make a difference.

If there are people out there who like this sort of challenge, I'd like some input.

Date: 2008-03-01 10:17 am (UTC)
ext_8103: (Default)
From: [identity profile] ewx.livejournal.com
Try taking all the memory allocation outside the loop; it may well involve taking locks even if it's not actually asking for new memory most of the time, and with the possible exception of dat2 appears to depend entirely on global constants, although this rather obscured by the way you pass bucket and N around.

Date: 2008-03-01 04:51 pm (UTC)
ext_8103: (Default)
From: [identity profile] ewx.livejournal.com

...in particular the buckets array is big enough that you might well be doing a syscall for every allocate and deallocate, if the memory manager is aggressive about returning memory to the OS; and moreover one that messes with page tables, which is potentially quite expensive and will surely have to serialize across all your threads.

You may be able verify or eliminate this with strace/ktrace/truss, rather than speculate, of course l-)

March 2024

S M T W T F S
     12
3456789
10111213141516
17181920212223
24 252627282930
31      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Dec. 29th, 2025 10:25 am
Powered by Dreamwidth Studios