Mar. 1st, 2008

fivemack: (Default)
I have just emerged from a five-hour optimisation trance, fuelled by sushi and plum wine.

Here is the code; it looks for numbers which can be written as the sum of three sixth powers in two different ways. It's multi-threaded (by constants in the code assuming you have four cores spare) and uses a cache-friendly blocked linked-list structure; it slices up the problem into chunks by sum-modulo-P and then into buckets by sum-modulo-Q, then sorts the bucket contents. It uses moderately prodigious amounts of memory - about 20N^2 bytes. It takes 1m15s wall-time on my machine (2.4GHz quad-core) to run 'sumsix_t4 2003', 10m35s wall-time for 'sumsix_t4 4007'. I'm surprised that the output from 'time' indicates that it spends non-trivial time in the OS: how? The only bits that look like OS calls are print statements and memory allocation, and it only does O(N) of those.

real 10m35.586s
user 31m20.462s
sys 5m20.408s

I'm sure it could be significantly faster - it's trivially parallel and I'm only getting x3 speedup on four CPUs - but I am not quite sure how. I've looked at profiles, I suspect I may have been foiled by out-of-order execution, where quick-op ( result of slow-op ) has to wait for slow-op to complete and so appears as a hotspot when the real hotspot is elsewhere. Hard-wiring the values of 'N' and 'bucket' so that the compiler can replace the modulo operations with multiplies by magic constants doesn't make a difference.

If there are people out there who like this sort of challenge, I'd like some input.

March 2024

S M T W T F S
     12
3456789
10111213141516
17181920212223
24 252627282930
31      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Aug. 12th, 2025 09:42 pm
Powered by Dreamwidth Studios