fivemack | Mar. 1st, 2008

I have just emerged from a five-hour optimisation trance, fuelled by sushi and plum wine.

Here is the code; it looks for numbers which can be written as the sum of three sixth powers in two different ways. It's multi-threaded (by constants in the code assuming you have four cores spare) and uses a cache-friendly blocked linked-list structure; it slices up the problem into chunks by sum-modulo-P and then into buckets by sum-modulo-Q, then sorts the bucket contents. It uses moderately prodigious amounts of memory - about 20N^2 bytes. It takes 1m15s wall-time on my machine (2.4GHz quad-core) to run 'sumsix_t4 2003', 10m35s wall-time for 'sumsix_t4 4007'. I'm surprised that the output from 'time' indicates that it spends non-trivial time in the OS: how? The only bits that look like OS calls are print statements and memory allocation, and it only does O(N) of those.

real 10m35.586s
user 31m20.462s
sys 5m20.408s

I'm sure it could be significantly faster - it's trivially parallel and I'm only getting x3 speedup on four CPUs - but I am not quite sure how. I've looked at profiles, I suspect I may have been foiled by out-of-order execution, where quick-op ( result of slow-op ) has to wait for slow-op to complete and so appears as a hotspot when the real hotspot is elsewhere. Hard-wiring the values of 'N' and 'bucket' so that the compiler can replace the modulo operations with multiplies by magic constants doesn't make a difference.

If there are people out there who like this sort of challenge, I'd like some input.

S	M	T	W	T	F	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Tom Womack

Mar. 1st, 2008

Mar. 1st, 2008

Many eyes make fast code

Profile

March 2024

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags