I have been know to factorise large numbers from time to time. I have a fairly ludicrous computer with 48 processors. Processors 6n .. 6n+5 share a memory bank.
When I do 'mpirun -n 24 [job]', I find that the speed of the job changes substantially, and often for the worse, every couple of hours. I suspect the scheduler is shuffling the jobs around the processors; even when I use a taskset to restrict the 24 jobs to 24 processors, it's shuffling them within the taskset. Since memory is allocated in the bank associated to the processor that the job doing the malloc is running on at that moment, and thereafter never moved, this means I end up with jobs running with all their memory accesses to a different bank; this is slow.
My current best-bet is:
taskset -c 0-2,6-8,12-14,18-20,24-26,30-32,36-38,42-44 mpirun -n 24 msieve ...
Allow the job to start (in particular, to allocate the enormous arrays it needs)
for u in $(for v in $(pidof msieve); do echo $v; done | sort -n); do grep -H "heap" /proc/$u/numa_maps; done
to determine which bank the memory has been allocated on, and then manually write a set of taskset commands to get each job onto a core associated with that memory bank. At least the one time I've tried it, precisely three jobs ended up allocated to each memory bank, though the ordering was 723154106374602651025347.
This seems to work reasonably well, but I feel there must be a less crazy way to do it! Any advice?
PS: it turns out that the right answer is to use options in mpirun:
taskset -c 0-47:6,1-47:6,2-47:6 mpirun -n 24 --bind-to-core --report-bindings numactl -l ~/msieve-mpi/msieve/trunk/msieve -v -nc2 3,8
where the taskset clause restricts the job to running on a subset of processors
the mpirun options bind each job to a single processor
the numactl option forces the job to allocate its memory on the processor it's bound to
When I do 'mpirun -n 24 [job]', I find that the speed of the job changes substantially, and often for the worse, every couple of hours. I suspect the scheduler is shuffling the jobs around the processors; even when I use a taskset to restrict the 24 jobs to 24 processors, it's shuffling them within the taskset. Since memory is allocated in the bank associated to the processor that the job doing the malloc is running on at that moment, and thereafter never moved, this means I end up with jobs running with all their memory accesses to a different bank; this is slow.
My current best-bet is:
taskset -c 0-2,6-8,12-14,18-20,24-26,30-32,36-38,42-44 mpirun -n 24 msieve ...
Allow the job to start (in particular, to allocate the enormous arrays it needs)
for u in $(for v in $(pidof msieve); do echo $v; done | sort -n); do grep -H "heap" /proc/$u/numa_maps; done
to determine which bank the memory has been allocated on, and then manually write a set of taskset commands to get each job onto a core associated with that memory bank. At least the one time I've tried it, precisely three jobs ended up allocated to each memory bank, though the ordering was 723154106374602651025347.
This seems to work reasonably well, but I feel there must be a less crazy way to do it! Any advice?
PS: it turns out that the right answer is to use options in mpirun:
taskset -c 0-47:6,1-47:6,2-47:6 mpirun -n 24 --bind-to-core --report-bindings numactl -l ~/msieve-mpi/msieve/trunk/msieve -v -nc2 3,8
where the taskset clause restricts the job to running on a subset of processors
the mpirun options bind each job to a single processor
the numactl option forces the job to allocate its memory on the processor it's bound to