fivemack | sh: neither as fast as sloths nor as elegant as hagfish

I have a directory with 244 files with names like m12331246123468911531238951802368109467.mlog, which I want to rename to names like C038.123312.mlog

time for u in m*mlog; do B=$(echo $u | cut -dm -f2 | cut -d. -f1); echo $u C${#B}.$(echo $B | cut -c1-6).mlog; done

takes 17 seconds

time for u in m*mlog; do B=$(echo $u | cut -dm -f2 | cut -d. -f1); echo $u C${#B}.${B:0:6}.mlog; done

takes eight seconds

time for u in m*mlog; do B=${u:1}; B=${B%.mlog}; echo $u C${#B}.${B:0:6}.mlog; done

takes 0.2 seconds.

Of course, when I replace 'echo' with 'mv' it still takes fourteen seconds, but I am not that shocked that mv over NFS might be slow.

Which suggests that doing $() to start a new shell is taking something like a hundredth of a second on a one-year-old PC. I didn't know that. On the other hand, if I start writing code this dense in unclear bashisms, my colleagues at work will disembowel me with spoons.

PS: if I stop running a CPU-intensive program on each of my eight cores, starting new processes gets about fifteen times faster. I can understand if it got twice as fast, but I really don't understand fifteen.

Flat | Top-Level Comments Only

From:

vicarage.livejournal.com

Shell spawning can kill you because of all the local and system .bashrc files it has to read and fight over. Years ago on a Cray T3E we had write special functions for basename and dirname because the shell spawning of `basename` upset it so much. Never might optimising the Fortran, work on the shell script

tau-iota-mu-c.livejournal.com

bash does not read .bashrc etc when just fork()ing and running something else via exec(). Nor would it reread all that even if it was fork()ing and then doing more bash shelly stuff, because the fork preserves the memory in the subprocess, so it's already seen .bashrc.

tcsh would be well and truly stupid enough to reparse all of .cshrc etc, but I don't fell like testing it (why oh why aren't astronomers brave enough to move on from 30 year old evil history?). It definitely does parse all of that crap when you have a #!/bin/csh script - fortunately bash doesn't do that unless you also supply -i.

No, you probably replaced the programs because most OSes have been traditionally very slow at fork() (and that goes for non-shell programs too). Slowaris is called Slowarsis for a reason :)

Linux has always had lower overheads at fork. The other OSes still did copy-on-write and everything, but just did it... badly.

The slowness of fork in this case when the CPUs are busy is surprising - possibly just a scheduler issue - the forking process is held too long on the wait queue and is starved of the resources needed to fork?

pjc50.livejournal.com

I've verified that you're right, it doesn't read anything nonessential other than a stat of my home directory. It's probably a scheduler issue.

dd-b.livejournal.com

The 17 second version has, I think, 5 shell invocations (might be 7, if I'm wrong about the first command in a $() not getting a separate shell).

The 8 second version has (similarly) 3 shell invocations.

The .2 second version has no shell invocations.

So that all looks about right for shell invocations being the issue, yes.

However, the first version (with somewhat different filenames obviously) executed on a more-than-5-year-old Linux box in 10 seconds -- for 1000 names, about 4 times as many as you used. The .2 second version took .07 seconds, again on 1000 files. I guess a factor of nearly 10 between two random old PCs is not out of bounds; the more important thing is the ratio between the tests being fairly consistent. (This was a decent server when new, which might about balance its being older.)

This may point at the NFS disk being the issue since my test was on local disk. I'm in the midst of completely hacking apart my little bit of NFS use so I guess I can't test that right now.

I dunno that the bashisms are less clear than using cut; in any case man bash or man cut will elucidate. It does mean the scripts become non-portable to systems without bash; I confess I've given up caring about those, myself.

Well, I understand 8 anyway. Your new bash has to take it's place in the round-robin with the 8 cpu-intensive programs, right?

fivemack.livejournal.com

The machine has eight CPUs, so I would have assumed my new bash has to take its place in one of the eight per-CPU round-robins, which would cause it to go half as fast as were there no heavy-CPU jobs running.

Hmmm; the Linux 2.6 scheduler does seem to have per-CPU run queues. That surprises me, since generally single-queue multi-server is much fairer and more efficient at allocating fungible resources. However, possibly the benefits of CPU affinity (mostly hot cache contents) trump that. And it's preemptive. So your task should in fact hit a cpu immediately (bumping the long-running task, whose priority will have been gradually raised (higher number, lower precedence)).

So yeah, there's something to explain there.

You're not short of memory for what's running, are you?

While it is preemptive, I don't think it will evict a running task from a cpu in favour of a new one. Moreover the shell script is going to repeatedly lose its scheduler slot due to waiting for a response from the NFS server.

ckd

I'd probably do that with perl -ne and a quick bit of string or regex manipulation followed by a rename() call; above a certain level of complexity, I stop trying to do everything in sh/bash.

(If it's something I'll use six months from now, I'll pay the setup time price and do it in Python so I can read it six months from now. For a one-off one-liner, Perl is fine.)

Agreed. Perl is also much less likely than bash to misbehave if you give it a file with a space or newline in the name.

pm215

On the other hand, if I start writing code this dense in unclear bashisms, my colleagues at work will disembowel me with spoons.

I think that if your code is preceded by a comment which says 'for each filename of format m12331246123468911531238951802368109467.mlog, print "C038.123312" where the number before the dot is the length of the original digit string and the number after is its first six digits', then it's reasonably clear; if it doesn't then it's pretty unclear whichever variant you use. (In particular with a comment it's immediately clear to the reader whether they actually need to check the implementation against the intention in order to achieve whatever goal they had in mind when they started reading the file, and there's a record of what your intention actually was!)

emperor

ewx.livejournal.com

What I said on IRC: fork+exec is incredibly expensive compared to a bit of string handling.

if I start writing code this dense in unclear bashisms, my colleagues at work will disembowel me with spoons

That assumes they’re fluent with cut. Personally I never bothered to learn it because all the alternatives were quicker and easier.

Tom Womack

sh: neither as fast as sloths nor as elegant as hagfish

sh: neither as fast as sloths nor as elegant as hagfish

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

March 2024

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags