fivemack: (Default)
Tom Womack ([personal profile] fivemack) wrote2010-07-20 09:11 pm

sh: neither as fast as sloths nor as elegant as hagfish

I have a directory with 244 files with names like m12331246123468911531238951802368109467.mlog, which I want to rename to names like C038.123312.mlog

time for u in m*mlog; do B=$(echo $u | cut -dm -f2 | cut -d. -f1); echo $u C${#B}.$(echo $B | cut -c1-6).mlog; done

takes 17 seconds

time for u in m*mlog; do B=$(echo $u | cut -dm -f2 | cut -d. -f1); echo $u C${#B}.${B:0:6}.mlog; done

takes eight seconds

time for u in m*mlog; do B=${u:1}; B=${B%.mlog}; echo $u C${#B}.${B:0:6}.mlog; done

takes 0.2 seconds.

Of course, when I replace 'echo' with 'mv' it still takes fourteen seconds, but I am not that shocked that mv over NFS might be slow.

Which suggests that doing $() to start a new shell is taking something like a hundredth of a second on a one-year-old PC. I didn't know that. On the other hand, if I start writing code this dense in unclear bashisms, my colleagues at work will disembowel me with spoons.

PS: if I stop running a CPU-intensive program on each of my eight cores, starting new processes gets about fifteen times faster. I can understand if it got twice as fast, but I really don't understand fifteen.

[identity profile] vicarage.livejournal.com 2010-07-20 08:39 pm (UTC)(link)
Shell spawning can kill you because of all the local and system .bashrc files it has to read and fight over. Years ago on a Cray T3E we had write special functions for basename and dirname because the shell spawning of `basename` upset it so much. Never might optimising the Fortran, work on the shell script

[identity profile] dd-b.livejournal.com 2010-07-20 08:48 pm (UTC)(link)
The 17 second version has, I think, 5 shell invocations (might be 7, if I'm wrong about the first command in a $() not getting a separate shell).

The 8 second version has (similarly) 3 shell invocations.

The .2 second version has no shell invocations.

So that all looks about right for shell invocations being the issue, yes.

However, the first version (with somewhat different filenames obviously) executed on a more-than-5-year-old Linux box in 10 seconds -- for 1000 names, about 4 times as many as you used. The .2 second version took .07 seconds, again on 1000 files. I guess a factor of nearly 10 between two random old PCs is not out of bounds; the more important thing is the ratio between the tests being fairly consistent. (This was a decent server when new, which might about balance its being older.)

This may point at the NFS disk being the issue since my test was on local disk. I'm in the midst of completely hacking apart my little bit of NFS use so I guess I can't test that right now.

I dunno that the bashisms are less clear than using cut; in any case man bash or man cut will elucidate. It does mean the scripts become non-portable to systems without bash; I confess I've given up caring about those, myself.

Well, I understand 8 anyway. Your new bash has to take it's place in the round-robin with the 8 cpu-intensive programs, right?

ckd: (cpu)

[personal profile] ckd 2010-07-20 09:38 pm (UTC)(link)
I'd probably do that with perl -ne and a quick bit of string or regex manipulation followed by a rename() call; above a certain level of complexity, I stop trying to do everything in sh/bash.

(If it's something I'll use six months from now, I'll pay the setup time price and do it in Python so I can read it six months from now. For a one-off one-liner, Perl is fine.)
pm215: (Default)

[personal profile] pm215 2010-07-20 09:39 pm (UTC)(link)
On the other hand, if I start writing code this dense in unclear bashisms, my colleagues at work will disembowel me with spoons.
I think that if your code is preceded by a comment which says 'for each filename of format m12331246123468911531238951802368109467.mlog, print "C038.123312" where the number before the dot is the length of the original digit string and the number after is its first six digits', then it's reasonably clear; if it doesn't then it's pretty unclear whichever variant you use. (In particular with a comment it's immediately clear to the reader whether they actually need to check the implementation against the intention in order to achieve whatever goal they had in mind when they started reading the file, and there's a record of what your intention actually was!)
ext_8103: (Default)

[identity profile] ewx.livejournal.com 2010-07-21 07:06 pm (UTC)(link)

What I said on IRC: fork+exec is incredibly expensive compared to a bit of string handling.

if I start writing code this dense in unclear bashisms, my colleagues at work will disembowel me with spoons

That assumes they’re fluent with cut. Personally I never bothered to learn it because all the alternatives were quicker and easier.