fivemack: (Default)
[personal profile] fivemack
I have on my desk the replacement for the computer I bought at the start of the year which exploded three hours after purchase.

I've had it for not quite a week. It persistently crashes, apparently sometimes corrupting parts of the hard disc, when I give it certain fiddly disc-intensive calculations to do. Sometimes it crashes under other circumstances. Is Ubuntu 6.10 known to be this irredeemably unreliable on contemporary Intel hardware, or have I just received my second lemon in an order of two?

It's running memtest86 overnight, but no faults have shown up yet; I suspect this is a disc-controller rather than a memory issue, and no faults will show up. The last memory issue I had -- a module which had an unshakeable faith that every 1024th bit in the memory it presented had to be reported as zero whatever had been written to it -- showed up immediately in memtest86.

Is there a disc equivalent of memtest86? I'm prepared to take backups of everything - there's not much novel, I've only had the machine a week - and sacrifice the disc contents should the test need to be destructive, though I'd prefer something that sat in userspace and confined its merciless thrashing to the bits of the disc on which my data isn't.

Date: 2007-03-02 11:57 pm (UTC)
From: [identity profile] brrm.livejournal.com
I've used HDTune at work before, without destroying anything. YMMV, of course.

Date: 2007-03-03 12:04 am (UTC)
From: [identity profile] fivemack.livejournal.com
HDTune appears to be Windows-only, which is little use on my Linux box.

Date: 2007-03-03 12:31 am (UTC)
From: (Anonymous)
You might be able to use BartPE or similar to use that.
On Debian, I've used hdparm, hddtemp, and smartmontools to poke at disks and see if they're complaining about problems.
There appears to be a tool with the somewhat obvious name of 'testdisk', which may also do what you want.

Baldy

Date: 2007-03-03 12:33 am (UTC)
From: (Anonymous)
Doh, forgot the daddy of them all - badblocks.

Baldy

Date: 2007-03-03 08:26 am (UTC)
From: [identity profile] arnhem.livejournal.com
badblocks is probably all you need to show up disk problems. The default options (read-only) won't show up all failure modes, but there's a vaguely useful man page ...

Date: 2007-03-03 12:47 pm (UTC)
From: [identity profile] womble2.livejournal.com
Modern disks do bad block management all by themselves. The SMART statistics will tell you what they've been doing though.

Date: 2007-03-03 12:50 pm (UTC)
From: [identity profile] womble2.livejournal.com
Is the power supply adequate? Are all the fans working? Are you using hardware RAID? Is the OS a fresh installation done by this machine or is it from another machine that could have corrupted it?

Date: 2007-03-05 12:03 am (UTC)
From: [identity profile] fivemack.livejournal.com
I don't know anything about the PSU; I just bought this box as a box from World of Computers. As fans go there are no hideous rattling noises of tormented bearings to be heard, but I haven't tried opening the box and poking things with my thumb; to some extent I bought a box rather than bits so that I didn't have to do that.

I'm using software RAID, a RAID1 made from partitions on two ATA discs and itself formatted as ext3; the OS is a fresh install of Ubuntu 6.10_x86-64.

I've managed to get some information about the failure mode ... under moderate disc load (four parallel cp operations of a few gigabytes from one part of the RAID1 partition to another) I got a syslog message: segfault at

ext3_ordered_writepage+243
called from find_busiest_group+404
called from do_writepages+41
called from del_timer_sync+12
called from find_busiest_group+404

Repeating the actions that caused that failure did not immediately produce a second; repeating them again while running nine cat /dev/zero > /dev/null & in the background caused the cursor-corruption and hang but no syslog entry; repeating again causes hang without cursor corruption, repeating from text consoles seems to work (four copies of the torture-test in parallel run to completion) but is obviously not acceptable.

I think this, particularly the way it seems to show up more in X, begins to look more like a kernel-interacting-with-chipset bug than like a hardware problem; that various people in the thread under http://www.aceshardware.com/forums/read_post.jsp?id=120076902&forumid=1 report instability may also be such a sign.

So, the Edgy kernel is dodgy. Where do I go from here?

Date: 2007-03-05 02:16 am (UTC)
From: [identity profile] womble2.livejournal.com
Sorry, no idea.

March 2024

S M T W T F S
     12
3456789
10111213141516
17181920212223
24 252627282930
31      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Aug. 30th, 2025 11:56 pm
Powered by Dreamwidth Studios