On 4/29/2011 1:48 AM, Holger Parplies wrote:
>
> well, age does matter at *some* point, as does heat. Unless you proactively
> replace the disks before that point is reached, they will likely all be "old"
> when the first one fails. Sure, if the first disk fails after a few months,
> the others will likely be ok (though I've had a set of 15 identical disks
> where about 10 failed within the first 2 years).
I think of it about like light bulbs. All you know is that they don't
last forever. Manufacturing batches are probably the most critical
difference and it's not something you can control. Anyway, the old rule
about data is that if something is important you should have at least 3
copies and don't let the person who destroyed the first 2 touch the last
one.
>>> [...] I think it brought up the *wrong* (i.e. faulty) disk of the mirror and
>>> failed on an fsck. [...]
>>
>> Grub doesn't know about raid and just happens to work with raid1 because it
>> treats the disk as a single drive.
>
> What's more, grub doesn't know about fsck.
>
> grub found and booted a kernel. The kernel then decided that its root FS on
> /dev/md0 consisted of the wrong mirror (or maybe its LVM PV on /dev/md1;
> probably both). grub and the BIOS have no part in that decision.
Sort-of... Grub itself is loaded by bios, which may fail (or not)
automatically to the alternate disk. Then it loads the kernel and
initrd from the disk it was configured to use (but which might not be in
the same position now). These can potentially be out of date if one
copy had been kicked out of the raid and you didn't notice. But that
probably wasn't the problem. The kernel takes over at that point,
re-detects the drives, assembles the raids, and then looks at the file
systems.
> I can see that the remaining drive may fail to boot (which it didn't), but I
> *can't* see why an array should be started in degraded mode on the *defective*
> mirror when both are present.
That's going to depend on what broke in the first place. If it went down
cleanly and both drives work at startup, they should have been assembled
together. If you crashed, the raid assembly will be looking at one
place for the uuid and event counts, where the file system cleanness
check happens later and looks in a different place. So the raid
assembly choice can't have anything to do with the correctness of the
file system on it. And just to make things more complicated, I've seen
cases where bad RAM caused very intermittent problems that included
differences between the mirror instances that lingered and re-appeared
randomly after the RAM was fixed.
>>> I *have* seen RAID members dropped from an array without understandable
>>> reasons, but, mostly, re-adding them simply worked [...]
>>
>> I've seen that too. I think retries are much more aggressive on single
>> disks or the last one left in a raid than on the mirror.
>
> Yes, but a retry needs a read error first. Are retries on single disks always
> logged or only on failure?
I've seen this with single partitions out of several on the same disk,
so I don't think it is actually seen as a hardware-level error. Maybe
it is just a timeout while the disk does a soft recovery.
> Or perhaps I should ask this: are retries uncommon enough to warrant failing
> array members, yet common enough that a disk that has produced one can still
> be trustworthy? How do you handle disks where you see that happen? Replace or
> retry?
Not sure there's a generic answer. I've replaced drives and not had it
happen again in some cases. In at least one case, it did keep happening
on the swap partition and eventually I stopped adding it back. Much,
much later the server failed in a way that looked like it was the
on-board scsi controller.
>>> [...] there are no guarantees your specific software/kernel/driver/hardware
>>> combination will not trigger some unknown (or unfixed ;-) bug.
>>
>> I had a machine with a couple of 4-year uptime runs (a red hat 7.3) where
>> several of the scsi drives failed and were hot-swapped and re-synced with no
>> surprises. So unless something has broken in the software recently, I mostly
>> trust it.
>
> You mean, your RH 7.3 machine had all software/kernel/driver/hardware
> combinations that there are?
No, I mean that the bugs in the software raid1 layer have long been
ironed out and I expect it to protect against other problems to a
greater extent than contributing to them. The physical hard drive
itself remains as the most likely failure point anyway. And you can
assume that most of the related software/drivers generally worked or you
wouldn't have data on the drive to lose.
> Like I said, I've seen (and heard of) strange occurrences, yet, like you, I
> mostly trust the software, simply out of lack of choice. I *can't* verify its
> correct operation;
Yes you can - there is an option to mdadm to verify that the mirrors are
identical (and fix if they aren't), and the underlying filesystem is
close enough that you can mount either member partition individually
instead of the mirror.
> Yet there remain these few strange occurrences, which may or may not be
> RAID-related. On average, every few thousand years, a CPU will randomly
> compute an incorrect result for some operation for whatever reason.
But you have that for every operation, and especially for things in the
kernel, no independent way to check them. That's why we keep multiple
independent copies and histories of files. Not to mention the (probably
as likely) chance that the building might burn down.
> It might as well be RAID weirdness in one case. Or the RAID weirdness may be
> the result of an obscure bug. Complex software *does* contain bugs, you know.
Yes, but RAID1 isn't all that complicated - basically "do the same thing
twice" on writes.
> Yes, I *did* mention that, I believe, but if your 2 TB resync doesn't complete
> before reboot/power failure, then you exactly *don't* have a rebuild initiated
> by an 'md --add'; after reboot, you have an auto-assembly (I also mentioned
> that). And, also agreed, I've also never ***seen*** it get this wrong when
> auto-assembling at reboot (well, except for once, but let's even ignore that).
That one is pretty straightforward since something would get marked as
complete only after the sync finishes on the new member. The software
would have to be fairly stupid to get that wrong.
> My point is that auto-assembly normally takes two (or more) mirrors that
> are either synchronized (normal shutdown) or at least nearly so (crash). What
> we are talking about here is adding a member that might be days, months, or
> even years out of date, with an arbitrary number of alternate members having
> been active in between. I don't know if the RAID implementation was designed
> with this usage pattern in mind. Is there a wrap-around for event counters?
> On what basis are they incremented? How does the software detect which member
> is more up-to-date after a crash?
I don't know how the 'more-up-to-date' counter is handled, but I don't
worry about it any more than any other kernel internal. I'd expect it
to be an integer of some reasonable size. It has always erred on the
side of caution as far as I can tell, normally not starting an auto-sync
if there is a mismatch. It doesn't matter in a hot-swap case unless you
accidentally reboot between swapping and adding the member with mdadm.
And if you are concerned about that possibility, you can just not set
the partition type as auto-detect.
> I'm not saying it doesn't work. I'm asking how it works so I can draw my own
> conclusions. That is what "Open Source" means, right?
Well, open source means there is at least one way to find out. But I
usually don't bother unless something goes wrong.
--
Les Mikesell
lesmikesell AT gmail DOT com
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network
management toolset available today. Delivers lowest initial
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
|