Re: [BackupPC-users] RAID and offsite

On 4/29/2011 1:48 AM, Holger Parplies wrote:
>
> well, age does matter at *some* point, as does heat. Unless you proactively
> replace the disks before that point is reached, they will likely all be "old"
> when the first one fails. Sure, if the first disk fails after a few months,
> the others will likely be ok (though I've had a set of 15 identical disks
> where about 10 failed within the first 2 years).

I think of it about like light bulbs.  All you know is that they don't 
last forever. Manufacturing batches are probably the most critical 
difference and it's not something you can control.  Anyway, the old rule 
about data is that if something is important you should have at least 3 
copies and don't let the person who destroyed the first 2 touch the last 
one.

>>> [...] I think it brought up the *wrong* (i.e. faulty) disk of the mirror and
>>> failed on an fsck. [...]
>>
>> Grub doesn't know about raid and just happens to work with raid1 because it
>> treats the disk as a single drive.
>
> What's more, grub doesn't know about fsck.
>
> grub found and booted a kernel. The kernel then decided that its root FS on
> /dev/md0 consisted of the wrong mirror (or maybe its LVM PV on /dev/md1;
> probably both). grub and the BIOS have no part in that decision.

Sort-of... Grub itself is loaded by bios, which may fail (or not) 
automatically to the alternate disk.  Then it loads the kernel and 
initrd from the disk it was configured to use (but which might not be in 
the same position now).  These can potentially be out of date if one 
copy had been kicked out of the raid and you didn't notice.  But that 
probably wasn't the problem.  The kernel takes over at that point, 
re-detects the drives, assembles the raids, and then looks at the file 
systems.

> I can see that the remaining drive may fail to boot (which it didn't), but I
> *can't* see why an array should be started in degraded mode on the *defective*
> mirror when both are present.

That's going to depend on what broke in the first place. If it went down 
cleanly and both drives work at startup, they should have been assembled 
together.  If you crashed, the raid assembly will be looking at one 
place for the uuid and event counts, where the file system cleanness 
check happens later and looks in a different place.  So the raid 
assembly choice can't have anything to do with the correctness of the 
file system on it.  And just to make things more complicated, I've seen 
cases where bad RAM caused very intermittent problems that included 
differences between the mirror instances that lingered and re-appeared 
randomly after the RAM was fixed.

>>> I *have* seen RAID members dropped from an array without understandable
>>> reasons, but, mostly, re-adding them simply worked [...]
>>
>> I've seen that too.  I think retries are much more aggressive on single
>> disks or the last one left in a raid than on the mirror.
>
> Yes, but a retry needs a read error first. Are retries on single disks always
> logged or only on failure?

I've seen this with single partitions out of several on the same disk, 
so I don't think it is actually seen as a hardware-level error.  Maybe 
it is just a timeout while the disk does a soft recovery.

> Or perhaps I should ask this: are retries uncommon enough to warrant failing
> array members, yet common enough that a disk that has produced one can still
> be trustworthy? How do you handle disks where you see that happen? Replace or
> retry?

Not sure there's a generic answer. I've replaced drives and not had it 
happen again in some cases.  In at least one case, it did keep happening 
on the swap partition and eventually I stopped adding it back. Much, 
much later the server failed in a way that looked like it was the 
on-board scsi controller.


>>> [...] there are no guarantees your specific software/kernel/driver/hardware
>>> combination will not trigger some unknown (or unfixed ;-) bug.
>>
>> I had a machine with a couple of 4-year uptime runs (a red hat 7.3) where
>> several of the scsi drives failed and were hot-swapped and re-synced with no
>> surprises.  So unless something has broken in the software recently, I mostly
>> trust it.
>
> You mean, your RH 7.3 machine had all software/kernel/driver/hardware
> combinations that there are?

No, I mean that the bugs in the software raid1 layer have long been 
ironed out and I expect it to protect against other problems to a 
greater extent than contributing to them.  The physical hard drive 
itself remains as the most likely failure point anyway. And you can 
assume that most of the related software/drivers generally worked or you 
wouldn't have data on the drive to lose.

> Like I said, I've seen (and heard of) strange occurrences, yet, like you, I
> mostly trust the software, simply out of lack of choice. I *can't* verify its
> correct operation;

Yes you can - there is an option to mdadm to verify that the mirrors are 
identical (and fix if they aren't), and the underlying filesystem is 
close enough that you can mount either member partition individually 
instead of the mirror.

> Yet there remain these few strange occurrences, which may or may not be
> RAID-related. On average, every few thousand years, a CPU will randomly
> compute an incorrect result for some operation for whatever reason.

But you have that for every operation, and especially for things in the 
kernel, no independent way to check them.  That's why we keep multiple 
independent copies and histories of files.  Not to mention the (probably 
as likely) chance that the building might burn down.

> It might as well be RAID weirdness in one case. Or the RAID weirdness may be
> the result of an obscure bug. Complex software *does* contain bugs, you know.

Yes, but RAID1 isn't all that complicated - basically "do the same thing 
twice" on writes.

> Yes, I *did* mention that, I believe, but if your 2 TB resync doesn't complete
> before reboot/power failure, then you exactly *don't* have a rebuild initiated
> by an 'md --add'; after reboot, you have an auto-assembly (I also mentioned
> that). And, also agreed, I've also never ***seen*** it get this wrong when
> auto-assembling at reboot (well, except for once, but let's even ignore that).

That one is pretty straightforward since something would get marked as 
complete only after the sync finishes on the new member.  The software 
would have to be fairly stupid to get that wrong.

> My point is that auto-assembly normally takes two (or more) mirrors that
> are either synchronized (normal shutdown) or at least nearly so (crash). What
> we are talking about here is adding a member that might be days, months, or
> even years out of date, with an arbitrary number of alternate members having
> been active in between. I don't know if the RAID implementation was designed
> with this usage pattern in mind. Is there a wrap-around for event counters?
> On what basis are they incremented? How does the software detect which member
> is more up-to-date after a crash?

I don't know how the 'more-up-to-date' counter is handled, but I don't 
worry about it any more than any other kernel internal.  I'd expect it 
to be an integer of some reasonable size. It has always erred on the 
side of caution as far as I can tell, normally not starting an auto-sync 
if there is a mismatch. It doesn't matter in a hot-swap case unless you 
accidentally reboot between swapping and adding the member with mdadm. 
And if you are concerned about that possibility, you can just not set 
the partition type as auto-detect.

> I'm not saying it doesn't work. I'm asking how it works so I can draw my own
> conclusions. That is what "Open Source" means, right?

Well, open source means there is at least one way to find out. But I 
usually don't bother unless something goes wrong.

-- 
   Les Mikesell
    lesmikesell AT gmail DOT com

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/