Networker

[Networker] Problem with mmrecov after /nsr array failure

2007-10-17 21:48:18
Subject: [Networker] Problem with mmrecov after /nsr array failure
From: Stan Horwitz <stan AT TEMPLE DOT EDU>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Wed, 17 Oct 2007 21:36:10 -0400
This past Saturday, my NetWorker 7.4 server's /nsr storage array failed. The server runs Solaris 9 and the array is an old Sun A1000 the array is one of two that was connected to the server. I placed a service call have have the problem fixed. On Saturday morning, I found myself in a computer room while a hardware fixer upper guy fixed the array ... so we thought. To make a long story short, fsck ran for 24 hours and by Sunday night, I was up and running again with the fixed array, but NetWorker crashed within seconds of restarting, so I decided to hold off until Monday morning to address the problem so I could get some sleep.

So on Monday, rebooting the server didn't produce any SCSI errors at all and it came backup fine, so I did a mmrecov from Friday's bootstrap tape. I restarted NetWorker and all was fine. Later that evening, the same array died on me again. Sigh! On Tuesday, we did more array repairs, but nothing we tried worked. The broken A1000 disk array is one of two we had sitting on my backup server. The second one is /nsr2 (which contains some CFI data for a few large clients), but it was only 20% full and the the /nsr array only contained 21GB worth of data. Since the /nsr2 array had something like 150GB free on it, so my boss and I decided to create a directory called nsr on the /nsr2 array and we disconnected the faulty /nsr array from the SCSI chain and powered it off. So /nsr now sits on the /nsr2 array and all the /nsr2 array's cfi data is still visible to NetWorker as /nsr2. I hope this makes sense.

This all works and I get no SCSI errors at all when I rebooted the server twice. Since this scheme wiped out the entire contents of / nsr, I used jbconfig to configure a tape library resource so I could read the bootstrap tape. Then I used mmrecov to recover the same bootstrap saveset from the same tape I used on Monday. This worked, except for one problem. When I did the mmrecov, instead of recovering to /nsr/res.R it recovered the data to /nsr/res and when I restarted NSR, the tape library that's connected to our server appeared twice in the NetWorker management console window and each instance of the tape library had two device resources for every physical device on the library (14 physical devices), except for the five devices that we use for NDMP which only had one device resource each. This server also has a Linux storage node connected to a totally different library, and that library's resource information is fine. I spent two hours tonight trying to fix this issue, including doing another mmrecov, which also dumped its data into /nsr/res instead of /nsr/res.R.

I tried deleting the second tape library resource, but this did not help. As a result, tape mount requests are not being satisfied for the main tape library, but they are for the tape library on my storage node. I don't know if its relevant, but the tape library is a Sony PetaSite with 14 S-AIT1 drives and its fibre channel connected to my NetWorker server. We do not do drive or tape library sharing. The inquire command also shows exactly the same thing it showed before we disconnected the broken A1000 array (except of course, for the missing array).

If anyone has any idea how to correct this problem, please let me know; otherwise, I intend to open up a support case with EMC in the morning (since I am too exhausted to do it now).

--
Stan Horwitz
stan AT temple DOT edu

CONFIDENTIALITY STATEMENT: The information contained in this e-mail, including attachments, is the confidential information of, and/or is the property of, Temple University. The information is intended for use solely by the individual or entity named in the e-mail. If you are not an intended recipient or you received this in error, then any review, printing, copying, or distribution of any such information is prohibited. Please notify the sender immediately by reply e-mail and then delete this e-mail from your system.

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the body of the email. Please write to networker-request 
AT listserv.temple DOT edu if you have any problems with this list. You can access the 
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER