Damaged Extents after upgrade to 8.1.19.000

uffeg · Sep 14, 2023

We upgraded from 8.1.18 to 8.1.19.000 to be able to run storage rule replication without them hanging. And not being able to cancel the processes, having to restart Spectrum to release the hanging jobs/sessions.
First storage rule repl job, we found 1 container with damaged extents. The job failed of course.
Another server has several damaged containers.

At the same time we have 2 server where it runs fine.
All server runnig prot/repl runs without issues.

Then we did setup a complete new replication target server, and started a new storage rule repl. Bam, damaged extents.
And this server has several customers, so the other two is still running protect/replicate to another target server.
It has 3 storage pools one per customer. No damaged extents in the pools running old prot/repl.
4 containers damaged in the pool that we run storage rule repl on.

We now have 3 running cases for 2 weeks. And it seems tough to solve.

Is this something anyone else has found ?
Or are we the only one running 8.1.19.000 and storage rule repl ?

/Ulf @Atea Sweden

Trident · Sep 15, 2023

Hi,

Can you share some details from the actlog (and other error messages) about these issues?

uffeg · Sep 15, 2023

Trident said:
Hi,

Can you share some details from the actlog (and other error messages) about these issues?

No messages during repl job.
It just simply fails replication everything.

ANR1652E Replication failed. Total number unresolved extents is 123,641.
Files replicated: 403,749 of 1,025,307. Files updated: 279,770 of 287,955. Files deleted: 189,277 of 189,277. Amount replicated: 1,556 GB of 1,970 GB. Amount transferred: 430 GB. Elapsed time: 0 Days, 0 Hours, 58 Minutes. (SESSION: 182859, PROCESS: 3001, JOB: 77)

I have another server that just simply stops replication in the middle of everything.
Also here it found damaged extents in the first repl job after upgrade to 8.1.19.000
That one I have to cancel the job, and of course that is not working either.
Job stays "terminating" even if that should be solved in 8.1.19.000
Having to reboot both source and target server.

Trident · Sep 15, 2023

uffeg said:
No messages during repl job.
It just simply fails replication everything.

ANR1652E Replication failed. Total number unresolved extents is 123,641.
Files replicated: 403,749 of 1,025,307. Files updated: 279,770 of 287,955. Files deleted: 189,277 of 189,277. Amount replicated: 1,556 GB of 1,970 GB. Amount transferred: 430 GB. Elapsed time: 0 Days, 0 Hours, 58 Minutes. (SESSION: 182859, PROCESS: 3001, JOB: 77)

I have another server that just simply stops replication in the middle of everything.
Also here it found damaged extents in the first repl job after upgrade to 8.1.19.000
That one I have to cancel the job, and of course that is not working either.
Job stays "terminating" even if that should be solved in 8.1.19.000
Having to reboot both source and target server.

Are there no entries with errors when looking in actlog searching for job id and/or proc num?
I guess you have found this: IT42584 (that should have been fixed) in your version.

uffeg · Sep 16, 2023

Trident said:
Are there no entries with errors when looking in actlog searching for job id and/or proc num?
I guess you have found this: IT42584 (that should have been fixed) in your version.

We have seen I/O error opening file. But at the same time I can copy that file with damaged extent to another folder in OS.

We started running storage rule repl in 4 servers. Got same issue in all 4 of them with damaged extents.

In 2 servers we have solved it by running a new FULL backup on the servers having damaged extents.
After that the repl jobs work.

But now we have it in a very big server where it affect Imagebackups only, can't do backup since the control info lays on the damaged extents. We are running a FULL VM backup an all VM's as I write this.
But I am afraid we will not be able to restore beyond today.

Anyone here who has done a repl node with repair on a DC vm node containing hundreds of VM's ?

uffeg · Sep 16, 2023

WE have been running FULL VM backups and now we tested incremental and no issue, and no damaged extent anymore. On monday we will try to restore beyond todays full backups and se what happens.
Will be fun to see what happens to the vm's we have 10 years retention sets on if they are of no use anymore or not.

I have a gut feeling 8.1.19.000 will be drawn back, or we will quite fast see 8.1.19.100.

lp1167 · Sep 19, 2023

can you run a audit?

uffeg · Sep 20, 2023

lp1167 said:
can you run a audit?

Yes, but we get this:
ANR4891I:Audit Container has encountered an I/O error for container /PHYFILE06_NFS/TSMDIR/TSMfile00/06/0f/00000000000f06e1.dcf in container storage pool TSMDIR while attempting to read a data extent.
At the same time we can copy that file. So it exists on disk. WE now have the same thing in 5 different server, and as soon as we start running STGRUL replicattion instead of Protect/Replicate node we get that same error in the source server. Never any issues in the target servers.
8.1.20.000 just arrived but doeasn't look like that would help.

Trident · Sep 20, 2023

Hi

Looks like you are using nfs. Are your setting correct (both server an client side)

IBM Spectrum Protect server support for NFS

IBM Spectrum Protect™ server support for Network File Systems (NFS) is described.

www.ibm.com

lp1167 · Sep 20, 2023

Verify the integrity of the disk where the container resides by running disk checks and looking for any hardware issues or file system corruption on the NFS device.

1.Check File Permissions:
Ensure that the appropriate file permissions and ownership are set for the container and the directory it resides in.

2.Check Disk Space:
Ensure that the disk where the container is located has enough free space to accommodate new data extents.

Check for OS or Hardware Issues:
Investigate if there are any known operating system or hardware issues that could be causing I/O errors. On the NFS directory - move a file to it.

scr1pt · Sep 25, 2023

We are also running 8.1.19 and stgrule for replication now, we were running previously 8.1.14.200 and did an upgrade to 8.1.19. After which most of the issues we encountered has gone away.

We have also had some damaged extents here and there, but that we have fixed with repair stgpool.

What OS are you running?

Have you also done a level 5 audit for the stgpools? See help 3.17.41.1

uffeg · Oct 2, 2023

lp1167 said:
Verify the integrity of the disk where the container resides by running disk checks and looking for any hardware issues or file system corruption on the NFS device.

1.Check File Permissions:
Ensure that the appropriate file permissions and ownership are set for the container and the directory it resides in.

2.Check Disk Space:
Ensure that the disk where the container is located has enough free space to accommodate new data extents.

Check for OS or Hardware Issues:
Investigate if there are any known operating system or hardware issues that could be causing I/O errors. On the NFS directory - move a file to it.

File permission OK. Space is more than enough. We can copy any container to another folder anywhere. We can also put other files in that filestructure without issues.

uffeg · Oct 2, 2023

scr1pt said:
We are also running 8.1.19 and stgrule for replication now, we were running previously 8.1.14.200 and did an upgrade to 8.1.19. After which most of the issues we encountered has gone away.

We have also had some damaged extents here and there, but that we have fixed with repair stgpool.

What OS are you running?

Have you also done a level 5 audit for the stgpools? See help 3.17.41.1

The strange thing is we have only seen the damaged extents in the server we run stgrule repl from.
Non of the others. Linux OS, 8.7.
We have not done a level 5 audit. We have 3 cases with IBM which are moving on really slow.
Not sure what you mean with:

See help 3.17.41.1

scr1pt · Oct 2, 2023

uffeg said:
The strange thing is we have only seen the damaged extents in the server we run stgrule repl from.
Non of the others. Linux OS, 8.7.
We have not done a level 5 audit. We have 3 cases with IBM which are moving on really slow.
Not sure what you mean with:

It's the help code to the "def stgrule" for level 5 audit.

uffeg · Oct 3, 2023

scr1pt said:
It's the help code to the "def stgrule" for level 5 audit.

Thanks I saw that later.....

I am wondering a bit. What does the audit do if it finds inconsistance right now then?

I might try it on a "not soooo much" important storagepool and customer.

scr1pt · Oct 3, 2023

uffeg said:
Thanks I saw that later.....

I am wondering a bit. What does the audit do if it finds inconsistance right now then?

I might try it on a "not soooo much" important storagepool and customer.

It should mark it as damaged, so you should be able to see it via "q damaged <stgpool> t=conta".

Then I would try to repair it:
repair stg <stgpool>

Scan damaged again:
audit cont <container from q damaged output> action=scandamaged (or scanall).

You could at this point also do "q damaged <stgpool> t=node" and "q damaged <stgpool> t=inv" to get an idea of which nodes/files are affected. If they are active files you could do a new selective/full on those objects to try and repair the extents that way as well.

If nothing works at this point then I would consider audit cont with action=removedamaged (if repair does not work).

Another way to find which containers are troublesome (if q damaged does not show anything) is whenever the stgrule fails/stops check the dsmffdc.log for which container it failed on. Then run a audit on that container with action=scanall.

This all assumes the issue you have is corruption within the containers, if the issue you are experiencing is due to latency or some configuration issue related to using NFS as @Trident pointed out then this might not be of any help.

I guess you are not hitting any ulimit value such as the open files limit? (check via "ulimit -a" as the instance user).
And nothing in /var/log/messages of interest?

FYI: Level 5 audit I have done two times, both times it took well over a week to run (depends on the number of containers you have).

uffeg · Oct 3, 2023

scr1pt said:
It should mark it as damaged, so you should be able to see it via "q damaged <stgpool> t=conta".

Then I would try to repair it:
repair stg <stgpool>

Scan damaged again:
audit cont <container from q damaged output> action=scandamaged (or scanall).

You could at this point also do "q damaged <stgpool> t=node" and "q damaged <stgpool> t=inv" to get an idea of which nodes/files are affected. If they are active files you could do a new selective/full on those objects to try and repair the extents that way as well.

If nothing works at this point then I would consider audit cont with action=removedamaged (if repair does not work).

Another way to find which containers are troublesome (if q damaged does not show anything) is whenever the stgrule fails/stops check the dsmffdc.log for which container it failed on. Then run a audit on that container with action=scanall.

This all assumes the issue you have is corruption within the containers, if the issue you are experiencing is due to latency or some configuration issue related to using NFS as @Trident pointed out then this might not be of any help.

I guess you are not hitting any ulimit value such as the open files limit? (check via "ulimit -a" as the instance user).
And nothing in /var/log/messages of interest?

FYI: Level 5 audit I have done two times, both times it took well over a week to run (depends on the number of containers you have).

Thanks for the tips. In a couple of servers we have managed to do as you said, run selective and full backups and eventually end up with just a few inactive files, then done a removedamage.

But we also have a server where we have repaired the damaged extents and the repl rule job won't finish anyway, it still has 2800 extents it can't replicate. and it says nada anywhere.

The NFS link we got was good, we had a config mistake in that specific server.
And that is something we are working hard with now.

uffeg · Oct 4, 2023

So did a test. Changed from Protect/Replicate to storage rule replication on one server still on 8.1.18.
Ran through and no damaged extents. WHY does 8.1.19 find damaged extents that no other release has found ?!?!? I still am 100% sure this is release problem. As soon as you enable stg rule repl, on 8.1.19 yore are screwed.

Trident · Oct 6, 2023

Hi,

Got the fun myself. Running a audit level 5 to identify bad extents.

uffeg · Oct 9, 2023

From IBM this morning:

With 8.1.19 we are expecting a new fix which addresses known issues with storage rules. We hope to have that available next week.
If you wish to use storage rules I would recommend you install this when we have it available from development.

So just a matter of waiting a week or 2 then upgrade to 8.1.19.100 or maybe this is a interimsfix like .007.

But I guess that won't fix the extents already marked as damaged, so an audit will still be needed.

I will write here when the fix is available and ready for download.

Damaged Extents after upgrade to 8.1.19.000

uffeg

Trident

TSM/Storge dude

uffeg

Trident

TSM/Storge dude

uffeg

uffeg

lp1167

uffeg

Trident

TSM/Storge dude

IBM Spectrum Protect server support for NFS

lp1167

scr1pt

uffeg

uffeg

scr1pt

uffeg

scr1pt

uffeg

uffeg

Trident

TSM/Storge dude

uffeg

Data Privacy Impact Assessment

Sponsor ADSM.ORG

Navigation Menu

NordVPN 3 Months FREE

Forum statistics