I have 600+ IBM z/OS batch jobs. The restart instructions for about 10% of them is simply, 'restart in abending step up to 5 times regardless of the abending step'. 'abending' usually means anything that is not a zero return code, not necessarily a true abend.
This you need to take to the Business after some investigation from application team side. Yo ucan definitely investigate on if the job should really abend in case of a non-zero return code or can there be other acceptable approach to it. For example, a group of business user is sent an e-mail detailing about the non-zero return code and its reason.
CA11 is an excellent tool for restart but the way you've descried it's use at your shop, it sounds very awkward. There should be a reason behind 5 time restart, perhaps thoese instructions are put in by some rookie.
I got another annoying job this week that needs this ‘generic’ restart because it crashes with contention (-911) once or twice a week. Operations calls the programmer (me); all I do is tell them to restart in the step it went down in. Operator documents it all in a incident ticket and then I have to respond and close the ticket. Unnecessary human intervention if we could figure out a way to exploit some system software to get around it.
SQLCODE= -911 is a difficult animal to deal with and you've to really sit down pulling your sleeves up to check and tune the application. SQLCODE -911 means that the current "UNIT OF WORK HAS BEEN ROLLED BACK DUE TO DEADLOCK OR TIMEOUT". It'll also give you the a message of type
REASON reason-code, TYPE OF RESOURCE resource-type, AND RESOURCE NAME resource-name in the SYSOUT of failed job.You've to start investigation with the resource names involved. Usually a proper scheduling solves this problem.
[quote]I have 600+ IBM z/OS batch jobs. The restart instructions for about 10% of them is simply, 'restart in abending step up to 5 times regardless of the abending step'. 'abending' usually means anything that is not a zero return code, not necessarily a true abend.[/quote]This you need to take to the Business after some investigation from application team side. Yo ucan definitely investigate on if the job should really abend in case of a non-zero return code or can there be other acceptable approach to it. For example, a group of business user is sent an e-mail detailing about the non-zero return code and its reason.
CA11 is an excellent tool for restart but the way you've descried it's use at your shop, it sounds very awkward. There should be a reason behind 5 time restart, perhaps thoese instructions are put in by some rookie.
[quote]I got another annoying job this week that needs this ‘generic’ restart because it crashes with contention (-911) once or twice a week. Operations calls the programmer (me); all I do is tell them to restart in the step it went down in. Operator documents it all in a incident ticket and then I have to respond and close the ticket. Unnecessary human intervention if we could figure out a way to exploit some system software to get around it.[/quote]SQLCODE= -911 is a difficult animal to deal with and you've to really sit down pulling your sleeves up to check and tune the application. SQLCODE -911 means that the current "UNIT OF WORK HAS BEEN ROLLED BACK DUE TO DEADLOCK OR TIMEOUT". It'll also give you the a message of type [ic]REASON reason-code, TYPE OF RESOURCE resource-type, AND RESOURCE NAME resource-name[/ic] in the SYSOUT of failed job.You've to start investigation with the resource names involved. Usually a proper scheduling solves this problem.