Automatically restarting a batch job w/o human intervention

This Forum is for the guests or the users who are not registered on this board. This part allows guests to post in.
ironponygrl

Automatically restarting a batch job w/o human intervention

Post by ironponygrl »

I have 600+ IBM z/OS batch jobs. The restart instructions for about 10% of them is simply, 'restart in abending step up to 5 times regardless of the abending step'. 'abending' usually means anything that is not a zero return code, not necessarily a true abend.

I got another annoying job this week that needs this ‘generic’ restart because it crashes with contention (-911) once or twice a week. Operations calls the programmer (me); all I do is tell them to restart in the step it went down in. Operator documents it all in a incident ticket and then I have to respond and close the ticket. Unnecessary human intervention if we could figure out a way to exploit some system software to get around it.

I'd rather not have to modify all the JCL or set up anything that is 'per job' if I did not have to.

The jobs are scheduled through CA7 and restarted via CA11. Is there a way to tell one of these two products that I want to define some generic 'property' to do my bidding that can be assigned to any number of jobs without being a custom version for each job? I'd like to get the humans out of the middle of it unless the job has already been restarted 4 times.

I'd rather not have to modify all the JCL or set up anything that is 'per job' if I did not have to.

I had another population that were just 'force complete' so I set up a generic ARFSET for that action, but I'd like to exploit workload automation to handle the auto restarts too. I dont know that ARFset is the right feature for what I want to do.

I'd rather not have to modify all the JCL or set up anything that is 'per job' if I did not have to.

Any suggestions from anyone?
nicc
Global Moderator
Global Moderator
Posts: 691
Joined: Wed Apr 23, 2014 8:45 pm

Re: Automatically restarting a batch job w/o human intervent

Post by nicc »

Any suggestions from anyone?
Yes, do not post in multiple forums at the same time. We would rather not read the same topic in the multiple places that we hang out.
(And I am not reposting my response from the first forum that I read this.)
Regards
Nic
User avatar
Anuj Dhawan
Founder
Posts: 2802
Joined: Sun Apr 21, 2013 7:40 pm
Location: Mumbai, India
Contact:
India

Re: Automatically restarting a batch job w/o human intervent

Post by Anuj Dhawan »

I have 600+ IBM z/OS batch jobs. The restart instructions for about 10% of them is simply, 'restart in abending step up to 5 times regardless of the abending step'. 'abending' usually means anything that is not a zero return code, not necessarily a true abend.
This you need to take to the Business after some investigation from application team side. Yo ucan definitely investigate on if the job should really abend in case of a non-zero return code or can there be other acceptable approach to it. For example, a group of business user is sent an e-mail detailing about the non-zero return code and its reason.

CA11 is an excellent tool for restart but the way you've descried it's use at your shop, it sounds very awkward. There should be a reason behind 5 time restart, perhaps thoese instructions are put in by some rookie.
I got another annoying job this week that needs this ‘generic’ restart because it crashes with contention (-911) once or twice a week. Operations calls the programmer (me); all I do is tell them to restart in the step it went down in. Operator documents it all in a incident ticket and then I have to respond and close the ticket. Unnecessary human intervention if we could figure out a way to exploit some system software to get around it.
SQLCODE= -911 is a difficult animal to deal with and you've to really sit down pulling your sleeves up to check and tune the application. SQLCODE -911 means that the current "UNIT OF WORK HAS BEEN ROLLED BACK DUE TO DEADLOCK OR TIMEOUT". It'll also give you the a message of type REASON reason-code, TYPE OF RESOURCE resource-type, AND RESOURCE NAME resource-name in the SYSOUT of failed job.You've to start investigation with the resource names involved. Usually a proper scheduling solves this problem.
Thanks,
Anuj

Disclaimer: My comments on this website are my own and do not represent the opinions or suggestions of any other person or business entity, in any way.
Post Reply

Create an account or sign in to join the discussion

You need to be a member in order to post a reply

Create an account

Not a member? register to join our community
Members can start their own topics & subscribe to topics
It’s free and only takes a minute

Register

Sign in

Return to “You are a Guest.”