|
Most people will want the newer 'Chaining Individual Frames', which lets one inter-depend at the frame level using the new 'dependon' submit command.
This allows you to create a chain of dependencies, such that some jobs
render frames in parallel, while other jobs wait for individual frames
to finish.
You can do this by making a script that invokes several job submissions;
each time a job is submitted, the jobid(s) are saved, and used in the
'dependon' command in the NEXT submit script. This creates a frame
dependency chain between one job and others. Many jobs can be chained
in this manner.
Jobs can be submitted such that one job waits for the other to
dump. To do this, use the submit script
'waitfor' command to wait for
other jobids.
You can do this by making a script that invokes several job submissions;
each time a job is submitted, the jobid is saved, and used in the 'waitfor'
command in the NEXT submit script. This chains one job to the other. Many
can be chained together in this manner.
It is advised you configure the first job (the one being waited for)
to contain either an autodump done
or an autodump donefail command,
so the job dumps automatically. This is needed, to trigger the next job to start.
The next job won't start unless the first job dumps completely.
A simple example showing how the csh eval
command can be used to gather up the jobid of the first job, so that it
is passed to the second job's waitfor command.
I. Chaining Individual Frames
Jobs can be submitted such that one job's frames waits for the others
using the submit script command DependOn.
Example
Here's a typical fg/bg/comp example; a submit script that starts three jobs;
two renders (fg/bg) run in parallel, and a third (comp) waits, starting comps
as frames in the fg/bg job complete. Note how the csh eval command
is used to gather up jobids for the comp job's dependon command.
#!/bin/csh -f
### SUBMIT SCRIPT -- Create frame dependencies between jobs
# Job #1: FOREGROUND ELEMENT
eval `rush -submit` << EOF
title MYSHOW/FG
ram 250
frames 1-10
command $cwd/render-fg
cpus +any=10@100
logdir $cwd/logs-fg
EOF
if ( $status ) exit 1
set fgjobid = $RUSH_JOBID
echo " FG: setenv RUSH_JOBID $RUSH_JOBID"
# Job #2 -- BACKGROUND ELEMENT
# This job can run in parallel with the foreground,
# so no dependency is defined.
#
eval `rush -submit` << EOF
title MYSHOW/BG
ram 250
frames 1-10
command $cwd/render-bg
cpus +any=10@100
logdir $cwd/logs-bg
EOF
if ( $status ) exit 1
set bgjobid = $RUSH_JOBID
echo " BG: setenv RUSH_JOBID $RUSH_JOBID"
# Job #3 -- COMP
# This job waits for individual frames in FG and BG jobs
# to complete successfully before comping frames.
#
eval `rush -submit` << EOF
title MYSHOW/COMP
ram 250
frames 1-10
command $cwd/render-comp
cpus +any=10@100
logdir $cwd/logs-comp
dependon $fgjobid $bgjobid
EOF
if ( $status ) exit 1
echo "COMP: setenv RUSH_JOBID $RUSH_JOBID"
II. Chaining Job Completion
If you are looking for chaining jobs at the frame level, you probably
want to see the above Chaining Frames
section. However, if you want one job to wait for other jobs to COMPLETLY
finish before moving on to the next, read on.
#!/bin/csh -f
### SUBMIT SCRIPT -- Chaining Multiple Jobs
# Job #1
eval `rush -submit` << EOF
title MYSHOW/MYRENDER
ram 250
frames 1-10
command $cwd/render-script
cpus +any=10@100
cpus vaio=8@100
autodump donefail
logdir $cwd/logs-1
EOF
if ( $status ) exit 1
# (eval eats the setenv command, so we duplicate it here)
echo "setenv RUSH_JOBID $RUSH_JOBID"
# Job #2 -- this job will wait for the above job to finish
rush -submit << EOF
title MYSHOW/MYCOMP
ram 250
frames 1-10
command $cwd/comp-script
cpus +any=10@100
cpus vaio=8@100
logdir $cwd/logs-2
waitfor $RUSH_JOBID
EOF
The benefit is mainly from avoiding the per-frame overhead involved in loading the entire scene (texture maps, animation files, etc) each frame.
One technique is to tell the render queue to render on 'tens' (ie. 1-500,10) and have the render script fire off ten frames at a time, using $RUSH_FRAME as the start frame, and ($RUSH_FRAME + 9) as the end frame.
This involves two things; setting a step rate for the frame range in the submit script, and also passing this step rate to the render script, so it knows how many frames to batch.
Batching Multiple
Frames
Submit Script |
#!/bin/csh -f source $RUSH_DIR/etc/.submit # NUMBER OF FRAMES TO BATCH # Change this value as needed @ batch = 10 rush -submit << EOF title BATCH_RENDER ram 10 frames 1-100,$batch command $cwd/render-batch $batch logdir $cwd/logs cpus va@100 how@100 EOF exit $status |
Batching Multiple
Frames
Render Script |
#!/bin/csh -f source $RUSH_DIR/etc/.render # START/END FRAME FOR BATCHING @ sfrm = $RUSH_FRAME @ efrm = ( $sfrm + $argv[1] - 1 ) echo "--- Working on frames $sfrm-$efrm - `date`" myrender $sfrm,$efrm,1 if ( $status ) exit 1 exit 0 |
Whenever your render script returns an exit code of 2 (REQUEUE), the frame is requeued, the 'Try' count is incremented (shown in 'rush -lf' and the frame is executed again.
Rush passes the retry count to the render script as an environment variable $RUSH_RETRY which the script can use to act conditionally.
Simple Retry Counting
Render Script |
#!/bin/csh -f source $RUSH_DIR/etc/.render echo "--- Working on frame $RUSH_FRAME - `date`" render /job/MYJOB/MYSHOT/ribs/fg.$RUSH_PADFRAME.rib if ( $status == 0 ) exit 0 # it worked # FAILED? RETRY 3 TIMES if ( $RUSH_TRY < 3 ) exit 2 # retry up to 3 times exit 1 # otherwise fail |
In some cases, using just the rush try counter can be problematic if, say, there are killer jobs on your network that regularly bump frames, causing try counts to clock up unexpectedly.
In such cases, you might want to make your own try counter that counts completed attempts on the rendered frame, ignoring frames bumped by other 'killer' jobs. To do this, you could make your render script use this approach:
Retry Counts Around Killer Jobs
Render Script |
#!/bin/csh -f ### RENDER SCRIPT # GET THE CURRENT TRY COUNT # Keep our own 'try file' that is basically the log filename # with a .try extension on the end. Note that resetting the # rush try count will reset our own try counter as well. # if ( $RUSH_TRY == 0 || ! -e $RUSH_LOGFILE.try ) echo 0 > $RUSH_LOGFILE.try @ mytry = `cat $RUSH_LOGFILE.try` if ( $mytry >= 3 ) then echo --- FAIL: TRY COUNT EXCEEDED ; exit 1 endif # Render command here render /myshow/myshot/foo.$RUSH_PADFRAME.rib set err = $status echo --- RENDER EXIT CODE: $err # Update try count AFTER render completes # This way we count complete trys, not bumps. # @ mytry++ ; echo $mytry > $RUSH_LOGFILE.try if ( $err ) then echo --- FAIL ; exit 1 endif echo --- OK exit 0 |
For instance, in a situation where a 3rd party program outputs error messages like 'cannot open file' or 'write error', but always returns an exit 0. A savvy render script programmer can use 'egrep' to detect the error message and report it back to rush.
Detecting Render Problems with Grep |
#!/bin/csh -f
### RENDER SCRIPT my_render $RUSH_FRAME # 'my_render' always returns an exit code
of 0,
egrep -s 'cannot open file|write error'
$RUSH_LOGFILE
|
Here's the same example using perl.
Detecting Render Problems with Perl |
#!/usr/bin/perl ### RENDER SCRIPT # 'my_render' always returns an exit code of 0, # so to detect errors we have to grep for them. system("my_render $ENV{RUSH_FRAME}"); # Check for error messages from the log file if ( open(FD, "<$ENV{RUSH_LOGFILE}") ) { while ( <FD> ) { if ( /cannot open file/ || /write error/ ) { print STDERR "-- FAILED --\n"; exit(1); } } close(FD); } else { print STDERR "$ENV{RUSH_LOGFILE}: $!\n"; } print STDERR "-- OK --\n"; exit(0); |
The following Advanced example detects particular errors,
and if found adds TD-friendly notes that appear in the job's framelist
to helpfully highlight what kind of error each failed frame encounted, ie:
STAT FRAME TRY HOSTNAME PID START ELAPSED NOTES
Fail 0030 2 vaio 20338 02/27,14:41:22 00:01:03 Missing file
Fail 0031 2 vaio 20339 02/27,14:41:22 00:01:03 Missing file
Fail 0032 2 vaio 20340 02/27,14:41:22 00:01:03 Missing file
Run 0033 9 vaio 20365 02/27,14:55:25 00:00:45 License error
Done 0034 9 vaio 20367 02/27,14:41:25 00:01:04 -
Done 0035 8 rotwang.erco.c 12663 02/27,14:41:32 00:01:06 -
Fail 0036 8 rotwang.erco.c 12664 02/27,14:55:35 00:00:55 Missing file
Fail 0037 8 ontario 20434 02/27,14:55:35 00:00:55 Missing file
Fail 0038 8 ontario 20441 02/27,14:55:35 00:00:55 Missing file
A more simple example of this technique can be found in the rush tutorial.
Grep: An Advanced Example |
#!/bin/csh -f ############################### # R E N D E R S C R I P T # ############################### echo "--- Working on frame $RUSH_FRAME - `date`" ### MAYA RENDER Render30 -s $RUSH_FRAME -e $RUSH_FRAME -b 1 -proj $1 -rd /jobs/MYSHOW/MYSHOT/images $2 set err = $status ### GREP FOR ERROR MESSAGES set msg = "" if ( `grep -s "Texture file" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Texture File" if ( `grep -s "Failed to open IFF" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "IFF Error" if ( `grep -s "find destination plug" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Plug Error" if ( `grep -s "ESEC_J" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "License Error" if ( `grep -s "doesn" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Missing File" if ( `grep -s "TrenderTesselation" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Tesselation Error" if ( `grep -s "Memory exception" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Memory Error" if ( `grep -s "post-process stage" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Post Process" ### FOUND ONE OF THE ABOVE? if ( "$msg" != "" ) then # MAKE NOTE IN FRAMELIST FOR TD/RENDER WATCHER rush -notes ${RUSH_FRAME}:"$msg" switch ( "$msg" ) ### NON-FATAL case "License Error": case "IFF Error": case "Plug Error": case "Tesselation Error": case "Memory Error": case "Post Process": echo -- REQUEUE exit 2 ### FATAL case "Texture File": case "Missing File": default: echo -- FAIL exit 1 endsw endif # NON-SPECIFIC ERROR? if ( $err != 0 ) then rush -notes ${RUSH_FRAME}:"See Logs" echo -- FAIL exit 1 endif # NO ERRORS echo -- OK exit 0 |
You can embed 'rush -notes' commands into your render script to alter the 'notes' field for the rendering frame, eg:
Frame notes are cleared each time a frame begins rendering, so there's no need to specify a rush command to clear the frame notes in your render script. In fact, that's discouraged because of the following warning..
Warning: Each execution of 'rush -notes' invokes a TCP connection
to the job server daemon. Invoking 'rush' commands on a per frame
basis is unwise (except under error condition circumstances), as it
imposes a large TCP load on the job server daemon if many connections
occur all at once, slowing the daemon's response critically.
This happens especially if your render times are short, and you are rendering on many cpus. Therefore you are only encouraged to embed 'rush' commands in render scripts under error conditions only (ie. infrequently), so as to lessen the possibility of multiple concurrent TCP connections. |
Here's an example showing a render script that makes use of the NOTES field to report helpful errors to the user..
% cat render_me #!/bin/csh -f echo "--- Working on frame $RUSH_FRAME - `date`" ### YOUR RENDER COMMAND(S) HERE particle $DATA/files/stars-$RUSH_PADFRAME.par set err = $status ### CHECK FOR MISSING FILES egrep -i -s no.such.file.or.directory $RUSH_LOGFILE if ( $status ) then rush -notes ${RUSH_FRAME}:'Missing file' exit 1 # FAIL endif ### CHECK FOR CORE DUMPS egrep -i -s core.dumped $RUSH_LOGFILE if ( $status ) then rush -notes ${RUSH_FRAME}:'Core dumped' exit 1 # FAIL endif ### CHECK FOR LICENSE ERRORS egrep -i -s no.available.licenses $RUSH_LOGFILE if ( $status ) then rush -notes ${RUSH_FRAME}:'License error' sleep 10 exit 2 # RETRY endif ### NON-SPECIFIC ERRORS if ( $err ) then rush -notes ${RUSH_FRAME}:'?' exit 1 endif exit 0 % rush -lf STAT FRAME TRY HOSTNAME PID START ELAPSED NOTES Fail 0030 2 vaio 20338 02/27,14:41:22 00:01:03 Missing file Fail 0031 2 vaio 20339 02/27,14:41:22 00:01:03 Missing file Fail 0032 2 vaio 20340 02/27,14:41:22 00:01:03 Missing file Que 0033 9 vaio 20365 02/27,14:55:25 00:00:45 License error Done 0034 9 vaio 20367 02/27,14:41:25 00:01:04 - Done 0035 8 vaio 20369 02/27,14:41:25 00:01:04 - Done 0036 8 tahoe 20389 02/27,14:41:29 00:01:03 - Done 0037 8 tahoe 20394 02/27,14:41:29 00:01:03 - Done 0038 8 tahoe 20396 02/27,14:41:29 00:01:03 - Done 0039 8 superior 20413 02/27,14:41:32 00:01:03 - Done 0040 8 superior 20423 02/27,14:41:32 00:01:03 - Fail 0041 8 erie 20425 02/27,14:41:32 00:00:08 Core dumped Done 0042 8 rotwang.erco.c 12662 02/27,14:41:32 00:01:06 - Done 0043 8 rotwang.erco.c 12663 02/27,14:41:32 00:01:06 - Fail 0044 8 rotwang.erco.c 12664 02/27,14:55:35 00:00:55 Missing file Fail 0045 8 ontario 20434 02/27,14:55:35 00:00:55 Missing file Fail 0046 8 ontario 20441 02/27,14:55:35 00:00:55 Missing file |
When one of the above failed frames is requeued, the NOTES field is cleared as soon as the frame starts rendering again, preventing stale error messages from remaining when the frame re-renders.
To disable this 'auto-clearing' behavior, use the submit command 'FrameFlags keepnotes'.
To pause the job for a short period, use the new 'rush -licpause' option in your render script; it will pause the job for 60 seconds (unless changed with the submit command LicPauseSecs):
#!/bin/csh -f ############################### # R E N D E R S C R I P T # ############################### source $RUSH_DIR/etc/.render echo "--- Working on frame $RUSH_FRAME - `date`" # INVOKE RENDERER hscript $hipfile < foo.hscript # CHECK FOR RENDER LICENSE ERRORS egrep -s 'Error acquiring license' ${RUSH_LOGFILE} if ( $status == 0 ) then # PAUSE JOB FOR SHORT TIME, REQUEUE FRAME rush -licpause rush -notes ${RUSH_FRAME}:'License Error' exit 2 endif exit 0 |
However, there are a couple of ways to do it using existing techniques.
2) Use 'rush -reserve' to reserve some of the processors on the machines you need, so you can thread your renders on these machines to use several processors.
3) Use frame blocks to represent the individual processors. See below.
1) Securing Ram To Secure Processors
If you have a farm of dual proc machines that all have a gig of ram configured in rush (eg. 'rush -lah' shows 1024 in the Ram column), and you submit a job with the 'ram' value set to '1024', then you will effectively secure both processors from rush, because when rush starts your render, it will subtract the ram you requested from the configured ram value in the rush hosts list, leaving zero available for any other job to use.
Also, you will only be able to start rendering on machines that have 1024 available, which means both processors must be unused by rush, otherwise rush will think less than 1024 is available, preventing your job from running on the machine.
2) Reserving Processors
This technique is pretty intuitive; simply use 'rush -reserve' to reserve processors on the machines you want to use, and then submit your job to use those machines.
Setup your render script to first check how many cpus are reserved by your reserve job on the local machine before starting the renderer. If no cpus got reserved (they're busy doing someone else's job) then just render on with one thread. But your reserve job has reserved the other cpu, then tell the renderer to use two threads.
3) Frame Block Approach
Use a single job with the frame range scaled up, so the frames are representative. For instance, to render frames 1 thru 5 using two procs per frame, submit a job where the frame range is multipled by 10 and stepped by 5. (ie. 10 15 20 .. 50 55) So you end up with:
_ 10 |__ represents frame 1 15 _| _ 20 |__ " " 2 25 _| _ 30 |__ " " 3 35 _| : :So in this case the 'even' frames (10 20 30..) start a 'render listener' (or a 'sleep' if the render doesn't use listeners), and the odd frames (15 25 35..) start the actual render, looking to the corresponding even frames for the extra threading processor to use.
For example, when frame '10' starts, the render script simply sleeps, or as is the case with some renderers, starts the 'listener' render process.
When frame '15' starts, it looks to frame 10 for the cpu to use as the threading processor, and starts the render for the real frame #1, telling it to thread with frame 10's processor.
When 15 finishes rendering, it marks frame 10 as 'done' so its cpu frees up, and the 'render listener' is killed.
If frame 15 starts, but seems frame 10 is not Busy (ie. 'Que'ed), then it simply exits with a requeue (exit 2) and tries again. Normally, though, frame 10 will always be running before 15 is started.