|
All you need are these two scripts to use the render queue: a Submit Script and a Render Script. Rush can create template scripts for you using the 'rush -tss' (Template Submit Script) and 'rush -trs' (Template Render Script) commands. (Here are some examples showing how to create them)
To render each frame, the render queue runs the Render Script, usually a simple script that contains the commands necessary to invoke whatever UNIX commands are necessary to invoke the renderer or compositor.
The 'current frame number' is passed to the render script via the $RUSH_FRAME environment variable. The Render Script can invoke any command line based programs: renderers, compositors, custom C programs, perl scripts, etc.
The Render Script is just a top level wrapper script that sets up the proper rendering environment, runs the renderer, and returns one of the 'rush exit codes' to indicate success or failure; 0=Done ok, 1=Failed, 2=Retry. The Render Script will be executed on all the machines configured in your Submit Script via the same pathname, so it must be accessible via NFS. (Or in the case of Win/NT, the "Network Neighborhood").
When the render script runs, various environment variables are passed from the render queue, which may be useful to intermediate or advanced shell programmers.
The Submit Script is executed to start the job running. As the job runs, frames are started on the various networked machines as needed, eating through the frame list until there are no more frames to render. After each frame renders, the system records how long the frame took to run in the Frame List Report and logs the error output for each frame in the Frame Logs.
The render queue uses Priority Values to allow important jobs to take precedence over lower priority jobs. When priorities of different jobs are equal, a 'round robin' approach is used to allow jobs to vie for cpus.
Priority flags allow a job to fight off other, lower priority jobs instead of passively waiting for idle cpus to become available ('k', the Kill flag). A job can also have a priority such that no other jobs can kill it ('a', the Almighty flag). Combined, these flags enable a job to kill off other jobs without allowing other jobs to kill it. Sysadmins will want to monitor audit logs for the use of such flags to prevent misuse.
rush(1) is used to control all aspects of the render queue. It is basically a 'client', and the daemon is a 'server'. rush(1) uses mostly TCP connections to communicate with the daemon.
The rushd(8) daemon is usually started by a machine's boot script, and accepts both TCP and UDP protocols, mostly using UDP to intercommunicate with the other rushd(8) daemons running on other hosts. Absolutely *no* broadcasting or multicasting is used; all UDP traffic is unicast (point to point).
There is one rushd(8) daemon that runs per host. Even multi-processor hosts use only one instance of the daemon to manage all processors.
The render queue system has two configuration files, located in /usr/local/rush/etc/*:
The 'hosts' file contains a list of all hosts participating in the render queue system, along with the number of configured cpus on each host, and other host specific information.
Both files are reloaded automatically by the daemon whenever their date stamps change, within 30 seconds.
These files should be rdist(1)ed from a central location whenever modified. Neither file should be in an NFS mounted directory, nor should the daemon executables.
[erco@howland]% rush -tss # Template Submit Script - Let's look at it. #!/bin/csh -f # It's just a csh script. # # S U B M I T # source /usr/tmp/rush/etc/.submit rush -submit << EOF title SHOW/SHOT # Title of job ram 250 # Amount of RAM job expects to use (MB) frames 1-100 # Frame range(s) logdir $cwd/logs # Directory for frame logs command $cwd/render-script # Path to render script donemail erco # Who to send mail to when job is done autodump done # Autodump job when done cpus howland=1@100 # cpu(s) to run on # Optional #notes This is a test # Optional free form notes for job #state Pause # Optional starting state for job EOF exit $status
Template Submit Script
In the above example, 5 task reservations are on howland. Two belong to Liza's job, three to Erco's. Erco's job has 100k (kill) priority, and Liza's job has 200k (kill) priority, higher than Erco's. Howland only has 3 cpus, therefore since Liza's job has higher priority it runs both of hers. The one leftover cpu is taken by Erco's job.[erco@howland]% rush -trs # Template Render Script - Let's look at it. #!/bin/csh -f # It's just a csh script, too. ############################### # R E N D E R S C R I P T # ############################### # Source your render environment as needed source $RUSH_DIR/etc/.render # System environment settings. (You can add other sources) echo "--- Working on frame $RUSH_FRAME - `date`" ### YOUR RENDER COMMAND(S) HERE time sleep 10 # Your render command set err = $status # Keep exit code from your 'render command' # Rush exit codes: 0=DONE 1=FAIL 2=RETRY if ( $err ) then # Translate render command exit code to rush exit codes (0|1|2) echo --- FAIL; exit 1 else echo --- DONE; exit 0 endif #NOTREACHED#
Template Render Script
Creating Render and Submit Scripts [erco@howland]% rush -tss > submit_me # Create submit script [erco@howland]% rush -trs > render_me # Create render script [erco@howland]% chmod +x submit_me render_me [erco@howland]% ls -la submit_me render_me -rwxrwxr-x 1 erco stree 443 Jan 24 00:59 render_me -rwxrwxr-x 1 erco stree 359 Jan 24 00:59 submit_me [erco@howland]% vi submit_me render_me # Customize the scripts (see below) [..] [erco@howland]% cat submit_me #!/bin/csh -f # # S U B M I T # source /usr/tmp/rush/etc/.submit rush -submit << EOF title VEGA # Set our title ram 100 # MB of RAM we expect to use frames 1-10 # Frame range to use (1 thru 10) logdir $cwd/logs command $cwd/render_me # Our render script donemail erco autodump done cpus howland=3@100 # Use up to 3 cpus on howland at 100 priority EOF exit $status [erco@howland]% cat render_me #!/bin/csh -f ############################### # R E N D E R S C R I P T # ############################### # Source your render environment as needed source $RUSH_DIR/etc/.render echo "--- Working on frame $RUSH_FRAME - `date`" ### YOUR RENDER COMMAND(S) HERE render < /job/VEGA/ribs/${RUSH_PADFRAME}.rib # Command to be rendered set err = $status # Rush exit codes: 0=DONE 1=FAIL 2=RETRY if ( $err ) then echo --- FAIL; exit 1 else echo --- DONE; exit 0 endif #NOTREACHED#
[erco@howland]% ./submit_me # Submit job by running submit script setenv RUSH_JOBID how.848 # Our jobid (grab with mouse and paste below) [erco@howland]% setenv RUSH_JOBID how-848 [erco@howland]% rush -lj # List Jobs STATUS JOBID TITLE OWNER %DONE BUSY NOTES ------ ----------- ------------ -------- ----- ---- ----------- Run how.848 VEGA erco %30 3 00:00:07
Submitting Job [erco@howland]% rush -lf # List Frames to see how they're doing STAT FRAME TRY HOSTNAME PID START ELAPSED _ Run 0001 1 howland 12499 01/24,01:00:25 00:00:09 | 3 frames running on Run 0002 1 howland 12500 01/24,01:00:25 00:00:09 | howland for last 9 secs. Run 0003 1 howland 12501 01/24,01:00:25 00:00:09 _| Que 0004 0 - 0 00/00,00:00:00 00:00:00 Que 0005 0 - 0 00/00,00:00:00 00:00:00 Que 0006 0 - 0 00/00,00:00:00 00:00:00 Que 0007 0 - 0 00/00,00:00:00 00:00:00 Que 0008 0 - 0 00/00,00:00:00 00:00:00 Que 0009 0 - 0 00/00,00:00:00 00:00:00 Que 0010 0 - 0 00/00,00:00:00 00:00:00 [erco@howland]% rush -lfi # List Frame Info for brief report State Total Perc ----- ----- ---- Que 7 %69 # %69 to go Run 3 %30 # %30 busy running Done 0 %0 # no frames done yet Fail 0 %0 # no frames failed either Hold 0 %0 [erco@howland]% rush -lc # List cpus to see all cpus we submitted CPUSPEC[HOST] STATE FRM PID JOBTID ELAPSED NOTES how=3@100 Run 0001 12499 2 00:00:31 how=3@100 Run 0002 12500 3 00:00:31 how=3@100 Run 0003 12501 4 00:00:31 [erco@howland]% rush -lfi State Total Perc ----- ----- ---- Que 4 %40 Run 3 %30 Done 3 %30 # Some frames are done; up to %30.. Fail 0 %0 Hold 0 %0
Monitoring Frames [erco@howland]% You have new mail. # Rush sends mail when job is done [erco@howland]% Mail # Let's read the mail.. "/usr/mail/erco": 1 message 1 new >N 2 erco@erco.com Mon Jan 24 01:02 [how-848] VEGA (%0 QUE, %100 DONE, %0 FAIL) & From erco@erco.com Mon Jan 24 01:02:30 2000 Date: Mon, 24 Jan 2000 01:02:30 -0800 From: erco@erco.com (Greg Ercolano) To: erco@erco.com Subject: [how-848] VEGA (QUE=%0, DONE=%100, FAIL=%0) # Subject shows jobid, title, stats Jobid: how-848 Title: VEGA Priority: 1 LogDir: /usr/var/tmp/rush/logs Ram: 100 Command: /usr/var/tmp/rush/render_me ChkCommand: EndCommand: AutoDump: done User: erco (1000/1007) DoneMail: erco StartDate: Mon Jan 24 01:00:24 2000 EndDate: - Elapsed: 00:02:05 Frames: 10 Cpus: how=3@100 Notes[0]: - Criteria: STAT FRAME TRY HOSTNAME PID START ELAPSED # Frame list dump Done 0001 1 howland 12499 01/24,01:00:25 00:00:32 Done 0002 1 howland 12500 01/24,01:00:25 00:00:32 Done 0003 1 howland 12501 01/24,01:00:25 00:00:32 Done 0004 1 howland 12517 01/24,01:00:56 00:00:31 Done 0005 1 howland 12519 01/24,01:00:56 00:00:32 Done 0006 1 howland 12522 01/24,01:00:57 00:00:32 Done 0007 1 howland 12535 01/24,01:01:28 00:00:32 Done 0008 1 howland 12537 01/24,01:01:28 00:00:31 Done 0009 1 howland 12539 01/24,01:01:28 00:00:31 Done 0010 1 howland 12548 01/24,01:02:00 00:00:30 & q Held 1 message in /usr/mail/erco
Job Completion: Done Mail [erco@howland]% ls logs # Frame logs directory 0001 0003 0005 0007 0009 framelist 0002 0004 0006 0008 0010 [erco@howland]% more logs/0001 # Frame 0001's log ------------------------------------------ _ -- Host: howland | -- Pid: 12499 | -- Jobid: how.848 | -- Frame: 1 | -- Owner: erco (1000/1007) | Rush header -- Tmpdir: /usr/var/tmp/RUSH_TMP.12499 | -- Logfile: /usr/var/tmp/rush/logs/0001 | -- Command: /usr/var/tmp/rush/render_me | -- Started: Mon Jan 24 01:00:26 2000 _| ------------------------------------------ _ --- Working on frame 1 - Mon Jan 24 01:00:26 | Writing /job/VEGA/tif/0001.tif | Output from render script --- DONE _| [erco@howland]% more logs/framelist # Frame List when job completed STAT FRAME TRY HOSTNAME PID START ELAPSED NOTES Done 0001 1 howland 12499 01/24,01:00:25 00:00:32 Done 0002 1 howland 12500 01/24,01:00:25 00:00:32 Done 0003 1 howland 12501 01/24,01:00:25 00:00:32 Done 0004 1 howland 12517 01/24,01:00:56 00:00:31 Done 0005 1 howland 12519 01/24,01:00:56 00:00:32 Done 0006 1 howland 12522 01/24,01:00:57 00:00:32 Done 0007 1 howland 12535 01/24,01:01:28 00:00:32 Done 0008 1 howland 12537 01/24,01:01:28 00:00:31 Done 0009 1 howland 12539 01/24,01:01:28 00:00:31 Done 0010 1 howland 12548 01/24,01:02:00 00:00:30
Frame Logs [erco@howland]% rush -lf STAT FRAME TRY HOSTNAME PID START ELAPSED _ Fail 0100 1 howland 23386 01/25,12:19:09 00:00:22 | Fail 0101 1 howland 23387 01/25,12:19:09 00:00:22 | Failed frames Fail 0102 1 howland 23388 01/25,12:19:09 00:00:22 _| Done 0103 1 howland 23410 01/25,12:19:30 00:00:22 Done 0104 1 howland 23412 01/25,12:19:30 00:00:23 [..] [erco@howland]% rush -que 100-102 # Requeue them manually to start again 0100: Que 0101: Que 0102: Que [erco@howland]% rush -lf STAT FRAME TRY HOSTNAME PID START ELAPSED _ Que 0100 1 howland 23386 01/25,12:19:09 00:00:22 | Que 0101 1 howland 23387 01/25,12:19:09 00:00:22 | They'll restart shortly Que 0102 1 howland 23388 01/25,12:19:09 00:00:22 _| Done 0103 1 howland 23410 01/25,12:19:30 00:00:22 Done 0104 1 howland 23412 01/25,12:19:30 00:00:23 [..]
Requeuing Frames [erco@howland]% rush -pause # Pause Job. Job how.850 is now 'Pause' # No new frames will be started, running frames allowed to finish. [erco@howland]% rush -lj STATUS JOBID TITLE OWNER %DONE BUSY NOTES ------ ----------- ------------ -------- ----- ---- ----------- Pause how.850 WERNER/C33 erco %55 0 Job paused. [erco@howland]% rush -cont # Continue Job Job how.850 is now 'Run' [erco@howland]% rush -lj STATUS JOBID TITLE OWNER %DONE BUSY NOTES ------ ----------- ------------ -------- ----- ---- ----------- Run how.850 WERNER/C33 erco %55 0 00:29:20
Pausing A Job [erco@howland]% eval `./submit_me` # Submit job, sets RUSH_JOBID automatically [erco@howland]% rush -ljf # List Job Full - All info about job Jobid: how-848 Title: VEGA Priority: 1 LogDir: /usr/var/tmp/rush/logs Ram: 100 Command: /usr/var/tmp/rush/render_me ChkCommand: EndCommand: AutoDump: done User: erco (1000/1007) DoneMail: erco StartDate: Mon Jan 24 01:00:24 2000 EndDate: - Elapsed: 00:00:16 Frames: 10 Cpus: how=3@100 Notes[0]: - Criteria: [erco@howland]% rush -lff | more # List Frame Full - all information about frames Frame: 1 State: Run NewState: ??? Hostname: howland Priority: 100 Pid: 12499 Tid: 1 TaskSeqID: 2 Tries: 1 Notes: StartDate: Mon Jan 24 01:00:25 2000 EndDate: Sun Jan 01 00:00:00 0000 Elapsed: 00:00:22 Frame: 2 State: Run NewState: ??? Hostname: howland Priority: 100 Pid: 12500 Tid: 2 TaskSeqID: 3 Tries: 1 Notes: StartDate: Mon Jan 24 01:00:25 2000 EndDate: Sun Jan 01 00:00:00 0000 Elapsed: 00:00:22 Frame: 3 State: Run NewState: ??? Hostname: howland Priority: 100 Pid: 12501 [..] [erco@howland]% rush -tasklist howland # Task List - View all jobs competing # for howland's cpus. TID JOBSID TASKSID PID JOBID/NAME FRM PRI STATE UTRY NOTES _ 3 4 4 0 how-848,VEGA 0000 100 Idle 1 Last: 0004 DONE | 4 4 5 0 how-848,VEGA 0000 100 Idle 1 Last: 0005 DONE | erco's 3 tasks reservations; one 'Run'ing 5 4 6 15650 how-848,VEGA 0006 100 Run 1 Elapsed=00:00:55 _| 6 6 7 15761 how-851,TESLA/MATTE 0502 200k Run 2 Elapsed=00:00:25 | liza's 2 task reservations; running at 7 6 8 15763 how-851,TESLA/MATTE 0503 200k Run 2 Elapsed=00:00:25 _| higher 'killer' priority - both are busy.
Advanced: Submit/Monitoring
--- --- --- --- --- --- --- --- --- --- --- WORK IN PROGRESS --- WORK IN PROGRESS --- WORK IN PROGRESS --- WORK IN PROGRESS --- --- --- --- --- --- --- --- --- --- ---