Rush Logo Rush Render Queue - Examples
V 103.07b 05/11/16
(C) Copyright 2008, 2016 Seriss Corporation. All rights reserved.
(C) Copyright 1995,2000 Greg Ercolano. All rights reserved.

Strikeout text indicates features not yet implemented


Rush Examples



   Introduction To Rush  

General Overview

    The Rush render queue allows users to manage jobs. A 'job' is usually just a range of frames that need to be rendered. To start a job you need to run a Submit Script, which contains: the instructions defining the job, the frame range to be rendered, which machines are to be used for rendering, the 'priority' the job should run at, the pathname of the Render Script for rendering frames, etc.

    There are several premade GUI submit interfaces which can be customized, such as the Maya submit script. Or you can write your own Submit Script in the language of your choice; Python, Perl, Csh, Bash.

    The one script can double as both the script the user invokes to submit the job, and can also contain the logic that invokes the render to run on each machine, setting proper environment variables for the renders, checking the renders for error codes and messages, handling retry logic, etc.

    To render each frame, Rush runs the script on each machine to invoke whatever UNIX commands are necessary to run the renderer or compositor.

    A frame number is passed via an environment variable, $RUSH_FRAME, to tell the script which frame to work on. The script can invoke any 'command line based' programs: renderers, compositors, custom C programs, other perl/python/csh scripts, etc.

    The purpose of the script during rendering is to ensure the correct environment variables are configured to run the render, determine what the render command line should look like, and then check the renderer for errors, and return one of the 'rush exit codes' to indicate success or failure; 0=Done ok, 1=Failed, 2=Retry. The script will be executed on all the machines the job requests to use, invokes with the same pathname on each machine. So the script must be accessible via a file server (such as an NFS path on Unix, or a UNC path on Windows).

    When the render script runs, various environment variables are passed from Rush which may be useful to intermediate or advanced script programmers.

    The script is executed to start the job running. As the job runs, frames are started on the various networked machines as needed, eating through the frame list until there are no more frames to render. After each frame renders, the system records how long the frame took to run, which is normally viewed within "irush", but can also be viewed from the command line, with the render output for each frame saved in the Frame Logs which can also be viewed from within irush by double clicking on the frame.

    The render queue uses Priority Values to allow important jobs to take precedence over lower priority jobs. When priorities of different jobs are equal, a 'round robin' approach is used to allow jobs to vie for cpus.

    Priority flags allow a job to fight other lower priority jobs, instead of passively waiting for idle cpus to become available ('k', the Kill flag). A job can also have a priority such that no other jobs can kill it ('a', the Almighty flag). Combined, these flags enable a job to kill off other jobs without allowing other jobs to kill it. Sysadmins can monitor audit logs for the use of such flags to prevent misuse, and there are rush configuartion flag settings that let the admin limit which users can use these flags.

    Rush has several GUI programs; irush lets users monitor and control jobs, rushtop which lets users monitor processor and ram use in realtime, onrush gives users a simple 'light switch' for disabling their workstations from rendering, and rushadmin which lets administrators administer various aspects of rush.

   Technical Overview  

'rush' and 'rushd'

The render queue consists of two executables:

  • rush(1) is the command line oriented user front end tool.
  • rushd(8) is the network daemon that runs on each host, one daemon per host.

rush(1) is used to control all aspects of the render queue. It is basically a 'client', and the daemon is a 'server'. rush(1) uses mostly TCP connections to communicate with the daemon.

The rushd(8) daemon is usually started by a machine's boot script, and accepts both TCP and UDP protocols, mostly using UDP to intercommunicate with the other rushd(8) daemons running on other hosts. Absolutely *no* broadcasting or multicasting is used; all UDP traffic is unicast (point to point).

There is one rushd(8) daemon that runs per host. Even multi-processor hosts use only one instance of the daemon to manage all processors.

The render queue system has two configuration files, located in /usr/local/rush/etc/*:

The rush.conf file contains general configurable settings for the system, used both by the daemon and front end tools.

The 'hosts' file contains a list of all hosts participating in the render queue system, along with the number of configured cpus on each host, and other host specific information.

Both files are reloaded automatically by the daemon whenever their date stamps change, within 30 seconds.

These files should be rdist(1)ed or rsync(1)ed from a central location whenever modified. Neither file should be in an NFS mounted directory, nor should the daemon executables.

Job Serving

Rush is a distributed system, which means there is no dependence on any single machine when it comes to serving jobs; this design distributes the load, and ensures no one box is responsible for all jobs.

When a user submits a job, the submit script determines which machine will act as the server for that job.

By default, the machine submitting the job becomes the job server, which is usually the user's workstation. This is normally a good choice, as this distributes the load of job serving when many people are submitting jobs (each user's workstation manages only their own jobs).

However, if a user thinks their workstation is unstable, and might be offline for long periods, or rebooted often, then the user can specify some other machine to act as the job server for their job(s).

Or, the submit scripts can be configured to choose a job server from a list of machines at random, or in some kind of rotation so that the process of job serving is automatically distributed to a known list of servers. It is NOT recommended that you choose a single machine to act as the job server for all jobs on large networks. It is important that the jobs be distributed, to manage load.


Examples

   Template Submit Script  

[erco@howland]% rush -tss            # Template Submit Script - Let's look at it.
#!/bin/csh -f                        # It's just a csh script.

#
#  S U B M I T
#

source /usr/tmp/rush/etc/.submit

rush -submit << EOF
title           SHOW/SHOT             # Title of job
ram             250                   # Amount of RAM job expects to use (MB)
frames          1-100                 # Frame range(s)
logdir          $cwd/logs             # Directory for frame logs
command         $cwd/render-script    # Path to render script
donemail        erco                  # Who to send mail to when job is done
autodump        done                  # Autodump job when done
cpus            howland=1@100         # cpu(s) to run on

# Optional
#notes          This is a test        # Optional free form notes for job
#state          Pause                 # Optional starting state for job
EOF
exit $status
    

There are other examples of submit scripts written in perl, and even C/C++.

   Template Render Script  

[erco@howland]% rush -trs                # Template Render Script - Let's look at it.
#!/bin/csh -f                            # It's just a csh script, too.

###############################
#  R E N D E R   S C R I P T  #
###############################

# Source your render environment as needed
source $RUSH_DIR/etc/.render              # System environment settings. (You can add other sources)

echo "--- Working on frame $RUSH_FRAME - `date`" 

### YOUR RENDER COMMAND(S) HERE
time sleep 10                             # Your render command
set err = $status                         # Keep exit code from your 'render command'

# Rush exit codes: 0=DONE  1=FAIL  2=RETRY 
if ( $err ) then                          # Translate render command exit code to rush exit codes (0|1|2)
    echo --- FAIL; exit 1 
else 
    echo --- DONE; exit 0 
endif 
#NOTREACHED#
    

   Creating Render and Submit Scripts  

[erco@howland]% rush -tss > submit_me    # Create submit script
[erco@howland]% rush -trs > render_me    # Create render script
[erco@howland]% chmod +x submit_me render_me
[erco@howland]% ls -la submit_me render_me
-rwxrwxr-x    1 erco     stree        443 Jan 24 00:59 render_me
-rwxrwxr-x    1 erco     stree        359 Jan 24 00:59 submit_me

[erco@howland]% vi submit_me render_me   # Customize the scripts (see below)

[..]

[erco@howland]% cat submit_me
#!/bin/csh -f

#
#  S U B M I T
#

source /usr/tmp/rush/etc/.submit

rush -submit << EOF
title           VEGA              # Set our title
ram             100               # MB of RAM we expect to use
frames          1-10              # Frame range to use (1 thru 10)
logdir          $cwd/logs
command         $cwd/render_me    # Our render script
donemail        erco
autodump        done
cpus            howland=3@100     # Use up to 3 cpus on howland at 100 priority
EOF
exit $status

[erco@howland]% cat render_me
#!/bin/csh -f

###############################
#  R E N D E R   S C R I P T  #
###############################

# Source your render environment as needed
source $RUSH_DIR/etc/.render

echo "--- Working on frame $RUSH_FRAME - `date`" 

### YOUR RENDER COMMAND(S) HERE
render < /job/VEGA/ribs/${RUSH_PADFRAME}.rib      # Command to be rendered
set err = $status

# Rush exit codes: 0=DONE  1=FAIL  2=RETRY 
if ( $err ) then 
    echo --- FAIL; exit 1 
else 
    echo --- DONE; exit 0 
endif 
#NOTREACHED#
    

   Submitting A Job  

[erco@howland]% ./submit_me     # Submit job by running submit script
setenv RUSH_JOBID how.848       # Our jobid (grab with mouse and paste below)

[erco@howland]% setenv RUSH_JOBID how-848

[erco@howland]% rush -lj        # List Jobs
STATUS JOBID       TITLE        OWNER    %DONE BUSY NOTES
------ ----------- ------------ -------- ----- ---- -----------
Run    how.848     VEGA         erco     %30   3    00:00:07
    

   Monitoring Frames  

[erco@howland]% rush -lf        # List Frames to see how they're doing

STAT FRAME TRY HOSTNAME       PID   START          ELAPSED   _ 
Run  0001  1   howland        12499 01/24,01:00:25 00:00:09   | 3 frames running on
Run  0002  1   howland        12500 01/24,01:00:25 00:00:09   | howland for last 9 secs.
Run  0003  1   howland        12501 01/24,01:00:25 00:00:09  _|
Que  0004  0   -              0     00/00,00:00:00 00:00:00 
Que  0005  0   -              0     00/00,00:00:00 00:00:00 
Que  0006  0   -              0     00/00,00:00:00 00:00:00 
Que  0007  0   -              0     00/00,00:00:00 00:00:00 
Que  0008  0   -              0     00/00,00:00:00 00:00:00 
Que  0009  0   -              0     00/00,00:00:00 00:00:00 
Que  0010  0   -              0     00/00,00:00:00 00:00:00 

[erco@howland]% rush -lfi       # List Frame Info for brief report
State Total Perc
----- ----- ----
Que   7     %69                 # %69 to go
Run   3     %30                 # %30 busy running
Done  0     %0                  # no frames done yet
Fail  0     %0                  # no frames failed either
Hold  0     %0

[erco@howland]% rush -lc        # List cpus to see all cpus we submitted

CPUSPEC[HOST]        STATE       FRM  PID     JOBTID  ELAPSED  NOTES
how=3@100            Run         0001 12499   2       00:00:31
how=3@100            Run         0002 12500   3       00:00:31
how=3@100            Run         0003 12501   4       00:00:31

[erco@howland]% rush -lfi
State Total Perc
----- ----- ----
Que   4     %40    
Run   3     %30
Done  3     %30                 # Some frames are done; up to %30..
Fail  0     %0
Hold  0     %0
    

   Done Mail  

[erco@howland]%
You have new mail.             # Rush sends mail when job is done

[erco@howland]% Mail           # Let's read the mail..
"/usr/mail/erco": 1 message 1 new
>N  2 erco@erco.com       Mon Jan 24 01:02  [how-848] VEGA (%0 QUE, %100 DONE, %0 FAIL)
& 
From erco@erco.com  Mon Jan 24 01:02:30 2000
Date: Mon, 24 Jan 2000 01:02:30 -0800
From: erco@erco.com (Greg Ercolano)
To: erco@erco.com
Subject: [how-848] VEGA (QUE=%0, DONE=%100, FAIL=%0)   # Subject shows jobid, title, stats

     Jobid: how-848
     Title: VEGA
  Priority: 1
    LogDir: /usr/var/tmp/rush/logs
       Ram: 100
   Command: /usr/var/tmp/rush/render_me    
ChkCommand: 
EndCommand: 
  AutoDump: done
      User: erco (1000/1007)
  DoneMail: erco
 StartDate: Mon Jan 24 01:00:24 2000
   EndDate: -
   Elapsed: 00:02:05
    Frames: 10
      Cpus: how=3@100
  Notes[0]: -
  Criteria: 

STAT FRAME TRY HOSTNAME       PID   START          ELAPSED   # Frame list dump
Done 0001  1   howland        12499 01/24,01:00:25 00:00:32 
Done 0002  1   howland        12500 01/24,01:00:25 00:00:32 
Done 0003  1   howland        12501 01/24,01:00:25 00:00:32 
Done 0004  1   howland        12517 01/24,01:00:56 00:00:31 
Done 0005  1   howland        12519 01/24,01:00:56 00:00:32 
Done 0006  1   howland        12522 01/24,01:00:57 00:00:32 
Done 0007  1   howland        12535 01/24,01:01:28 00:00:32 
Done 0008  1   howland        12537 01/24,01:01:28 00:00:31 
Done 0009  1   howland        12539 01/24,01:01:28 00:00:31 
Done 0010  1   howland        12548 01/24,01:02:00 00:00:30 

& q
Held 1 message in /usr/mail/erco
    

   Frame Logs  


[erco@howland]% ls logs                         # Frame logs directory
0001       0003       0005       0007       0009       framelist
0002       0004       0006       0008       0010

[erco@howland]% more logs/0001                  # Frame 0001's log
------------------------------------------       _
--    Host: howland                               |
--     Pid: 12499                                 |
--   Jobid: how.848                               |
--   Frame: 1                                     |
--   Owner: erco (1000/1007)                      |  Rush header
--  Tmpdir: /usr/var/tmp/RUSH_TMP.12499           |
-- Logfile: /usr/var/tmp/rush/logs/0001           |
-- Command: /usr/var/tmp/rush/render_me           |
-- Started: Mon Jan 24 01:00:26 2000             _|
------------------------------------------       _ 
--- Working on frame 1 - Mon Jan 24 01:00:26      |
Writing /job/VEGA/tif/0001.tif                    |  Output from render script
--- DONE                                         _|

[erco@howland]% more logs/framelist             # Frame List when job completed

STAT FRAME TRY HOSTNAME       PID   START          ELAPSED  NOTES
Done 0001  1   howland        12499 01/24,01:00:25 00:00:32 
Done 0002  1   howland        12500 01/24,01:00:25 00:00:32 
Done 0003  1   howland        12501 01/24,01:00:25 00:00:32 
Done 0004  1   howland        12517 01/24,01:00:56 00:00:31 
Done 0005  1   howland        12519 01/24,01:00:56 00:00:32 
Done 0006  1   howland        12522 01/24,01:00:57 00:00:32 
Done 0007  1   howland        12535 01/24,01:01:28 00:00:32 
Done 0008  1   howland        12537 01/24,01:01:28 00:00:31 
Done 0009  1   howland        12539 01/24,01:01:28 00:00:31 
Done 0010  1   howland        12548 01/24,01:02:00 00:00:30 
    

   Requeuing Frames  


[erco@howland]% rush -lf
STAT FRAME TRY HOSTNAME       PID   START          ELAPSED  _
Fail 0100  1   howland        23386 01/25,12:19:09 00:00:22  |
Fail 0101  1   howland        23387 01/25,12:19:09 00:00:22  | Failed frames
Fail 0102  1   howland        23388 01/25,12:19:09 00:00:22 _|
Done 0103  1   howland        23410 01/25,12:19:30 00:00:22 
Done 0104  1   howland        23412 01/25,12:19:30 00:00:23 
[..]

[erco@howland]% rush -que 100-102   # Requeue them manually to start again
0100: Que
0101: Que
0102: Que

[erco@howland]% rush -lf
STAT FRAME TRY HOSTNAME       PID   START          ELAPSED   _
Que  0100  1   howland        23386 01/25,12:19:09 00:00:22   |
Que  0101  1   howland        23387 01/25,12:19:09 00:00:22   | They'll restart shortly
Que  0102  1   howland        23388 01/25,12:19:09 00:00:22  _|
Done 0103  1   howland        23410 01/25,12:19:30 00:00:22 
Done 0104  1   howland        23412 01/25,12:19:30 00:00:23 
[..]
    

   Pausing A Job  


[erco@howland]% rush -pause        # Pause Job.
Job how.850 is now 'Pause'         # No new frames will be started, running frames allowed to finish.

[erco@howland]% rush -lj
STATUS JOBID       TITLE        OWNER    %DONE BUSY NOTES
------ ----------- ------------ -------- ----- ---- -----------
Pause  how.850     WERNER/C33   erco     %55   0    Job paused.

[erco@howland]% rush -cont         # Continue Job
Job how.850 is now 'Run'

[erco@howland]% rush -lj
STATUS JOBID       TITLE        OWNER    %DONE BUSY NOTES
------ ----------- ------------ -------- ----- ---- -----------
Run    how.850     WERNER/C33   erco     %55   0    00:29:20
    

   Advanced: Submit/Monitoring  


[erco@howland]% eval `./submit_me`   # Submit job, sets RUSH_JOBID automatically
[erco@howland]% rush -ljf            # List Job Full - All info about job
     Jobid: how-848
     Title: VEGA
  Priority: 1
    LogDir: /usr/var/tmp/rush/logs
       Ram: 100
   Command: /usr/var/tmp/rush/render_me    
ChkCommand: 
EndCommand: 
  AutoDump: done
      User: erco (1000/1007)
  DoneMail: erco
 StartDate: Mon Jan 24 01:00:24 2000
   EndDate: -
   Elapsed: 00:00:16
    Frames: 10
      Cpus: how=3@100
  Notes[0]: -
  Criteria: 

[erco@howland]% rush -lff | more      # List Frame Full - all information about frames
    Frame: 1
    State: Run
 NewState: ???
 Hostname: howland
 Priority: 100
      Pid: 12499
      Tid: 1
TaskSeqID: 2
    Tries: 1
    Notes: 
StartDate: Mon Jan 24 01:00:25 2000
  EndDate: Sun Jan 01 00:00:00 0000
  Elapsed: 00:00:22

    Frame: 2
    State: Run
 NewState: ???
 Hostname: howland
 Priority: 100
      Pid: 12500
      Tid: 2
TaskSeqID: 3
    Tries: 1
    Notes: 
StartDate: Mon Jan 24 01:00:25 2000
  EndDate: Sun Jan 01 00:00:00 0000
  Elapsed: 00:00:22

    Frame: 3
    State: Run
 NewState: ???
 Hostname: howland
 Priority: 100
      Pid: 12501
[..]

[erco@howland]% rush -tasklist howland     # Task List - View all jobs competing for howland's cpus.
TID JOBSID  TASKSID PID   JOBID/NAME           FRM  PRI   STATE   UTRY NOTES               _
3   4       4       0     how-848,VEGA         0000 100   Idle    1    Last: 0004 DONE      |
4   4       5       0     how-848,VEGA         0000 100   Idle    1    Last: 0005 DONE      | erco's 3 tasks reservations; one 'Run'ing
5   4       6       15650 how-848,VEGA         0006 100   Run     1    Elapsed=00:00:55    _|
6   6       7       15761 how-851,TESLA/MATTE  0502 200k  Run     2    Elapsed=00:00:25     | liza's 2 task reservations; running at
7   6       8       15763 how-851,TESLA/MATTE  0503 200k  Run     2    Elapsed=00:00:25    _| higher 'killer' priority - both are busy.
    

In the above example, 5 task reservations are on howland. Two belong to Liza's job, three to Erco's. Erco's job has 100k (kill) priority, and Liza's job has 200k (kill) priority, higher than Erco's. Howland only has 3 cpus, therefore since Liza's job has higher priority it runs both of hers. The one leftover cpu is taken by Erco's job.