Rush Logo Rush Render Queue - TD FAQ
V 103.07b 05/11/16
(C) Copyright 2008, 2016 Seriss Corporation. All rights reserved.
(C) Copyright 1995,2000 Greg Ercolano. All rights reserved.

Strikeout text indicates features not yet implemented


TD Frequently Asked Questions



   How can I use padded frame numbers (0000) in my render script?  
Use $RUSH_PADFRAME, it is created for you automatically to do 4 digit padding.

To do your own custom frame number padding, use this unix technique:

    set padframe = `perl -e 'printf("%04d",$ENV{RUSH_FRAME});'`
To use different padding widths, just change the '4' (in '%04d') to a different number.

   My renders are coming up 'FAIL'. How do I figure out what's wrong?  
The most common problem is a render script that does not properly handle returning exit codes. Make sure your render script is correctly returning an appropriate render script exit code: 0=OK, 1=FAIL, 2=RETRY.

Also, check the frame logs being generated by your render script. Frame logs contain the error messages for each rendered frame which should help you determine the problem. Make sure your submit script has LogDir pointing to a valid directory, which is where your frame logs will be found.

   How do I have rush automatically retry frames? How do I set the number of retrys?  
See Retrying Frames.

   My job isn't starting renders on my cpus. What's going on?  
Use 'rush -lc' and check the Notes column for messages.

If you know the remote cpus aren't just busy with other jobs, then list your cpus and check the 'NOTES' column to see if the system is giving you reasons why your cpus are being rejected. 

The job might be in Pause, there are no more frames to render, all the available machines don't have as much ram as your job needs, etc. Here are some typical situations:

    Messages Showing Why Cpus Won't Render
    [erco@howland]% rush -lc
    CPUSPEC[HOST]        STATE       FRM  PID     JOBTID  ELAPSED  NOTES
    placid=3@100k        Idle        -    -       1       00:04:37 Job state is 'Pause'
    tahoe=1@1            Idle        -    -       2       00:02:08 No more frames
    superior=1@1         Idle        -    -       3       00:02:08 Not enough ram
    waccubuc=1@1         Idle        -    -       4       00:02:08 This is a 'neverhost'
    ontario=1@1          Idle        -    -       5       00:02:08 Failed 'criteria' check

   How do I set up my submit script to only render on
certain platforms or operating systems?
  
Use the Criteria submit script command.

This command allows you to build a list of platforms, operating systems, or other general criteria to limit which machines will run your renders.

You can see the different criteria names in the output of 'rush -lah'. It is up to your sysadmin to maintain the criteria names.

   How can I render several frames in one process using rush?  
With clever scripting. See Batching Multiple Frames for how to render several frames at a time.

Sometimes it pays to render several frames at a time rather than one at a time, to decrease the amount of time the renderer spends loading files.

If you have existing script filters which monitor the progress of renders to determine which frames are rendering, you can probably easily modify these scripts to work with rush to reflect changes in the frame list, using either frame notes (rush -notes) or frame state change operations (rush -que/rush -done). 

   My job has its 'k' flag set; why isn't it bumping off other jobs' frames?  
For a job to bump another off a cpu, these things must be true:
  • A job only bump other jobs of lower priority (i.e., not same priority) 
  • A job can't be bumped if almighty flag is set ('a'). 
  • A job can't be bumped unless its entry in the -tasklist is either in the Avail or Run state.
When a frame is bumped, the bumped frame will show a message in its frame list indicating the job that bumped it, e.g.:

    Render 'Bump' Messages
    % rush -lf erie-790
    STAT FRAME TRY HOSTNAME PID   ELAPSED  NOTES
    Run  0100  0   tahoe    10290 00:00:26 
    Run  0101  0   tahoe    10291 00:00:26 
    Que  0102  1   tahoe    10292 00:00:09 Bumped by ralph's superior-791,KILLER @300ka
    Que  0103  0   -        0     00:00:00 
    [..]

   Is there an easier way to set the RUSH_JOBID environment variable?  
You can use eval `submit` to automatically set it, or a simple alias to set it manually. However, cutting and pasting the 'setenv' command is not so hard.

Some people like to use this alias to make it easy to set new jobid variables:

      # Put this in your .cshrc
      alias jid 'setenv RUSH_JOBID "\!*"'
Then you can use it on the command line to set one or more jobids:
      erco@tahoe % jid tahoe.932 tahoe.933
If you want to have the RUSH_JOBID variable set automatically in your shell whenever you invoke your submit script, then use 'eval':
      erco@tahoe % eval `my_submit_script`
...the shell automatically parses the 'setenv RUSH_JOBID' command rush prints on stdout when a job is successfully submitted. Error messages are not affected by 'eval', so you don't have to worry about losing error messages when using this technique.

   What does 'rush' stand for?  
Rush is not an acronym, it was named after film 'rushes', and the fact effects people are usually in a rush to get things done.

   Can my render script detect being 'bumped' by higher priority jobs?  
    Not without clever scripting.

    Usually the desire to do this stems from wanting to clean up left over temporary files generated by renders. In most cases, you can avoid left over files by putting temporary files in $RUSH_TMPDIR, which rush cleans automatically, even after bumps.

    Bumps and dumps use SIGKILL to kill the render script and its children. This signal is NOT trappable. There's a reason:

      Under many circumstances SIGTERM, the 'trappable' kill is not effective, especially during heavy rendering, causing bumped frames not to bump, screwing up unattended use, and leaving processors unproductive.

      Since bumps can happen just as readily as dumps, both use SIGKILL, untrappable, and always effective (except in pathological cases where the process is hung).

      So do not expect to be able to trap interrupts to detect bumps/dumps.

    If you need a way to determine if you are re-rendering a frame that was previous killed mid-execution (i.e., bumped by a higher priority job), you can put some logic into your render script:

    CSH: How To Detect if Frame Previously Bumped
        #!/bin/csh -f
        ..
        if ( -e /somewhere/$RUSH_FRAME.busy ) then
            echo We are picking up a frame that was killed.
            echo Do pickup stuff here..
        endif
    
        # Create a 'busy' file for this frame
        #    If we are bumped, busy file is left behind 
        #    so that the above logic can detect it.
        #
        touch /somewhere/$RUSH_FRAME.busy
        echo Do rendering here..
        rm -f /somewhere/$RUSH_FRAME.busy
            

   Can I chain separate jobs together, so that one waits for the other to get done?  
Yes, see the submit script command WaitFor to have a job wait for others to dump before starting.

Also, see DependOn to have a job wait for frames in another job to get done, i.e., rather than wait for the entire job to complete.

   Is it possible to use negative frame numbers in rush?  
Yes, as of Rush 102.42 you can now do your evil negative frame numbers.

Older versions of Rush do not support negative numbers.

   Is there a way to see just the cpus busy running my job?  
Yes. In Unix:

   rush -lc | grep Busy
   rush -lf | grep Run
   

...and on WinNT, if you don't have grep(1):

   rush -lc | findstr Busy
   rush -lf | findstr Run
   

   Is there a way to see what jobs a machine is busy rendering?  
In Unix:

   rush -tasklist host | grep Busy
   

..and on WinNT, if you don't have grep(1):

   rush -tasklist host | findstr Busy
   

   Is there a way to requeue a busy frame for a host that is down?  
If a machine goes down while rendering a frame, the frame stays in the Busy state until the machine is rebooted. Once rush realizes the remote machine rebooted, it requeues the frame.

But if the machine never reboots, the frame will stay in the Busy state indefinitely, unless you take the following action.

Assuming you're *sure* the machine is down, and not just 'slow', use the following command:

    % rush -down hosta hostb
    

...where 'hosta' is the name of the machine that is down, and 'hostb' is the name of the machine that's the server for the job(s) with the hung frame(s).

Beware; if the remote machine is not really down, and is still running the frame, doing the above will start the frame running on another machine, and the two frames will overwrite each other.

   How do I list all the machines in a hostgroup?  
As of version Rush 102.31, you can list the hostgroups with the 'rush -lhg' command.

With older versions of Rush, you can grep the output of 'rush -lah', or parse the contents of the $RUSH_DIR/etc/hosts file.

For instance, to print all the hosts in the "+foo" hostgroup:

    rush -lah | grep +foo
    
...or to precisely parse them from the hosts file with awk (which you should be able to cut and paste into a Unix tcsh shell):

    Parsing Host Groups from rush/etc/hosts
    	awk 'BEGIN { s="+foo"; }                                    \
    	     { if (match($0,"^#")) next; n = split($5,arr,",");     \
    	       for (i=1; i<=n; i++) {                               \
    		   if(arr[i]==s) { print $1; break; }               \
    	       }                                                    \
    	     }' < /usr/local/rush/etc/hosts
    	    

   Why is rush creating files owned by 'rush' or 'ntrush'?  

    Under Windows, rush assumes all jobs submitted on windows machines to be owned by the user the Rushd service is configued to run as. Normally this is a special user created for rendering called either 'rush', 'ntrush', 'render', or whatever your sysadmin configured the username to be.

   Why are there weird file permissions on my rendered images?  

    Under unix, the file permissions are determined by your umask.

    If the umask is incorrect, make sure your render script is sourcing the $RUSH_DIR/etc/.render file, and make sure the sysadmin configured this file correctly to include the umask approriate render jobs. This way everyone will have the correct umask by default, since all render scripts should be sourcing that file.

    You can also explicitly specify a umask command in your render script. Just include the command somewhere before your first render commands.

   Is there a way to start a job at a particular time?  
    Yes, you can now specify a time for the job to waitfor, either in the submit script with the WaitFor command, or from the command line with 'rush -waitfor', e.g:
        waitfor +8h     -- wait 8 hours from now
        waitfor 7:30pm  -- wait until 7:30pm
        
    See WaitFor for more info.

   Under Windows, why are UNC paths better than drive letters?  

    It is highly recommended you use UNC paths for all renders when windows platforms are involved. UNC paths are better than drive letters for many reasons:

    • UNC paths always work.
      UNC's work universally once the server is sharing them. For drive maps to work, they have to be configured at each client.

    • Drive letters don't always work.
      If a user isn't logged in, or logs out, drive letters 'disappear', and can unexpectedly cause background renders to suddenly fail.

    • Drive letters can disappear when someone logs out.
      UNC paths always work, being resolved 'on the fly', and are not affected by interactive user login/logouts.

    • UNC paths will work even on a freshly booted machine.
      Drive letters won't work unless someone is logged in, and has already configured them.

    • Drive letters are not portable to other platforms.
      Unix cannot be made to understand or redirect drive letter style pathnames, eg. z:/foo/bar. Only UNC paths are similar enough to unix to be portable (eg. //host/share/foo/bar)

    • Microsoft changed how drive mappings behaved between Win2K and XP.
      In Win2K (and older), drive maps were global to the machine.
      In WinXP (and newer), drive maps are now 'per user'; so a drive map setup for the logged in user won't affect other users or services (such as Rushd's).

    For these reasons, and others, UNCs should be used over drive letters. Microsoft's own documentation recommends against the use of drive letters, due to their limitations in multi-user environments, and in the context of services.

    This said, there are ways to force drive mappings in Rush, and several companies seem to work sufficiently with this configuration. But if you can avoid drive maps, it's best to.

   How do I make UNC paths work the same on Windows and Unix?  

    The trick is to tweak the unix side, using symbolic links or making mount directories that simulate the windows path layout.

    The best thing is to create the mounts to match the UNC paths. On my network, windows machines see the server's disk as //tahoe/jobs. On the unix machines, I can organize my mount points this way:

      mkdir /tahoe
      mkdir /tahoe/jobs
      mount -t nfs tahoe:/raid/jobs /tahoe/jobs

    This way 'ls -la //tahoe/jobs' will show the directory contents of the file server.

    Or, if your mounts are in a directory like /mnt/jobs and you don't want to change that, you can work around easily enough with symbolic link(s) to make UNC paths work under unix.

    For instance, to get //tahoe/jobs to resolve to /mnt/jobs, I can create the necessary symbolic link this way:

      mkdir /tahoe
      ln -s /mnt/jobs /tahoe/jobs

    ..this makes the unix path //tahoe/jobs resolve through the symlink to the actual mount directory /mnt/jobs.

    You can find out the UNC pathnames for your drive maps by typing at a DOS prompt:

      NET USE

    For example:

        C:\>net use
        New connections will be remembered.
        Status       Local     Remote                    Network
        -----------------------------------------------------------------------------
        Connected    Z:        \\tahoe\jobs              Microsoft Windows Network
                     -----     ---------------
                     Drive     UNC pathname
            
    Note that in many 3rd party browsers, you have to include the trailing slash after the volume name to browse the base directory. eg. when typing the first part of a UNC path into a browser, include the trailing slash:
            //tahoe/jobs        -- BAD: might not present a directory listing
            //tahoe/jobs/       -- GOOD: more likely will present a listing
        
    With no exceptions that I'm aware, all third party applications (Maya, Renderman, Houdini, Rush, etc) will understand front slashes equally well as backslashes in pathnames. Front slashes are preferred, as they are portable across multiple platforms.

    Only Microsoft's own tools have problems with frontslashes, namely:

    • Explorer
    • Desktop folder browser
    • Many DOS commands such as COPY and DIR, which use front slashes as argument flags.

    In all other cases, frontslashes work fine, esp. in third party software, like CSH, Perl, Maya, Shake, Renderman, Rush, and all programs written in C or C++. Programmers would have to go out of their way to misinterpret front slashes.

    Here are some ideas on how to configure Mac OSX to create static mounts on boot, and how to set up a small Mac network.

   Under Windows, why does irush say 'you are not owner!'
when I /am/ the owner?
  

    If your Windows "User name" contains spaces, this caused a problem in Rush versions 102.41 and older, eg:

    	User name: Joe User            <-- BAD
    	Full name: Joe User            OK
    	

    In this case, the user's jobs would submit as simply "joe", and since "joe" != "joe user", Rush would complain:

    	you're not the owner! (use -fu to override)
    	

    ..and the user would enable the FU button in irush, and then they could do the operation again.

    This is fixed in Rush 102.41a and up.

    Workaround for 102.41 and older: Change your "User name" to not contain spaces (it's OK if the "Full name" has spaces). So for instance, this is OK:

    	User name: joe                 OK
    	Full name: Joe User            OK
    	

   Maya renders sometimes fail with 'Failed creating directory: /Alias/maya'?  

Description

    The typical message from Maya in the frame log:

        [..]
        Executing: maya -batch -render -verbose 1 -proj //server/your/project -s 1 -e 1 ..
        *** Fatal Error: Failed creating directory: /Alias/maya
        Please check for sufficient disk space and necessary write permissions.
        [..]
        

    This error is misleading, because it would appear it's trying to create files in the root directory (/Alias/maya) which makes no sense, and is not really what's happening.

Cause

    The cause of this error is either:

      1) The user's home directory doesn't exist or is unreadable on this machine.

      2) The user's home directory doesn't have a ~/maya/ (LINUX/IRIX), or a ~/Library/Preferences/Alias/maya/ (OSX) directory

    Maya's command line renderer won't run if the user's home directory doesn't exist and/or doesn't have the maya preferences directory. (Under Linux/Irix: the user's maya preferences directory is in ~/maya/, or on the Mac it's in ~/Library/Preferences/Alias/maya)

    To determine which machine is having the problem, look at the header at the top of the frame log from the failed render; it will tell you a) which machine the problem is on, and b) which user account the render is running as:

    
        --------------- Rush 102.42a --------------
        --      Host: jupiter     <-- Machine that has the problem
        --       Pid: 868
        --     Title: car_toon3
        --     Jobid: thor.73
        --     Frame: 0032
        --     Tries: 0
        --     Owner: fred (1025/1025)
        -- RunningAs: render      <-- User the render is running as
        --  Priority: 200
        --      Nice: 10
        --    Tmpdir: C:/TEMP/.RUSH_TMP.725
        --   LogFile: //server/jobs/..
        --   Command: perl //server/rushscripts/..
        --   Started: Mon Jan 23 11:16:16 2006..
        ------------------------------------------
        

    In this case, you would ssh(1) or rsh(1) over to host "jupiter", and become the user 'render' to check that user's home directory for problems.

Solution

    To fix #1, make sure the user's home directory exists on all the render nodes, and is read/writeable by the user.

    To fix #2, either:

      a) Login in as the user on the host that tried to render the frame, briefly open the Maya GUI, then exit. This will cause Maya to create the files it needs for rendering in the user's home directory.

      -- OR --

      b) Copy a working ~/maya (Linux/Irix) or ~/Library/Preferences (Mac) from another machine to the user's home directory on the render node's, making sure the entire contents is read/writeable to the user.

   Maya mentalray renders sometimes fail with "Error: (Mayatomr) : could not get a license"?  

Description

    The typical message from Maya in the frame log:

        [..]
        Executing: Render -r mr -proj //server/jobs/somejob -s 1 -e 5 -b 1 [..]
        [..]
        Error: (Mayatomr) : could not get a license
        mental ray: wallclock  0:00:03.06 total
        mental ray: CPU user  0:00:00.73 total
        mental ray: allocated 1 MB, max resident 2 MB, RSS 0 MB
    
        --- LICENSE ERROR: Encountered: Error: (Mayatomr) : could not get a license
        --- FAILED
        [..]
        

    This means the Mayatomr plugin was unable to check out a license, probably because you are rendering with the Mayatomr plugin on more machines than you have licenses for, or there's some kind of problem with your Alias license file in the /var/flexlm or c:/flexlm directory.

Cause

    The cause of this error is usually one of:

      1) A bad Alias license file on one (or more) of the machines.
      2) A permission problem on the license file or flexlm directory (ie. not 'readable' by all)
      3) You are rendering on more machines than you have licensed for Mayatomr

    If the problem is specific to one machine, make sure that machine's Alias license file is correctly installed, has 'open' perms such that all users have read permission, and no other license files are usurping it.

Solution

    You should really talk to Alias Support about licensing problems, but there's a few things you can check yourself.

    First check that the license files have read permissions for everyone Under unix. For example, this situation would be bad:

        tower:/var/root root# ls -la /var/flexlm
        total 24
        drwxrwxrwx    4 Tomas  Tomas   136 May 11 17:45 .
        drwxr-xr-x   25 root   wheel   850 May 12 12:05 ..
        -rwx------    1 Tomas  Tomas   174 Nov 30 22:08 aw.dat
        ^^^^^^^^^^
        BAD: Only user "Tomas" can use maya, no one else!
        

    You can run these two 'chmod' commands as root on each machine to fix the permissions so that *all users* have read access to the license file(s):

            chmod -R a+r /var/flexlm      # adds read permission for all files/dirs under /var/flexlm
            chmod a+x /var/flexlm         # adds 'x' permission to the /var/flexlm dir
        

    Under windows, use a similar Microsoft CACLS command (or you can use the GUI) to open up the ACLs to ensure the dir and files have 'read' permission for 'everyone'. (you're on your own with the CALCS command).

    Also, investigate that your Alias license files are consistent on all machines, or are correct by visually inspecting their contents.

    Talk to Alias to find out how many licenses of Mayatomr you need, and show them the current license files you have.

    Until you can fix the problem, you can limit your jobs to only use machines that work by making a +hostgroup that contains only the hostnames that work, or if limited by number, only put that many hosts in the +hostgroup, then submit your mental ray jobs to only that +hostgroup.

   My renders fail on linux with 'cannot restore segment prot after reloc: Permission denied'?  

Description

    You may see this error in your render logs, e.g.:

        [..]
        Starting "/usr/autodesk/maya8.5/bin/maya"
        /usr/autodesk/maya8.5/bin/maya.bin: error while loading shared libraries:
        /usr/autodesk/maya8.5/lib/libHumanIKShared.so: cannot restore segment prot after reloc: Permission denied
        // Maya exited with status 127
        [..]
        

Cause

    A permission error, most likely due to 'SELinux' being enabled on your linux system.

Workaround

    Turn off SELinux, and retry the render.

    On Redhat/Fedora systems, you can usually easily turn off selinux by running the following command as root:

      setenforce 0

    For more info on SELinux, see 'man selinux' and 'man setenforce'.

    This is a common error on google; for more info, search on google for 'segment prot after reloc: Permission denied'.

   Our nodes intermittently loose mounts, causing frames to fail with log write errors.. is there a way to detect and auto-offline the machine?  
If a render node looses the mount to the file server, it can't write to a render log or run the render script.

In Rush 103.00 and up, the rush/etc/mountcheck script can be configured to run to check for such conditions; when enabled the script is executed just before Rush opens the log for each rendering frame.

The script can check if e.g. mounts are in place, and if not, can either attempt to fix them, or offline the local machine and requeue the renders elsewhere.

See docs for the mountcheck file and how to configure it with the mountcheck_cmd in the rush.conf file.

   After reboot, Rush starts renders running too quickly, eg. before the machine is ready for rendering.. is there a script to prevent this?  
On all versions of Rush on Linux + OSX, you can add code to the rush boot script to do any checks you want to see if the machine is ready before rush is started.

On Windows, unfortunately there's no way to intercept starting the Rushd service with a script.

New in 103.00, both Unix and Windows machines now have an optional 'rush/etc/bootcheck' script that can be used to run any 'check commands' you want to verify if the machine is running correctly. When configured, this script is run as soon as the daemon starts up, but before it does any initialization. See the link for more info.