Rush - etc/rush/mountcheck File

Rush Render Queue - The rush/etc/mountcheck file
V 103.07 05/28/15
(C) Copyright 2008, 2015 Seriss Corporation. All rights reserved.
(C) Copyright 1995,2000 Greg Ercolano. All rights reserved.

mountcheck Script

Normally one would do such checks in the render script, but if the render script exists on a file server that isn't mounted, it therefore cannot be accessed, and will be unavailable to do such checks.

Use this script if you're having network instabilities where, for example, render nodes are intermittently losing access to the file server and have become unusable for rendering.

This is where the mountcheck script is useful; since it's a script located locally on each machine (in the rush/etc directory), it is always available to be used to test file system mounts and try to either fix them, or it can offline the local machine if need be and tell the job to requeue the frame elsewhere by returning exit(1).

Common goals this script might want to achieve:

Check if the file server's mounts are accessible.
If unavailable, try to either fix the problem and return an exit code of '0' if successful, or disable the machine from rendering jobs either with 'rush -an' or 'rush -offline' to prevent further rendering. You can optionally notify the user with 'rush -exitnotes', and return an exit code of '1' to retry the frame elsewhere.

Check for missing infrastructure that all renders depend on.
Things like missing licensing, plug-ins, fonts, etc. Anything that you know all your renders depend on for correct operation that might be missing for one reason or another.

This script is enabled by uncommenting these lines in the rush.conf file:

Mountcheck Configuration in rush.conf


    # MOUNT CHECK COMMAND
    #	Command run before each frame to verify mounts.
    #	Uncomment + modify example scripts as needed.
    #
    #os=windows mountcheck_cmd "perl c:/rush/etc/mountcheck"
    #os=unix    mountcheck_cmd "perl /usr/local/rush/etc/mountcheck"

NOTE: You can rewrite this script in python, or any other language you prefer, as long as the language is installed on your machines. Just change 'perl' in the above commands to 'python', 'sh', or whatever you prefer, and rewrite the script in that scripting language.

If you are only concerned with running mountcheck on a single machine, you can enable this script to be executed only on that particular machine using a host= prefix to target a single machine (or +hostgroup of machines), e.g.


      host=tahoe      mountcheck_cmd "perl /usr/local/rush/etc/mountcheck"

When enabled, this script will be run just before each render is started, and should at minimum return one of the following exit codes indicating if the machine is OK to render frames:

Mountcheck Exit Codes
`exit 0`	Machine is OK, the render is started.
`exit 1`	Machine is NOT OK, the render is NOT started: the frame is requeued, and its TRY count increments by 1.
`exit n`	Other exit codes are reserved for future use.

every frame rendered

For reasons that should be obvious, this script must exist LOCALLY on each machine in order to be available when file servers might be down or unavailable. Since the script lives in rush/etc, it can be pushed around the network with 'rush -push mountcheck +any'.

When enabled as shown above, this script will be executed before the render script starts, and before the logfile for the render is created. Any output this script generates is appended to the machine's local rushd.log. So only print messages if there's an error, and keep messages terse. Include time stamps for clarity, as shown in the example script.

This script runs as the user the renders run as, and will inherit all RUSH environment variables the user's render script would have (e.g. RUSH_JOBID, RUSH_LOGFILE, etc). If you need the script to do anything as root, you'll need to involve su(1), or setuid scripts, or some other technique to escalate permissions.

WARNING: Keep the contents of this script simple. See 'Caveats' below to prevent creating worse problems for your network.

Caveats

It's important this script live LOCALLY ON EACH MACHINE, so it can be executed even if the file server is inaccessible (a common problem you'll want to detect).

Any output from the script will be appended to the rushd.log file unless redirected elsewhere. Include date stamps in your messages so that one can determine when the messages were generated by reviewing the log file.

To avoid excessive log output, it's recommended the script not print anything on success, and only show messages when errors have occurred to prevent trafficking the logs every time a frame renders.

Keep the execution flow of the script as simple as possible when no error conditions are present. DO NOT USE high latency TCP rush commands like 'rush -lf' or 'rush -lj'. This script may run hundreds of times per second on a large network, so you don't want to storm the render queue's job server.

This script's execution time is included as part of all user's frame elapsed times, so KEEP THE OVERHEAD OF THIS SCRIPT SMALL so as not to impact the render times too much, especially on non-error conditions.

By having the script exist as $RUSH_DIR/etc/mountcheck, the script can be easily distributed among all the machines using 'rush -push, e.g.:
```
    rush -push mountcheck +any           # push the mountcheck script to all machines
    
```

Note this script runs as the the user the renders runs as, so if you need commands to be run as root, you'll need to either use sudo(1), setuid apps, or other such techniques to do it.

All environment variables the user's render script would normally have will be available to this script as well (e.g. RUSH_FRAME, RUSH_JOBID, RUSH_LOGFILE, etc).

Keep in mind this script will run on all machines before each user's render, so consider that something as simple as rebooting the file server could cause this script to detect a 'failure condition' on all nodes in a very short time. So if your script e.g. sends emails when errors are detected, work in a choke so that it doesn't send emails continuously, DoS'ing your mail server.

More than one instance of this script might run at the same time on a single machine; if you've configured the rush/etc/hosts file to have a CPUS value larger than 1, it's possible many instances of this script will all be running AT THE SAME TIME. So if you intend to invoke potentially atomic operations (like re-mounting drives), protect those operations with locks to prevent more than one instance running at a time.

This script can be written in any language, it does not have to be perl. If written with cross-platform execution in mind, one script can be used to manage all your platforms.

Code this script carefully; it's easy to make a simple problem worse by adding the wrong commands that DoS attack servers, sending too many emails when something goes wrong, or otherwise tries 'too hard' to fix a problem.

When there are no intermittent problems on your network, it's best to disable this script completely to avoid unnecessary overhead for rendering.

Caveats

See Also