Rush Logo Rush Render Queue - Admin FAQ
V 103.08-pre 07/17/15
(C) Copyright 2008, 2015 Seriss Corporation. All rights reserved.
(C) Copyright 1995,2000 Greg Ercolano. All rights reserved.

Strikeout text indicates features not yet implemented


Systems Administrator Frequently Asked Questions


   I'm getting 'connect(myhost): Invalid argument'
in my linux rushd.log?
  

   I'm getting 'Reverse lookup for 'myhost' returned 127.0.0.1 (loopback address)'?  

    Your host is coming back with 127.0.0.1 as its ip address, instead of the network interface's actual address. For rush to operate on a network correctly, it must use the network IP for your machine, not the loopback address.

    You'll want to fix your machine's /etc/hosts file or hostname lookup system so that the machine's hostname resolves to the network address instead.

      NOTE: In some very rare cases, you may actually want rush to use the loopback address.
      If so, configure the rush.conf 'allowloopback 1' command and rush will allow 127.x.x.x addresses to be used.

    Having the machine name resolve to the loopback address is a known issue with most, if not all of the 'Redhat Linux' based installers, which use this as a default so that their daemons can work OK without a network setup. But this is NOT something you want when a machine is on a network.

    To test for the problem, open a shell on that machine and ping its own hostname. If the address returned is 127.x.x.x, that is the problem:

      Reverse Lookup -- Problem
      
          you@tahoe % ping tahoe
          PING tahoe (127.0.0.1) from 127.0.0.1 : 56(84) bytes of data.  
          64 bytes from 127.0.0.1: icmp_seq=0 ttl=255 time=0.5 ms
          64 bytes from 127.0.0.1: icmp_seq=1 ttl=255 time=0.4 ms
          ^C
      
          you@tahoe % cat /etc/hosts
          127.0.0.1 tahoe localhost           <-- This is the problem
          

    ..the /etc/hosts entry is wrong. Correct it by making separate entries for 'localhost' and the machine's hostname and IP address, eg:

      Reverse Lookup -- Solution
      
          root@tahoe # cat /etc/hosts    _
          127.0.0.1      localhost        |__ CORRECT
          192.168.0.37   tahoe           _|
      
          root@tahoe # ping tahoe
          PING tahoe (192.168.0.37) from 192.168.0.37 : 56(84) bytes of data.  
          64 bytes from 192.168.0.37: icmp_seq=0 ttl=255 time=0.5 ms
          64 bytes from 192.168.0.37: icmp_seq=1 ttl=255 time=0.5 ms
          

    BTW, you shouldn't need domain names for the official names of your machines in /etc/hosts; short names should work correctly if you have the appropriate "search" lines in your /etc/resolv.conf file, e.g.:

      Specifying Your Domain in /etc/resolv.conf
      
          root@tahoe # cat /etc/resolv.conf
          nameserver 192.168.0.14
          search yoyodyne.x                                                    
          

    Fedora 8 Update
    In some more recent versions of linux that support IPV6, you may still get 127.0.0.1 when you 'ping' the host's own hostname, even after fixing the 127.0.0.1 line. A possible cause is the new "::1" (IPV6) line the default Linux config might hard code into your /etc/hosts file. This is essentially equivalent to the 127.0.0.1 line, even if you have IPV6 turned off. A typical situation would be:

      Problem: IPV6 ''::1'' Entries
          root@tahoe # cat /etc/hosts
          127.0.0.1      localhost.localdomain localhost tahoe    <-- BAD     
          ::1            localhost.localdomain localhost tahoe    <-- BAD
          192.168.0.37   tahoe
          

    Take the host's name out of that entry too, as that is essentially the same as the 127.0.0.1 line. So the corrected file will look like:

      Solution: IPV6 ''::1'' Entries
      
          root@tahoe # cat /etc/hosts
          127.0.0.1      localhost                                             
          ::1            localhost
          192.168.0.37   tahoe    
          

   I'm getting 'rush: connect(host): Permission denied' in www-rush?  
    This is probably a linux machine with selinux configured.

    Unless you know how to configure selinux for your needs, simply disable selinux.

    In the case of 'www-rush' giving "Permission denied" errors, selinux is probably preventing cgi-bin scripts from opening new network connections.

    Disabling selinux varies from distro to distro, but it sometimes just involves editing the /etc/selinux/config file and rebooting.

   I'm getting 'rresvport(): Permission denied', what's that mean?  
    Usually one gets this error in the context of running 'rush' from the command line:

      rresvport(): Permission denied -- Problem
      
          % rush -ping
          tahoe: rush: rresvport(): Permission denied  
          

    Rush uses a reserved port to communicate with the daemon, and therefore needs to run SUID root.

    Make sure the SUID bit is on for the rush(1) binary, and the owner is root:

       rresvport(): Permission denied -- Solution 
      
          # chown 0:0  /usr/local/rush/bin/{rush,rushd}    
          # chmod 4755 /usr/local/rush/bin/{rush,rushd}
          

    This problem is similar to this problem.

   What does 'bind(): Address already in use' mean?  
    It usually means some other daemon is using rush's tcp or udp port. Usually it comes down to one of these things:

    1. This is an SGI, and the kernel's NFS is using the port
    2. This is Unix (SGI/Linux/OSX), and something else is using the port (nfslockd, ypbind, mountd, rpc.statd, etc).
    3. Two or more rushd's are running. (Not likely in 102.31+)
    4. You recently stop/started the daemon. Problem goes away by itself. (Not likely in 102.31+)

    #1 often occurs if you've just installed rush on an SGI for the first time, and the machine has been up for a while. 'netstat -an' will show a whole slew of UDP listeners on ports between 512 and 1024 all in sequence, one of them being port number 696, the one rush has been assigned by IANA. Some rogue kernel utility is causing this, probably NFS. Usually fuser(1) or lsof(1) shows no process associated with the rogue UDP listeners because it's a kernel process. The easiest solution is to simply reboot; when rush starts on boot, it always secures the port it needs well before the kernel gets a chance to step on it.

    #2 Stop the rush daemon, and use 'netstat -an' to see if some other program is using rush's port (normally port #696; see your rush.conf file's serverport setting, incase your site uses a different port number). Look for open UDP or TCP connections on that port, either in the Local or Foreign address.

    If you see port #696 in the 'Foreign' address of the local machine, suspect hung clients on the remotes:

    • rsh over to the remote machine (ie. 'Foreign' host)
    • Kill any 'rush' client processes you see, eg. 'killall rush'
    • Back on the local machine, do a 'netstat -an' to verify the connections are gone or closing.
    • Restart the daemon once all 696 ports have closed

    If the local TCP or UDP port is in use, suspect some system daemon or other is using the port when it shouldn't. Use fuser(1) (eg. "fuser -vn udp 696") or lsof(1) (eg. "lsof -i UDP:696" or "lsof -i TCP:696") or similar utility to figure out which process is using the port, so you can either disable it, or change the boot order so that rushd starts before it, or earlier in the boot process to prevent the other daemons from securing rush's port before rush can.. then reboot. NOTE: You must be root to run 'lsof', otherwise you'll get an empty or incomplete report!

    If fuser(1) shows no process and it's an SGI, then see #1..

    #3 only happens in older versions of rush (pre-102.31) where more than one rush might be running. Newer versions of rush use a lock file that prevents this.

    Only one daemon should have a PPID of 1 (Parent Process ID). If there's more than one with a PPID of 1, kill the one(s) with the higher PID.

    #4 is common only in the older versions of rush (pre-102.31), and occurs when you stop/start the rushd daemon. This problem fixes itself within 2 minutes automatically. The OS often keeps recently closed TCP listeners unavailable to other processes for a 90 second period. Rush will keep retrying to bind to the port, and eventually succeeds within 2 minutes.

   On submit, I get 'rush: LogDir '//foo/bar': No such file or directory'
or 'ERROR: can't create log: //foo/bar': Access is denied.
  
    A Microsoft Windows specific problem; the typical complaint being 'but the directory does exist!'.

    This is encountered only on windows machines, and can be caused by:

    1. The directory is inaccessible to the user the Rushd service is running as
    2. The password expired for the user the Rushd service is running as
    3. A trailing slash is specified in the directory name (102.40e or older)

    Regarding #1, when you submit a job, it is RUSHD service that does the check to see if it can write to the log directory. It does this test running as the user the RUSHD service is configured to run as (Control Panel | Administrative Tasks | Services | Rushd | Log On As) If the daemon can't access the directory, it can sometimes respond "directory does not exist", when it really means 'Access is denied'. Make sure the permissions on the directory are open (if the server is a Samba server, use 'chmod 777' on the directory at the server side).

    Also, verify the user the RUSHD service is running as is able to access the file server;

    • Login as the user you have the RUSHD service configured to run as
    • Browse to the log directory from a browser, like 'My Computer', 'Network Neighborhood', 'My Network Places', etc.

    ..while doing these steps, see if you prompted for this dialog:

    If so, that's the problem. You shouldn't be seeing this dialog; it's an error message indicating the current user does not have access to the server. Even though you can click past it, RUSHD can't. Fix your server config so this dialog doesn't show up.

    When RUSHD tries to access the disk server, it has no way to supply a password to that dialog; the RUSHD service must be running as a user that has access to the disk server without getting that prompt. There are also other ways to test for this, using either runas or telnet to simulate the rushd service's access to the drive.

    Regarding #2, verify the Rushd user's password is set to never expire.

    Regarding #3, this is for older versions of rush only. In versions predating Rush 102.40f, if a trailing slash is specified on the directory name, it would complain it can't find the directory. Remove the trailing slash from your "Log Dir" prompt after you browse to the log directory. As of 102.40f, rush strips the trailing slash before doing the test. The need to do this is due to a bug in Microsoft's 'access(2)' and 'stat(2)' functions not being able to handle trailing slashes on directory names.

   On submit, I get 'rush: 'rush -submit': user@somehost: LogDir '//fileserver/share/SHOW/SHOT/myfiles/foo.ma.log': Logon failure: unknown user name or bad password.'  
    This is a Microsoft error regarding authentication for the account 'user' between the client 'somehost' and the 'fileserver'. This is usually caused by:

    1. The 'user' the Rushd service is configured to run as has a password that has either expired or has been changed

    2. There's a problem with authentication between 'fileserver' and 'somehost' for the 'user' account

    The error message is a problem encountered while accessing the fileserver; when a job is submitted, rush first checks to see that it can access the job's log directory (LogDir). In this case it failed due to authentication.

    The error message gives us three bits of important information to track the problem down: the "user@somehost" tells us the context the error occurred, and the pathname "//server/share/SHOW/SHOT.." tells us the path that couldn't be accessed. So we know to go over to machine 'somehost' to investigate the problem, and login as 'user', and then we should try to access the path "//server/share/SHOW/SHOT..", either from DOS with a 'DIR' command, or through a file browser.

    If you can login and access the share, then check to make sure the password you supplied when logging in as 'user' is the same password the Rushd service is actually configured to use. (eg. Control Panel -> Admin Tools -> Services -> Rushd -> Log On As..).

    See also this related problem and solutions.

   What's the best way to verify all the daemons are running?  
    Use:

      rush -ping +any

    This 'pings' all the daemons in the $RUSH_DIR/etc/hosts file with a TCP message.

    If the daemon isn't running, tail(1) the daemon's log file in $RUSH_DIR/var/rushd.log.

   How do I stop/start the daemons? (Unix/NT)  

    Stop/Start Rushd
    Irix /etc/init.d/rush stop
    /etc/init.d/rush start
    Linux/RedHat 6.x /etc/rc.d/init.d/rush stop
    /etc/rc.d/init.d/rush start
    Windows net stop rushd
    net start rushd
    Mac OSX /usr/local/rush/etc/S99rush stop
    /usr/local/rush/etc/S99rush start

All daemons can be stopped as follows:

    rush -dexit +any -t 5

(Note: the '-t 5' sets the TCP timeout to 5 seconds per host, preventing the command from getting 'stuck' on dead or hung machines.)

   Is there an example boot script I can use to invoke rush?  

   Is there a way to enable/disable rush automatically
at certain times of day?
  
    Yes, you can use the operating system's task scheduler to invoke the rush -online and rush -offline commands at any times you want.

    Under Unix, use crontab(1), or under Windows use the task scheduler:
    Start | Accessories | System Tools | Scheduled Tasks.

    Usually it's best to set up one administration machine to control the online/offline schedule for all the others, making administration simple. If you have a mixed network of windows and unix machines, you can still use one machine to control all of them.

    Example: under Unix if you want all the machines in the '+work' group (workstations) to be online m-f 7pm-8am and all day sat/sun, this crontab on the chosen 'administration machine' should do it:

	1) Invoke 'crontab -e'
	2) Type in these three lines:

# min hour dom mon dow       command
  0   19   *   *   1,2,3,4,5 /usr/local/rush/bin/rush -online  -msg "Online m-f 7pm-8am, all sat+sun" +work | /usr/bin/logger 2>&1
  0   8    *   *   1,2,3,4,5 /usr/local/rush/bin/rush -offline -msg "Offline during working hours"    +work | /usr/bin/logger 2>&1

	3) Save and exit.
    

    Change the -msg text as you like; that message will show up in the irush 'All Cpus' report. Having the output piped through logger(1) lets you track problems in the /var/log/messages (redhat) or /var/log/system.log (OSX).

    For more info, see 'man -a crontab' and 'man logger'.

    If you want, you can use rush -getoff instead of 'rush -offline' to /kill/ off renders at exactly 8am, so that long renders aren't still running. Or, just let the users manage offlining the machines in the morning using 'onrush' when they come in; then they can choose 'offline' or 'getoff' as needed, and users that are sick that day will have their machines remain online throughout the day.

   Is there a way to run 'rush -online'
automatically when someone logs out?
  
    Yes; when a user logs out of the window manager, the sysadmin can configure the following files to run 'rush -online':

      Online At Logout
      Irix /usr/lib/X11/xdm/Xreset
      Linux/RedHat 6.x /etc/X11/xdm/TakeConsole
      Mac OSX For recent versions of OSX (10.5 and up), it's as simple as running:
      sudo defaults write com.apple.loginwindow LogoutHook /path/to/your/online-script.sh
      For older releases (10.4 and back) see:
      this Apple documentation on "-LogoutHook".
      Windows See below on how to use Group Policy (gpedit.msc or through Active Directory)
      and how to make logon/logout scripts for local users.

    A literal example of what should be added to these files would be:

      Online At Logout Example
      
          /usr/local/rush/bin/rush -online
          logger -t RUSH "Rush online (user logout)"     
          

    Use of logger(1) is optional; it leaves an audit trail in the syslog. Include full path to logger(1) if security is an issue.

    For Windows, you can use Group Policy (gpedit.msc or through Active Directory). An example of this would be: open up Group Policy (snap-in through the MMC), go to:

      Local Computer Policy | User Configuration | Windows Settings | Scripts (Logon/Logoff) | Logoff

    There you can add the batch file that you want to execute upon logoff. This can be a one line batch script that invokes 'rush -online'. For details, see these Microsoft knowledgebase articles:

    You may want to redirect stdout/stderr of your 'rush -online'/offline commands to a log file so that errors can be checked for later, eg:

      Windows Online At Logout Example
      
        rush -online >> c:\temp\rush-logout-online.log 2>&1     
          

   Is there a way to run 'rush -online'
automatically when someone's screensaver pops on?
  
    There probably is, but I don't know how to do it.

    If you have any suggestions on how to do it on various platforms, please send them to:

      Greg Ercolano

   What kinds of security issues are there with rush?  
  • Be sure your rush network is firewalled from the internet. Rush is not designed to run on machines that are open to the internet.

    If you want Rush to work on machines separated by the internet (eg. a remote farm), be sure to use encrypted connections for all traffic passing through the internet, to prevent unwanted users from tapping into the network connections, such as VPN or other encrypted network connections.

  • To avoid root loopholes, be sure all subdirs in the path to the setuid binaries and config files have tight permissions, e.g., if rush is installed in /usr/local/rush/bin:

    Rush Security Permissions
    
    chmod go-w /usr \
    	   /usr/local \
    	   /usr/local/rush \
    	   /usr/local/rush/bin \
    	   /usr/local/rush/bin/* \
    	   /usr/local/rush/etc \
    	   /usr/local/rush/var \
    	   /usr/local/rush/var/*
    
    chmod 4755 /usr/local/rush/bin/{rush,rushd}
    chown 0:0  /usr/local/rush/bin/{rush,rushd}
              
    

  • By default, rush uses reserved port 696 to communicate udp/tcp packets. For secure networks, make sure users do not have access to root to avoid renegade software from exploiting the port.

  • If security is an issue at your site, be sure to check ALL settings, esp. UidRange and GidRange. Also, correctly configure AdminUser. At least read about these all, before accepting the defaults.

    If you want to make changes to these settings, see the Rush Configuration File documentation for more info.

  • Rush daemons will not run jobs as a uid or gid of root. If you really want a job to run as root, make the commands the job invokes setuid, or use 'sudo'.

  • Rush daemons will only trust remote machines that are configured in its host list. Rush will log all connection attempts from machines not configured in the hosts file. Sysadmins can grep the rushd.log files for the string 'SECURITY' to detect security related problems.

  • The 'rush -push' feature that helps sysadmins easily distribute the rush.conf, license.dat, hosts, and other such files. This feature can be disabled via the rush.conf file, if the feature is considered a security loophole.

   How do I update changes to the rush hosts file
(or rush.conf file) to the network?
  
    Normally you should use 'rushadmin(1)' or 'rush -push hosts +any', as these tools will release across platforms easily and quickly.

    However, if the rush daemons are not yet running (e.g. you are setting up a new installation) then you must use operating system commands to copy the files to the network.

    Under unix, you should use rdist(1) or rsync(1). Some examples:

      Unix: Manually Copying Files
      
          # SEND A NEW rush.conf
          #    (These are Bourne Shell commands..)
          #
          for i in `awk '/^[a-z]/{print $1}' /usr/local/rush/etc/hosts`; do   
             rdist -c /usr/tmp/newconf ${i}:/usr/local/rush/etc/rush.conf
          done
      
          # SEND A NEW RUSH hosts FILE
          #    (These are Bourne Shell commands..)
          #
          for i `awk '/^[a-z]/{print $1}' /usr/tmp/newhosts`; do
              rdist -c /usr/tmp/newhosts ${i}:/usr/local/rush/etc/hosts
          done
          
          

    To use rsync(1), try replacing 'rdist -c' with 'rsync -avz' in the above examples.

    Under windows, you can use COPY. Use UNC paths to specify the remote directories, assuming their C: drives are 'shared' correctly:

      Windows: Manually Copying Files
      
          for %i in ( nt1 nt2 nt3 ) do copy c:\rush\etc\hosts \\%i\c\rush\etc   
          
          

   Is there a way to see whose jobs are bumping whom?  
    Grep the $RUSH_DIR/var/rushd.log file for BUMP messages.

   Is there a way to see who's changing other people's jobs?  
    Grep the $RUSH_DIR/var/rushd.log file for SECURITY messages.

   Can rush be told to use a different network interface,
other than the machine's hostname?
  
    Yes. In the rush hostlist, the hostname can actually be a pair of hostnames separated by a ':', e.g., tahoe:tahoe-eth. This is documented in the hosts file docs.

    The name on the left of the ':' is the hostname of the machine as users will see it (in Cpus reports and jobids), and the name (or IP address) that follows the ':' is the network interface you want rush to actually bind to.

    Example

      Lets say the machine in question is "tahoe", and there are two interfaces,
      
      	192.168.1.20
      	192.168.2.20
         
      ..if you want to have rush bind to the .2.20 address, then in the rush/etc/hosts file, change the line for that machine from:
      
          tahoe                2    512     0       +any,+win
      
      ..to instead read:
      
          tahoe:192.168.2.20   2    512     0       +any,+win
      
      Be sure to 'push' this change to all machines, as they'll all need to have it in order to communicate properly with the host, especially the license server.

      So make this change manually on the machine in question (tahoe), and manually on the license server. Once made to the license server, push the change out from there to all the other machines:

      
          rush -push hosts +any
          rush -reload hosts +any
      
      ..and that should make it take effect.

      After making these changes, restart the daemon on the machine in question (tahoe) and see if it properly binds to the port. Watch the rushd.log on that machine and the license server for errors.

   Where should I get perl for Windows?  
    It is highly recommended you use ActiveState Perl.

    It's definitely the best. Both well integrated and documented specifically for the Windows platform. Highly cross-platform compatible, with excellent Windows-specific modules and many of the standard internet modules, including Mail/FTP/NNTP, etc.

    I've personally tested and used it extensively in various production environments and have found it to be the most stable perl available.

    It's a free download.

   Where can I get rsh/telnet daemons for Windows?  

    I have personally evaled both products, and found them both useful.

    Regarding Denicomp and rsh, it lets you run simple commands on the remote machines using NT's own rsh(1) client. It supports 'rsh hostname command', but does not support 'rush hostname'. In other words, you can't strike up an interactive session. You get a limited trial to use the software for free, then if you like it you should buy it.

    Regarding Georgia Softwork's telnet server, I have to say it's impressive what it does. You can run interactive dos applications that even do direct screen memory access, and the results will look correct on the telnet client. Compatible with unix telnet clients, as with NT clients. Unfortunately, the software is very expensive. But you get a 30 day trial to test it out.

    There is also freeware available. Most of those I've evaled have extreme limitations, or are easily broken.

    The NT Resource Kit from Microsoft which comes with a telnet server, though I've never tried it.

   Windows: is there a way to restart the rushd service as a normal user?  
    There are two ways I know of.

    I. If you're running under Win2K, you can use the new 'runas' Win2K command. Similar to su(1) in unix; it lets you run commands as administrator. The following gives you a DOS shell with network administrator priveleges regardless of who the current logged in user is:

      runas /user:YOUR_DOMAIN\Administrator cmd
      Password:

    In the DOS shell that appears, run 'net stop rushd' and 'net start rushd'.

    II. Use your domain controller's Remote Services administration software. With a Win2K server:

    1. Start | Programs | Administrative Tools | Active Directory Users And Computers
    2. Select 'Computers'
    3. From the list of computers, right click on the one to control, and choose 'Manage'
    4. Under the "Tree" tab, click "Services and Applications"
    5. Choose "Services", then choose "Rushd" and then use the usual Start/Stop controls

    There's surely something similar under WinNT Server.

   Windows: Is there a way to disable error dialogs?  
    Yes. But it is a registry tweak that affects the entire machine.

    See this Microsoft Knowlege Base Article Q124873 for more info. To paraphrase, this article basically says, along with the usual risk disclaimers regarding manual editing of the registry:

    1. Run Registry Editor (REGEDT32.EXE).

    2. From the HKEY_LOCAL_MACHINE subtree, go to the following key:
      \SYSTEM\CurrentControlSet\Control\Windows\ErrorMode

    3. Select the ErrorMode value.

    4. From the Edit menu, choose DWORD.

    5. Type 0 (zero), 1, or 2 to select the error mode. Regardless of this setting, all errors are written to the system log:

      • 0 - Error message box pops up (default).
      • 1 - No dialog for system errors only.
      • 2 - No dialog for system or other errors.

   I changed the ip addresses of a host on my network,
and now rush can't talk to it?
  
    When you change the IP address of one of the machines, the OS and rush daemons need to be told to flush their caches. Flush the low level OS caches first, so that regular 'ping ' works correctly, then update rush's.

    The lowest level cache is the ARP cache, eg. 'arp -a', which associates IP addresses to MAC addresses. When ARP is wrong, nothing works. Even 'ping [newipaddress]' may hang or fail, because they'll still be trying the old MAC address. When you change the IP of a machine, all the other machines will still have the old IP in their ARP caches. These caches usually time out after 15 minutes, so by the time you read this article, those caches probably will have flushed out. You can view the arp caches on the remote machines with 'arp -a', and you can remove old IP entries with 'arp -d', and they'll fix themselves. This should get 'ping [newipaddress]' to work correctly.

    The next higher level of cache is DNS hostname/IP caching. If these are wrong, 'ping [ipaddress]' will work, but 'ping [hostname]' will show the wrong IP. Different OS's have different ways to flush these high level caches:

    • OSX (10.1 through 10.4): lookupd -flushcache (See 'man lookupd')
    • OSX (10.5 and up): dscacheutil -flushcache (See 'man dscacheutil)
    • Windows (all releases): ipconfig /flushdns (See 'ipconfig /?')
    • Linux: Most linux distros don't have an IP cache by default, unless 'nscd' was enabled. If 'nsd' is on, you can clear its cache by restarting it, eg. /etc/init.d/nscd restart See 'man nscd'.

    Once the DNS name caching daemons have been fixed, 'ping [hostname]' should work and show the correct IP. Once that's working, you can tell rush to flush the daemon's hostname-to-IP caches by either:

    • (New in 102.42) Use 'rush -reload hosts +any -t 3'
      to tell all the machines to reload their hosts files.
    • -- or --

    • Change the date stamp on the license server's $RUSH_DIR/etc/hosts file by just loading it in an editor and then re-saving it, then push the changed file to the network with 'rush -push +any', so that all the daemons see the changed datestamp, and reload the hosts file, flushing the caches.

    The rush daemons cache hostname-to-ip-address lookups for all the hosts in the rush hostlist for speed. This prevents load on your DNS or NIS servers, since rush makes many hostname/ip lookups when managing jobs.

    You can view the daemon's hostname/IP cache by invoking: 'rush -lah <hostname>'. This will show you daemon's IP cache, which you can compare to the "realtime" hostname-to-ip-lookups report 'rush -lah' to check for discrepencies.

    In cases where render nodes can't get a license from the license server because the node's IP address changed, the license server might have an old cache, and you'll want to run that command on the license server, where <hostname> is hostname of the license server. You can tell all machines to flush their caches by either using:

    This will show you the rush hostlist according to the daemon on the named machine, including it's cached IP address lookup information. In cases where render nodes can't get a license from the license server because the license server has an old cache, you'll want to run that command on the license server, where <hostname> is hostname of the license server.

   Sometimes I see '???.???.???.???' in 'rush -lah' reports. Is this bad?  
    Yes, it's bad.

    Rush will not operate correctly if it can't do hostname lookups for any machine in the rush hosts file.

    The question marks mean rush is unable to lookup a host's name, and that means the OS's own hostname lookup system is not correct. (eg. 'ping hostname' will fail as well) The more of these there are, the slower rush will operate. You will also notice sluggish or very slow operation in the GUIs, and in the generation of most rush reports.

    Probably other tools like 'ping' will be unable to lookup the hostname. Possible causes:

    • In an environment where static /etc/hosts files are used for name lookups, make sure the hostname is in all your /etc/hosts files.

    • In an NIS environment, same problem likely exists with your 'hosts' map.

    • Do not use DHCP, unless you have static assignments where the leases don't expire. Rush will get confused if a machine changes its IP address, eg. after a reboot. Use static IPs.

    • In a DNS environment, either your DNS server is not responding, not configured, or the host is not in your DNS. Use nslookup(1) to debug the problem.

    • In a Windows environment where WINS is used to do hostname lookups, if the machine is down, WINS can't do a hostname lookup for it. To solve this problem, you can do any *one* of the following:

      • Make *static* IP entries for all rush machines on your WINS server, so the hostnames still lookup even if the machine is down. Use the 'WINS Manager' on your PDC.

      • Maintain static hosts files on the rush machines. Windows has a unix-like hosts file "C:\WINNT\SYSTEM32\DRIVERS\ETC\HOSTS" which if set up to contain IP-to-hostname entries for all the rush hosts, it can be copied to all the machines to ensure hostname lookups never fail.

      • Get away from WINS, and use DNS with static IP-to-hostname lookups. This will ensure hostname lookups work even when hosts are down.

   Rush is acting slow; reports take a long time,
and the GUI is sluggish. What's wrong?
  
    Run 'rush -lah' and 'rush -lah localhost' to see if it reports "???.???.???.???" for the ip address of any hostnames. If so, you are having a name lookup problem, and that's causing the problem.

    This is especially a problem on Windows networks if you use WINS instead of DNS. WINS can't do hostname lookups for a machine that is down. A good reason NOT to be lazy, and depend on WINS to dynamically keep track of things.

    To solve this problem, see the above.

   Can I use DHCP on machines running rush?  
    It is not advised.

    You can do it if you set it up so the leases never expire.

    Rush will not operate correctly if the IP address of machines change randomly, or change when they reboot. The best thing to do is assign static ip addresses to all machines running rush.

   How do I know how many rush licenses are checked out?  
    It would be the number of hosts you have in your rush hosts file, which should be the same on all machines.

    You can find out how many you have easily via:

      Counting hosts in Rush
      
          rush -lah | grep '^[0-9]' | wc -l                            
          

    Rush checks licenses out on boot, it does not check out/check in while the system is running.

   Rushtop doesn't show cpu use for some of my windows machines?  
    The daemon isn't running at all, and doesn't respond to 'rush -ping'..
    ..then try restarting the daemon and clearing the daemon log and lock files:

        Restart daemon, removing lock and log  
          C:\> net stop rushd
          C:\> del c:\rush\var\*LCK
          C:\> del c:\rush\var\rushd.log
          C:\> net start rushd
      	    

    ..then check with 'rush -ping'. If there's no response, check the contents of the rushd.log file for error messages. If there are license errors, refer to the rushd.log on the license server, as there might be a problem with the hostname resolution, or an IP address conflict for that machine.

    If after the above the rushd.log doesn't even exist, make sure the c:\rush\var directory is writable to the the user the rushd service is configured to run as. You can open up the permissions completely by doing:

      Opening permissions on c:\rush\var
          C:\> net stop rushd
          C:\> cacls c:\rush\var /T /E /C /G everyone:F    
          C:\> attrib -R c:\rush\var\*
          C:\> net start rushd
      	    

    If the machine responds to 'rush -ping' but doesn't show up in rushtop..
    ..then either (a) the host is not in the +any hostgroup (check rush/etc/hosts), or (b) has a software firewall that is passing TCP port 696 but not UDP port 696. Open both UDP and TCP port 696 in the software firewall, or disable the firewall altogether.

    The machine renders OK but doesn't show cpu bars in rushtop..
    If it renders OK, and for sure the machine is in the +any hostgroup, then most likely this is a PDH library problem.. this is Microsoft Window's own system library interface for presenting CPU metrics to programs like Rush.

    In Rush version 102.42a7 or later, the error message should show up in rushtop for that machine in an error bar. In older releases, you may have to check the rushd.log for errors, or use 'rush -status ' to see the error message on the 'X' line of that command's output, eg:

      Checking 'rush -status' for errors
      C:\> rush -status tahoe
      h tahoe
      d 0 RUSHD 102.42a8 PID=2317     Online 2 0 2 ""
      j erco ontario.14 VECTORIZE Fail 133:33:29 0 100 0 ""
      j erco ontario.15 STAGE Done 112:17:41 100 0 0 ""
      p - - - - - - -
      p - - - - - - -
      X PdhValidatePath('\Memory\Pages'): The specified object is not found on the system 
      	

   Maya renders fail with '--- FAILED: EXITCODE=128'?  
    This is not an issue with rush; this is a Maya licensing problem on only Windows machines, where two different users can't run Maya on the same machine, even though licenses are available to do so. This is a bug in maya's licensing for windows, but workarounds exist.

    A complete description of the problem, and possible workarounds are fully described in this maya issues document.

   Rendering on windows, I get
'CreateProcess(perl ...): The system cannot find the file specified.'?
  
    You'll see this error in the daemon log for a windows machine.
    You may also see it in the frame logs, and/or in the NOTES section of the 'Cpus' report for your job for cpus that are windows machines.

    The cause is either 'perl' is not installed on the machine in question, or perl was recently installed, without restarting the rush daemon.

    If you install perl while the rush dameon is running, you must restart the daemon for it to pick up the addition to the system environment's PATH variable, eg:

            net stop rushd
            net start rushd
       
    Then requeue the frames and try again.

   Rushtop(1) sometimes creates a lot of Arp traffic. Why? And can this be prevented?  
    rushtop(1) can create a lot of Arp traffic if many machines rushtop(1) is trying to contact are powered off.

    One arp request is sent for every machine that is down, so if 20 machines are down, 20 arp packets will be sent every time rushtop(1) tries to update its bar graph.

    In version 102.40h and up, an exponential backoff algorithm was added to 'rush -status' (which rushtop(1) uses), to slowly pull back transmissions to machines that appear to be down.

    To increase the backoff times, you can tune the rush.status_backoff_max value to be a higher number than the default value of 15, which puts a maximum time between transmissions of roughly 45 seconds.

    If a series of machines are going to be down for some time, it is best to change the '+any' entry for these machines in the rush/etc/hosts file to '+offline' instead, so that user invocations of rushtop(1) don't try to contact these machines. rushtop(1) by default uses '+any' to determine which hosts to show in its bargraph. Changing the hosts to +offline effectively removes the machines from the "+any" hostgroup, preventing rushtop from trying to contact them.

   Error 1058: The specified service is disabled and cannot be started.  
    You might see this while trying to start the Rushd service.

    Cause: The service that you are trying to start has been disabled for the current hardware profile.

    Resolution: Go into the Services for Rushd, and enable the Hardware Profile (for whatever reason, someone or something disabled it).

    1. Go to Start | Settings | Control Panel | Services | Rushd | Logon As
    2. Go to the "HW Profiles" window, select the current hardware profile and click Enable.
    3. Click OK, then try starting the Rushd service.

   c:/rush/var/.rushd.LCK: Invalid argument (can't open daemon lockfile)  
    You might get this in the rushd.log just after copying the C:/RUSH directory to a new machine from some other machine, and starting the daemon.

    To fix this, delete the lock file. As Administrator:

        net stop rushd
        del c:/rush/var/.rushd.LCK
        net start rushd
        

    The problem was likely a permissions problem caused by the copy operation that brought the c:\rush directory from the source machine, preserving the state of the lock file that was probably 'in use' at the time of the copy.

    Normally the install script removes the lock file; make sure you actually ran the c:\rush\etc\bin\install.bat script after copying over rush, to prevent other problems.

    Once repaired, this shouldn't re-occur, though you should maybe check the directory permissions on the c:\rush\var directory with the DOS 'cacls' command.

    Caveats: Using XCOPY to copy the directory seems to avoid the problem; usually the weird perms only happen when GUI drag+drop copying (from Microsoft's GUI) is used. See the 'Network Install' instructions for the recommended XCOPY command to copy the C:\rush directory to a network of machines.

   Starting rushd under WinNT it says 'can't find PDH.DLL' or 'PDH.LIB'  
    Microsoft didn't include PDH.DLL in some releases of Windows NT.

    You can download the PDH file for winnt from this Microsoft article, or search the Microsoft Knowledge Base for article "Q284996".

    For general information on accessing Microsoft support files, see Microsoft Knowledge Base article Q119591.

   udp: iface bind(192.168.0.32:696): Address already in use  
    Please see this faq item; it's the same thing.

   udp: iface bind() to udp port 696 on 192.168.0.32: Address already in use  
    Please see this faq item; it's the same thing.

   'xxx': valid host is NOT in rush hostlist  
    You'll get this error if remote machine 'xxx' is running the Rushd service, but is not in all the rush/etc/hosts files, causing those machines not to trust 'xxx', and print this error in the logs.

    Solution #1: If you want the machine on the rush network, then add the hostname to the rush hosts files.

    Solution #2: If you are trying to /remove/ that machine from the Rush network, then be sure to /disable/ the daemon on that machine (Windows: disable the service, Unix: comment it out of the boot script, or run the 'rush/etc/bin/uninstall' script)

    Typical causes:

    1. The machine's own IP address does not match what's in DNS or the system's /etc/hosts
    2. The host is simply not listed in the rush/etc/hosts file
    3. The rush hosts files are out of sync on some machines
    4. The host was supposed to be removed from rush, but the service wasn't disabled.
    5. There is a problem with either forward or reverse lookups for the host. eg. the hostname is resolving to one IP address, but the reverse lookup for that number is some other hostname. (Use 'nslookup' to debug)

    The problem might also be similar to this problem and solutions

   LICENSE TradeUidGid() with lic-server[1.2.3.4] failed:
Error from lic-server[6.7.8.9]: host 1.2.3.4 not in lic-server:/usr/local/rush/etc/hosts
  
    You'll get this error in a client machine's rushd.log.

    This will happen if the client machine '1.2.3.4' can't check out a license from license server 'lic-server' because the actual IP address of 'client' is not resolving to a hostname in lic-server's rush/etc/hosts file.

    Put another way, this error occurs when the client contacts 'lic-server' to check out a license, and lic-server looks at the source IP address from client's connection, but is unable to reverse lookup that IP address into a hostname that can be found in the lic-server's rush/etc/hosts file. The license server rejects connections from IPs not in its rush/etc/hosts file.

    Usually you can debug the problem by running 'rush -lah' (List All Hosts) on both license server and client machine, and checking the IP address client in both reports for a discrepancy.

    If the client's hostname doesn't show up at all in lic-server's 'rush -lah' report, then that's the problem; it either needs to be properly added to the rush/etc/hosts files, or it needs to be properly removed and disabled.

    If the IP addresses shown in 'rush -lah' are different, that's the problem; check your hostname lookup system (DNS, /etc/hosts, LDAP..) to see why the two machines disagree on the IP address.

    If those agree, check to see if the license server's daemon has cached an old IP address for the client machine; on the license server, try running 'rush -lah HOSTNAME' (where HOSTNAME is the license server's own hostname) and compare the IP for the client in that report to the same entry in 'rush -lah' (without the 'HOSTNAME'). If those reports disagree on the client's IP, then the daemon has an old address cached; use 'rush -reload hosts' to flush the daemon's cache. If that fixes things (reports now agree), then use 'rush -reload hosts +any' to flush the cache on /all/ machines.

    If both reports show the same IP address, but the IPs don't agree with the error message's IP address, then probably the machine has a different IP address assigned to it than the one your hostname lookup system thinks it should have. Either correct your DNS/LDAP/etc/hosts, or change the machine's actual IP address to match DNS.

    If the machine's actual IP and two 'rush -lah' reports ALL agree, one possibility is the client's IP address was recently changed, and the license server still has the /old/ IP address cached. Try first clearing the license server OS's IP cache with:

    • OSX (10.1 through 10.4): lookupd -flushcache (See 'man lookupd')
    • OSX (10.5 and up): dscacheutil -flushcache (See 'man dscacheutil)
    • Windows (all releases): ipconfig /flushdns (See 'ipconfig /?')
    • Linux: Most linux distros don't have an IP cache by default, unless 'nscd' was enabled. If 'nsd' is on, you can clear its cache by restarting it, eg. /etc/init.d/nscd restart See 'man nscd'.

    ..then clear Rush's own IP cache by running 'rush -reload hosts' on the license server. Then restart the daemon on the client (or just wait 30 seconds), and see if it picks up the license by looking at the client's rushd.log or with 'rush -ping <client>'. It's possible other machines may have the old IP cached, in which case you'd want to repeat this on the other machines as well (eg. 'rush -reload hosts +any -t 3'). Or just reboot them.

    The cause and solution is similar to this problem.

   LICENSE client x.x.x.x sent us garbage(i)  
    You'll see this on the license server, where x.x.x.x is the IP address of one of the remote rush hosts.

    The problem is usually one of the following:

    • The remote machine has more than one network interface, and needs this rush configuration.
    • The remote machine is a recently setup redhat linux machine with this problem.

   can't get etheraddr[#]: xxx  
    You'll see this only on Windows machines. The cause is you don't have NetBIOS enabled. Check your network properties:

    1. Right click on Network Neighborhood, choose Properties

    2. 'NetBIOS' should appear under the "Services" tab. If not, use "Add" to add it. You shouldn't need a disk, but it might prompt for one. Just point it to your C: drive.

    3. 'NetBIOS' should appear under the "Bindings" tab. And your ethernet card should appear bound to NetBIOS.

    Once NetBIOS is enabled, you'll likely be asked to reboot. On reboot, Rushd should then start OK; check your log.

    Usually, there is no reason to have NetBEUI or IPX enabled under networking.

    This problem is often seen with 3Com cards.

   sendto(host:696): Message too long: to host  
    This is fixed in 102.31p and up.

    You'll see this error on Mac OSX with version 102.31n and previous when you try to use 'rush -push' from the mac, or try to use 'rushadmin' from the mac.

    The workaround is to add the following line to the 'start' section of the /Library/StartupItems/Rush/Rush boot script:

           sysctl -w net.inet.udp.maxdgram=64000
       
    Then reboot; the problem should go away. This should not be needed if you're running rush version 102.31p and higher, as the problem should already be fixed in all versions after that.

    Note that if you run the above sysctl command from the command line as root, you can fix the problem right away, eg:

           [root@macosx] # rush -push rush.conf ontario
           rush: sendto(ontario:696): Message too long: to ontario
    
           [root@macosx] # sysctl -w net.inet.udp.maxdgram=64000
           net.inet.udp.maxdgram: 9216 -> 64000
    
           [root@macosx] # rush -push rush.conf on
           ontario[rush.conf]: OK
       
    ..but if you reboot the machine, the problem comes back. This is why the fix is made to the boot script to have it stay in effect.

    You do not need to do this unless you are running a release older than 102.31p

   The service did not start due to a logon failure  
    Windows only.

    You'll get this error if the Rushd Service's user password expired, or was recently changed by someone.

    If the user's password expired, set it up to never expire. If someone changed the password, be sure to configure that new password for the Rushd Service's account settings in the Service Manager.

    You'll get this error while trying to start the rushd service from the Service Manager or from the DOS command line, e.g.:

        C:\> net start rushd
        System error 1069 has occurred.
        The service did not start due to a logon failure.
        

   My DOS shell can't find programs like 'more', 'ping' and 'xcopy'  
    Installs of Houdini have been known to break the default Windows path, such that one no longer can run standard programs like ping, notepad, nslookup, more, etc.

    Fix your system path.

   Under Windows, why are UNC paths better than drive letters?  
    This question is answered in the TD faq here.

   How do I make UNC paths work the same on Windows and Unix?  
    This question is answered in the TD faq here.

   I know drive mapped/drive letters are bad, but what do I do if I need them anyway?  
    The best way to ensure drive maps are available at render time is to add code to the submit script that forces the drives to be mapped at render time, so that bad or missing mapps are 'remapped on the fly' just before rendering, eg:

      Remap drives at render time
      
          # WINDOWS? CHECK IF Z: IS MAPPED
          #    If not, remap it.
          #
          if ( $G::iswindows && ! -d "z:/" )
          {
              my $cmd = "net use z: /delete < nul";
              print "Z: NOT MAPPED -- FORCE UNMAP FIRST: $cmd\n";
              system($cmd);
      
              $cmd = "net use z: \\\\tahoe\\jobs /PERSISTENT:YES < nul";
              print "Z: MAPPING WITH: $cmd\n";
              system($cmd);
          }
          # RENDER COMMANDS HERE
          print "\nExecuting: $command\n";
          my $errmsg;
          my $exitcode = RunCommand($command, \$errmsg);
          [..]
          

    The technique used here is to check if z:/ is a directory, and if it's not, assume the drive is either 'unavailable' (ie. is mapped, but the map has timed out) or 'unmapped'. To solve the 'unavailable' situation, any old mapping is first deleted (net use z: /delete), before remapping it as 'persistent'. The redirect from < /nul ensures that any 'Y/N?' prompts are 'answered'.

   Is there a way to silence the 'Checkpoint wrote' messages?  
    Yes, in Rush version 102.41 and up you may see these messages.

    You can disable them in the rush.conf file; just change the checkpoint.log setting from '1' to '0'.

    Don't forget to push the rush.conf file to the network, eg. 'rush -push rush.conf +any'.

   Can Rush be installed in a directory other than /usr/local/rush'?  
    It is not recommended.

    Usually the intention is to install the Rush directory on an NFS drive, which is absolutely not recommended; it is a grand mistake to install network daemons on an NFS drive; it completely defeats the fault tolerance of the system.

    But if you want to install rush on a different local directory, you can either make /usr/local/rush a symbolic link to the new location (the easiest thing to do), or if you really want to avoid /usr/local/rush altogether, then you'll have to:

    • Set the $RUSH_DIR environment variable to the directory, and this variable must be set (a) before the daemon is started and (b) in all user environments.
    • Edit the 'rush/etc/rush.conf' file, and change any references to /usr/local/rush to the path you're using instead.
    You will make things easier on yourself and your users if you just install rush in the usual /usr/local/rush location.

   Can I install rush on an NFS server, to avoid installing locally on each machine?  
    No!

    The install instructions all warn:

      WARNING: As with all daemons and their config files, do *not* install the rush directory or binaries on an NFS mounted drive. Keep rush binaries local to each machine.

    The main reasons:

    1. Having executables demand page over NFS means the daemons will be dependent on the health of the NFS server. NFS 'hangs' can freeze the daemon in an unkillable state, and will make them unresponsive to the interactive apps.

      This also kills the whole fault tolerant 'distributed' design of Rush, which is to not be dependent on the health of any one server. The daemons are carefully written so as not to be dependent on any other server, keeping them responsive even when NFS servers or even license servers are rebooting.

    2. Similarly, Rush regularly monitors files in the rush/etc directory for changes. If files in rush/etc are NFS mounted, an NFS hiccup could cause the daemon to freeze as soon as it tries to touch one of these files.
    3. Rushd reads/writes files in the rush/var directory. If the rush dir is NFS mounted, all the daemons will overwrite each other's logs and state files, causing pandemonium.
    4. Rush uses file locking on files in rush/var to ensure integrity with its data files. NFS file locking is very broken in all releases of Unix (Sun, Irix, Linux, Mac, everything). It hangs programs in an unkillable state when it fails, and it fails often under medium and heavy load. File locking doesn't fit the indempotent 'stateless' design paradigm of NFS, so it is broken by design, and therefore untrustworthy.
    If you disagree with the above, and are sure of your NFS being stable 100% of the time, then at LEAST make sure the rush/var directory is local; the daemons read/write files in that directory, and they simply will not operate correctly unless that dir is local, or at very least unique on each machine.

   I'm getting 'select() on connect(): (111) Connection refused' when I submit jobs, or use 'rush -ping'  
    This means the Rushd daemon or service isn't running.

    Look in the rush/var/rushd.log file for errors from the daemon. Most likely causes:

    • This machine's hostname is not in the rush/etc/hosts file
    • The rush/etc/hosts file is not the same on all machines (it should be)
    • This machine is not in the license server's rush/etc/hosts file
    • The wrong IP address is cached in the license server, because a remote host's IP address recently changed.

    NOTE: If this is a Windows machine, and the rushd.log file is /empty/, even when you try to manually start the daemon, then try these commands as administrator:

        net stop rushd
        del /f c:\rush\var\.rushd.lck
        net start rushd
        
    If the rushd.log still won't show any contents, there might be a permissions problem with the rush directory itself, where the service is running as a user that doesn't have access permission to the c:\rush directory.

   What are the hardware/software prerequisites for Rush?  
    There are some prerequisites for Rush, especially on Windows and Mac networks. Rush assumes there's a professionally configured network; hostname/ip lookups all working, consistently configured user accounts, file servers for the common data with correctly configured access permissions.

    Some specifics:

    • Use DNS for hostname/ip lookups. (Static /etc/hosts are OK too.)
      Do /not/ use the Microsoft "WINS" or Apple "Rendezvous" defaults; these are 'zero administration' name lookup systems that are weak protocols, and not fault tolerant. Their biggest problem is being unable to do name lookups for machines that suddenly turned off, which will cause network wide problems during rendering. DNS does not have this problem.

    • Use static IP assignments for the machines, not DHCP.
      If you have to use DHCP, make sure the leases are set to never expire.

    • Make sure all machines can 'ping' each other by hostname.
      DNS will make this possible on all platforms. All machines must be able to lookup each other's IP addresses based on hostnames to communicate correctly.

    • Unix networks must have consistent user accounts.
      All users submitting jobs should have valid accounts on all machines with consistent uid/gid values. Example: If "fred" has a uid/gid of 100/20 on one machine, his account should have those same uid/gid values on all machines. (exception below)

    • Macs must have static hostnames in /etc/hostconfig.
      Macs must have static hostnames:

      • For OSX 10.10 and up, you can set the hostname with these commands:

            sudo scutil --set ComputerName  "yourhostname"
            sudo scutil --set LocalHostName "yourhostname"
            sudo scutil --set HostName      "yourhostname"
            

      • For OSX 10.9 and older, you can set the hostname in the /etc/hosconfig. Just remove any existing "HOSTNAME=" lines, and adding at the bottom:

            HOSTNAME=yourhostname
            

      After making that change, reboot to make sure it takes effect throughout the system's daemons.

      This avoids the 'hostname.local' Rendezvous/Bonjour stuff. Leave off the .local or .yourdomain from the hostname.. just use the short name. You can configure your DNS domain name (if any) in your client DNS settings for "Search Domains".

    • Windows Rush service runs as a particular user.
      Under Windows only, you will need to configure the Rush service to run as a particular user. This user must have read/write access to the file server. In a Windows WORKGROUP, this means the user must have a the same password on all machines including the file server, for consistent access. The user's password must be configured to be non-blank, and non-expiring.

    • Render clients should have the file server mount on boot.
      This is a unix-specific requirement. (Under windows UNC paths can access the file server at all times without setup on the clients.) Unix machines should use NFS to mount the file server on boot, either via 'mount' commands in the boot scripts, or via the automounter. SMB and AFP mounts are usually not suitable, especially under Mac OSX due to this OSX restriction that makes multiuser access to AFP/SMB mounts very unworkable. See this overview of the benefits and drawbacks of NFS/SMB/AFP.

    The requirements for DNS and static IPs are necessary for stability; Rush by definition makes machines into 'cpu servers' for rendering. As 'servers', these machine's IP addresses must be consistent through reboots in order for communications to be re-established reliably. By always rebooting with the same IP address, Rush will be able to resume communications quickly.

    The requirement for static IPs is similar to why Web Servers and File Servers must have static IPs; so that clients can find them quickly when rebooted. Another reason to use static IPs: ARP and IP caching are affected by changes in IP addresses, so DHCP dynamic IP changes can cause 20 minutes of confusion on the network for rebooted machines while Arp tables expire. Also IP address caching in name lookup services can equally cause communication problems.

    Avoidance of DHCP prevents problems with the Rushd network services during startup. If DHCP servers are not immediately responsive after reboot, they can assign random or improper IP address and DNS/hostname info, causing network services to bind to the wrong IP's, and can't resolve other hostnames. Use static IPs, DNS server info and hostnames to avoid this.

    Avoidance of WINS and Rendezvous/Bonjours are important. These protocols are unable to do hostname lookups for machines that are turned off, which cause network-wide problems with hostname lookups. DNS or even /etc/hosts are better, as they ensure hostnames resolve even when the remotes are turned off, ensuring responsive name lookups.

    The Unix requirement for consistent uid/gid values is needed for consistent file system permissions when network rendering. Normally Rush runs renders as the user that submitted the job on unix machines. Note: you can avoid adding user accounts to all machines if you force Rush to run all renders as a single user (eg. 'render'), in which case only that user needs a valid account on all machines.

   How do you enable Rush mail messages to work under windows?  

   Windows: rushd starts, but shows /no/ new messages in the rushd.log?  
    If you see no messages in the rushd.log, it's either one of two problems:

    • Bad lock file was copied from another machine, and how to fix it
    • Bad permissions on the c:/rush/var and c:/rush/etc directory, and how to fix it

   can't open port lock file '/usr/local/rush/var/nextport: Permission denied  
    Make sure rush/rushd are owned by root and have the SUID bit set. eg:
          chown 0:0  /usr/local/rush/bin/{rush,rushd}
          chmod 4755 /usr/local/rush/bin/{rush,rushd}
      

    When you extracted the rush distribution tar file, you probably forgot to specify the tar(1) 'p' flag.

   rush: iface bind(x.x.x.x): Cannot assign requested address  
    The machine is probably confused about its own IP address.

    Run 'hostname' to determine the host's name, then 'ping [the_hostname]'.

    Verify if the IP address ping prints in response is the machine's actual network IP address, which you can verify with the command:

            Windows: ipconfig /all
           Unix/OSX: ifconfig
      
    Also, check that your hostname lookups are configured correctly in either DNS, or if not using DNS, in your (unix)"/etc/hosts" or (windows)"c:\winnt\system32\drivers\etc\hosts" files.

   unknown uid 100  
    You'll see this error usually in the IRUSH 'Frames' report (rush -lf) or 'Cpu' report (rush -lc).

    You'll get this error under one of two circumstances:

    1. When your unix machines don't have consistent uid/gid values for users
    2. When you submit a job from Windows that uses unix machines, and haven't yet configured ntrushuid and ntrushgid values in the rush.conf file.
Problem #1: Inconsistent UID/GID Description + Solution
    Description: When a unix machine runs a render for a job submitted from another unix machine, it will try to run that job as the user that submitted the job, using the same uid/gid values. NFS also works this way. If the user doesn't exist, or there is no matching account on the render nodes for the uid/gid values of the user who submitted the job, renders will fail with this error.

    Solution: fix the uid/gid values on your machines so each user has the same uid/gid value on each machine. eg. if 'fred' has a uid/gid value of 105/100, then make sure his account has the same uid/gid values on /all/ machines.

Problem #2: Windows -> Unix Description + Solution
    Description: When a job is submitted from a Windows machine, and asks to render on unix machines, Rush has to assign the job a uid/gid value, because Windows has no concept of these values.

    The default uid/gid values Rush assigns to windows jobs on unix machines is uid=100, gid=100. To change the defaults, edit the ntrushuid and ntrushgid values in the rush.conf file.

    If the unix machine tries to run a job with a uid/gid of 100, but the unix machine doesn't /have/ an account with that uid/gid, then Rush gives this error.

    Solution: The sysadmin should create a 'render' (or "rush") account on all the unix machines, making sure they're all using the same uid/gid values. Either assign that user a uid/gid of 100/100, or use your own custom uid/gid values and configure them for the ntrushuid and ntrushgid values respectively in the rush.conf file. The rush.conf file can be modified with 'rushadmin', and then be pushed out to the network with the 'Send' button, or can be manually edited in a text editor, and then pushed to the network with 'rush -push rush.conf +any')

    This procedure is also covered in the Windows Installation instructions.

    Caveats: The sysadmin must make sure the user permissions of the unix "render" user are compatible with the permissions of the Windows users, so that both platforms have the same permission to each other's files. This way when a windows job tries to render on unix, it uses the uid/gid of the "render" user to run the renders.

   rushd: error while loading shared libraries: libstdc++.so.5: cannot open shared object file..  
    You are probably trying to use rush's 'redhat9' distribution on RHE4 (Redhat Enterprise 4) or FC4 (Fedora Core 4). If so, be sure to install the 'compatibilitiy libraries', in particular, rpm -ivh compat-libstdc++-33-3.2.3-47.3.i386.rpm.

    Or, if you don't have access to that, you can try downloading the libs you need from the password protected customer patches directory; use the same login/password info you used to download/install the Rush software.

   'push xxxx': DeleteFile(c:\rush/etc/hosts): Access is denied.  
              somehost[hosts]: 'push hosts': DeleteFile(c:\rush/etc/hosts): Access is denied.
          somehost[rush.conf]: 'push rush.conf': DeleteFile(c:\rush/etc/rush.conf): Access is denied.
        somehost[license.dat]: 'push license.dat': DeleteFile(c:\rush/etc/license.dat): Access is denied.
    
    This means that the named files in the c:\rush\etc\ directory are not writable by the user the Rushd service is running as. This error is Windows specific; you will see this when pushing updated files to remote Windows machines.

    A few situations can cause this. Examples:

    • Someone manually copied the file.
      If the Rushd service is running as the user 'render', and someone logged in as "Administrator" used COPY to put a new license.dat file on the machine. The file now becomes owned by Administrator, and is now has permissions that only Administrator can overwrite. Later, when someone tries to push out a new license.dat file using 'rush -push', Rushd running as 'render' can't overwrite the file.
    • Someone changed the user the Rushd service is running as.
      Let's say the Rushd service was first installed and configured to run as the user 'Administrator'. When files are pushed, they take on that user's access permissions. Later, someone changes the Rushd service to run as a the new user 'render' -- the files in the c:/rush/etc directory are still owned by 'Administrator' and not writable by anyone else.
    The fix is simple: use 'CACLS' to open up the permissions for all files in c:\rush to be writable by the user the Rushd service is running as.

    Let's say the Rushd service is configured to run as the user 'render', then you would run this command as Administrator:

        cd c:\rush && cacls . /t /e /c /g render:f
    

    ..which opens the permissions to the user 'render'. Or, if security isn't a big deal, you can just open up the files to be writeable by everyone:

        cd c:\rush && cacls . /t /e /c /g everyone:f
    

   rresvport_NET: All ports in use  
    You will see this error on unix machines, either when submitting jobs, or pushing buttons in irush, or when running simply rush commands like 'rush -lf', 'rush -lj', etc.

    The cause is the permissions on the 'rush' executable are not 4755 root/root. Verify the permissions are exactly as shown in red:

        % ls -la /usr/local/rush/bin/{rush,rushd}
        -rwsr-xr-x  1 root root 2441421 Jan  1 00:00 /usr/local/rush/bin/rush
        -rwsr-xr-x  1 root root 3065406 Jan  1 00:00 /usr/local/rush/bin/rushd
        
    The ownerships are important too; they should either be either root/root or root/wheel.

    To fix the problem, run these commands while logged in as root in the order shown:

        # chown 0:0  /usr/local/rush/bin/{rush,rushd}
        # chmod 4755 /usr/local/rush/bin/{rush,rushd}
        
    The most common causes of this problem: the rush tar file was not extracted with the 'p' (preserve permissions) flag during installation, or the install script was not run (which normally fixes this problem), or an inexperienced root user recently invoked a runaway 'chmod -R' or 'chown -R' command that affected the entire rush directory tree. If the latter, suspect that the perms on the rush directories might be wrong as well; consider checking that all the rush dirs are chmod 755/chown 0:0.

    This problem is similar to this problem.

   Why is it bad to put NFS mounts (or symlinks to NFS mounts) in the root directory?  

    Sometimes people wonder why NFS mounts are created in /mnt or /hosts subdirectories under root, like /mnt/jobs, instead of just /jobs. There's several reasons, one in particular is very important.. to avoid hanging up the entire OS when there's an NFS outage!

    Whenever the operating system wants to access a file or program, say /usr/bin/man, the operating system starts at the root (/), and walks the directory contents of the root directory until it hits 'usr', then walks the directory list of /usr until it finds 'bin', etc.

    If you have a mount (or symlink pointing to a mount) in the root directory, and it just happens to appear earlier in the directory entries than /usr, invoking /usr/bin/man will involve stepping over your mount/symlink while searching for /usr. If the NFS server is down, your app will hang when it tries to step over the mount point/symlink, causing it to freeze up your shell. (Similar to how running 'df' freezes the shell during NFS outages)

    Consider you're an administrator wanting to login as root to fix the mounts on the client during an NFS outage. You can't even login because the act of logging in opens files in /var and /etc, any of which might trip over your mount and freeze up the login process. In such a scenario, EVERY program trying to do just about anything will hang; daemons periodically opening their config files in /etc or /var could hang up by this.

    By putting mounts and symlinks to mounts in a directory BELOW root, you avoid the problem. For instance, if my server's name is 'tahoe', and it has two volumes I want to mount, 'jobs' and 'admin', then I might use:

    
       /
        |-- bin
        |-- boot
        |    |___ lost+found
        |     
        |-- dev
        :
        |-- tahoe        <-- plain subdirectory
        |    |__ jobs    <-- mount point for tahoe:/raid/jobs
        |    |__ admin   <-- mount point for tahoe:/raid/admin
        |
        |-- tmp
        |-- usr
        |    |__ X11R6
        |    |__ bin
        :    :  
        

    This way the OS can walk over the 'tahoe' subdirectory without touching the NFS mount points inside ('jobs' and 'admin'). Then root can login safely and run commands, even if the NFS server is hung. (Just be sure root's PATH doesn't include any directories that point to an NFS server, or you'll end up with the same problem when root tries to login or run commands. If you have commands you want to access on your NFS server as root, make 'aliases' to the absolute NFS paths, to avoid putting the NFS directory in root's PATH. That way an NFS outage will only affect your aliases, and not ALL commands you invoke.)

    You may already have mounts/symlinks to mounts in the root directory, and have just been lucky things work fine during an NFS outage, because your mounts/symlinks just happen to be 'below' the important entries like /bin. Keep in mind the true order of directory entries in the file system is typically /not/ alphabetical.. the order is often hashed, or dependent on the creation order of the directory entries. (Depends on how your filesystem manages directories)

   How do I add hosts and remove hosts and render nodes in Rush?  

   When I hit 'Submit' I get '/var/tmp/submit-xxx.pl': No such file or directory  

    A common cause for this is the person submitting didn't invoke the submit script with an absolute path that points to the file server.

    The submit script looks at the path used to invoke it to determine the absolute path to its location so that it can submit itself as a job so that it can run itself on all machines.

    If you try to submit a /copy/ that is on the desktop, or invoke the script by first cd'ing to the directory where it's located and just invoke ./submit-foo.pl, the script won't be able to determine the absolute path to itself. (Automatic attempts to determine this are surprisingly flawed) The absolute path needs to be specified when the script is started up, eg. /yourserver/rushscripts/perl/submit-xxx.pl

    Make sure the user's desktop shortcuts that point to the submit scripts are actually aliases or symlinks (and not copies) that point to an absolute path to the the submit script on the file server.

    For more information on How to Setup Desktop Shortcuts to the submit scripts, see:

   When I install rush on SUSE 10.x I get '*** RUSH INSTALL FAILED!'  

    When installing Rush version 102.42a9 or older on some of the recent versions of SUSE (10.x and up), you may encounter this error while running the install script:

      SUSE Install Errors
      cp: cannot create regular file `/etc/rc.d/init.d/rush': No such file or directory
      [..]
      *** RUSH INSTALL FAILED!
      *** There were 1 error(s).
      	

    This is due to SUSE deviating from the /etc/init.d/ standard for boot scripts in an effort to parallelize booting.

    A simple tweak to the install script will make this work; see this newsgroup article for info.

   Is there a way in rush to limit rendering to certain processors? (processor affinity)  

    Yes, Rush version 102.42a9c (and up) provides an optional (and currently undocumented) 'Affinity' field in the rush/etc/hosts file. There's a newsgroup article that covers its syntax and use.

    In 102.42a9c, this feature only works for Microsoft Windows.
    In a follow up release, it will support linux as well.
    Since OSX does not support processor affinity as of this writing (Nov 2010), affinity values for OSX machines will currently be ignored. When OSX adds support for this in the future, a future release of Rush will incorporate it as well.

    As of this writing, this Rush feature is not documented, save the above newsgroup article. This will change in the future once field testing can determine its proper usage.