RUSH RENDER QUEUE
(C) Copyright 1995,2000 Greg Ercolano. All rights reserved.
V 102.42 12/14/04


Systems Administrator
Frequently Asked Questions

  1. On submit, I get 'logdir '/some/path/': no such file or directory', but the dir exists!
  2. What's the best way to verify all the daemons are running?
  3. How do I stop/start the daemons? (Unix/NT)
  4. Is there an example boot script I can use to invoke rush?
  5. Is there a way to run 'rush -online' automatically when someone logs out?
  6. Is there a way to run 'rush -online' automatically when someone's screensaver pops on?
  7. What kinds of security issues are there with rush?
  8. How do I update changes to the rush hosts file (or rush.conf file) to the network?
  9. Is there a way to see whose jobs are bumping whom?
  10. Is there a way to see who's changing other people's jobs?
  11. Can rush be told to use a different network interface, other than the machine's hostname?
  12. Windows: Where can I get perl for Windows?
  13. Windows: Where can I get rsh/telnet daemons for Windows?
  14. Windows: is there a way to restart the rushd service as a normal user?
  15. Windows: Is there a way to disable error dialogs? (General Protection Faults, etc)
  16. I changed the ip addresses of a host on my network, and now rush can't talk to it?
  17. Sometimes I see "???.???.???.???" in 'rush -lah' reports. Is this bad?
  18. Rush is acting slow; reports take a long time, and the GUI is sluggish. What's wrong?
  19. Can I use DHCP on machines running rush?
  20. How do I know how many rush licenses are checked out?
  21. Rushtop doesn't show cpu use for some of my windows machines?
  22. Maya renders fail with "--- FAILED: EXITCODE=128"?
  23. Rendering on windows, I get 'CreateProcess(perl ...): The system cannot find the file specified.'?
  24. Starting rushd under WinNT it says 'can't find PDH.DLL' or 'PDH.LIB'
  25. Starting rushd under Windows it says: 'The service did not start due to a logon failure'
  26. My DOS shell can't find programs like 'more', 'ping' and 'xcopy'
  27. Under Windows, why are UNC paths better than drive letters?
  28. How do I make UNC paths work the same on Windows and Unix?
  29. Is there a way to silence the "Checkpoint wrote" messages?
  30. Can Rush be installed in a directory other than /usr/local/rush'?
  31. Can I install rush on an NFS server, to avoid installing locally on each machine?
  32. I'm getting 'select() on connect(): (###) Connection refused' when I submit jobs, or use 'rush -ping'
  33. What are the hardware/software pre-requisites for Rush?
  34. How do you enable Rush 'donemail/dumpmail' messages to work?

Common Errors in rushd.log

  1. bind(): Address already in use
  2. Reverse lookup for 'xxx' returned 127.0.0.1 (loopback address) instead of network address.
  3. connect(myhost): Invalid argument
  4. udp: iface bind(#.#.#.#:696): Address already in use
  5. 'xxx': valid host is NOT in rush hostlist
  6. client #.#.#.# sent us garbage(i)
  7. can't get etheraddr[#]: error message
  8. sendto(host:696): Message too long: to host
  9. CreateProcess(perl ...): The system cannot find the file specified.
  10. Rushtop(1) sometimes creates a lot of Arp traffic. Why? And can this be prevented?
  11. Error 1058: The specified service is disabled and cannot be started.
  12. c:/rush/var/.rushd.LCK: Invalid argument (can't open daemon lockfile)

Common Errors from 'rush'

  1. rush: connect(myhost): Invalid argument" in my linux rushd.log?
  2. rush: rresvport(): Permission denied
  3. rush: logdir '/some/path/': no such file or directory
  4. rush: bind(): Address already in use
  5. rush: can't open port lock file '/usr/local/rush/var/nextport: Permission denied
  6. rush: iface bind(x.x.x.x): Cannot assign requested address
  7. unknown uid 100
  8. rushd: error while loading shared libraries: libstdc++.so.5: cannot open shared object file..

  I'm getting 'connect(myhost): Invalid argument'
in my linux rushd.log?
  

  I'm getting 'Reverse lookup for 'myhost' returned 127.0.0.1 (loopback address)'?  

    Your host is coming back with 127.0.0.1 as its ip address, instead of the network interface's actual address. Fix your machine's /etc/hosts file or hostname lookup system.

    This is known problem with most, if not all of the 'Redhat Linux' installer programs.

    To test for the problem, open a shell on that machine and ping its own hostname. If the address returned is 127.x.x.x, that is the problem:

    
        you@tahoe % ping tahoe
        PING tahoe (127.0.0.1) from 127.0.0.1 : 56(84) bytes of data.
        64 bytes from 127.0.0.1: icmp_seq=0 ttl=255 time=0.5 ms
        64 bytes from 127.0.0.1: icmp_seq=1 ttl=255 time=0.4 ms
        ^C
    
        you@tahoe % cat /etc/hosts
        127.0.0.1 tahoe localhost           <-- This is the problem
    	

    ..the /etc/hosts entry is wrong. Correct it by making separate entries for 'localhost' and the machine's hostname and IP address, eg:

    
        root@tahoe # cat /etc/hosts    _
        127.0.0.1      localhost        |__ CORRECT
        192.168.0.37   tahoe           _|
    
        root@tahoe # ping tahoe
        PING tahoe (192.168.0.37) from 192.168.0.37 : 56(84) bytes of data.
        64 bytes from 192.168.0.37: icmp_seq=0 ttl=255 time=0.5 ms
        64 bytes from 192.168.0.37: icmp_seq=1 ttl=255 time=0.5 ms
    	

  I'm getting 'rresvport(): Permission denied', what's that mean?  
    Usually one gets this error in the context of running 'rush' from the command line:

    
        % rush -ping
        tahoe: rush: rresvport(): Permission denied
    	

    Rush uses a reserved port to communicate with the daemon, and therefore needs to run SUID root.

    Make sure the SUID bit is on for the rush(1) binary, and the owner is root:

    
        chmod 4755 /usr/local/rush/bin/rush
        chown 0.0  /usr/local/rush/bin/rush
    	

  What does 'bind(): Address already in use' mean?  
    It usually means one of these things.

    1. This is an SGI, and the kernel's NFS is using the port
    2. Something else is using the port.
    3. Two or more rushd's are running. (Not likely in 102.31+)
    4. You recently stop/started the daemon. Problem goes away by itself. (Not likely in 102.31+)

    #1 often occurs if you've just installed rush on an SGI for the first time, and the machine has been up for a while. 'netstat -an' will show a whole slew of UDP listeners on ports between 512 and 1024 all in sequence, one of them being port number 696, the one rush has been assigned by IANA. Some rogue kernel utility is causing this, probably NFS. Usually fuser(1) shows no process associated with the rogue UDP listeners because it's a kernel process. The easiest solution is to simply reboot; when rush starts on boot, it always secures the port it needs well before the kernel gets a chance to step on it.

    #2 Stop the rush daemon, and use 'netstat -an' to see if some other program is using rush's port (normally port #696; see your rush.conf file's serverport setting, incase your site uses a different port number). Look for open UDP or TCP connections on that port, either in the Local or Foreign address.

    If you see port #696 in the 'Foreign' address of the local machine, suspect hung clients on the remotes:

    • rsh over to the remote machine (ie. 'Foreign' host)
    • Kill any 'rush' client processes you see, eg. 'killall rush'
    • Back on the local machine, do a 'netstat -an' to verify the connections are gone or closing.
    • Restart the daemon once all 696 ports have closed

    If the local TCP or UDP port is in use, suspect some system daemon or other is using the port when it shouldn't. Use fuser(1) or similar utility to figure out which process is using the port, or simply reboot.

    If fuser(1) shows no process and it's an SGI, then see #1..

    #3 only happens in older versions of rush (pre-102.31) where more than one rush might be running. Newer versions of rush use a lock file that prevents this.

    Only one daemon should have a PPID of 1 (Parent Process ID). If there's more than one with a PPID of 1, kill the one(s) with the higher PID.

    #4 is common only in the older versions of rush (pre-102.31), and occurs when you stop/start the rushd daemon. This problem fixes itself within 2 minutes automatically. The OS often keeps recently closed TCP listeners unavailable to other processes for a 90 second period. Rush will keep retrying to bind to the port, and eventually succeeds within 2 minutes.

  On submit, I get 'rush: LogDir '//foo/bar': No such file or directory'
or 'ERROR: can't create log: //foo/bar': Access is denied.
  
    A Microsoft Windows specific problem; the typical complaint being 'but the directory does exist!'.

    This is encountered only on windows machines, and can be caused by:

    1. The directory is inaccessible to the user the Rushd service is running as
    2. The password expired for the user the Rushd service is running as
    3. A trailing slash is specified in the directory name (102.40e or older)

    Regarding #1, when you submit a job, it is RUSHD that does the check to see if it can write to the log directory. It does this test running as the user the RUSHD service is configured to run as (Control Panel | Administrative Tasks | Services | Rushd | Log On As) If the daemon can't access the directory, it responds that the directory does not exist What windows should say is 'Permission denied', but instead it says "File Not Found". Make sure the permissions on the directory are open (if the server is a Samba server, use 'chmod 777').

    Also, verify the user the RUSHD service is running as is able to access the file server;

    • Login as the user you have the RUSHD service configured to run as
    • Browse to the log directory from a browser, like 'My Computer', 'Network Neighborhood', 'My Network Places', etc.

    ..while doing these steps, see if you prompted for this dialog:

    If so, that's the problem. You shouldn't be seeing this dialog; it's an error message indicating the current user does not have access to the server. Even though you can click past it, RUSHD can't. Fix your server config so this dialog doesn't show up.

    When RUSHD tries to access the disk server, it has no way to supply a password to that dialog; the RUSHD service must be running as a user that has access to the disk server without getting that prompt.

    Regarding #2, verify the Rushd user's password is set to never expire.

    Regarding #3, this is for older versions of rush only. In versions predating Rush 102.40f, if a trailing slash is specified on the directory name, it would complain it can't find the directory. Remove the trailing slash from your "Log Dir" prompt after you browse to the log directory. As of 102.40f, rush strips the trailing slash before doing the test. The need to do this is due to a bug in Microsoft's 'access(2)' and 'stat(2)' functions not being able to handle trailing slashes on directory names.

  What's the best way to verify all the daemons are running?  
    Use:

      rush -ping +any

    This 'pings' all the daemons in the $RUSH_DIR/etc/hosts file with a TCP message.

    If the daemon isn't running, tail(1) the daemon's log file in $RUSH_DIR/var/rushd.log.

  How do I stop/start the daemons? (Unix/NT)  

    Stop/Start Rushd
    Irix /etc/init.d/rush stop
    /etc/init.d/rush start
    Linux/RedHat 6.x /etc/rc.d/init.d/rush stop
    /etc/rc.d/init.d/rush start
    Windows net stop rushd
    net start rushd

All the daemons can be stopped via:

    rush -dexit +any

  Is there an example boot script I can use to invoke rush?  

  Is there a way to run 'rush -online'
automatically when someone logs out?
  
    Yes; when a user logs out of the window manager, the sysadmin can configure the following files to run 'rush -online':

      Online At Logout
      Irix /usr/lib/X11/xdm/Xreset
      Linux/RedHat 6.x /etc/X11/xdm/TakeConsole
      Mac OSX See this Apple documentation on "-LogoutHook".

    A literal example of what should be added to these files would be:

    
        /usr/local/rush/bin/rush -online
        logger -t RUSH "Rush online (user logout)"
    	

    Use of logger(1) is optional; it leaves an audit trail in the syslog. Include full path to logger(1) if security is an issue.

  Is there a way to run 'rush -online'
automatically when someone's screensaver pops on?
  
    There probably is, but I don't know how to do it.

    If you have any suggestions on how to do it on various platforms, please send them to:

      Greg Ercolano

  What kinds of security issues are there with rush?  
  • Be sure your rush network is firewalled from the internet. Rush is not designed to run on machines that are open to the internet.

    If you want Rush to work on machines separated by the internet (eg. a remote farm), be sure to use encrypted connections for all traffic passing through the internet, to prevent unwanted users from tapping into the network connections, such as VPN or other encrypted network connections.

  • To avoid root loopholes, be sure all subdirs in the path to the setuid binaries and config files have tight permissions, e.g., if rush is installed in /usr/local/rush/bin:


chmod go-w /usr \
	   /usr/local \
	   /usr/local/rush \
	   /usr/local/rush/bin \
	   /usr/local/rush/bin/* \
	   /usr/local/rush/etc \
	   /usr/local/rush/var \
	   /usr/local/rush/var/*

chmod 4755 /usr/local/rush/bin/rush
chmod  755 /usr/local/rush/bin/rushd

chown 0.0 /usr/local/rush/bin/rush \
	  /usr/local/rush/bin/rushd
	

  • By default, rush uses reserved port 696 to communicate udp/tcp packets. For secure networks, make sure users do not have access to root to avoid renegade software from exploiting the port.

  • If security is an issue at your site, be sure to check ALL settings, esp. UidRange and GidRange. Also, correctly configure AdminUser. At least read about these all, before accepting the defaults.

    If you want to make changes to these settings, see the Rush Configuration File documentation for more info.

  • Rush daemons will not run jobs as a uid or gid of root. If you really want a job to run as root, make the commands the job invokes setuid, or use 'sudo'.

  • Rush daemons will only trust remote machines that are configured in its host list. Rush will log all connection attempts from machines not configured in the hosts file. Sysadmins can grep the rushd.log files for the string 'SECURITY' to detect security related problems.

  • The 'rush -push' feature that helps sysadmins easily distribute the rush.conf, license.dat, hosts, and other such files. This feature can be disabled via the rush.conf file, if the feature is considered a security loophole.

  How do I update changes to the rush hosts file
(or rush.conf file) to the network?
  
    Normally you should use 'rushadmin(1)' or 'rush -push hosts +any', as these tools will release across platforms easily and quickly.

    However, if the rush daemons are not yet running (e.g. you are setting up a new installation) then you must use operating system commands to copy the files to the network.

    Under unix, you should use rdist(1). Some examples:

    
        # SEND A NEW rush.conf
        foreach i ( `awk '/^[a-z]/{print $1}' /usr/local/rush/etc/hosts` )   
        rdist -c /usr/tmp/newconf ${i}:/usr/local/rush/etc/rush.conf
        end
    
        # SEND A NEW RUSH hosts
        foreach i ( `awk '/^[a-z]/{print $1}' /usr/tmp/newhosts` )
        rdist -c /usr/tmp/newhosts ${i}:/usr/local/rush/etc/hosts
        end
    	

    Under windows, you can use COPY. Use UNC paths to specify the remote directories, assuming their C: drives are 'shared' correctly:

    
        for %i in ( nt1 nt2 nt3 ) do copy c:\rush\etc\hosts \\%i\c\rush\etc   
    	

  Is there a way to see whose jobs are bumping whom?  
    Grep the $RUSH_DIR/var/rushd.log file for BUMP messages.

  Is there a way to see who's changing other people's jobs?  
    Grep the $RUSH_DIR/var/rushd.log file for SECURITY messages.

  Can rush be told to use a different network interface,
other than the machine's hostname?
  
    Yes. In the rush hostlist, the hostname can actually be a pair of hostnames separated by a ':', e.g., tahoe:tahoe-eth.

    The name on the left of the ':' is the familiar hostname(1) of the machine, and the name that follows the ':' is the alternate network interface you want to use.

    See also the Hosts File: Hostname section on the hostname field.

  Where should I get perl for Windows?  
    It is highly recommended you use ActiveState Perl.

    It's definitely the best. Both well integrated and documented specifically for the Windows platform. Highly cross-platform compatible, with excellent Windows-specific modules and many of the standard internet modules, including Mail/FTP/NNTP, etc.

    I've personally tested and used it extensively in various production environments and have found it to be the most stable perl available.

    It's a free download.

  Where can I get rsh/telnet daemons for Windows?  

    I have personally evaled both products, and found them both useful.

    Regarding Denicomp and rsh, it lets you run simple commands on the remote machines using NT's own rsh(1) client. It supports 'rsh hostname command', but does not support 'rush hostname'. In other words, you can't strike up an interactive session. You get a limited trial to use the software for free, then if you like it you should buy it.

    Regarding Georgia Softwork's telnet server, I have to say it's impressive what it does. You can run interactive dos applications that even do direct screen memory access, and the results will look correct on the telnet client. Compatible with unix telnet clients, as with NT clients. Unfortunately, the software is very expensive. But you get a 30 day trial to test it out.

    There is also freeware available. Most of those I've evaled have extreme limitations, or are easily broken.

    The NT Resource Kit from Microsoft which comes with a telnet server, though I've never tried it.

  Windows: is there a way to restart the rushd service as a normal user?  
    There are two ways I know of.

    I. If you're running under Win2K, you can use the new 'runas' Win2K command. Similar to su(1) in unix; it lets you run commands as administrator. The following gives you a DOS shell with network administrator priveleges regardless of who the current logged in user is:

      runas /user:YOUR_DOMAIN\Administrator cmd
      Password:

    In the DOS shell that appears, run 'net stop rushd' and 'net start rushd'.

    II. Use your domain controller's Remote Services administration software. With a Win2K server:

    1. Start | Programs | Administrative Tools | Active Directory Users And Computers
    2. Select 'Computers'
    3. From the list of computers, right click on the one to control, and choose 'Manage'
    4. Under the "Tree" tab, click "Services and Applications"
    5. Choose "Services", then choose "Rushd" and then use the usual Start/Stop controls

    There's surely something similar under WinNT Server.

  Windows: Is there a way to disable error dialogs?  
    Yes. But it is a registry tweak that affects the entire machine.

    See this Microsoft Knowlege Base Article Q124873 for more info. To paraphrase, this article basically says, along with the usual risk disclaimers regarding manual editing of the registry:

    1. Run Registry Editor (REGEDT32.EXE).

    2. From the HKEY_LOCAL_MACHINE subtree, go to the following key:
      \SYSTEM\CurrentControlSet\Control\Windows\ErrorMode

    3. Select the ErrorMode value.

    4. From the Edit menu, choose DWORD.

    5. Type 0 (zero), 1, or 2 to select the error mode. Regardless of this setting, all errors are written to the system log:

      • 0 - Error message box pops up (default).
      • 1 - No dialog for system errors only.
      • 2 - No dialog for system or other errors.

  I changed the ip addresses of a host on my network,
and now rush can't talk to it?
  
    touch(1) all the rush hosts files on the network, so the daemons will reload their IP address caches. This can be done easily by re-releasing the rush hosts file with 'rush -push hosts +any'.

    For speed, the rush daemons cache hostname-to-ip-address lookups for all the hosts in the rush hostlist. This prevents load on your DNS or NIS servers, since rush makes numerous hostname/ip lookups while running jobs.

    When you change the IP address of one of the machines, the rush daemons need to be told to flush their caches. touch(1)ing all the $RUSH_DIR/etc/hosts files changes the date stamp of the file, causing the daemons to think a change was made, which then reload the file, and flush their cache.

    You can check to see what any daemon has in its IP cache using 'rush -lah <hostname>'. This will show you the rush hostlist according to the daemon on the named machine, including it's cached IP address lookup information.

  Sometimes I see '???.???.???.???' in 'rush -lah' reports. Is this bad?  
    Yes, it's bad.

    Rush will not operate correctly if it can't do hostname lookups for any machine in the rush hosts file.

    The question marks mean rush is unable to lookup a host's name. The more of these there are, the slower rush will operate. You will also notice sluggish or very slow operation in the GUIs, and in the generation of most rush reports.

    Probably other tools like 'ping' will be unable to lookup the hostname. Possible causes:

    • In an environment where static /etc/hosts files are used for name lookups, make sure the hostname is in all your /etc/hosts files.

    • In an NIS environment, same problem likely exists with your 'hosts' map.

    • Do not use DHCP, unless you have static assignments where the leases don't expire. Rush will get confused if a machine changes its IP address, eg. after a reboot. Use static IPs.

    • In a DNS environment, either your DNS server is not responding, not configured, or the host is not in your DNS. Use nslookup(1) to debug the problem.

    • In a Windows environment where WINS is used to do hostname lookups, if the machine is down, WINS can't do a hostname lookup for it. To solve this problem, you can do any *one* of the following:

      • Make *static* IP entries for all rush machines on your WINS server, so the hostnames still lookup even if the machine is down. Use the 'WINS Manager' on your PDC.

      • Maintain static hosts files on the rush machines. Windows has a unix-like hosts file "C:\WINNT\SYSTEM32\DRIVERS\ETC\HOSTS" which if set up to contain IP-to-hostname entries for all the rush hosts, it can be copied to all the machines to ensure hostname lookups never fail.

      • Get away from WINS, and use DNS with static IP-to-hostname lookups. This will ensure hostname lookups work even when hosts are down.

  Rush is acting slow; reports take a long time,
and the GUI is sluggish. What's wrong?
  
    Run 'rush -lah' and 'rush -lah localhost' to see if it reports "???.???.???.???" for the ip address of any hostnames. If so, you are having a name lookup problem, and that's causing the problem.

    This is especially a problem on Windows networks if you use WINS instead of DNS. WINS can't do hostname lookups for a machine that is down. A good reason NOT to be lazy, and depend on WINS to dynamically keep track of things.

    To solve this problem, see the above.

  Can I use DHCP on machines running rush?  
    It is not advised.

    You can do it if you set it up so the leases never expire.

    Rush will not operate correctly if the IP address of machines change randomly, or change when they reboot. The best thing to do is assign static ip addresses to all machines running rush.

  How do I know how many rush licenses are checked out?  
    It would be the number of hosts you have in your rush hosts list, which should be the same on all machines.

    So basically, it's the number of lines in the rush hosts file, not counting the comments. Or the number of hosts in 'rush -lah'.

    Rush checks licenses out on boot, it does not check out/check in while the system is running.

  Rushtop doesn't show cpu use for some of my windows machines?  
    Either the RUSHD service on those machines isn't running, or it is running, but is configured to run as a user that doesn't have local admin privelege. RUSHD must run with local admin privelege in order to access the machines cpu usage, otherwise RUSHD can't report it to 'rushtop'.

  Maya renders fail with '--- FAILED: EXITCODE=128'?  
    This is not an issue with rush; this is a Maya licensing problem on only Windows machines, where two different users can't run Maya on the same machine, even though licenses are available to do so. This is a bug in maya's licensing for windows, but workarounds exist.

    A complete description of the problem, and possible workarounds are fully described in this maya issues document.

  Rendering on windows, I get
'CreateProcess(perl ...): The system cannot find the file specified.'?
  
    You'll see this error in the daemon log for a windows machine.
    You may also see it in the frame logs, and/or in the NOTES section of the 'Cpus' report for your job for cpus that are windows machines.

    The cause is either 'perl' is not installed on the machine in question, or perl was recently installed, without restarting the rush daemon.

    If you install perl while the rush dameon is running, you must restart the daemon for it to pick up the addition to the system environment's PATH variable, eg:

            net stop rushd
            net start rushd
       
    Then requeue the frames and try again.

  Rushtop(1) sometimes creates a lot of Arp traffic. Why? And can this be prevented?  
    rushtop(1) can create a lot of Arp traffic if many machines rushtop(1) is trying to contact are powered off.

    One arp request is sent for every machine that is down, so if 20 machines are down, 20 arp packets will be sent every time rushtop(1) tries to update its bar graph.

    In version 102.40h and up, an exponential backoff algorithm was added to 'rush -status' (which rushtop(1) uses), to slowly pull back transmissions to machines that appear to be down.

    To increase the backoff times, you can tune the rush.status_backoff_max value to be a higher number than the default value of 15, which puts a maximum time between transmissions of roughly 45 seconds.

    If a series of machines are going to be down for some time, it is best to change the '+any' entry for these machines in the rush/etc/hosts file to '+offline' instead, so that user invocations of rushtop(1) don't try to contact these machines. rushtop(1) by default uses '+any' to determine which hosts to show in its bargraph. Changing the hosts to +offline effectively removes the machines from the "+any" hostgroup, preventing rushtop from trying to contact them.

  Error 1058: The specified service is disabled and cannot be started.  
    You might see this while trying to start the Rushd service.

    Cause: The service that you are trying to start has been disabled for the current hardware profile.

    Resolution: Go into the Services for Rushd, and enable the Hardware Profile (for whatever reason, someone or something disabled it).

    1. Go to Start | Settings | Control Panel | Services | Rushd | Logon As
    2. Go to the "HW Profiles" window, select the current hardware profile and click Enable.
    3. Click OK, then try starting the Rushd service.

  c:/rush/var/.rushd.LCK: Invalid argument (can't open daemon lockfile)  
    You might get this in the rushd.log just after copying the C:/RUSH directory to a new machine from some other machine, and starting the daemon.

    To fix this, run these commands as Administrator:

        net stop rushd
        del c:/rush/var/.rushd.LCK
        net start rushd
        

    Once repaired, this shouldn't re-occur; the problem was caused by the copy operation.

    The most likely cause was the Rushd on the source machine had the .rushd.LCK file open at the time the copy operation was in progress, copying weird 'file in use' permissions for that file to the new machine.

    Using XCOPY to copy the directory seems to avoid the problem; usually the weird perms only happen when GUI drag+drop copying (from Microsoft's GUI) is used. See the 'Network Install' instructions for the recommended XCOPY command to copy the C:\rush directory to a network of machines.

  Starting rushd under WinNT it says 'can't find PDH.DLL' or 'PDH.LIB'  
    Microsoft didn't include PDH.DLL in some releases of Windows NT.

    You can download the PDH file for winnt from this Microsoft article, or search the Microsoft Knowledge Base for article "Q284996".

    For general information on accessing Microsoft support files, see Microsoft Knowledge Base article Q119591.

  udp: iface bind(10.10.10.115:696): Address already in use  
    Please see this faq item; it's the same thing.

  'xxx': valid host is NOT in rush hostlist  
    You'll get this error if remote machine 'xxx' is running the Rushd service, but is not in all the rush/etc/hosts files, causing those machines not to trust 'xxx', and print this error in the logs.

    Solution #1: If you want the machine on the rush network, then add the hostname to the rush hosts files.

    Solution #2: If you are trying to /remove/ that machine from the Rush network, then be sure to /disable/ the daemon on that machine (Windows: disable the service, Unix: comment it out of the boot script, or run the 'rush/etc/bin/uninstall' script)

    Typical causes:

    1. The host is simply not listed in the rush/etc/hosts file
    2. The rush hosts files are out of sync on some machines
    3. The host was supposed to be removed from rush, but the service wasn't disabled.
    4. There is a problem with either forward or reverse lookups for the host. eg. the hostname is resolving to one IP address, but the reverse lookup for that number is some other hostname. (Use 'nslookup' to debug)

  LICENSE client x.x.x.x sent us garbage(i)  
    You'll see this on the license server, where x.x.x.x is the IP address of one of the remote rush hosts.

    The problem is usually one of the following:

    • The remote machine has more than one network interface, and needs this rush configuration.
    • The remote machine is a recently setup redhat linux machine with this problem.

  can't get etheraddr[#]: xxx  
    You'll see this only on Windows machines. The cause is you don't have NetBIOS enabled. Check your network properties:

    1. Right click on Network Neighborhood, choose Properties

    2. 'NetBIOS' should appear under the "Services" tab. If not, use "Add" to add it. You shouldn't need a disk, but it might prompt for one. Just point it to your C: drive.

    3. 'NetBIOS' should appear under the "Bindings" tab. And your ethernet card should appear bound to NetBIOS.

    Once NetBIOS is enabled, you'll likely be asked to reboot. On reboot, Rushd should then start OK; check your log.

    Usually, there is no reason to have NetBEUI or IPX enabled under networking.

    This problem is often seen with 3Com cards.

  sendto(host:696): Message too long: to host  
    This is fixed in 102.31p and up.

    You'll see this error on Mac OSX with version 102.31n and previous when you try to use 'rush -push' from the mac, or try to use 'rushadmin' from the mac.

    The workaround is to add the following line to the 'start' section of the /System/Library/StartupItems/Rush/Rush boot script:

           sysctl -w net.inet.udp.maxdgram=64000
       
    Then reboot; the problem should go away. This should not be needed if you're running rush version 102.31p and higher, as the problem should already be fixed in all versions after that.

    Note that if you run the above sysctl command from the command line as root, you can fix the problem right away, eg:

           [root@macosx] # rush -push rush.conf ontario
           rush: sendto(ontario:696): Message too long: to ontario
    
           [root@macosx] # sysctl -w net.inet.udp.maxdgram=64000
           net.inet.udp.maxdgram: 9216 -> 64000
    
           [root@macosx] # rush -push rush.conf on
           ontario[rush.conf]: OK
       
    ..but if you reboot the machine, the problem comes back. This is why the fix is made to the boot script to have it stay in effect.

    You do not need to do this unless you are running a release older than 102.31p

  The service did not start due to a logon failure  
    Windows only.

    You'll get this error if the Rushd Service's user password expired, or was recently changed by someone.

    If the user's password expired, set it up to never expire. If someone changed the password, be sure to configure that new password for the Rushd Service's account settings in the Service Manager.

    You'll get this error while trying to start the rushd service from the Service Manager or from the DOS command line, e.g.:

    	C:\> net start rushd
    	System error 1069 has occurred.
    	The service did not start due to a logon failure.
        

  My DOS shell can't find programs like 'more', 'ping' and 'xcopy'  
    Installs of Houdini have been known to break the default Windows path, such that one no longer can run standard programs like ping, notepad, nslookup, more, etc.

    Fix your system path.

  Under Windows, why are UNC paths better than drive letters?  
    This question is answered in the TD faq here.

  How do I make UNC paths work the same on Windows and Unix?  
    This question is answered in the TD faq here.

  Is there a way to silence the 'Checkpoint wrote' messages?  
    Yes, in Rush version 102.41 and up you may see these messages.

    You can disable them in the rush.conf file; just change the checkpoint.log setting from '1' to '0'.

    Don't forget to push the rush.conf file to the network, eg. 'rush -push rush.conf +any'.

  Can Rush be installed in a directory other than /usr/local/rush'?  
    It is not recommended.

    Usually the intention is to install the Rush directory on an NFS drive, which is absolutely not recommended; it is a grand mistake to install network daemons on an NFS drive; it completely defeats the fault tolerance of the system.

    But if you want to install rush on a different local directory, you can:

    • Set the $RUSH_DIR environment variable points to the directory, and this variable must be set a) before the daemon is started and b) in all user environments.
    • Edit the 'rush/etc/rush.conf' file, and change any references to /usr/local/rush to be the path you're using instead.
    But you will things easier on yourself and your users if you just install rush in the usual /usr/local/rush location.

  Can I install rush on an NFS server, to avoid installing locally on each machine?  
    No!

    Please see step#1 of the install instructions, which warns:

      WARNING: As with all daemons and their config files, do *not* install the rush directory or binaries on an NFS mounted drive. Keep rush binaries local to each machine.

    The main reasons:

    1. Having executables demand page over NFS means the daemons will be dependent on the health of the NFS server. NFS 'hangs' can freeze the daemon in an unkillable state, and will make them unresponsive to the interactive apps.

      This also kills the whole fault tolerant 'distributed' design of Rush, which is to not be dependent on the health of any one server. The daemons are carefully written so as not to be depndent on any other server, keeping them responsive even when NFS servers or even license servers are rebooting.

    2. Rushd reads/writes files in the rush/var directory. If the rush dir is NFS mounted, you absolutely must make provsions to ensure at least the var directory is local on each machine, or the daemons will all write over each other's logs.
    3. File locking would be the only way to ensure integrity with the rush data files. NFS file locking is very broken in all releases of Unix (Sun, Irix, Linux, everything). It hangs programs in an unkillable state when it fails. and it fails often under medium and heavy load. File locking doesn't fit the indempotent 'stateless' design paradigm of NFS, so it is broken by design.
    If you disagree with the above, then at LEAST make sure the rush/var directory is local; the daemons read/write files in that directory, and simply will not operate correctly unless that dir is local, or at very least unique on each machine.

  I'm getting 'select() on connect(): (111) Connection refused' when I submit jobs, or use 'rush -ping'  
    This means the Rushd daemon or service isn't running.

    Look in the rush/var/rushd.log file for errors from the daemon. Most likely causes:

    • This machine's hostname is not in the rush/etc/hosts file
    • The rush/etc/hosts file is not the same on all machines (it should be)
    • This machine is not in the license server's rush/etc/hosts file

    NOTE: If this is a Windows machine, and the rushd.log file is /empty/, even when you try to manually start the daemon, then try these commands as administrator:

            net stop rushd
    	del /f c:\rush\var\.rushd.lck
    	net start rushd
        

  What are the hardware/software pre-requisites for Rush?  
There are some prerequisites for Rush, especially on Windows and Mac networks. Rush assumes there's a professionally configured network; hostname/ip lookups all working, consistently configured user accounts, file servers for the common data with correctly configured access permissions.

Operating systems now come with 'zero administration' defaults (WINS, Rendezvous, DHCP) which lead some into thinking there's very little that needs to be configured for a working network. But these defaults are NOT suitable for production; 'zero admin' networks will work sporadically, which is fine for 'home' networks, but not for production.

  • Use DNS for hostname/ip lookups
    Do /not/ use the Microsoft "WINS" or Apple "Rendezvous" defaults; these are 'zero administration' name lookup systems that are weak protocols, and are not fault tolerant. Their biggest problem is being unable to do name lookups for machines that suddenly turned off, which will cause network wide problems during rendering. DNS does not have this problem. Static /etc/hosts will work too (lmhosts on windows), but that becomes unmanagable on networks larger than a dozen or so machines.
  • Use static IP addresses for the machines
    Do not use DHCP (ie. "Automatic" IP assignments). If you must use DHCP, make sure the leases are set to never expire.
  • Verify all machines can 'ping' each other by hostname
    DNS will make this possible on all platforms. All machines must be able to lookup each other's IP addresses based on hostnames to communicate correctly.
  • Unix machines must have valid user accounts
    All users submitting jobs should have valid accounts on the rendering machines with consistent uid/gid values. Example: If "fred" has a uid/gid of 100/20 on one machine, his account should have those same values on all machines. (See exception below)
  • Macs must have a statically configured hostname in /etc/hostconfig
    Make sure there's a "HOSTNAME=yourhostname" entry in /etc/hostconfig.. Either add the line (Tiger), or change the existing "HOSTNAME=-AUTOMATIC-" (Panther), and reboot before installing rush. This avoids the 'hostname.local' Rendezvous stuff that causes intermittant problems.

The requirements for DNS and static IPs are necessary for stability; Rush by definition makes machines into 'cpu servers' for rendering. As 'servers', these machine's IP addresses must be consistent through reboots for Rush to be able to correctly handle failure tolerance if they are crashed/rebooted during communication transactions. By coming up with the same IP address, Rush will be able to resume communications quickly.

The requirement for static IPs is similar to the reason Web Servers and File Servers must have static IPs; so that clients can find them quickly when rebooted. Arp and IP caches are affected by changes in IP addresses; IP changes can cause ~15 minutes of confusion while the Arp tables expire.

The Unix requirement for consistent uid/gid values is needed for consistent file system permissions when network rendering. Normally Rush runs renders on unix machines as the user that submitted the job. Exception: you /can/ force Rush to run all renders as a single user (usually a user called 'render' or 'rush'.. up to you) in which case only that user needs a valid account on all machines.

  How do you enable Rush mail messages to work under windows?  

  can't open port lock file '/usr/local/rush/var/nextport: Permission denied  
    Make sure rush/rushd are owned by root and have the SUID bit set. eg:
          chown 0.0  /usr/local/rush/bin/{rush,rushd}
          chmod 4755 /usr/local/rush/bin/{rush,rushd}
      

    When you extracted the rush distribution tar file, you probably forgot to specify the tar(1) 'p' flag.

  rush: iface bind(x.x.x.x): Cannot assign requested address  
    The machine is probably confused about its own IP address.

    Run 'hostname' to determine the host's name, then 'ping [the_hostname]'.

    Verify if the IP address ping prints in response is the machine's actual network IP address, which you can verify with the command:

            Windows: ipconfig /all
           Unix/OSX: ifconfig
      

  unknown uid 100  
    You'll see this error usually in the IRUSH 'Frames' report (rush -lf) or 'Cpu' report (rush -lc).

    You'll get this error under one of two circumstances:

    1. When your unix machines don't have consistent uid/gid values for users
    2. When you submit a job from Windows that uses unix machines, and haven't yet configured ntrushuid and ntrushgid values in the rush.conf file.
Problem #1: Inconsistent UID/GID Description + Solution
    Description: When a unix machine runs a render for a job submitted from another unix machine, it will try to run that job as the user that submitted the job, using the same uid/gid values. NFS also works this way. If the user doesn't exist, or there is no matching account on the render nodes for the uid/gid values of the user who submitted the job, renders will fail with this error.

    Solution: fix the uid/gid values on your machines so each user has the same uid/gid value on each machine. eg. if 'fred' has a uid/gid value of 105/100, then make sure his account has the same uid/gid values on /all/ machines.

Problem #2: Windows -> Unix Description + Solution
    Description: When a job is submitted from a Windows machine, and asks to render on unix machines, Rush has to assign the job a uid/gid value, because Windows has no concept of these values.

    The value Rush assigns by default is uid=100, gid=100, the default values you'll find for ntrushuid and ntrushgid values in the rush.conf file, which you should customize.

    If the unix machine tries to run a job with a uid/gid of 100, but the unix machine doesn't /have/ an account with that uid/gid, then Rush gives this error.

    Solution: The sysadmin should create a 'render' (or "rush") account on all the unix machines, making sure they're all using the same uid/gid values. Either assign that user a uid/gid of 100/100, or use your own custom uid/gid values and configure them for the ntrushuid and ntrushgid values respectively in the rush.conf file. The rush.conf file can be modified with 'rushadmin', and then be pushed out to the network with the 'Send' button, or can be manually edited in a text editor, and then pushed to the network with 'rush -push rush.conf +any')

    This procedure is also covered in the Windows Installation instructions.

    Caveats: The sysadmin must make sure the user permissions of the unix "render" user are compatible with the permissions of the Windows users, so that both platforms have the same permission to each other's files. This way when a windows job tries to render on unix, it uses the uid/gid of the "render" user to run the renders.

  rushd: error while loading shared libraries: libstdc++.so.5: cannot open shared object file..  
    You are probably trying to use rush's 'redhat9' distribution on RHE4 (Redhat Enterprise 4) or FC4 (Fedora Core 4). If so, be sure to install the 'compatibilitiy libraries', in particular, rpm -ivh compat-libstdc++-33-3.2.3-47.3.i386.rpm.