RUSH RENDER QUEUE
(C) Copyright 1995,2000 Greg Ercolano. All rights reserved.
V 102.31m 02/02/02
Strikeout text indicates features not yet implemented


Systems Administrator Questions

  1. I'm getting "connect(myhost): Invalid argument" in my linux rushd.log?
  2. I'm getting 'rresvport(): Permission denied', what's that mean?
  3. What does 'bind(): Address already in use' mean?
  4. Rush can't find the user's log directory on submit, but it does exist!
  5. What's the best way to verify all the daemons are running?
  6. How do I stop/start the daemons? (Unix/NT)
  7. Is there an example boot script I can use to invoke rush?
  8. Is there a way to run 'rush -online' automatically when someone logs out?
  9. Is there a way to run 'rush -online' automatically when someone's screensaver pops on?
  10. What kinds of security issues are there with rush?
  11. How do I update changes to the rush hosts file (or rush.conf file) to the network?
  12. Is there a way to see whose jobs are bumping whom?
  13. Is there a way to see who's changing other people's jobs?
  14. Can rush be told to use a different network interface, other than the machine's hostname?
  15. Windows: Where can I get perl for Windows?
  16. Windows: Where can I get rsh/telnet daemons for Windows?
  17. Windows: is there a way to restart the rushd service as a normal user?
  18. Windows: Is there a way to disable error dialogs? (General Protection Faults, etc)
  19. I changed the ip addresses of a host on my network, and now rush can't talk to it?
  20. Sometimes I see "???.???.???.???" in 'rush -lah' reports. Is this bad?
  21. Rush is acting slow; reports take a long time, and the GUI is sluggish. What's wrong?
  22. Can I use DHCP on machines running rush?
  23. How do I know how many rush licenses are checked out?
  24. Rushtop doesn't show cpu use for some of my windows machines?
  25. Maya renders fail with "--- FAILED: EXITCODE=128"?
  26. On submit, I get 'logdir '//Xxx/yyy/logs/': no such file or directory', but the dir exists!
  27. Rendering on windows, I get 'CreateProcess(perl ...): The system cannot find the file specified.'?
  28. Starting rushd under WinNT it says 'can't find PDH.DLL' or 'PDH.LIB'

Common Errors in rushd.log

  1. bind(): Address already in use
  2. connect(myhost): Invalid argument
  3. udp: iface bind(10.10.10.115:696): Address already in use
  4. 'xxx': valid host is NOT in rush hostlist
  5. LICENSE client 157.166.34.210 sent us garbage(i)
  6. can't get etheraddr[#]: error message
  7. sendto(host:696): Message too long: to host
  8. CreateProcess(perl ...): The system cannot find the file specified.

I'm getting 'connect(myhost): Invalid argument' in my linux rushd.log?

    Your host is coming back with 127.0.0.1 as its ip address, instead of the network interface's actual address. Fix your machine's /etc/hosts file.

    This is known problem with default installations of linux RedHat (7.1, etc) caused by the installer.

    To test for the problem, open a shell on that machine and ping its own hostname. If the address returned is 127.x.x.x, fix your /etc/hosts file.

    
    you@tahoe % ping tahoe
    PING tahoe (127.0.0.1) from 127.0.0.1 : 56(84) bytes of data.
    64 bytes from 127.0.0.1: icmp_seq=0 ttl=255 time=0.5 ms
    64 bytes from 127.0.0.1: icmp_seq=1 ttl=255 time=0.4 ms
    ^C
    
    you@tahoe % grep tahoe /etc/hosts
    127.0.0.1 tahoe localhost
    		

    ..this is wrong. Correct it by editing /etc/hosts, and making separate entries for 'localhost' and the machine's actual hostname and IP address, eg:

    
    root@tahoe # vi /etc/hosts
    root@tahoe # cat /etc/hosts
    127.0.0.1      localhost
    192.168.0.37   tahoe
    
    root@tahoe # ping tahoe
    PING tahoe (192.168.0.37) from 192.168.0.37 : 56(84) bytes of data.
    64 bytes from 192.168.0.37: icmp_seq=0 ttl=255 time=0.5 ms
    64 bytes from 192.168.0.37: icmp_seq=1 ttl=255 time=0.5 ms
    		

I'm getting 'rresvport(): Permission denied', what's that mean?
    Usually one gets this error in the context of running 'rush' from the command line:

    % rush -ping
    tahoe: rush: rresvport(): Permission denied

    Rush uses a reserved port to communicate with the daemon, and therefore needs to run SUID root.

    Make sure the SUID bit is on for the rush(1) binary, and the owner is root:

    chmod 4755 /usr/local/rush/bin/rush
    chown 0.0 /usr/local/rush/bin/rush

What does 'bind(): Address already in use' mean?
    It usually means one of these things.

    1. This is an SGI, and the kernel's NFS is using the port
    2. Something else is using the port.
    3. Two or more rushd's are running. (Not likely in 102.31+)
    4. You recently stop/started the daemon. Problem goes away by itself. (Not likely in 102.31+)

    #1 often occurs if you've just installed rush on an SGI for the first time, and the machine has been up for a while. 'netstat -an' will show a whole slew of UDP listeners on ports between 512 and 1024 all in sequence, one of them being port number 696, the one rush has been assigned by IANA. Some rogue kernel utility is causing this, probably NFS. Usually fuser(1) shows no process associated with the rogue UDP listeners because it's a kernel process. The easiest solution is to simply reboot; when rush starts on boot, it always secures the port it needs well before the kernel gets a chance to step on it.

    #2 Stop the rush daemon, and use 'netstat -an' to see if some other program is using rush's port (normally port #696; see your rush.conf file's serverport setting, incase your site uses a different port number). Look for open UDP or TCP connections on that port, either in the Local or Foreign address.

    If you see port #696 in the 'Foreign' address of the local machine, suspect hung clients on the remotes:

    • rsh over to the remote machine (ie. 'Foreign' host)
    • Kill any 'rush' client processes you see, eg. 'killall rush'
    • Back on the local machine, do a 'netstat -an' to verify the connections are gone or closing.
    • Restart the daemon once all 696 ports have closed

    If the local TCP or UDP port is in use, suspect some system daemon or other is using the port when it shouldn't. Use fuser(1) or similar utility to figure out which process is using the port, or simply reboot.

    If fuser(1) shows no process and it's an SGI, then see #1..

    #3 only happens in older versions of rush (pre-102.31) where more than one rush might be running. Newer versions of rush use a lock file that prevents this.

    Only one daemon should have a PPID of 1 (Parent Process ID). If there's more than one with a PPID of 1, kill the one(s) with the higher PID.

    #4 is common only in the older versions of rush (pre-102.31), and occurs when you stop/start the rushd daemon. This problem fixes itself within 2 minutes automatically. The OS often keeps recently closed TCP listeners unavailable to other processes for a 90 second period. Rush will keep retrying to bind to the port, and eventually succeeds within 2 minutes.

On submit, I get 'rush: LogDir '//foo/bar': No such file or directory'
    ..the typical complaint being 'but the directory does exist!'.

    This is encountered on windows machines, and is caused by RUSHD being configured to run as a user that has no access to the network drive containing the log directory.

    When you submit a job, it is RUSHD that does the check to see if it can access the log directory. If the daemon can't access the directory, it responds with:

      rush: LogDir '//foo/bar/bla': No such file or directory

    If you login as administrator on the workstation, and are prompted for a password to access the network drive (either on login, or when you try to access the log directory through a browser) then you have replicated the problem. RUSHD has no way to enter a password; you have to configure it to run as a user that has access to the network drives.

    The solution is to configure the server to allow the user RUSHD runs as to have access to the file server.

What's the best way to verify all the daemons are running?
    Use:

      rush -ping +any

    This 'pings' all the daemons in the $RUSH_DIR/etc/hosts file with a TCP message.

    If the daemon isn't running, tail(1) the daemon's log file in $RUSH_DIR/var/rushd.log.

How do I stop/start the daemons? (Unix/NT)
Irix /etc/init.d/rush stop
/etc/init.d/rush start
Linux/RedHat 6.x /etc/rc.d/init.d/rush stop
/etc/rc.d/init.d/rush start
Windows NT net stop rushd
net start rushd

All the daemons can be stopped via:

    rush -dexit +any

Is there an example boot script I can use to invoke rush?
    Yes; see $RUSH_DIR/etc/S99rush.

Is there a way to run 'rush -online' automatically when someone logs out?
    Yes; when a user logs out of the window manager, the sysadmin can configure the following files to run 'rush -online':

    Irix /usr/lib/X11/xdm/Xreset
    Linux/RedHat 6.x /etc/X11/xdm/TakeConsole

    A literal example of what should be added to these files would be:

    /usr/local/rush/bin/rush -online
    logger -t RUSH "Rush online (user logout)"

    Use of logger(1) is optional; it leaves an audit trail in the syslog. Include full path to logger(1) if security is an issue.

Is there a way to run 'rush -online' automatically when someone's screensaver pops on?
    There probably is, but I don't know how to do it.

    If you have any suggestions on how to do it on various platforms, please send me email.

What kinds of security issues are there with rush?
  • To avoid root loopholes, be sure all subdirs in the path to the setuid binaries and config files have tight permissions, e.g., if rush is installed in /usr/local/rush/bin:


chmod go-w /usr \
	   /usr/local \
	   /usr/local/rush \
	   /usr/local/rush/bin \
	   /usr/local/rush/bin/* \
	   /usr/local/rush/etc \
	   /usr/local/rush/var \
	   /usr/local/rush/var/*

chmod 4755 /usr/local/rush/bin/rush
chmod  755 /usr/local/rush/bin/rushd

chown 0.0 /usr/local/rush/bin/rush \
	  /usr/local/rush/bin/rushd
		

  • By default, rush uses reserved port 696 to communicate udp/tcp packets. For secure networks, make sure users do not have access to root to avoid renegade software from exploiting the port.

  • Rush daemons will not run any job as a uid or gid less than 100. You can further restrict which uids/gids rush can run processes as via UidRange and GidRange or even ForceUid/ForceGid.

  • Rush daemons will only trust remote machines that are configured in its host list. Rush will log all connection attempts from machines not configured in the hosts file. Sysadmins can grep the rushd.log files for the string 'SECURITY' to detect security related problems.

  • The new 'rush -push' feature which helps sysadmins release the rush hosts/rush.conf/license.dat files can be disabled to close any suspected loop holes with such file distribution.

How do I update changes to the rush hosts file (or rush.conf file) to the network?
    You should use rdist(1), and the changed files will be picked up automatically by the daemons within a minute. Here are some examples:


	# SEND A NEW rush.conf
	foreach i ( `awk '/^[a-z]/{print $1}' /usr/local/rush/etc/hosts` )
   	rdist -c /usr/tmp/newconf ${i}:/usr/local/rush/etc/rush.conf
	end
	# SEND A NEW RUSH hosts
	foreach i ( `awk '/^[a-z]/{print $1}' /usr/tmp/newhosts` )
   	rdist -c /usr/tmp/newhosts ${i}:/usr/local/rush/etc/hosts
	end
		

    NOTE: When sending out new files, you must use rdist(1), and not cp(1) or rcp(1). rdist(1) uses a special 'tmp-file/rename' technique that prevents the daemon from parsing the file before it has finished being written.

Is there a way to see whose jobs are bumping whom?
    Grep the $RUSH_DIR/var/rushd.log file for BUMP messages.

Is there a way to see who's changing other people's jobs?
    Grep the $RUSH_DIR/var/rushd.log file for SECURITY messages.

Can rush be told to use a different network interface, other than the machine's hostname?
    Yes. In the rush hostlist, the hostname can actually be a pair of hostnames separated by a ':', e.g., tahoe:tahoe-eth.

    The name on the left of the ':' is the familiar hostname(1) of the machine, and the name that follows the ':' is the alternate network interface you want to use.

    See also the Hosts File: Hostname section on the hostname field.

Where should I get perl for Windows?
    It is highly recommended you use ActiveState Perl.

    It's definitely the best. Both well integrated and documented specifically for the Windows platform. Highly cross-platform compatible, with excellent Windows-specific modules and many of the standard internet modules, including Mail/FTP/NNTP, etc.

    I've personally tested and used it extensively in various production environments and have found it to be the most stable perl available.

    It's a free download.

Where can I get rsh/telnet daemons for Windows?

    I have personally evaled both products, and found them both useful.

    Regarding Denicomp and rsh, it lets you run simple commands on the remote machines using NT's own rsh(1) client. It supports 'rsh hostname command', but does not support 'rush hostname'. In other words, you can't strike up an interactive session. You get a limited trial to use the software for free, then if you like it you should buy it.

    Regarding Georgia Softwork's telnet server, I have to say it's impressive what it does. You can run interactive dos applications that even do direct screen memory access, and the results will look correct on the telnet client. Compatible with unix telnet clients, as with NT clients. Unfortunately, the software is very expensive. But you get a 30 day trial to test it out.

    There is also freeware available. Most of those I've evaled have extreme limitations, or are easily broken.

    The NT Resource Kit from Microsoft which comes with a telnet server, though I've never tried it.

Windows: is there a way to restart the rushd service as a normal user?
    There are two ways I know of.

    I. If you're running under Win2K, you can use the new 'runas' Win2K command. Similar to su(1) in unix; it lets you run commands as administrator. The following gives you a DOS shell with network administrator priveleges regardless of who the current logged in user is:

      runas /user:YOUR_DOMAIN\Administrator cmd
      Password:

    In the DOS shell that appears, run 'net stop rushd' and 'net start rushd'.

    II. Use your domain controller's Remote Services administration software. With a Win2K server:

    1. Start->Programs->Administrative Tools->Active Directory Users And Computers
    2. Select 'Computers'
    3. From the list of computers, right click on the one to control, and choose 'Manage'
    4. Under the "Tree" tab, click "Services and Applications"
    5. Choose "Services", then choose "Rushd" and then use the usual Start/Stop controls

    There's surely something similar under WinNT Server.

Windows: Is there a way to disable error dialogs? (General Protection Faults, etc)
    Yes. But it is a registry tweak that affects the entire machine.

    See this Microsoft Knowlege Base Article Q124873 for more info. To paraphrase, this article basically says, along with the usual risk disclaimers regarding manual editing of the registry:

    1. Run Registry Editor (REGEDT32.EXE).

    2. From the HKEY_LOCAL_MACHINE subtree, go to the following key:
      \SYSTEM\CurrentControlSet\Control\Windows\ErrorMode

    3. Select the ErrorMode value.

    4. From the Edit menu, choose DWORD.

    5. Type 0 (zero), 1, or 2 to select the error mode. Regardless of this setting, all errors are written to the system log:

      • 0 - Error message box pops up (default).
      • 1 - No dialog for system errors only.
      • 2 - No dialog for system or other errors.

I changed the ip addresses of a host on my network, and now rush can't talk to it?
    touch(1) all the rush hosts files on the network, so the daemons will reload their IP address caches.

    For speed, the rush daemons cache hostname-to-ip-address lookups for all the hosts in the rush hostlist. This prevents load on your DNS, NIS and WINS servers, since rush makes numerous hostname/ip lookups when it's running jobs.

    When you change the IP address of one of the machines, the rush daemons need to be told to flush their caches. touch(1)ing all the $RUSH_DIR/etc/hosts files changes the date stamp of the file, causing the daemons to think a change was made, which then reload the file, and flush their cache.

    You can check to see what any daemon has in its IP cache using 'rush -lah <hostname>'. This will show you the rush hostlist according to the daemon on the named machine, including it's cached IP address lookup information.

Sometimes I see '???.???.???.???' in 'rush -lah' reports. Is this bad?
    Yes, it's bad.

    Rush will not operate correctly if it can't do hostname lookups for any machine in the rush hosts file.

    The question marks mean rush is unable to lookup a host's name. The more of these there are, the slower rush will operate. You will also notice sluggish or very slow operation in the GUIs, and in the generation of most rush reports.

    Probably other tools like 'ping' will be unable to lookup the hostname. Possible causes:

    • In an environment where static /etc/hosts files are used for name lookups, make sure the hostname is in all your /etc/hosts files.

    • In an NIS environment, same problem likely exists with your 'hosts' map.

    • Do not use DHCP. DHCP is bad for machines running rush. Use static IPs.

    • In a DNS environment, either your DNS server is not responding, not configured, or the host is not in your DNS. Use nslookup(1) to debug the problem.

    • In a Windows envrionment where WINS is used to do hostname lookups, if the machine is down, WINS can't do a hostname lookup for it. To solve this problem, you can do any *one* of the following:

      • Make *static* IP entries for all rush machines on your WINS server, so the hostnames still lookup even if the machine is down. Use the 'WINS Manager' on your PDC.

      • Maintain static hosts files on the rush machines. Windows has a unix-like hosts file "C:\WINNT\SYSTEM32\DRIVERS\ETC\HOSTS" which if set up to contain IP-to-hostname entries for all the rush hosts, it can be copied to all the machines to ensure hostname lookups never fail.

      • Get away from WINS, and use DNS with static IP-to-hostname lookups. This will ensure hostname lookups work even when hosts are down.

Rush is acting slow; reports take a long time, and the GUI is sluggish. What's wrong?
    Run 'rush -lah' and 'rush -lah localhost' to see if it reports "???.???.???.???" for the ip address of any hostnames. If so, you are having a name lookup problem, and that's causing the problem.

    This is especially a problem on Windows networks if you use WINS instead of DNS. WINS can't do hostname lookups for a machine that is down. A good reason NOT to be lazy, and depend on WINS to dynamically keep track of things.

    To solve this problem, see the above.

Can I use DHCP on machines running rush?
    It is not advised.

    You can do it if you set it up so the leases never expire.

    Rush will not operate correctly if the IP address of machines change randomly, or change when they reboot. The best thing to do is assign static ip addresses to all machines running rush.

How do I know how many rush licenses are checked out?
    It would be the number of hosts you have in your rush hosts list, which should be the same on all machines.

    So basically, it's the number of lines in the rush hosts file, not counting the comments. Or the number of hosts in 'rush -lah'.

    Rush checks licenses out on boot, it does not check out/check in while the system is running.

Rushtop doesn't show cpu use for some of my windows machines?
    Either the RUSHD service on those machines isn't running, or it is running, but is configured to run as a user that doesn't have local admin privelege. RUSHD must run with local admin privelege in order to access the machines cpu usage, otherwise RUSHD can't report it to 'rushtop'.

Maya renders fail with '--- FAILED: EXITCODE=128'?
    This is not an issue with rush; this is a Maya licensing problem on only Windows machines, where two different users can't run Maya on the same machine, even though licenses are available to do so. This is a bug in maya's licensing for windows, but workarounds exist.

    A complete description of the problem, and possible workarounds are fully described in this maya issues document.

On submit, I get 'logdir '//Xxx/yyy/logs/': no such file or directory', but the dir exists!
    This happens if you submit a job to a windows machine if the 'logdir' has a trailing slash.

    Remove the trailing slash on the pathname, and it'll be OK.

    This problem has been resolved in versions 102.31p and up.

    Note: Windows is to blame for this one; the bug is in Microsoft Window's stat() call; it can't handle pathnames that have trailing slashes. The fix in 102.31p involves rush actively removing the trailing slash before all calls to stat(), which should not be necessary.

Rendering on windows, I get
'CreateProcess(perl ...): The system cannot find the file specified.'?
    You'll see this error in the daemon log for a windows machine.
    You may also see it in the frame logs, and/or in the NOTES section of the 'Cpus' report for your job for cpus that are windows machines.

    The cause is either 'perl' is not installed on the machine in question, or perl was recently installed, without restarting the rush daemon.

    If you install perl while the rush dameon is running, you must restart the daemon for it to pick up the addition to the system environment's PATH variable, eg:

            net stop rushd
            net start rushd
       
    Then requeue the frames and try again.

Starting rushd under WinNT it says 'can't find PDH.DLL' or 'PDH.LIB'
    Microsoft didn't include PDH.DLL in some releases of Windows NT.

    You can download the PDH file for winnt from this Microsoft article, or search the Microsoft Knowledge Base for article "Q284996".

    For general information on accessing Microsoft support files, see Microsoft Knowledge Base article Q119591.

udp: iface bind(10.10.10.115:696): Address already in use
    Please see this faq item; it's the same thing.

'xxx': valid host is NOT in rush hostlist
    That means 'xxx' is running rush daemons, but that host does not have an entry in the ../rush/etc/hosts file.

    Usuaully the cause of this is either the rush hosts files are out of sync, or a decommissioned machine still has an old rush hosts file, and the daemon is configured 'on'.

    Solution: either add the hostname to all the rush hosts files, or disable the rush daemons on the remote machine 'xxx'.

LICENSE client x.x.x.x sent us garbage(i)
    You'll see this on the license server, where x.x.x.x is the IP address of one of the remote rush hosts.

    The problem is usually one of the following:

    • The remote machine has more than one network interface, and needs this rush configuration.
    • The remote machine is a recently setup linux redhat 7.x with this problem.

can't get etheraddr[#]: xxx
    You'll see this only on Windows machines. The cause is you don't have NetBIOS enabled. Check your network properties:

    1. Right click on Network Neighborhood, choose Properties

    2. 'NetBIOS' should appear under the "Services" tab. If not, use "Add" to add it. You shouldn't need a disk, but it might prompt for one. Just point it to your C: drive.

    3. 'NetBIOS' should appear under the "Bindings" tab. And your ethernet card should appear bound to NetBIOS.

    Once NetBIOS is enabled, you'll likely be asked to reboot. On reboot, Rushd should then start OK; check your log.

    Usually, there is no reason to have NetBEUI or IPX enabled under networking.

    This problem is often seen with 3Com cards.

sendto(host:696): Message too long: to host
    This is fixed in 102.31p and up.

    You'll see this error on Mac OSX with version 102.31n and previous when you try to use 'rush -push' from the mac, or try to use 'rushadmin' from the mac.

    The workaround is to add the following line to the 'start' section of the /System/Library/StartupItems/Rush/Rush boot script:

           sysctl -w net.inet.udp.maxdgram=64000
       
    Then reboot; the problem should go away. This should not be needed if you're running rush version 102.31p and higher, as the problem should already be fixed in all versions after that.

    Note that if you run the above sysctl command from the command line as root, you can fix the problem right away, eg:

           [root@macosx] # rush -push rush.conf ontario
           rush: sendto(ontario:696): Message too long: to ontario
    
           [root@macosx] # sysctl -w net.inet.udp.maxdgram=64000
           net.inet.udp.maxdgram: 9216 -> 64000
    
           [root@macosx] # rush -push rush.conf on
           ontario[rush.conf]: OK
       
    ..but if you reboot the machine, the problem comes back. This is why the fix is made to the boot script to have it stay in effect.