From: Greg Ercolano <erco@(email surpressed)>
Subject: [Q+A] How do I determine what caused a render node to crash during
   Date: Tue, 04 Apr 2006 15:18:01 -0400

Msg# 1269
View Complete Thread (4 articles) | All Threads
Last Next

> We're working on some large projects, and sometimes our renders crash
> some of our render farm nodes, causing them to hang or freeze up.
> Can you recommend any techniques for determining the cause of the crash?

    A completely frozen box usually means the kernel panicked,
    or a mount to the file server froze up (did you put mounts in
    your root directory? Or make symlinks in the root directory to
    mount points? Bad!), or the box is thrashing to death.

    The most common cause is the render process is using so much ram
    it's causing the box to swap to death. Unix boxes often don't behave
    well when a process uses to much memory.

    The next common cause is a buggy OS; the render is somehow triggering
    a bug in a device driver or kernel causing it to crash. Sometimes
    heavy rendering can trigger subtle threading bugs in the kernel
    between eg. the network card and/or disk driver, the two most heavily
    used device drivers during rendering.

    Or possibly there's a hardware problem if the problem is specific
    to a machine (ie. a burned out cpu fan, bad ram, dust problems
    on the mobo, bad drive)

    The first place to look would be the log file for the frame..
    quite possibly there's enough info there to indicate something
    was wrong during the render.

    Next place to look would be the machine's own system log:

	LINUX: /var/log/messages
	OSX: /var/log/system.log
	WINDOWS: Check the "Event Viewer" (My Computer|Manage|Event Viewer)

    For example, under linux, look in /var/log/messages for error messages
    and/or look at the console's last dying messages before rebooting the box.
    I would leave the render nodes in text mode (ie. no X windows),
    and when a box hangs in this way, inspect the console (VGA) for any
    last messages, and become familiar with the magic "SysRq" key sequences
    to inspect the kernel's status (if it's still alive enough to do so), eg:
    http://www.tldp.org/HOWTO/Keyboard-and-Console-HOWTO-8.html

    This implies having an actual keyboard and monitor attached to
    the render node during diagnosis -- I assume you have a KVM or at very
    least, a way to manually patch a keyboard/monitor to the blade. Very
    important to have access to the machine's console to do diagnosis
    of hung boxes, before wacking the reset button.

    If you suspect a ram issue with the renders, you might want to
    leave 'vmstat 3' running to monitor ram/virt memory/io while renders
    are running.

    You could implement a boot script that logs the output of vmstat
    to a log file (eg. /var/log/vmstat.log). It can be as simple as just
    backgrounding a 'vmstat 3 >> /var/log/vmstat.log' command from a boot
    script, or more robust as the following perl script, which includes
    date stamps and [optionally] periodic ps(1) reports:

==========================================================================

#!/usr/bin/perl -w
#
# vmstat-logger -- watch vmstat statistics with date stamps
# erco 1.0 04/04/06
#
#     Run this program at boot in the background to keep a log
#     history of virtual memory use and disk i/o. eg:
#
#           /etc/LOCAL-vmstat-logger < /dev/null > /var/log/vmstat-errors.log 2>&1 &
#

use strict;
require "ctime.pl";

# CUSTOMIZABLE VALUES
my $logfile = "/var/log/vmstat.log";		# log we append to
my $vmsecs   = 5;				# vmstat sample rate (seconds)
my $datesecs = 300;				# date stamp rate (must be > $vmsecs)
my $pssecs   = 60;				# ps(1) report rate (must be > $vmsecs)

# OPEN LOG FOR APPENDING
unless ( open(LOG, ">>/var/log/vmstat.log") ) {
    print STDERR "$0: $logfile: $!\n";
    exit(1);
}
select(LOG); $|=1; select(STDOUT);		# unbuffered i/o to log

# OPEN CONTINUOUS VMSTAT FOR READING
unless ( open(VMSTAT, "vmstat $vmsecs|") ) {
    print STDERR "$0: vmstat $vmsecs: (could not execute): $!\n";
    exit(1);
}

# LOG CONTINUOUS OUTPUT OF VMSTAT WITH DATESTAMPS
my $count;
my $vmstatout;
for ( $count=0; $vmstatout = <VMSTAT>; $count++ ) {

    # TIME STAMP
    if ( ( $count % ($datesecs/$vmsecs)) == 0 ) {	# log time stamp
        print LOG "DATE: --- " . ctime(time());
    }

#   # PS REPORT
#   if ( ( $count % ($pssecs/$vmsecs)) == 0 ) {		# [OPTIONAL] log ps report
#       print LOG "--- PS REPORT:\n";
#       unless ( open(PS, "ps faux|") ) {
#           print LOG "$0: ps faux: $!\n";
#       } else {
#           while (<PS>) {
#               if ( ! /^root/ ) {			# log all non-root processes
#                   print LOG "PS: $_";
#               }
#           }
#           close(PS);
#       }
#   }

    # VMSTAT REPORT
    print LOG "VMSTAT: $vmstatout";				# log vmstat output
}

# EOF

Last Next