> We're working on some large projects, and sometimes our renders crash
> some of our render farm nodes, causing them to hang or freeze up.
> Can you recommend any techniques for determining the cause of the crash?
A completely frozen box usually means the kernel panicked,
or a mount to the file server froze up (did you put mounts in
your root directory? Or make symlinks in the root directory to
mount points? Bad!), or the box is thrashing to death.
The most common cause is the render process is using so much ram
it's causing the box to swap to death. Unix boxes often don't behave
well when a process uses to much memory.
The next common cause is a buggy OS; the render is somehow triggering
a bug in a device driver or kernel causing it to crash. Sometimes
heavy rendering can trigger subtle threading bugs in the kernel
between eg. the network card and/or disk driver, the two most heavily
used device drivers during rendering.
Or possibly there's a hardware problem if the problem is specific
to a machine (ie. a burned out cpu fan, bad ram, dust problems
on the mobo, bad drive)
The first place to look would be the log file for the frame..
quite possibly there's enough info there to indicate something
was wrong during the render.
Next place to look would be the machine's own system log:
LINUX: /var/log/messages
OSX: /var/log/system.log
WINDOWS: Check the "Event Viewer" (My Computer|Manage|Event Viewer)
For example, under linux, look in /var/log/messages for error messages
and/or look at the console's last dying messages before rebooting the box.
I would leave the render nodes in text mode (ie. no X windows),
and when a box hangs in this way, inspect the console (VGA) for any
last messages, and become familiar with the magic "SysRq" key sequences
to inspect the kernel's status (if it's still alive enough to do so), eg:
http://www.tldp.org/HOWTO/Keyboard-and-Console-HOWTO-8.html
This implies having an actual keyboard and monitor attached to
the render node during diagnosis -- I assume you have a KVM or at very
least, a way to manually patch a keyboard/monitor to the blade. Very
important to have access to the machine's console to do diagnosis
of hung boxes, before wacking the reset button.
If you suspect a ram issue with the renders, you might want to
leave 'vmstat 3' running to monitor ram/virt memory/io while renders
are running.
You could implement a boot script that logs the output of vmstat
to a log file (eg. /var/log/vmstat.log). It can be as simple as just
backgrounding a 'vmstat 3 >> /var/log/vmstat.log' command from a boot
script, or more robust as the following perl script, which includes
date stamps and [optionally] periodic ps(1) reports:
==========================================================================
#!/usr/bin/perl -w
#
# vmstat-logger -- watch vmstat statistics with date stamps
# erco 1.0 04/04/06
#
# Run this program at boot in the background to keep a log
# history of virtual memory use and disk i/o. eg:
#
# /etc/LOCAL-vmstat-logger < /dev/null > /var/log/vmstat-errors.log 2>&1 &
#
use strict;
require "ctime.pl";
# CUSTOMIZABLE VALUES
my $logfile = "/var/log/vmstat.log"; # log we append to
my $vmsecs = 5; # vmstat sample rate (seconds)
my $datesecs = 300; # date stamp rate (must be > $vmsecs)
my $pssecs = 60; # ps(1) report rate (must be > $vmsecs)
# OPEN LOG FOR APPENDING
unless ( open(LOG, ">>/var/log/vmstat.log") ) {
print STDERR "$0: $logfile: $!\n";
exit(1);
}
select(LOG); $|=1; select(STDOUT); # unbuffered i/o to log
# OPEN CONTINUOUS VMSTAT FOR READING
unless ( open(VMSTAT, "vmstat $vmsecs|") ) {
print STDERR "$0: vmstat $vmsecs: (could not execute): $!\n";
exit(1);
}
# LOG CONTINUOUS OUTPUT OF VMSTAT WITH DATESTAMPS
my $count;
my $vmstatout;
for ( $count=0; $vmstatout = <VMSTAT>; $count++ ) {
# TIME STAMP
if ( ( $count % ($datesecs/$vmsecs)) == 0 ) { # log time stamp
print LOG "DATE: --- " . ctime(time());
}
# # PS REPORT
# if ( ( $count % ($pssecs/$vmsecs)) == 0 ) { # [OPTIONAL] log ps report
# print LOG "--- PS REPORT:\n";
# unless ( open(PS, "ps faux|") ) {
# print LOG "$0: ps faux: $!\n";
# } else {
# while (<PS>) {
# if ( ! /^root/ ) { # log all non-root processes
# print LOG "PS: $_";
# }
# }
# close(PS);
# }
# }
# VMSTAT REPORT
print LOG "VMSTAT: $vmstatout"; # log vmstat output
}
# EOF
|