From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Die frames
   Date: Tue, 08 Nov 2005 14:10:50 -0800
Msg# 1108
View Complete Thread (5 articles) | All Threads
Last Next
Brent Hensarling wrote:
Its funny, I never noticed that there :) What about the case where a frame is hung on a machine that is not dead, but the frame will not reque at all?

	When you requeue a running frame, Rush uses the most powerful means
	to kill the process it can to kill the process. Under unix, it uses
	kill -9, and under Windows it uses a similar 'most deadly' means of
	killing the process.

	If the kernel won't let the process go, it's most likely 'hung'
	in a way that it severe.

	Under Unix, the common cause is the file server is down, and the
	app is accessing a file through a hard mount. The process won't
	revive unless the remote server comes back, or the mount is fixed.

	In some cases, the problem is a broken network file system that
	is buggy and not compatible with the file server, causing the
	mount to hang indefinitely, even though the remote server is healthy.
	(In such a case, the bug is probably in the client's NFS implementation)

	Similar problems with other filesystems/platforms can cause this.

> We tend to get this all the time, and basically what I  do
is a restart on the rush service for the machine that has the hung frame, and it then reques the frame and then I can turn it back online and it will start rendering again.

	Probably what happens there is the process is orphaned but remains
	running.

	To solve this problem I would investigate deeply what is causing
	the processes to hang in an unkillable manner.

	For instance, can you kill the render with 'kill -9' (UNIX)
	or from the task manager (WINDOWS)? My guess is you won't be able to,
	indicating that your OS is at fault, most likely cause is the
	mounting scheme (buggy NFS/SMB) or buggy network lookup system
	(LDAP/NIS). I've seen both cause 'unkillable hanging problems'
	that you would experience both under the command line and rush.

Will the down command just reque the frame and if the machine is still online, let it continue rendering?

	IIRC, down tells the job server to release the frame and requeue it,
	severing the association with the remote.

	However, I believe the remote will still not let it go, because
	the process is still running from it's point of view -- the only
	way to really let it go is to kill the render process somehow,
	otherwise you'd have to restart the rushd service, or better yet
	reboot the box, since unkillable processes, esp. under windows,
	should not be happening. You need to identify the problem with the
	OS if a process is unkillable.

	I'm not familiar with the windows specific debugging tools, but
	under unix I use 'strace -p [pid_of_render_process' to determine
	what the process is doing, and/or the long reports from 'ps' to
	see what the WCHAN entry shows the process waiting for, and the
	'STATUS' column showing what mode the process is in (ie. in an
	unkillable sleep, or some such, which means it's stuck on some
	I/O device).


--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)

Last Next