From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: several strange Rush behaviours
   Date: Thu, 19 Jul 2012 12:58:08 -0400

Msg# 2258
View Complete Thread (5 articles) | All Threads
Last Next

On 07/19/12 01:51, Abraham Schneider wrote:
> Done     rind10.12202     N_NORM_019_060_comp_v04_as aschneid        %100    %0    0   16:52:52
> Done     rind10.12204     N_NORM_002_010_comp_v16_dl dlaubsch        %100    %0    0   16:32:02
> Done     rind10.12206     N_LOW_048_060_redLog_v00a_mw mwarlimont    %100    %0    0   16:28:48
> Done     rind10.12213     N_NORM_001_050_comp_v38_ts tstern          %100    %0    0   16:06:51
> Done     rind10.12215     N_LOW_048_080_redLog_v00a_mw mwarlimont    %100    %0    0   16:05:03
> Done     rind10.12218     N_NORM_019_060_comp_v04_as aschneid        %100    %0    0   15:08:02
> Fail     rind10.12221     N_NORM_001_010_comp_v01_st stischne         %99    %1    0   00:58:31
> Done     rind10.12223     N_NORM_001_010_comp_v01_st stischne        %100    %0    0   00:55:04
> Run      rind10.12225     N_NORM_001_050_comp_v38_ts tstern           %81    %0    2   00:41:35
> Run      rind10.12226     N_NORM_055_010_comp_v01_mt mwarlimont        %0    %0    0   00:36:58
> Run      rind10.12227     N_NORM_100_020_comp_v21_mt mwarlimont        %0    %0    0   00:34:11
> Done     rind10.12228     N_NORM_001_010_comp_v01_st stischne        %100    %0    0   00:23:08
> Run      rind10.12230     N_NORM_103_010_comp_v14_mt mwarlimont       %36    %0   17   00:19:49
> Run      rind10.12231     N_NORM_022_cfd0046_comp_v102 ppoetsch        %6    %0    0   00:19:37
> 
> ..sometimes something like above happens: job 12225 starts rendering on all online
> machines. But halfway through the rendering, it just stops or the amount
> of CPUs drops significantly and all the other machines continue
> rendering on a much newer job

Hi Abraham,

        Wow, you have large jobids! You must have the jobidmax value
	cranked up. Be careful with that (see below).

	Can I see these reports for the rind10.12225/6/7 jobs? eg:

		rush -lf rind10.12225 rind10.12226 rind10.12227
		rush -lc rind10.12225 rind10.12226 rind10.12227

	The '26 and '27 jobs appear to be getting completely skipped over,
	they stand out to me.

	All the rest seem like they could be OK, need to see the reports
	to know more.

	The 12225 job doesn't worry me too much, as it's 81% done with
	2 busy frames, so if those are the last two frames in the job,
	that would make sense. But if there are still available frames
	in the Que state with a TRY count of zero, that would be puzzling.

	As you probably know, if a job is rendering its last few frames,
	newly idle cpus will go to the next jobs down. If someone requeues
	all the frames in one of the higher up jobs, then that could bring
	them back down to 0% done, and they'd have to wait for available procs.
	I'll be able to tell from the 'Frames' report; the TRY column will show
	if a frame has already been run before.

> 2. problem:
> Most of the time, switching a machine/workstation from offline to =
> online, it takes from many seconds to several minutes for this machine =
> to pick up a frame and start rendering. The machine is shown as 'online' =
> instantly, but it just won't start rendering a frame. It's listed as =
> 'online' and 'idle' for several minutes. This happens for all of our =
> machines, doesn't matter if they are Macs or Linux.

	Can you send me the tasklist for the machine in question, ie:

		rush -tasklist SLOWHOST

	..I want to see how large that report is. If it's really large,
	that might be the reason.

	That report will show the list of jobs it is considering
	to give the idle cpus in the order it wants to check.

	One situation might be if there's a bunch of jobs at the
	top of its list that are being managed by a machine that
	is currently down. In that case rush will try to contact
	that machine to get the job started, and will keep trying
	until a timeout of about a minute or so, then it will give up
	and move to the next jobs in the list that are not on that
	unresponsive machine.

	Another possibility is if machines reboot to new IP addresses
	(eg. DHCP assigned machines), that might cause rushd to not
	be able to reach job servers to establish jobs, causing the
	above situation.

	It might be good if you send me the rushd.log from machines
	that act this way; I might be able to tell from that if
	there's a problem.

> Any explanation for that?

    Those large jobids might be the culprit, not sure.

    When you have jobidmax set high, this can mean thousands of jobs can remain
    in the queue, causing the system to work extra hard to find jobs that are
    available.

    The large max should be OK /as long/ as the 'Jobs' reports are kept trim.
    ie. dump old jobs. You don't want to leave old jobs in the queue; they take
    up memory and make the daemon work harder internally to consider those
    jobs in case they've been requeued.

    And if you have several job servers each with very large queues, that
    would exacerbate the problem.

    The reason rush comes with 999 as the max for jobids is to force
    folks to dump old jobs so that the queue doesn't get artificially large.

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

Last Next