From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: several strange Rush behaviours
   Date: Fri, 20 Jul 2012 13:14:35 -0400
Msg# 2261
View Complete Thread (5 articles) | All Threads
Last Next
On 07/20/12 03:31, Abraham Schneider wrote:
>>        Wow, you have large jobids! You must have the jobidmax value
>>       cranked up. Be careful with that (see below).
>
> Don't worry, it's just the number that I cranked up. Normaly there
> aren't more than 200-300 jobs in rush. We dump them regularly.

	Sounds good.

> I think
> I'm using these high numbers because when there are jobs on the farm and
> the maximum number is reached, new jobs will start with 0 again and
> these jobs will be rendered next, despite of older jobs waiting in the
> queue. FIFO is based on jobID I think!?

	FIFO is based primarily on the submit time of the job.
	The jobid is only used to break a tie, where two jobs
	submit during the same second.

>>       Can I see these reports for the rind10.12225/6/7 jobs? eg:
>>
>>               rush -lf rind10.12225 rind10.12226 rind10.12227
>>               rush -lc rind10.12225 rind10.12226 rind10.12227
>
> see separate mail to greg@(email surpressed)

	Great -- see the previous email for an analysis of those.
	Turns out those jobs all seemed OK relative to one another,
	so I requested the later jobs as well.

> Of course I know that. As you can see in the report, 81% done was not at
> the end of the shot, as the shot is quite long. You see the problem part
> where only the machine 'sunrender' is rendering frames.

	Hmm, I saw the sunrenders in the reports, but didn't see
	what was wrong there..?

>>> 2. problem:
>>> Most of the time, switching a machine/workstation from offline to
>>> online, it takes from many seconds to several minutes for this machine
>>> to pick up a frame and start rendering. The machine is shown as
>>> 'online' instantly, but it just won't start rendering a frame.
>>> It's listed as 'online' and 'idle' for several minutes. This happens
>>> for all of our machines, doesn't matter if they are Macs or Linux.
>>
>>       Can you send me the tasklist for the machine in question, ie:
>>
>>               rush -tasklist SLOWHOST
>
> see separate mail

D'oh:
	I think I forgot to go into details on this; I probably should
	have said 'when you online a machine and it's not picking up,
	send the tasklist at that moment when its not picking up.'

	And in such a case, it would also be useful to see what the state
	of all the jobs on the farm is, eg: 'rush -laj; rush -lac'.

	So perhaps you can resend those three (tasklist/laj/lac) right
	when the machine is onlined and not rendering.

So if it's not large leftover jobs, then it must be something
in the code.

Perhaps what is happening there is if a machine has been offline
for a while, and many jobs have come and gone but not dumped,
perhaps the scheduler has old jobs left behind in it, such that
when the machine comes online, it walks through all those old
jobs to find the next one to render, and it takes a while to
walk through them all to 'catch up'.

I better check this.. esp. in the context of a FIFO scheduler.

> seems not to be a problem there, the rushd.log of the problem machine =
> 'apu' is quite small:
>
> today:
> 07/20,03:00:27 ROTATE     rushd.log rotated. pid=3D146, 0/3 busy, OFFLINE
> 07/20,03:00:27 ROTATE     apu RUSHD 102.42a9d PID=3D146  Boot=07/17/12,19:07:25
>
> yesterday:
> 07/19,03:00:16 ROTATE     rushd.log rotated. pid=3D146, 0/3 busy, OFFLINE
> 07/19,03:00:16 ROTATE     apu RUSHD 102.42a9d PID=3D146     Boot=07/17/12,19:07:25
> 07/19,18:23:56 SECURITY   Daemon changed to Online by aschneid@itchy[192.168.10.21], Remark:online by aschneid 07/19/12,18:23 via irush state (online) by aschneid@itchy[192.168.10.21]
> 07/19,19:02:03 SECURITY   Daemon changed to Getoff by aschneid@itchy[192.168.10.21], Remark:getoff by aschneid 07/19/12,19:02 via irush state (getoff) by aschneid@itchy[192.168.10.21]
> 07/20,03:00:27 ROTATE     rushd.log rotated by rush.conf: logrotatehour=3

	Yes, those are /really/ healthy logs.
	Scary in fact.

	But it looks like it was only online for 40 mins (between 6:23p ~ 7p),
	so it probably didn't do much rendering, if any.

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)



Last Next