From: "Abraham Schneider" <aschneider@(email surpressed)>
Subject: Re: several strange Rush behaviours
   Date: Fri, 20 Jul 2012 06:31:23 -0400
Msg# 2259
View Complete Thread (5 articles) | All Threads
Last Next
Am 19.07.2012 um 18:58 schrieb Greg Ercolano:

> [posted to rush.general]
>
> On 07/19/12 01:51, Abraham Schneider wrote:
>> Done     rind10.12202     N_NORM_019_060_comp_v04_as aschneid        %100    %0    0   16:52:52
>> Done     rind10.12204     N_NORM_002_010_comp_v16_dl dlaubsch        %100    %0    0   16:32:02
>> Done     rind10.12206     N_LOW_048_060_redLog_v00a_mw mwarlimont    %100    %0    0   16:28:48
>> Done     rind10.12213     N_NORM_001_050_comp_v38_ts tstern          %100    %0    0   16:06:51
>> Done     rind10.12215     N_LOW_048_080_redLog_v00a_mw mwarlimont    %100    %0    0   16:05:03
>> Done     rind10.12218     N_NORM_019_060_comp_v04_as aschneid        %100    %0    0   15:08:02
>> Fail     rind10.12221     N_NORM_001_010_comp_v01_st stischne         %99    %1    0   00:58:31
>> Done     rind10.12223     N_NORM_001_010_comp_v01_st stischne        %100    %0    0   00:55:04
>> Run      rind10.12225     N_NORM_001_050_comp_v38_ts tstern           %81    %0    2   00:41:35
>> Run      rind10.12226     N_NORM_055_010_comp_v01_mt mwarlimont        %0    %0    0   00:36:58
>> Run      rind10.12227     N_NORM_100_020_comp_v21_mt mwarlimont        %0    %0    0   00:34:11
>> Done     rind10.12228     N_NORM_001_010_comp_v01_st stischne        %100    %0    0   00:23:08
>> Run      rind10.12230     N_NORM_103_010_comp_v14_mt mwarlimont       %36    %0   17   00:19:49
>> Run      rind10.12231     N_NORM_022_cfd0046_comp_v102 ppoetsch        %6    %0    0   00:19:37
>>
>> ..sometimes something like above happens: job 12225 starts rendering on all online
>> machines. But halfway through the rendering, it just stops or the amount
>> of CPUs drops significantly and all the other machines continue
>> rendering on a much newer job
>
> Hi Abraham,
>
>        Wow, you have large jobids! You must have the jobidmax value
>       cranked up. Be careful with that (see below).

Don't worry, it's just the number that I cranked up. Normaly there aren't more than 200-300 jobs in rush. We dump them regularly. I think I'm using these high numbers because when there are jobs on the farm and the maximum number is reached, new jobs will start with 0 again and these jobs will be rendered next, despite of older jobs waiting in the queue. FIFO is based on jobID I think!? So I wanted to avoid to have this switch from max jobID to 0 too often during day time.

>
>       Can I see these reports for the rind10.12225/6/7 jobs? eg:
>
>               rush -lf rind10.12225 rind10.12226 rind10.12227
>               rush -lc rind10.12225 rind10.12226 rind10.12227

see separate mail to greg@(email surpressed)
>
>       The '26 and '27 jobs appear to be getting completely skipped over,
>       they stand out to me.
>
>       All the rest seem like they could be OK, need to see the reports
>       to know more.
>
>       The 12225 job doesn't worry me too much, as it's 81% done with
>       2 busy frames, so if those are the last two frames in the job,
>       that would make sense. But if there are still available frames
>       in the Que state with a TRY count of zero, that would be puzzling.
>
>       As you probably know, if a job is rendering its last few frames,
>       newly idle cpus will go to the next jobs down. If someone requeues
>       all the frames in one of the higher up jobs, then that could bring
>       them back down to 0% done, and they'd have to wait for available procs.
>       I'll be able to tell from the 'Frames' report; the TRY column will show
>       if a frame has already been run before.

Of course I know that. As you can see in the report, 81% done was not at the end of the shot, as the shot is quite long. You see the problem part where only the machine 'sunrender' is rendering frames.

>
>> 2. problem:
>> Most of the time, switching a machine/workstation from offline to =
>> online, it takes from many seconds to several minutes for this machine =
>> to pick up a frame and start rendering. The machine is shown as 'online' =
>> instantly, but it just won't start rendering a frame. It's listed as =
>> 'online' and 'idle' for several minutes. This happens for all of our =
>> machines, doesn't matter if they are Macs or Linux.
>
>       Can you send me the tasklist for the machine in question, ie:
>
>               rush -tasklist SLOWHOST

see separate mail
>
>       ..I want to see how large that report is. If it's really large,
>       that might be the reason.
>
>       That report will show the list of jobs it is considering
>       to give the idle cpus in the order it wants to check.
>
>       One situation might be if there's a bunch of jobs at the
>       top of its list that are being managed by a machine that
>       is currently down. In that case rush will try to contact
>       that machine to get the job started, and will keep trying
>       until a timeout of about a minute or so, then it will give up
>       and move to the next jobs in the list that are not on that
>       unresponsive machine.
>
>       Another possibility is if machines reboot to new IP addresses
>       (eg. DHCP assigned machines), that might cause rushd to not
>       be able to reach job servers to establish jobs, causing the
>       above situation.
>
>       It might be good if you send me the rushd.log from machines
>       that act this way; I might be able to tell from that if
>       there's a problem.
>
seems not to be a problem there, the rushd.log of the problem machine 'apu' is quite small:

today:
07/20,03:00:27 ROTATE     rushd.log rotated. pid=146, 0/3 busy, OFFLINE
07/20,03:00:27 ROTATE     apu RUSHD 102.42a9d PID=146     Boot=07/17/12,19:07:25

yesterday:
07/19,03:00:16 ROTATE     rushd.log rotated. pid=146, 0/3 busy, OFFLINE
07/19,03:00:16 ROTATE     apu RUSHD 102.42a9d PID=146     Boot=07/17/12,19:07:25
07/19,18:23:56 SECURITY   Daemon changed to Online by aschneid@itchy[192.168.10.21], Remark:online by aschneid 07/19/12,18:23 via irush state (online) by aschneid@itchy[192.168.10.21]
07/19,19:02:03 SECURITY   Daemon changed to Getoff by aschneid@itchy[192.168.10.21], Remark:getoff by aschneid 07/19/12,19:02 via irush state (getoff) by aschneid@itchy[192.168.10.21]
07/20,03:00:27 ROTATE     rushd.log rotated by rush.conf: logrotatehour=3


>> Any explanation for that?
>
>    Those large jobids might be the culprit, not sure.
>
>    When you have jobidmax set high, this can mean thousands of jobs can remain
>    in the queue, causing the system to work extra hard to find jobs that are
>    available.
>
>    The large max should be OK /as long/ as the 'Jobs' reports are kept trim.
>    ie. dump old jobs. You don't want to leave old jobs in the queue; they take
>    up memory and make the daemon work harder internally to consider those
>    jobs in case they've been requeued.
>
>    And if you have several job servers each with very large queues, that
>    would exacerbate the problem.
>
>    The reason rush comes with 999 as the max for jobids is to force
>    folks to dump old jobs so that the queue doesn't get artificially large.

As I said above, we try to dump as often as possible. We use only one submit host and I get:

rush -lj rind10 | wc
     266    2130   25912

So there are about 260 jobs in our queue


Hope that helps.

Abraham



Abraham Schneider
Senior VFX Compositor
 

ARRI Film & TV Services GmbH
Tuerkenstr. 89
D-80799 Muenchen / Germany

Phone (Tel# suppressed) 

EMail aschneider@(email surpressed)
www.arri.de/filmtv
________________________________


ARRI Film & TV Services GmbH
Sitz: München Registergericht: Amtsgericht München
Handelsregisternummer: HRB 69396
Geschäftsführer: Franz Kraus, Dr. Martin Prillmann, Josef Reidinger

Last Next