Am 19.07.2012 um 18:58 schrieb Greg Ercolano:
> [posted to rush.general]
>
> On 07/19/12 01:51, Abraham Schneider wrote:
>> Done rind10.12202 N_NORM_019_060_comp_v04_as aschneid %100 %0 0 16:52:52
>> Done rind10.12204 N_NORM_002_010_comp_v16_dl dlaubsch %100 %0 0 16:32:02
>> Done rind10.12206 N_LOW_048_060_redLog_v00a_mw mwarlimont %100 %0 0 16:28:48
>> Done rind10.12213 N_NORM_001_050_comp_v38_ts tstern %100 %0 0 16:06:51
>> Done rind10.12215 N_LOW_048_080_redLog_v00a_mw mwarlimont %100 %0 0 16:05:03
>> Done rind10.12218 N_NORM_019_060_comp_v04_as aschneid %100 %0 0 15:08:02
>> Fail rind10.12221 N_NORM_001_010_comp_v01_st stischne %99 %1 0 00:58:31
>> Done rind10.12223 N_NORM_001_010_comp_v01_st stischne %100 %0 0 00:55:04
>> Run rind10.12225 N_NORM_001_050_comp_v38_ts tstern %81 %0 2 00:41:35
>> Run rind10.12226 N_NORM_055_010_comp_v01_mt mwarlimont %0 %0 0 00:36:58
>> Run rind10.12227 N_NORM_100_020_comp_v21_mt mwarlimont %0 %0 0 00:34:11
>> Done rind10.12228 N_NORM_001_010_comp_v01_st stischne %100 %0 0 00:23:08
>> Run rind10.12230 N_NORM_103_010_comp_v14_mt mwarlimont %36 %0 17 00:19:49
>> Run rind10.12231 N_NORM_022_cfd0046_comp_v102 ppoetsch %6 %0 0 00:19:37
>>
>> ..sometimes something like above happens: job 12225 starts rendering on all online
>> machines. But halfway through the rendering, it just stops or the amount
>> of CPUs drops significantly and all the other machines continue
>> rendering on a much newer job
>
> Hi Abraham,
>
> Wow, you have large jobids! You must have the jobidmax value
> cranked up. Be careful with that (see below).
Don't worry, it's just the number that I cranked up. Normaly there aren't more than 200-300 jobs in rush. We dump them regularly. I think I'm using these high numbers because when there are jobs on the farm and the maximum number is reached, new jobs will start with 0 again and these jobs will be rendered next, despite of older jobs waiting in the queue. FIFO is based on jobID I think!? So I wanted to avoid to have this switch from max jobID to 0 too often during day time.
>
> Can I see these reports for the rind10.12225/6/7 jobs? eg:
>
> rush -lf rind10.12225 rind10.12226 rind10.12227
> rush -lc rind10.12225 rind10.12226 rind10.12227
see separate mail to greg@(email surpressed)
>
> The '26 and '27 jobs appear to be getting completely skipped over,
> they stand out to me.
>
> All the rest seem like they could be OK, need to see the reports
> to know more.
>
> The 12225 job doesn't worry me too much, as it's 81% done with
> 2 busy frames, so if those are the last two frames in the job,
> that would make sense. But if there are still available frames
> in the Que state with a TRY count of zero, that would be puzzling.
>
> As you probably know, if a job is rendering its last few frames,
> newly idle cpus will go to the next jobs down. If someone requeues
> all the frames in one of the higher up jobs, then that could bring
> them back down to 0% done, and they'd have to wait for available procs.
> I'll be able to tell from the 'Frames' report; the TRY column will show
> if a frame has already been run before.
Of course I know that. As you can see in the report, 81% done was not at the end of the shot, as the shot is quite long. You see the problem part where only the machine 'sunrender' is rendering frames.
>
>> 2. problem:
>> Most of the time, switching a machine/workstation from offline to =
>> online, it takes from many seconds to several minutes for this machine =
>> to pick up a frame and start rendering. The machine is shown as 'online' =
>> instantly, but it just won't start rendering a frame. It's listed as =
>> 'online' and 'idle' for several minutes. This happens for all of our =
>> machines, doesn't matter if they are Macs or Linux.
>
> Can you send me the tasklist for the machine in question, ie:
>
> rush -tasklist SLOWHOST
see separate mail
>
> ..I want to see how large that report is. If it's really large,
> that might be the reason.
>
> That report will show the list of jobs it is considering
> to give the idle cpus in the order it wants to check.
>
> One situation might be if there's a bunch of jobs at the
> top of its list that are being managed by a machine that
> is currently down. In that case rush will try to contact
> that machine to get the job started, and will keep trying
> until a timeout of about a minute or so, then it will give up
> and move to the next jobs in the list that are not on that
> unresponsive machine.
>
> Another possibility is if machines reboot to new IP addresses
> (eg. DHCP assigned machines), that might cause rushd to not
> be able to reach job servers to establish jobs, causing the
> above situation.
>
> It might be good if you send me the rushd.log from machines
> that act this way; I might be able to tell from that if
> there's a problem.
>
seems not to be a problem there, the rushd.log of the problem machine 'apu' is quite small:
today:
07/20,03:00:27 ROTATE rushd.log rotated. pid=146, 0/3 busy, OFFLINE
07/20,03:00:27 ROTATE apu RUSHD 102.42a9d PID=146 Boot=07/17/12,19:07:25
yesterday:
07/19,03:00:16 ROTATE rushd.log rotated. pid=146, 0/3 busy, OFFLINE
07/19,03:00:16 ROTATE apu RUSHD 102.42a9d PID=146 Boot=07/17/12,19:07:25
07/19,18:23:56 SECURITY Daemon changed to Online by aschneid@itchy[192.168.10.21], Remark:online by aschneid 07/19/12,18:23 via irush state (online) by aschneid@itchy[192.168.10.21]
07/19,19:02:03 SECURITY Daemon changed to Getoff by aschneid@itchy[192.168.10.21], Remark:getoff by aschneid 07/19/12,19:02 via irush state (getoff) by aschneid@itchy[192.168.10.21]
07/20,03:00:27 ROTATE rushd.log rotated by rush.conf: logrotatehour=3
>> Any explanation for that?
>
> Those large jobids might be the culprit, not sure.
>
> When you have jobidmax set high, this can mean thousands of jobs can remain
> in the queue, causing the system to work extra hard to find jobs that are
> available.
>
> The large max should be OK /as long/ as the 'Jobs' reports are kept trim.
> ie. dump old jobs. You don't want to leave old jobs in the queue; they take
> up memory and make the daemon work harder internally to consider those
> jobs in case they've been requeued.
>
> And if you have several job servers each with very large queues, that
> would exacerbate the problem.
>
> The reason rush comes with 999 as the max for jobids is to force
> folks to dump old jobs so that the queue doesn't get artificially large.
As I said above, we try to dump as often as possible. We use only one submit host and I get:
rush -lj rind10 | wc
266 2130 25912
So there are about 260 jobs in our queue
Hope that helps.
Abraham
Abraham Schneider
Senior VFX Compositor
ARRI Film & TV Services GmbH
Tuerkenstr. 89
D-80799 Muenchen / Germany
Phone (Tel# suppressed)
EMail aschneider@(email surpressed)
www.arri.de/filmtv
________________________________
ARRI Film & TV Services GmbH
Sitz: München Registergericht: Amtsgericht München
Handelsregisternummer: HRB 69396
Geschäftsführer: Franz Kraus, Dr. Martin Prillmann, Josef Reidinger
|