On 10/26/11 06:49, Abraham Schneider wrote:
> For Nuke itself, we have enough render licenses to use the whole farm
> for rendering. But for some of the plugins (Furnace, Ocula, ...) we
> have only a limited number of licenses. I'm wondering now how to deal
> with this on a rush renderfarm. I see three possibilites there:
>
> 1. limit the cpus used by the job to the amount of available licenses.
> Seems to be fine, but has two disadvantages: it only works if you have
> 1 job on the farm that uses this plugin. If you start a second job, it
> will try to render on the other free machines and will fail. Second
> problem is that sometimes license servers will not release the
> licenses as fast as the jobs jump from machine to machine. So even if
> I limit the job to the correct amount of cpus, there may be a missing
> license when one machine finishes a frame and a different machine
> wants to start a new frame.
>
> 2. use the hosts file to define groups of machines which only contain
> the correct amount of machines. This should avoid the problems above,
> but handling this is painful. A machine or two may be down, then you
> have to change the hosts file again.
Yes; defining a hostgroup such as +furnace would be one
way to go.
Yes, if one of the machines in the group is taken down,
you'd have to modify that hostgroup's membership..
but I'd think that'd be part of regular network
administration to enable/disable machines when
they're taken down. (As opposed to a machine that
just needs a reboot)
> You have slower and faster
> machines, how do you distribute them to the different groups?
You can make two sub-groups if you want control over
machine speed. eg:
+furnace -- all the 'furnace' machines
+furnace_fast -- just the fast ones in the furnace group
+furnace_slow -- just the slow ones in the furnace group
..so if you have a job that needs to keep at least 2 cpus busy
on the fast machines, then have that job ask for the +furnace_fast
machines at a higher priority, eg:
+furnace=10@100
+furnace_fast=2@900
> It's a possible solution but doesn't feel like THE solution :)
A centralized 'license counter' is perhaps what you're wanting,
but it has its own issues; random interactive use counts against
licenses, a single machine would have to be responsible for keeping
track of license counts, etc.
> 3. Use something like the licpause function of Rush. Problems with
> licpause: it pauses the JOB, not the frame/batch frames of the machine
> that has the license problem.
Yes; this is because the job really shouldn't try to pick
up on more machines if the software it's running is out of
licenses; it doesn't make sense to tie up newly available cpus
with a job that will not be able to run.
So the licpause gives newly available cpus a shot at other
jobs when a job can't get more licenses.
> And the normal license pause function of
> the submit-nuke.pl will not work, because some of the plugins will not
> raise an error exit code, so there is a license error,
That should be OK; if you can identify all the license error
messages, the script can check for these messages (even if the
exit code is zero) to detect the license error, and handle it
accordingly.
If you supply me with the complete frame log showing the license
error messages, I can tell you how to add those checks to the script.
Or, send me both the error messages and the script, and I can make
the change for you so you can see how to add your own.
> So because of my very limited Perl knowledge I have two questions:
> - How can I check (for example by doing something like a grep of the
> logfile) for license problems inside of the submit-nuke.pl and raise a
> different exitcode, so the normal licpause function will also work?
There is a global LogCheck() function built into the .common.pl
(which all the scripts load for 'common' functions) that can
be called to 'grep' the log file for certain messages.
This takes into account retries, so that error messages aren't
retriggered by older messages due to retries in the same log.
With the above complete frame logs showing the license errors
I can show you what to change.
> - what would be a good way to do something like the license pause on a
> per-frame base instead of doing it per job? Any suggestions?
You can do things like sleep() and retry the command again
repeatedly until it works.. that's not hard. But that ties up
the cpu until a license becomes available.. it might be better
if the cpu becomes available to other jobs, in which case
you can just do a sleep and exit(2) so that rush requeues the frame,
allowing the scheduler to 'round robin' select some other job.
(The sleep prevents the scheduler from 'spinning' the reque
frame too quickly, in case there are no other jobs)
--
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)
|