From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: how to deal with missing Nuke plugin licenses
   Date: Thu, 27 Oct 2011 11:47:36 -0400
Msg# 2139
View Complete Thread (2 articles) | All Threads
Last Next
On 10/26/11 06:49, Abraham Schneider wrote:
> For Nuke itself, we have enough render licenses to use the whole farm
> for rendering. But for some of the plugins (Furnace, Ocula, ...) we
> have only a limited number of licenses. I'm wondering now how to deal
> with this on a rush renderfarm. I see three possibilites there:
> 
> 1. limit the cpus used by the job to the amount of available licenses.
> Seems to be fine, but has two disadvantages: it only works if you have
> 1 job on the farm that uses this plugin. If you start a second job, it
> will try to render on the other free machines and will fail. Second
> problem is that sometimes license servers will not release the
> licenses as fast as the jobs jump from machine to machine. So even if
> I limit the job to the correct amount of cpus, there may be a missing
> license when one machine finishes a frame and a different machine
> wants to start a new frame.
> 
> 2. use the hosts file to define groups of machines which only contain
> the correct amount of machines. This should avoid the problems above,
> but handling this is painful. A machine or two may be down, then you
> have to change the hosts file again.

	Yes; defining a hostgroup such as +furnace would be one
	way to go.

	Yes, if one of the machines in the group is taken down,
	you'd have to modify that hostgroup's membership..
	but I'd think that'd be part of regular network
	administration to enable/disable machines when
	they're taken down. (As opposed to a machine that
	just needs a reboot)

> You have slower and faster
> machines, how do you distribute them to the different groups?

	You can make two sub-groups if you want control over
	machine speed. eg:

		+furnace	-- all the 'furnace' machines
		+furnace_fast	-- just the fast ones in the furnace group
		+furnace_slow	-- just the slow ones in the furnace group

	..so if you have a job that needs to keep at least 2 cpus busy
	on the fast machines, then have that job ask for the +furnace_fast
	machines at a higher priority, eg:

		+furnace=10@100
		+furnace_fast=2@900

> It's a possible solution but doesn't feel like THE solution :)

	A centralized 'license counter' is perhaps what you're wanting,
	but it has its own issues; random interactive use counts against
	licenses, a single machine would have to be responsible for keeping
	track of license counts, etc.

> 3. Use something like the licpause function of Rush. Problems with
> licpause: it pauses the JOB, not the frame/batch frames of the machine
> that has the license problem.

	Yes; this is because the job really shouldn't try to pick
	up on more machines if the software it's running is out of
	licenses; it doesn't make sense to tie up newly available cpus
	with a job that will not be able to run.

	So the licpause gives newly available cpus a shot at other
	jobs when a job can't get more licenses.

> And the normal license pause function of
> the submit-nuke.pl will not work, because some of the plugins will not
> raise an error exit code, so there is a license error,

	That should be OK; if you can identify all the license error
	messages, the script can check for these messages (even if the
	exit code is zero) to detect the license error, and handle it
	accordingly.

	If you supply me with the complete frame log showing the license
	error messages, I can tell you how to add those checks to the script.
	Or, send me both the error messages and the script, and I can make
	the change for you so you can see how to add your own.


> So because of my very limited Perl knowledge I have two questions:
> - How can I check (for example by doing something like a grep of the
> logfile) for license problems inside of the submit-nuke.pl and raise a
> different exitcode, so the normal licpause function will also work?

	There is a global LogCheck() function built into the .common.pl
	(which all the scripts load for 'common' functions) that can
	be called to 'grep' the log file for certain messages.

	This takes into account retries, so that error messages aren't
	retriggered by older messages due to retries in the same log.

	With the above complete frame logs showing the license errors
	I can show you what to change.

> - what would be a good way to do something like the license pause on a
> per-frame base instead of doing it per job? Any suggestions?

	You can do things like sleep() and retry the command again
	repeatedly until it works.. that's not hard. But that ties up
	the cpu until a license becomes available.. it might be better
	if the cpu becomes available to other jobs, in which case
	you can just do a sleep and exit(2) so that rush requeues the frame,
	allowing the scheduler to 'round robin' select some other job.
	(The sleep prevents the scheduler from 'spinning' the reque
	frame too quickly, in case there are no other jobs)


-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)


Last Next