From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 18:08:41 -0400
Msg# 1585
View Complete Thread (16 articles) | All Threads
Last Next
Antoine Durr wrote:
>> 	Agreed; but if you have e.g. node locked licenses,
>> 	it at least ensures only those boxes get the renders.
> 
> Again, an appropriately named pool would take care of that.

	Yes, submitting to +houdini=2 is nicer, so even if the
	node locks are moved around change, users don't have to
	change their 'Cpus' specs.

> Jobs should have the notion of "requirements", one of which is a 
> particular license type.   Admittedly, this is tricky because users use 
> the licenses w/out notifying the queue!  So the queue has to figure out 
> what's left, figure out how many it's using, and only allow so many 
> more after that.

	Yes.. some companies make license counting 'wrappers' for their
	renders, so that interactive use can be tracked and be 'predicted'
	as part of a larger, 'reservation' oriented system.

	Others use their 'watcher' to interrogate the third party
	license managers, to see how many lics are available, and
	modify the cpu allocations to keep that balanced.

	Honestly though, most folks with large nets just buy or rent
	the licenses they need so they can make use of the whole farm,
	and not have to juggle that stuff, because even when its done
	right, there are race conditions with the license counting,
	unless you have some kind of reservation system embedded in
	the license system, or a wrapper that does this.

> When the process completes, it tails the last line of that 
> file, and puts it into the per-frame notes field, so that the users see 
> what kind of memory footprint their job had.

	Yes, that is very useful info, and its best to simply advertise
	it to the user, so they can submit their job with a correct
	'Ram' value based on their impression of the numbers.

	Often the renderers print those numbers in the log for you,
	so you can just grep them out; maya and mental ray both
	do this.

	Trouble is the format of these messages can change from one
	rev of the renderer to another. And, depending on the job,
	sometimes there are several of these messages per frame,
	such as when renders are batching, or worse, when complex jobs
	render multiple images per frame ('levels', 'passes', 'comps' etc)
	So it gets a little tricky to implement that in a way that
	works for all situations.

	It's a good idea to show that info in the rush 'Frames' reports,
	just watch out that on large networks, running the 'rush -notes'
	command from within the render script /every frame/ may
	"DoS attack" your jobserver esp if the render times are short.

	For folks with large networks, though, I recommend /against/ running
	commands like 'rush -notes', 'rush -ljf' in the render scripts every
	frame, for that reason. 'rush -notes' is a high latency TCP command
	that isn't really meant to be run on 100's of machines at once all
	to the same server. 'rush -notes' is only recommended for advertising
	error conditions.

	On a small net like yours, though, it shouldn't be a problem.
	It's only when you get above 50 render nodes or so that this
	would become a problem; depends on how fast render times are.

	A few weeks ago I implemented a much lower latency 'rush -exitnotes'
	command which handles sending back 'per frame' messages to the
	jobserver in a reasonable manner by connecting to the 'render node'
	instead of the job server, setting things up so the note is passed
	back to the job server as part of the UDP message that delivers the
	exit code back to the server. It'll be in version 102.43.

	I'd like to add the 'ram usage indicator' to the submit scripts
	as an option, once the 'rush -exitnotes' is fully released.

> This is a pretty critical 
> feature, IMO, as you *really* want to avoid going into swap on a box!

	Yes, definitely.

	Gets tricky to detect swap though, as often when a box looks like
	it's swapping, it's actually just paging out old junk to make room
	in ram that it should have cleared out long ago.

	Best thing is to just know in advance how much ram the job will
	tend to need, and submit with that ram value set. (e.g. the 'Ram:'
	prompt in the submit forms)

	'rushtop' is handy for seeing if a job is using a lot of ram.
	Just render your job on a box, and watch the ram use as the
	render runs to get a feel for how nasty it is.. then submit
	with the 'Ram' field set accordingly.

	Rushtop is the only thing in rush that actually polls ram use,
	and that's global ramuse, not process hierarchy ramuse.. the
	only thing about tam all the OS's deliver coherently for our
	purposes.

	But the numbers the renderers spit out are the best ones
	by far. Beats polling.

> Ideally, I should be able to submit with a requirement of a certain 
> amount of ram, and have the frames only run on machines that have that 
> much ram left.

	Yes, the 'Ram' submit value can be used for this; the value
	your job submits with, say '10', tells rush each frame uses
	'10' units of ram. The 'RAM' column in the rush/etc/hosts file
	indicates how many units of ram rush thinks each machine has,
	so when it tries to start a frame rendering, it subtracts the
	job's 'ram' value from the total to see if there's room to run it.

	This is all management of static values.. rush doesn't actually
	poll the machine's ram use. Rush assumes if it can use the machine,
	it can use all of it. You can reserve cpus (and ram) using the
	'rush -reserve' command.

> Yes, that is not failure-free, as doing only 
> spot-checks doesn't tell you that a particular job on a host might 
> suddenly chew up another gig.  But at least it should try.

	My goal was to avoid any features in rush that had too much
	of a 'fuzzy' aspect to it, such as polling ram use.

	Even kernel mailing lists argue endlessly on how free ram
	should be determined, and it often changes from release
	to release. I've had to tweak rushtop several times to take
	into account changes in the different OS's ram calculations.

	It's too bad that most of the OS's (esp unix!) doesn't let
	a parent program get back the memory use and usr/sys time
	of the accumulated process tree. The counters are there
	in the structures for the accumulated times, but they're
	all zeroed out. All you can get back is the ram/cpu time
	of the immediate process reliably, and that's usually just
	the perl script, which is useless. The only way to get unix
	to show process hierarchy data that I've seen is to have to
	have system wide process accounting (acct(2)) turned on.
	And whenever that's on, the system bogs down because
	process accounting makes giant logs quickly. But it helps
	the OS tally accumulated process tree info internally.

	Surprisingly, Windows is the only OS that seems to tally
	child processing correctly, both ram and cpu ticks with
	their new job objects stuff.

	I only recently discovered that, and will try some experiments
	to see if it /actually/ works, and not just a place holder.
	This way the cpu.acct file can finally log the memory use
	and usr/sys time info instead of being all zeroes, as well
	as providing the info for the frames report..!

> Funny, I'm doing the ram-usage polling right now.  I simultaneously 
> launch a memory watcher script, which given a PID, finds all the child 
> PIDs and adds up their memory consumption, writes to a file in the 
> logdir.

	Trouble I've found with snapshoting the proc table (did it at DD)
	is that you run into a few real world problems, enough that it
	can often cause more trouble than its worth.

	When polling the process hierarchy, you can end up with wild
	snapshots when processes fork, showing double memory use during
	that time. You can try to smooth those out as aberrant data,
	but some renders fork frequently, causing the data to sometimes
	appear valid, throwing the job into a stall.

	Also, sometimes a single frame would go bananas on ram, causing
	the queue to think the job was going into a phase of high memory
	use. Or sometimes a scene will simply go from a black frame to
	a sudden high memory use, enough to swap. An automated mechanism
	that tries to use this wild data to handle scheduling almost always
	stalls the job, causing folks to simply turn off the feature;
	they'd rather have their render crash a few machines on the
	few frames that go bananas instead of having the job completely
	stall in the middle of the night.

	In rushtop I added an experimental 'paging activity' indicator
	(the 'orange bar') which watches for 'excessive' paging activity,
	and bumps the bar when that happens. This limit was determined
	empirically.. when the orange bar appears, chances are you can
	'feel' the slowness if you're on that box in the form of an
	unresponsive mouse, or similar.

-- 
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

Last Next