From: Antoine Durr <antoine@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 16:30:46 -0400

Msg# 1584
View Complete Thread (16 articles) | All Threads
Last Next

On 2007-06-06 13:15:34 -0700, Greg Ercolano <erco@(email surpressed)> said:

Antoine Durr wrote:

maxcpus 2
cpus    tahoe=1@999 ontario=1@999


Beyond their own machine (which is important) I really dislike the
notion of a user having to know or choose what machines their stuff
lands on.


	Agreed; but if you have e.g. node locked licenses,
	it at least ensures only those boxes get the renders.


Again, an appropriately named pool would take care of that.


	If you have floaters, then yes, you wouldn't specify
	hostnames, just a cpu cap.

	Thing is, if you submit two jobs that are both houdini
	jobs, and you only have two floating lics, then you
	need some 'centralized' way to count the lics; this is
	where a watcher script might come in, and realize there
	are two houdini jobs and limit the procs, or tweak the
	priorities to prevent a situation where more than two
	are rendering at once.

Jobs should have the notion of "requirements", one of which is aparticular license type. Admittedly, this is tricky because users usethe licenses w/out notifying the queue! So the queue has to figure outwhat's left, figure out how many it's using, and only allow so manymore after that.


	Rush itself doesn't maintain a centralized perspective
	of things, so it can't do centralized stuff like counting.

I'm curious as to the different ways that exist.  What kinds of things
are important to people?


	I've seen a variety of requests too numerous to mention.
	I think Esc had the most complex of all the scheduling
	algorithms I'd come across for The Matrix III. They had
	all kinds of stuff in there; taking ram into account,
	render times that change over time, I think they were
	even polling ram use as the renders ran. The guy who was

Funny, I'm doing the ram-usage polling right now. I simultaneouslylaunch a memory watcher script, which given a PID, finds all the childPIDs and adds up their memory consumption, writes to a file in thelogdir. When the process completes, it tails the last line of thatfile, and puts it into the per-frame notes field, so that the users seewhat kind of memory footprint their job had. This is a pretty criticalfeature, IMO, as you *really* want to avoid going into swap on a box!

Ideally, I should be able to submit with a requirement of a certainamount of ram, and have the frames only run on machines that have thatmuch ram left. Yes, that is not failure-free, as doing onlyspot-checks doesn't tell you that a particular job on a host mightsuddenly chew up another gig. But at least it should try.


-- Antoine

	writing it had a lot of high level goals.

	Trouble with their implementation was they had >500 boxes
	on their farm, and were throwing all the jobs submitted
	to one box..! I warned that was a bad, bad from the get go,
	but they locked into that for some reason. The whole point
	of rush's decentralized design is to distribute the job
	load, so it really hinders it to focus all jobs on a single
	box for a large net. This made it tough for them, because
	the central box became overloaded fast, in addition to their
	watcher constantly rescheduling it.

	That was a while ago, 2002/2003 IIRC, when boxes and
	networks were slower, and rush has had some optimizations
	since then.



--
Floq FX Inc.
10839 Washington Blvd.
Culver City, CA 90232
310/430-2473

Last Next