From: Antoine Durr <antoine@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 16:30:46 -0400
Msg# 1584
View Complete Thread (16 articles) | All Threads
Last Next
On 2007-06-06 13:15:34 -0700, Greg Ercolano <erco@(email surpressed)> said:

Antoine Durr wrote:
maxcpus 2
cpus    tahoe=1@999 ontario=1@999

Beyond their own machine (which is important) I really dislike the
notion of a user having to know or choose what machines their stuff
lands on.

	Agreed; but if you have e.g. node locked licenses,
	it at least ensures only those boxes get the renders.

Again, an appropriately named pool would take care of that.


	If you have floaters, then yes, you wouldn't specify
	hostnames, just a cpu cap.

	Thing is, if you submit two jobs that are both houdini
	jobs, and you only have two floating lics, then you
	need some 'centralized' way to count the lics; this is
	where a watcher script might come in, and realize there
	are two houdini jobs and limit the procs, or tweak the
	priorities to prevent a situation where more than two
	are rendering at once.

Jobs should have the notion of "requirements", one of which is a particular license type. Admittedly, this is tricky because users use the licenses w/out notifying the queue! So the queue has to figure out what's left, figure out how many it's using, and only allow so many more after that.


	Rush itself doesn't maintain a centralized perspective
	of things, so it can't do centralized stuff like counting.

I'm curious as to the different ways that exist.  What kinds of things
are important to people?

	I've seen a variety of requests too numerous to mention.
	I think Esc had the most complex of all the scheduling
	algorithms I'd come across for The Matrix III. They had
	all kinds of stuff in there; taking ram into account,
	render times that change over time, I think they were
	even polling ram use as the renders ran. The guy who was

Funny, I'm doing the ram-usage polling right now. I simultaneously launch a memory watcher script, which given a PID, finds all the child PIDs and adds up their memory consumption, writes to a file in the logdir. When the process completes, it tails the last line of that file, and puts it into the per-frame notes field, so that the users see what kind of memory footprint their job had. This is a pretty critical feature, IMO, as you *really* want to avoid going into swap on a box!

Ideally, I should be able to submit with a requirement of a certain amount of ram, and have the frames only run on machines that have that much ram left. Yes, that is not failure-free, as doing only spot-checks doesn't tell you that a particular job on a host might suddenly chew up another gig. But at least it should try.

-- Antoine

	writing it had a lot of high level goals.

	Trouble with their implementation was they had >500 boxes
	on their farm, and were throwing all the jobs submitted
	to one box..! I warned that was a bad, bad from the get go,
	but they locked into that for some reason. The whole point
	of rush's decentralized design is to distribute the job
	load, so it really hinders it to focus all jobs on a single
	box for a large net. This made it tough for them, because
	the central box became overloaded fast, in addition to their
	watcher constantly rescheduling it.

	That was a while ago, 2002/2003 IIRC, when boxes and
	networks were slower, and rush has had some optimizations
	since then.


--
Floq FX Inc.
10839 Washington Blvd.
Culver City, CA 90232
310/430-2473


Last Next