From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 15:03:29 -0400
Msg# 1580
View Complete Thread (16 articles) | All Threads
Last Next
Antoine Durr wrote:
> Well, the problem with changing the # of cpus allocated to a job is 
> that you stomp on information that could be critical to the job, e.g. 

	I see; one way to communicate that to the watcher would be to
	have the job specify 'maxcpus 4' (rush submit command), and the
	watcher could notice that and honor it as a 'given limit' for
	that particular job, and not assign more cpus than that.

> I'm running a comp, but don't want to run with more than 4 cpus, or 
> I'll flood the IO bandwidth of the drives.

	Heh; if 4 procs are flooding the fileserver bandwidth,
	time for a new file server ;)

> Or I've only got two Houdini licenses, thus only run two jobs.

	I see; in such cases either maxcpus or even just submitting
	to two named machines might help. eg:

maxcpus 2
cpus    tahoe=1@999 ontario=1@999

> Thus, the thing that should 
> be tweaked is priority, so that a person deficient cpus gets a higher 
> priority, and therefore a greater chance of picking up the next 
> available proc.  Of course, then there's no longer a user-settable 
> priority system.  This isn't too bad, as (IMO) all users should have 
> the same priority, and it's up to show management to tweak that.

	Ya, this is kinda why I leave it up to customers to implement
	their own watcher scripts, cause folks want to schedule things
	their own ways.

	You should be able to slap rush around to follow your own rules,
	just be sure not to slap it around too often, or it'll spend more
	time rescheduling things than it will keeping things running.

	I hope that my 'working example', whenever I get it working,
	will be a good starting point, as I intend to show good practices
	in how to monitor/adjust rush at a decent rate, without overloading
	the daemons.

	Trouble is, I've been busy on some other stuff, but I intend
	to have something in the not too distant future.

> Ideally, the priority scheduling should be revised on every cpu 
> assignment and every done frame, so that the next assignment makes the 
> distribution more balanced.

	My take on it is that there are so many problems that come
	with production, that no one scheduler can handle them all
	efficiently.. it's just too complicated.

	And when the scheduler is so complicated that only a few
	zen gurus can understand it, folks will curse it constantly.

	So rush's view is to just keep procs busy, with cpucaps and
	priority will manage things. If things are taking too long,
	bump a few of the high-pri procs ('staircasing') up a little more,
	maybe even make them 'k' (kill) priority.

> The challenge then becomes dealing with 
> fast frames, as you then spend an inordinate amount of time 
> rebalancing.  Thus, every 5 or 10 seconds should be plenty.  However, 
> if a whole slew of cpus can be assigned in that time, the queue could 
> very quickly become out of balance.

	I think if you look at it from the point of view where the
	caps prevent things from getting too crazy between samples,
	you'll find stability.

	Oscillation in a scheduler is a common problem, and I avoid
	all that by having static scheduling to fit the distributed
	nature of the queue.

	Rush has a different approach from the centralized schedulers;
	it takes some time to understand its approach, and not try
	to 'force fit' a scheduling algorithm that's too opposite to
	it's design. The idea behind rush's design is to prevent the
	need for micromanaged processing; if you have a comp you want
	to sneak by on a few procs, just give the job a few high-pri
	procs, and the rest at low, eg:

cpus +any=2@900k
cpus +any=10@1

	Works best if everyone is using that same technique, so that
	all are guaranteed a few procs, and the rest round robin..
	if there's nothing else going on, they get as many procs
	as they can handle.


-- 
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

Last Next