From: Antoine Durr <antoine@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 15:48:57 -0400
Msg# 1582
View Complete Thread (16 articles) | All Threads
Last Next
On 2007-06-06 12:03:29 -0700, Greg Ercolano <erco@(email surpressed)> said:

Antoine Durr wrote:
Well, the problem with changing the # of cpus allocated to a job is
that you stomp on information that could be critical to the job, e.g.

	I see; one way to communicate that to the watcher would be to
	have the job specify 'maxcpus 4' (rush submit command), and the
	watcher could notice that and honor it as a 'given limit' for
	that particular job, and not assign more cpus than that.

I'm running a comp, but don't want to run with more than 4 cpus, or
I'll flood the IO bandwidth of the drives.

	Heh; if 4 procs are flooding the fileserver bandwidth,
	time for a new file server ;)

That was just a number. But with gig-e and xeon 5355's running 4k comps, you can get I/O limited very quick.


Or I've only got two Houdini licenses, thus only run two jobs.

	I see; in such cases either maxcpus or even just submitting
	to two named machines might help. eg:

maxcpus 2
cpus    tahoe=1@999 ontario=1@999

Beyond their own machine (which is important) I really dislike the notion of a user having to know or choose what machines their stuff lands on. The users don't (and shouldn't) have control of arbitrary machines. And what if one of those goes down? What if a new host gets added in the middle of the night? A job should not have any innate knowledge about specific machines. A job should know about pools of machines only.


Thus, the thing that should
be tweaked is priority, so that a person deficient cpus gets a higher
priority, and therefore a greater chance of picking up the next
available proc.  Of course, then there's no longer a user-settable
priority system.  This isn't too bad, as (IMO) all users should have
the same priority, and it's up to show management to tweak that.

	Ya, this is kinda why I leave it up to customers to implement
	their own watcher scripts, cause folks want to schedule things
	their own ways.

I'm curious as to the different ways that exist. What kinds of things are important to people?


	You should be able to slap rush around to follow your own rules,
	just be sure not to slap it around too often, or it'll spend more
	time rescheduling things than it will keeping things running.

	I hope that my 'working example', whenever I get it working,
	will be a good starting point, as I intend to show good practices
	in how to monitor/adjust rush at a decent rate, without overloading
	the daemons.

	Trouble is, I've been busy on some other stuff, but I intend
	to have something in the not too distant future.

Ideally, the priority scheduling should be revised on every cpu
assignment and every done frame, so that the next assignment makes the
distribution more balanced.

	My take on it is that there are so many problems that come
	with production, that no one scheduler can handle them all
	efficiently.. it's just too complicated.

I think a good way to do this is to keep the load balancing out of the assignment loop (which is inherently the way it is right now). Give the users a few different balancing scripts, with different knobs. The caveate, of course, is that these balancing scripts can get enough low-level information and run frequently enough to be useful.


	And when the scheduler is so complicated that only a few
	zen gurus can understand it, folks will curse it constantly.

I think folks will curse it if it doesn't do what it says it will do. In the meantime, Rush wins by having superb assignment consistency, at the expense of balancing fairness.


	So rush's view is to just keep procs busy, with cpucaps and
	priority will manage things. If things are taking too long,
	bump a few of the high-pri procs ('staircasing') up a little more,
	maybe even make them 'k' (kill) priority.

The only place I think kill priority is warranted is on your own machine. Killing jobs that are mid-way is a great way to waste resources. I love that Rush can be set up so that your own machine is "yours", which is of great comfort to users. Yes, if I want to use my machine, I should be able to kill whatever's on there (since I can do that already with -getoff). But you wouldn't want me to have that power over *your* machine.

I don't think the users should be tasked with determining the most advantageous priorities themselves just to get their frames run. Also, when you get to the point of having 10x jobs as there are cpus, the tendency to favor your own can easily outweigh the needs of the studio. *That's* where users start to curse. So yeah, having something in there so that you get at least 1 cpus to get your stuff going. At R&H, we had the notion that single-frame jobs got higher priority than multi-frame jobs, so that you could run your single test frames and get them through the queue. And what did I do? I wrote a submission script that submitted all my frames as single-frame jobs! ;-)


The challenge then becomes dealing with
fast frames, as you then spend an inordinate amount of time
rebalancing.  Thus, every 5 or 10 seconds should be plenty.  However,
if a whole slew of cpus can be assigned in that time, the queue could
very quickly become out of balance.

	I think if you look at it from the point of view where the
	caps prevent things from getting too crazy between samples,
	you'll find stability.

	Oscillation in a scheduler is a common problem, and I avoid
	all that by having static scheduling to fit the distributed
	nature of the queue.

	Rush has a different approach from the centralized schedulers;
	it takes some time to understand its approach, and not try
	to 'force fit' a scheduling algorithm that's too opposite to
	it's design. The idea behind rush's design is to prevent the
	need for micromanaged processing; if you have a comp you want
	to sneak by on a few procs, just give the job a few high-pri
	procs, and the rest at low, eg:

cpus +any=2@900k
cpus +any=10@1

	Works best if everyone is using that same technique, so that
	all are guaranteed a few procs, and the rest round robin..
	if there's nothing else going on, they get as many procs
	as they can handle.

The round robin'ing still fails when you have a mix of long renders and short frames. The long frames get all the cpus.

-- Antoine



--
Floq FX Inc.
10839 Washington Blvd.
Culver City, CA 90232
310/430-2473


Last Next