On 2007-06-06 12:03:29 -0700, Greg Ercolano <erco@(email surpressed)> said:
Antoine Durr wrote:
Well, the problem with changing the # of cpus allocated to a job is
that you stomp on information that could be critical to the job, e.g.
I see; one way to communicate that to the watcher would be to
have the job specify 'maxcpus 4' (rush submit command), and the
watcher could notice that and honor it as a 'given limit' for
that particular job, and not assign more cpus than that.
I'm running a comp, but don't want to run with more than 4 cpus, or
I'll flood the IO bandwidth of the drives.
Heh; if 4 procs are flooding the fileserver bandwidth,
time for a new file server ;)
That was just a number. But with gig-e and xeon 5355's running 4k
comps, you can get I/O limited very quick.
Or I've only got two Houdini licenses, thus only run two jobs.
I see; in such cases either maxcpus or even just submitting
to two named machines might help. eg:
maxcpus 2
cpus tahoe=1@999 ontario=1@999
Beyond their own machine (which is important) I really dislike the
notion of a user having to know or choose what machines their stuff
lands on. The users don't (and shouldn't) have control of arbitrary
machines. And what if one of those goes down? What if a new host gets
added in the middle of the night? A job should not have any innate
knowledge about specific machines. A job should know about pools of
machines only.
Thus, the thing that should
be tweaked is priority, so that a person deficient cpus gets a higher
priority, and therefore a greater chance of picking up the next
available proc. Of course, then there's no longer a user-settable
priority system. This isn't too bad, as (IMO) all users should have
the same priority, and it's up to show management to tweak that.
Ya, this is kinda why I leave it up to customers to implement
their own watcher scripts, cause folks want to schedule things
their own ways.
I'm curious as to the different ways that exist. What kinds of things
are important to people?
You should be able to slap rush around to follow your own rules,
just be sure not to slap it around too often, or it'll spend more
time rescheduling things than it will keeping things running.
I hope that my 'working example', whenever I get it working,
will be a good starting point, as I intend to show good practices
in how to monitor/adjust rush at a decent rate, without overloading
the daemons.
Trouble is, I've been busy on some other stuff, but I intend
to have something in the not too distant future.
Ideally, the priority scheduling should be revised on every cpu
assignment and every done frame, so that the next assignment makes the
distribution more balanced.
My take on it is that there are so many problems that come
with production, that no one scheduler can handle them all
efficiently.. it's just too complicated.
I think a good way to do this is to keep the load balancing out of the
assignment loop (which is inherently the way it is right now). Give
the users a few different balancing scripts, with different knobs. The
caveate, of course, is that these balancing scripts can get enough
low-level information and run frequently enough to be useful.
And when the scheduler is so complicated that only a few
zen gurus can understand it, folks will curse it constantly.
I think folks will curse it if it doesn't do what it says it will do.
In the meantime, Rush wins by having superb assignment consistency, at
the expense of balancing fairness.
So rush's view is to just keep procs busy, with cpucaps and
priority will manage things. If things are taking too long,
bump a few of the high-pri procs ('staircasing') up a little more,
maybe even make them 'k' (kill) priority.
The only place I think kill priority is warranted is on your own
machine. Killing jobs that are mid-way is a great way to waste
resources. I love that Rush can be set up so that your own machine is
"yours", which is of great comfort to users. Yes, if I want to use my
machine, I should be able to kill whatever's on there (since I can do
that already with -getoff). But you wouldn't want me to have that
power over *your* machine.
I don't think the users should be tasked with determining the most
advantageous priorities themselves just to get their frames run. Also,
when you get to the point of having 10x jobs as there are cpus, the
tendency to favor your own can easily outweigh the needs of the
studio. *That's* where users start to curse. So yeah, having
something in there so that you get at least 1 cpus to get your stuff
going. At R&H, we had the notion that single-frame jobs got higher
priority than multi-frame jobs, so that you could run your single test
frames and get them through the queue. And what did I do? I wrote a
submission script that submitted all my frames as single-frame jobs! ;-)
The challenge then becomes dealing with
fast frames, as you then spend an inordinate amount of time
rebalancing. Thus, every 5 or 10 seconds should be plenty. However,
if a whole slew of cpus can be assigned in that time, the queue could
very quickly become out of balance.
I think if you look at it from the point of view where the
caps prevent things from getting too crazy between samples,
you'll find stability.
Oscillation in a scheduler is a common problem, and I avoid
all that by having static scheduling to fit the distributed
nature of the queue.
Rush has a different approach from the centralized schedulers;
it takes some time to understand its approach, and not try
to 'force fit' a scheduling algorithm that's too opposite to
it's design. The idea behind rush's design is to prevent the
need for micromanaged processing; if you have a comp you want
to sneak by on a few procs, just give the job a few high-pri
procs, and the rest at low, eg:
cpus +any=2@900k
cpus +any=10@1
Works best if everyone is using that same technique, so that
all are guaranteed a few procs, and the rest round robin..
if there's nothing else going on, they get as many procs
as they can handle.
The round robin'ing still fails when you have a mix of long renders and
short frames. The long frames get all the cpus.
-- Antoine
--
Floq FX Inc.
10839 Washington Blvd.
Culver City, CA 90232
310/430-2473
|