The cpu accounting file is configured with the rush.conf
file's CpuAcctPath command.
Each time a frame finishes executing, a new entry is created in the Cpu Accounting
file, logging the name of the job, how long the frame ran, etc. The file format is
one record per line, with tab delimited fields in each record.
NOTE:
Do /not/ attempt to symbolically link or mount the rush/var/cpu.acct
log file directlry to an NFS server or remote file system; keep these
files /local/. NFS is the 'kiss of death' for daemons (rush, cron)
if the NFS server hangs or goes down; as soon as the daemon tries to
touch a hung NFS (e.g. rush adding a line to cpu.acct when a frame
finishes), the daemon will hang up completely. In the case of rush,
it will not only make the daemon unresponsive via irush during the
outage, it will also be unkillable if the mounts are 'hard'.
If you want to centralize the data, use the new rush.conf command
cpuacct.dbasedir command to
set a path to the server, and rush will regularly update the file info to a
date stamped directory hierarchy to make it easy to get at the data. Threads
are used when writing to this directory to ensure file server hiccups do not
affect the main daemon thread.
Other ways to manage this data are to use a crontab to sweep the data
to a central location (more or less what cpuacct.dbasedir does
without needing cron to do it), and use some other protocol like rcp, rdist,
rsync, etc. which will timeout if the remote server is not responding.
Here's a small example of a cpu.acct file:
Cpu Accounting File Example
|
r start 940000000 online
p 940001000 tahoe.798 WERNER/C33 erco 0106 superior 100k 122 0 0 0 27823
p 940001123 tahoe.797 KILLER erco 0504 superior 200 121 0 0 0 27846
s 940007123 offline fred@tahoe[192.17.1.34] -
m 940100000
r end 940100000 offline
d stop 940100005 offline
d start 940100008 offline
Where the first letter in each line is one of:
'r' - Log rotation
'p' - Process completed
's' - State change of daemon (online/offline)
'm' - Midnight marker
'd' - Daemon start/stop
|
The first character of each line in the file determines the type of data found on the
rest of the line, and defines what the space delimited fields will contain.
What follows is a description of the different letter codes, and the format of the
data lines for each code..
|
|
This entry is added to the file whenever rush finishes rendering a rush frame (process).
It includes relevant info about the process; its pid, exit code, how long it ran, etc.
New data for job/host ram and host cpu slots was added in 103.06 to assist in generating
more accurate actualization, so that external programs can see when changes were made
to the rush/etc/hosts file 'Ram' and 'Cpus' values, and can determine ram utilization
based on job requests.
'p' Process Entries
|
p 948242783 tahoe.798 WERNER/C33 erco 0106 superior 100k 122 0 0 0 27822 50/100 4
p 948242783 tahoe.798 WERNER/C33 erco 0107 superior 100k 122 0 0 0 27834 50/100 4
p 948242865 tahoe.797 KILLER erco 0504 superior 200 121 0 0 0 27846 25/100 4
- --------- --------- ---------- ---- ---- -------- ---- --- - - - ----- ------ -
| | | | | | | | | | | | | | |
| | | | | | | | | | | | | | (New in 103.06)
| | | | | | | | | | | | | | #Cpu slots for this machine
| | | | | | | | | | | | | | (from rush/etc/hosts file's 'Cpus' column)
| | | | | | | | | | | | | |
| | | | | | | | | | | | | (New in 103.06)
| | | | | | | | | | | | | Ram job requested / Total Ram for this machine
| | | | | | | | | | | | | (Total Ram from rush/etc/hosts file's 'Ram' column)
| | | | | | | | | | | | |
| | | | | | | | | | | | Pid
| | | | | | | | | | | |
| | | | | | | | | | | Exit code
| | | | | | | | | | |
| | | | | | | | | | #Secs User Time
| | | | | | | | | |
| | | | *Job | | | | #Secs System Time
| | | | Owner | | | |
| | | | | | | |
| | | Title of job | | | #Secs Wall Clock Time
| | Jobid | | |
| | | | Priority
| time(2) process started | |
| | Host that ran the process
'p' indicates 'process entry' |
Frame that ran
* The job owner is not necessarily the owner of the process.
Such is the case in windows jobs running frames on unix machines,
or 'forceuid' configured in the rush.conf file.
|
CAVEATS
- 'Exit code' is normally a positive number representing the actual
exit code of the process. This value will be negative if the process
was signaled; the value being the signal number. If the value is negative,
this usually means the process killed, segfaulted, or was bumped by a higher
priority process. Commonly, the 'Exit code' will be one of:
-15 - process killed with SIGTERM; someone probably manually killed it
-9 - process killed with SIGKILL; probably bumped in a priority battle
-3 - process killed with SIGINT; someone sent it a ^C
0 - process did an exit(0); frame Done
1 - process did an exit(1); frame Fail
2 - process did an exit(2); frame Requeue
- Although tempting, it is not recommend to use process execution times
for cpu billing purposes. Wall clock time includes time the process may have
spent waiting for network load. User and System times report the
respective times spent for the Render Script only; not its sub-processes
(e.g., the renderer).
To properly bill for cpu time, you would either need to enable full-on
Unix process accounting to attain accumulated cpu time for all sub-processes
in the user's render script, or, create wrapper scripts that use programs
like timex(2) to monitor the binary execution time of the critical
render/compositor processes.
Tools like timex(2) indicate in their documentation that they must have
Unix process accounting enabled to show sub-process totals.
This is usually prohibitive on production machines, due to disk resources
used by the Unix process accounting system.
|
|
These entries indicate when the account log is rotated out, useful in determining
the range of start and end times the data within this file represents.
'r' Rotation Entries
|
r start 948200000 online
r end 948300001 online
- ----- --------- ------
| | | |
| | | (New in 102.42a9)
| | | Indicates 'online' or 'offline' state of daemon
| | |
| | time(1) file was rotated
| |
| 'start' indicates the time the new log created
| 'end' indicates time log was rotated out (Ocpu.acct files only)
|
'r' indicates log file was rotated, either manually or automatically
|
|
|
An 's' entry is logged when someone changes the online/offline state
of the dameon, indicating what time the change was made, what state
it was changed to (online or offline), by whom, from which machine
the change was issued, and optional comments (if any).
's' State Entries
|
(New in 102.42a9)
s 982330201 online jerry@tahoe[192.17.1.34] -
s 982334241 offline root@meade[192.15.0.177] Offline for maintenance
- --------- ------- ------------------------ -----------------------
| | | | |
| | | | Optional remarks ('-' if none)
| | | |
| | | User@host who invoked the online/offline command
| | |
| | 'online' or 'offline'
| |
| time(1) state was changed
|
's' indicates daemon's online/offline state was changed
|
|
|
Midnight marks are useful for applications to determine days that were
completely idle, such as when a log isn't rotated for several days.
Midnight Marks
|
m 1114326000
- ----------
| |
| time(1) mark occurred
|
'm' indicates a midnight time marker.
|
|
|
Daemon boot messages indicate when the daemon was started/stopped.
Daemon Boot
|
d 1114326000 start "reason"
d 1114326005 stop "reason"
- ---------- ----- --------
| | | |
| | | (Optional) user supplied reason why daemon was stop/started
| | |
| | start|stop
| |
| time(1) mark occurred
|
'd' indicates a daemon boot message
|
|
|