From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: strange retries
   Date: Mon, 26 Jun 2006 06:09:56 -0400
Msg# 1318
View Complete Thread (4 articles) | All Threads
Last Next
Hi Abraham,

	Check for more than one "rushd" daemon running on the job server
	for that job (jobid is "lion.717", so check "lion" for two rushd's)

	Under normal circumstances there are sometimes several rushd processes,
	which are children of the main daemon, and usually are gone within 10
	seconds or so.

	The regular parent daemon is the one with a PPID of 1.
	The children daemon have a PPID (Parent PID) of the parent.
	Look for a child rushd that has been around for over a minute
	(eg. the 'START' column of the 'ps aux' report) with a PPID greater
	than "1".

	The cause of the extra daemon is a bug in 'rush -ljf <jobid>' command that
	caused the extra daemon if <jobid> didn't exist at the time the 'rush -ljf'
	command was issued. (or if you hit 'Jobs Full' in irush for a job that didn't
	exist).

	This bug was fixed in a recent release (102.42a6) which came out a few weeks
	ago. The upgrade is free; installing the new 102.42a6 release solves this
	problem permanently.

	For the short term, identifying the rogue child rushd and killing it will
	stop the retries.



Abraham Schneider wrote:
[posted to rush.general]

Hi!

We have some troubles with our network/pipeline (linux based ethernet servers connected to a SAN storage): black rendered frames from shake, corrupted rendered frames and several other problems. We try to figure out which part of the pipeline makes these problems but that's not easy because the problems couldn't be reproduced easily.

There is one strange thing that I get when rendering with rush and shake: when I look in iRush->Frames I sometimes retries for some packets. Strange thing is: the log of this packet looks like there hasn't been any problem or retry.

For example: I have "retry #2 of 5" in the notes of a packet, but the log looks like this:

###
### lion.700: 0006
###
--------------- Rush 102.42a --------------
--      Host: scarecrow
--       Pid: 14415
--     Title: servertest1
--     Jobid: lion.717
--     Frame: 0006
--     Tries: 0
--     Owner: aschneid (1054/2001)
-- RunningAs: aschneid (1054/2001)
--  Priority: 81
--      Nice: 10
--    Tmpdir: /var/tmp/.RUSH_TMP.142
--   LogFile: /mnt/frozone/projects/servertest/servertest1.shk.log/0006
-- Command: perl /mnt/libs/rushlib/submit-shake.pl -render /mnt/frozone/projects/servertest/servertest1.shk 5 300 5 AddNever+Requeue 60000 off -v -motion 1.0 1 -cpus 4 -proxyscale Base
--   Started: Sat Jun 24 04:47:00 2006
------------------------------------------
    SHAKEPATH: /mnt/frozone/projects/servertest/servertest1.shk
  RENDERFLAGS: -v -motion 1.0 1 -cpus 4 -proxyscale Base
  BATCHFRAMES: 5 (6-10)
      RETRIES: 5 (AddNever+Requeue after 5 retries)
   MAXLOGSIZE: 60000
PATH: /usr/nreal/shake/bin:/usr/local/rush/bin:/usr/local/rush/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin

Executing: logtrim -s 60000 -c shake -exec /mnt/frozone/projects/servertest/servertest1.shk -t 6-10 -v -motion 1.0 1 -cpus 4 -proxyscale Base
info: rendering frame 6
info: frame 6 rendered in 26.94s
info: rendering frame 7
info: frame 7 rendered in 29.38s
info: rendering frame 8
info: frame 8 rendered in 27.35s
info: rendering frame 9
info: frame 9 rendered in 23.67s
info: rendering frame 10
info: frame 10 rendered in 26.74s
--- SHAKE SUCCEEDS: EXITCODE=0

Any idea why there was a retry? How could I check what forced the retry? Any other general ideas how to test my pipeline to find problematic part?


--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)(new)

Last Next