blog.fevrierdorian.com

Resolve network slowdowns due to Nuke, After Effects and others software like them (English Translation)

vendredi 19 octobre 2012 à 10:31

This is an english translation of a ticket I wrote (in french ) few days ago.

Nuke, After Effects and probably others, can be very "greedy" in terms of disk accesses. So much so they can break network performances drastically if several stations are renderings.

In this ticket I purpose a quick explanation of the why and how and a small Python class that can be used as a prototype to solve the problem.

What's the problem?

Nothing better than a concrete case to state the thing:

At the end of Lorax, there was a big number of "compers" (compositors) who was rendering their images all at the same time and time estimations given by Nuke was sometime surprising (3h-4h left for a simple nuke script). Monitoring a file being written, I realized it weight was increasing very slowly. It was 100ko, then 300ko, then after 10 minutes it became 400ko etc... I concluded the network was overloaded...

I remembered that we had the same worries on Tales Of The Night. Our infrastructure was certainly much smaller but 3 After Effect render was "pumping" the entire network. The found solution at that time was to render the file in local and copy it once finished. The effects were immediate: No more slowdown with the network.

So I tried to render the nuke script of a CG artist locally to see if it reduce the problem. The largest files to read were source files (because they are many), and the final image weight was actually very light so I was under no illusions. But once again, the conclusion was clear: The rendering was finished in 10-15 minutes (I'm not kidding) instead of 3-4hrs...

I've thought it was maybe the writing process which was a problem. However, when I copied the newly calculated image sequence to the server disks, the copy was very fast. So I've done this with every artist, one by one, and in a big afternoon, the network has no bottleneck anymore and all render ended up.

But that was not a solution. We had to understand why Nuke couldn't write it images quickly through the network. After have asked on the nuke-users mailing list (where I could see I was not alone but The Foundry doesn't seemed able to change this), I've started to "profile" Nuke to know how it does the job (strace and inotifywatch are you friends ).

Conclusions seems obvious but it's always good to check in practice what you suspect:

With Nuke, if you write a zip 1 line compressed EXR, in 1920x1080, Nuke will do a little less than 900 (approximately) write accesses on the file. If you are in zip 16 lines, it will do about 70 accesses (1080/16). And in uncompressed, that's really 1080 accesses.

In facts, compress in zip 16 line is not efficient if images should be read by Nuke. And depending of your network infrastructure, write line by line can put it completely down. It's difficult to explain how finally few Nuke rendering can fill a network, even if this one is strong. I feel this is related to multithreading: Nuke reads images (often, many at the same time) on the network while it is writing through it.

The most obvious solution is therefore to write the rendered image(s) on the local disk and to copy it in one time (one access) on the network disk. If you don't have technical resources (or just time), it's the simplest approach, but on larger projects it can quickly become daunting and (lest we forget) source of errors.

There are several solutions and I was leaning on a prototype that I found interesting because it's easy to implement.

The principle

You launch a Python thread that will watch a folder.
Every three seconds, the thread will list present files in a folder and check if their name will match to a given regular expression (the "pattern" of your file name).
If the file seems to be a file you want to move once finished, it search the same file with ".finished" (example: "toto.001.exr.finished").
If this file exists, it move the origin file and remove the ".finished" one and start the main loop again.
Once the render is done (so every images are finished), you ask the thread to stop itself.

As you can see, this method requires that you create a ".finished" file each time an images is finished. This is because it's impossible for the thread to know when an image is actually completed. The creation of this ".finished" file can be handled in a thousand different ways (For Maya, a simple "Post render frame" should do the job) so I will not go into details.

The code

Here it is:

import os, threading, re, time
 
class MoverThread( threading.Thread ) :
 
	def __init__( self, dirTocheck, dirToMoveIn, patternToCheck, force=False ) :
		threading.Thread.__init__( self )
 
		self._terminate = False
		self.dirTocheck = dirTocheck
		self.dirToMoveIn = dirToMoveIn
		self.force = force
 
		# regex pattern
		self.patternToCheck = patternToCheck
		self.rePattern = re.compile( patternToCheck )
 
		# sanity check
		if not os.path.isdir(self.dirTocheck) :
			raise Exception( "The given directory (dirTocheck) is not a valid directory -> %s" %  self.dirTocheck )
 
		if not os.path.isdir(self.dirToMoveIn) :
			raise Exception( "The given directory (dirToMoveIn) is not a valid directory -> %s" %  self.dirToMoveIn )
 
	def run( self ) :
 
		filesNotMoved = []
 
		while not self._terminate :
 
			# we wait 3 seconds before do anything
			time.sleep( 3 )
 
			# for every "entry" (file or folder) in the folder we check it have the good pattern. If it has, we check for a ".finished" file
			for entry in os.listdir( self.dirTocheck ) :
 
				# check the current entry is "compliant" with the given regex
				if not self.rePattern.match( entry ) :
					continue
 
				srcFilePath = os.path.join( self.dirTocheck, entry )
				dstFilePath = os.path.join( self.dirToMoveIn, entry )
 
				if os.path.isfile( srcFilePath+".finished" ) :
 
					# destination file aready exist?
					if os.path.isfile( dstFilePath ) and not self.force:
 
						# don't add the entry if it is already in the list
						if not entry in filesNotMoved :
							filesNotMoved.append( entry )
 
						continue
 
					# move the file to it new location
					os.rename( srcFilePath, dstFilePath )
					os.remove( srcFilePath+".finished" )
 
					print "File %s moved to %s" % ( entry, self.dirToMoveIn )
 
					break	# restart the while loop to avoid to continue the list of file we maybe have removed: ".finished"
 
		print "Terminated!"
 
		for fileNotMoved in filesNotMoved :
			print "Already exists: Can't move %s to %s" % ( fileNotMoved, self.dirToMoveIn )
 
 
 
	def join( self ) :
 
		self._terminate = True
 
		threading.Thread.join( self )

As you can see (or not), everything happen in a thread.

It's used like this:

import waitFinishAndCopy
myMoverThread = waitFinishAndCopy.MoverThread("/a/local/path/", "/a/network/path/", "^toto\.[0-9]{4}\.exr$")
myMoverThread.start()
# start rendering, do rendering, end rendering.
myMoverThread.join()

And voila!

Conclusion

I hope this modest prototype will inspire you if you are experiencing delays on your network.

I also suggest to do some profiling on your core network applications, especially if they are used by many people. Their behavior is always interesting (and sometimes surprising).

Have a nice day!

Dorian