Ask a CG line producer what limits their ability to deliver a movie on time. They would answer: People, compute & storage. So, what would happen if they had access to an almost limitless supply of computational power and storage?
Rendering animated movies on the public cloud sounds crazy until you really think about it. As a workload, it is extremely computationally intensive, requires an enormous amount of storage and is subject to dramatic variances due to the nature of the entertainment business (Summer blockbusters, Christmas movies, etc.). The cloud offers unprecedented access to compute and data storage resources, available on-demand and at very large scales. It would seem to be a match made in heaven, however, as with many things, the devil is in the details.
The price is right
A quick back of the envelope comparison of on-demand costs with on-premises hardware will quickly lead to the conclusion that the cloud is far too expensive for rendering at scale. However, the key to getting the right price is using the cloud provider’s spare capacity.
Imagine you are running a public cloud. You have to be able to quickly provide computational and storage capacity to a customer within 5 minutes of them requesting it. How could you achieve such a feat? The answer is rather low-tech: you must over-provision and maintain healthy excess inventory.
Both Amazon Web Services and Google Cloud Platform provide alternate billing models that leverage this excess capacity. The models of each cloud differ considerably but the result is the same: Anywhere from 80~90% savings over on-demand prices for computational capacity. But of course there’s a trade-off. These servers may be taken away in the event that another customer requests the same computational hardware at on-demand prices.
Cut off one head, two more shall take its place
Thankfully the rendering workload is uniquely fault-tolerant. Any render wrangler will tell you that even if individual servers crash or suffer from a hardware failure, the job will go on. This is mostly due to the way renders are distributed amongst the servers available.
When operating at scale, each scene is split into its component frames. These frames are then distributed to individual servers by a queue management engine. The specific engine used depends on the company involved but they usually work in a similar fashion (for e.g.: Sun Grid Engine, Backburner, Deadline, Coalition, Torque). If a server becomes unresponsive the frame is marked as unfinished and the work of rendering it is dished out to another server. This means that rendering is perfectly suited to using excess cloud capacity.
Storage & Network Bandwidth
Rendering requires shared storage for all scene assets and output data. This is typically exposed as an NFS share running on a VM. This works great until the number of render nodes grows to around 250~300. At that size the storage becomes a bottleneck causing less than optimal performance. In order to push through and get the right performance we need a clustered file system. We have worked with Intel’s Lustre and Avere System’s Cloud NAS, both allow us to scale well beyond 300 servers and still provide ample network bandwidth for the data store.
Thanks to Google’s better network performance (see below) the NFS limit is slightly higher on GCP, but it is still the limiting factor to hitting massive scale.
Pick a cloud, any cloud
As mentioned earlier both AWS and GCP provide pricing models that work well with rendering. However, there are some differences that are worth pointing out:
|Maximum cores/server||40 cores||32 cores|
|Maximum RAM/server||244 GB||208 GB|
|Maximum Network||10 Gbps||16 Gbps|
These are current as of the date of publication but are likely to change when new larger instances from both vendors reach general availability. Currently AWS holds the crown of most cores but Google is undoubtedly the network bandwidth champ.
Traditionally render farms are purpose-built datacenters usually located very close to the animation professionals building the models, textures, lighting and scenes. These assets are relatively heavy (of the order of 10’s of terabytes). These assets change frequently during the course of production and may need to be modified at a moment’s notice. This makes transferring them from local systems to the cloud problematic both in terms of sheer size as well as maintaining synchronization.
The assets also form the core of a studio’s intellectual property and are extremely sensitive in nature. Many studios have very strict security provisions and the MPAA itself has set forth several guidelines for datacenter security.
Since render farms are purpose-built, adding capacity is not easy. Datacenters take years to plan, hardware procurement cycles are slow and the setup, configuration and networking takes time. As a result of this, the licensing models offered by rendering engine software vendors are typically not designed to be elastic.
Rendering/VFX demand continues to grow at incredible rates. The prevelance of 4K has lead to ever-increasing texture sizes and an almost insatiable need for computational resources. New frontiers such as virtual reality and 360° photography are generating even more demand.
As the cloud matures and the proprietary technology being developed at Amazon and Google takes tighter hold, it will be increasingly difficult for on-premises data centers to keep up. The unprecedented scale of the cloud is forcing rearchitecting of traditional technology and it is leading to dramatic leaps in performance.
Given the broader industry trends and the track record of the cloud so far, it is very likely that this technology will play a major part in the next wave of rendering & VFX work. We believe that any animation and/or VFX studio needs to get comfortable with the cloud way of doing things because it will almost certainly be their new render farm.