I am adding new topic to the Blog. High Performance Computing and Supercomputing on VMware vSphere. This topic came up from a recent customer discussion on how they could build a distributed cloud to the purpose of grid computing that scales on par with traditional super computers. Coming from a family that has been involved with supercomputers for decades, I found this topic to be of keen interest. As a result, I’ve started to collect a series of notes and white papers on the topic of High Performance Computing (HPC) on VMware vSphere and the changes that must be considered to maximize performance and scalability of a virtual HPC Cluster Environment.
First, an old but mandatory read on the topic is covered in a Paper published by Cam Macdonald and Paul Lau from the University of Alberta titled: Pragmatics of Virtual Machines for High-Performance Computing: A Quantitative Study of Basic Overheads by Cam Macdonell and Paul Lu
http://www.vmware.com/files/pdf/paullu.vmware.final.pdf
Macdonald and Lau’s paper is currently a little dated but many of the ideas still hold. Some points to be aware of is understanding how vSphere 5 has been optimized to reduce overhead, allow for better performance, and the automated orchestration features that will allow for capacity on demand. Macdonald’s and Lau’s closing issues regarding overhead have been reduced to less than 1% for most compute types and with the increase in performance of the CPU every year, this overhead becomes smaller yet.
Next, VMware Employee Jeff Buell posted a blog posts on “HPC Application Performance on ESX 4.1: Stream” back in Sept 2010.
http://blogs.vmware.com/performance/2010/09/hpc-application-performance-on-esx-41-stream.html
Jeff takes great strides in identifying factors that will optimize performance of the compute cluster. Most notable is the use of local memory when writing applications will ensure optimal memory bandwidth once deployed and keeping the computer resources within a single NUMA node to optimize resource utilization. While vSphere can address 1TB of RAM and up to 32CPU’s from a single VM, the optimization for performance lays on keeping VM’s tuned and sized to run within the optimal limitations of the server the VM is hosted upon.
Next, I want to make sure you follow the blog posts of Josh Simons. Josh works in the Office of the CTO at VMware as a strategist specializing in HPC and maintains the VMware Blog posts on HPC here: http://communities.vmware.com/community/vmtn/cto/high-performance
Josh has contributed several videos and discussions on the topics. With the recent Supercomputing 2011 event in Seattle, John pulled together several interviews and overviews of technologies that will enable Cloud based HPC.
In addition, Josh’s 2010 overview of HPC in the Cloud
Lastly, my own observations and comments:
With the recent release of vSphere 5 and Auto Deploy, the process of maintaining a scalable Cloud infrastructure has become considerably simplified. The process of updating an entire server farm can be reduced from weeks to minutes by leveraging PXE and an Image Server to refresh entire farms of servers at reboot. By adding solutions such as templates, workflow orchestration, and capacity management, we can now scale up clusters of computers on demand to accommodate almost any size distributed workload. Adding vCloud Director and vCloud connector allows us to scale the compute cluster even further into a single or multiple public cloud providers on demand. In addition, with the new scalability improvements of vSphere 5, we are finding larger VM’s, more addressable RAM, less overhead, and significant IO gains at the Hypervisor. All of these improvements contribute to the greater acceptance of HPC workloads in the Cloud and in a VM.
As I dig deeper into the topic, I hope I can contribute some of my own personal works to the field and leverage the knowledge of my colleagues to ensure others can explore this emerging growth area.

