Virtualization Technology Adoption – part 5 – High Performance Computing
Now we’ve come to the earliest stage of market adoption, the Innovators. We have also come to the most interesting of computing classes, supercomputing, also commonly referred to as High Performance Computing (HPC). That HPC is the most interesting class of computing, is my perspective only, as are all these posts. If you find my perspective helpful, great, if not, that’s OK too, in either case, let me know in the comments what you like about the model, or what you don’t like, or what you might add, change or delete. My analysis is by no means complete, just some rambling based on my own experience and reading. Chris Lau added some useful comments to my post on desktop computing that I had not considered, and it added to my understanding, and actually fit well with the model. Thanks again Chris for reading and adding value.
The use of virtual machines is being considered for HPC, but there are many barriers to its use. I commented on some of these barriers in a good itWorldCanada article by Kathleen Lau on Linux in the Cloud, which included a video interview discussing Canada’s largest supercomputer a machine built by IBM for SciNet.
I cannot characterize supercomputer programs in such a short post, and there are many dimensions to supercomputing architecture from both the hardware and software perspective. The key element in considering what makes computers super is parallelism. A supercomputer program can partition the problem it is solving into smaller components that will execute concurrently on a large number of processors. So if the solver takes 10 hours to complete a problem on one CPU, in an ideal world the problem can be solved in one hour on 10 CPUs – this is linear scalability. Some algorithms scale linearly, more scale sub-linearly, and a few under certain conditions may see super-scaling (it can happen with if the partition is small enough to fit entirely in on-chip cache memory for example). HPC users spend a considerable amount of research tuning their systems and software programs to scale well. In many cases the commercial value of HPC is “time to solution”, if an answer to a simulation is needed in an hour, there is high value to getting the answer on time and little or no value if the answer is late. For example, if the shuttle astronauts find a tear in the heat shield of the space shuttle in orbit, NASA will model that tear in a computer model of the shuttle and simulate a re-entry under those conditions. If that simulation takes a month to complete, it is of little value to that mission. If it takes a day to complete, they may make a risk decision based on the results. If it takes a hour to complete, they can run many simulations with varied parameters to ensure a better analysis. Time to completion is largely a function of all the parallel task taking the same time. If one task slows down, then the entire simulation slows down. The tuning of supercomputers focuses on making sure all the processors are ideally balanced, and any points of the system causing slowdown are optimized. This is much harder to do with virtual machines.
If a parallel algorithm is highly scalable and tolerant to imbalances across the computing resources, then it is an excellent candidate for virtualization. These algorithms have loosely coupled parallelism, and typically can take advantage of very large numbers of processors and are less dependent of balance and interprocess communication. SETI@home is an example of such a program. Where this is an option, consolidation is not a value, as dividing the processor resources across several virtual machines will slow down the algorithm defeating the objective of using more resources to speed up the result. Consolidation in compute servers was a very strong value proposition because their compute capacity was greatly under-utilized. HPC resources by contrast squeeze every last ounce of performance from each and every CPU. The value proposition in this case is portability, because the parallel program can run anywhere, even in the “cloud”. Another value proposition if heterogeneity, many supercomputer programs are so highly tuned, they will only be supported for certain versions of operating systems. A supercomputer that is to be shared by a diverse research community must then be able to support a wide variety of operating systems and versions, which can be done by virtual machines. The alternative is to reboot the processor with another OS, but that has another set of issues.
Another trend has appeared in the chart – the value propositions become more sparse earlier on the technology adoption curve. Do more value propositions emerge as you advance through the lifecycle, or do addition value propositions drive the adoption along the curve?
Unfortunately, many HPC programs require tightly coupled parallelism to perform well, and this is not easy to achieve with virtualization. It can be done, but it will take time; the HPC market is smaller that the compute server, desktop or embedded computing markets, thus investment in virtualization will not have as strong a commercial return, and thus will likely be largely done by the research organizations themselves. When virtualization achieves better adoption in the HPC market, it will save money for vendors of commercial software vendors of HPC programs, as they will only need to test there programs in a virtual machine, not in a large number of operating systems and versions – I believe this reason will encourage virtual machines into the early adopter camp, at least for commercial HPC software. “Grand Challenge” supercomputer systems, by there nature, stretch their underlying technologies to the breaking point, so I believe migration to the early adopter phase will be slower for these Top500 machines.
So why do I find supercomputing the most interesting class of computing? Because there we push the envelop of IT technology, we scale the pieces of IT until they break, we fix them, scale it some more until something else breaks, we fix that and scale it some more. We build these machines to simulate problems too costly or too dangerous to perform in the real world. We build computational models of cars and crash them without ever building a prototype, the result being lighter, safer cars that are designed in a shorter interval. We build computational models of weather systems, which allow us to predict the weather in the future to a certain probability of accuracy; as the machine gets bigger, as the granularity of the models get finer and as the algorithms improve, so do the predictions. The mapping of the human genome has enabled a whole new field of science – computational drug discovery; algorithms pour over huge databases of chemical structures looking for pattern matches with human DNA – this has the potential to cut the drug discovery interval in half, and is opening the door to the possibility of designer drugs tuned to our individual DNA. Susan Baldwin, executive director of Compute Canada spoke to the ORION summit on HPC; she presented the analogy, “If Google is the search engine of the present, then HPC is a search engine for the future”. Supercomputers are being built to solve “grand challenge” science problems where the only method experimentation is numerical modeling and simulation. Supercomputers will be used to make new discoveries and solve more complex riddles.
Now I get on my soap box. Canada is falling behind in the use of IT technology and it is well documented that as a result, there is a growing competitive lag with the US and other advanced nations. Canada’s top supercomputer is on 22nd on the most recent Top500 list. Unless a new larger machine is brought online in the next few weeks, that position will drop with the next list in June. Yes, the US economy is 10x the Canadian economy, so it is not surprising that they are above us on the list, but they have 12 machines on the list above us, and their top machine is 10x bigger – we’re way behind. Even more interesting is the other nations above us on the list: Switzerland, UK, China, Saudi Arabia, Korea, Germany and Russia. You can download the list as a spreadsheet and calculate total capacity in the top500 ranked systems by country, and it doesn’t position Canada any better. Canada needs to build more supercomputers. We need to build a petaFLOP machine.



