By Simon Karpen, Sr. Technical Consultant

Supercomputing is The International Conference for High Performance Computing (HPC), Networking, Storage and Analysis. Supercomputing ’12 is happening now in Salt Lake City. The primary organizers are IEEE and ACM, and the sponsors include computing and research organizations and companies.

As a Taos Practice Leader, one topic that is very near and dear to my heart is system reliability. We have a great deal of expertise in helping our clients craft reliable, redundant and resilient systems that can meet complex business needs and realistic budgets.

In large clusters, many of these reliability problems are magnified. There is a great deal of ongoing research on how to maintain performance and reliability when scaling from tens of thousands to hundreds of thousands of nodes. Research areas range from the analysis of DRAM failures to improved methods of fault prediction based on analysis of a very large corpus of system logs. Reducing DRAM failures reduced node reboots caused by uncorrectable errors, which provided a meaningful improvement in throughput and reduction in checkpointing overhead. There is a great deal of research and innovation in this area, and more will be needed to successfully build 100k-node clusters without excessive overhead.

Perhaps even more critical than failures, there is a growing concern about the impact of silent corruption and errors in data. When you are working with multi-petabyte data sets, storage with silent errors of 1 in 10¹³ bits can be a significant problem. There are similar ratios and issues with undetected DRAM errors, even in ECC RAM (not all single-bit errors are always caught, and there are sources of single-bit errors other than the DRAM itself). In some cases, critical computations are even run on two or three separate systems (to verify that answers are consistent), but this is not practical for all or even most computations due to cost. Advances in filesystem checksumming (i.e. ZFS) have helped in some areas, but there is a great deal of additional work to do in this area.

The Supercomputing conference is also working hard to provide opportunities for traditionally under-represented groups. This is both good business (there is not necessarily a shortage of potential talent, but there is a real shortage of actualized talent) and good for society. The conference provides mentorship support, guidance on networking and professional relationships, and financial support to help those who could not otherwise attend. As a mentor, I had the opportunity to help a very hard-working non-trad undergrad navigate the conference, choose tutorial and workshop sessions, and get comfortable networking with fellow participants.

For research topics and projects that are smaller, still in progress, or just better presented that way, there is also a poster session. Topics range from virtual machine migrations over long-haul WANs to scaling performance on CPU vs GPU for specific codes. There were also many posters discussing new applications of Hadoop and related technologies, cloud pricing, and a recent innovation — electronic posters displaying extremely detailed visualizations of everything from aerosols in the atmosphere to crack formation in a piece of metal.

Finally, while it isn’t directly related to building large-scale infrastructure, Intel brought the bridge from the original USS Enterprise to the conference. We shouldn’t forget to thank science fiction for many of the ideas behind current technologies we take for granted, such as flip phones, iPads, and of course Google Earth.