By Simon Karpen — Taos Practice Leader
Supercomputing is the International Conference for High-Performance Computing (HPC), Networking, Storage and Analysis. Supercomputing ’12 is happening now in Salt Lake City. The primary organizers are IEEE and ACM, and the sponsors include computing and research organizations and companies.
One of the overriding themes of the conference tutorials and workshops is big data. In the recent past, many HPC problems have been of the “big compute, small data, embarrassingly parallel” variety. These scale well on everything from a Beowulf cluster to, in extreme cases, scatter/gather to individual workstations (i.e. SETI@Home). The current challenges involve extremely data-intensive problems, such as genomics research and climate modeling, which require totally different modes of operation and infrastructure.
For small-data problems, much of the software is custom-written and domain-specific. For big-data, this is no longer true; the commercial and HPC/research worlds are one and the same. Column-oriented NoSQL data stores, Object storage, and of course a wide variety of Hadoop implementations now play a critical role. Many of these big-data problems rely on the ability of tools like Hadoop to parallelize I/O; we’re moving from a CPU-bound world to an I/O bound world. Thanks to open-source, there’s a great deal of tool improvement and cross-pollination between the commercial and research spheres.
Another development that also parallels Silicon Valley is the rise of the cloud. Substantial research is happening on public clouds such as AWS, and many organizations are building private clouds with OpenStack or Eucalyptus. In the past, the HPC community built its own tools for infrastructure management; now many of these clouds are managed using common configuration management software such as Chef. Again, there’s a great deal of cross-pollination, shared development, and shared improvement to tools and ideas.
Finally, the disk is officially the new tape. Many BigData tools (including Hadoop) are designed to read and write data on disks in a linear fashion, to ensure adequate performance with cost-effective commodity storage. For workloads requiring random access, solid-state storage, of course, rules the roost. For these workloads, an Atom-based system with an SSD can outrun a traditional many-core workstation/server platform with spinning disks.