Systems with exascale performance are likely to be deployed in 2020-2023. Getting there will require significant advances in HPC systems. In particular:
- Single single-thread performance is not increasing, an exascale system will require billions of concurrent threads. It is not clear which applications can scale to that level, and how a software stack will be structured to manage such a level of parallelism
- Power restrictions will probably entail the massive use of hybrid architectures, much lower CPU to memory ratio, and significantly enhanced locality. Power management will be very important.
- Higher error rate (and the avoidance of costly error recovery mechanisms) will require new solutions to resilience
I am involved in several efforts to define the software research agenda for exascale and start executing on it. Note that exascale may be the end of the road, at least for a system using silicon technology. We should start thinking seriously about alternative technologies and on a computer science agenda that is not predicated on Moore's Law.
One major roadblock to exascale is the increase in failure rates, due to the increase in component count, the decrease in the reliability of components at smaller scale, and the reduced important of resilience in leading markets, such as mobile. I edited a major report on the problem of resilience at exascale and co-authored surveys of the field.
The Argo project studies operating system and runtime designs for exascale. The goals include
- A hierarchical global OS design for managing parallel resources
- A node OS design that delegates to runtime, to the extent possible, resource management; and executes asynchronous OS code on dedicated cores
- A runtime designed to support large numbers of lightweight tasks
- A distributed, hierarchical resource management infrastructure that uses online feedback and machine learning to continuously adjust resource allocation, including power.
- A global information bus that distributes performance and RAS information so as to support localized performance tuning and fault handling
I am the chief scientist for this large project.
Parallel I/O systems involve a complex layering of libraries, middleware and parallel file systems. All of these layers have multiple setable configuration parameters. Since the impact of these different knobs is hard to understand, users normally use default values, resulting in a significant loss of I/O performance. Autotuning is used to significantly improve performance (see paper).
We are exploring a task programming model that provides a global name space, uses blocking one-sided operations for communication, and support semi-transparent caching of remote data, for applications were collaborative caching is necessary
This international project focused on enabling climate simulation at exascale. It involved three thrusts on scalability, resilience, and the use of accelerators. It resulted in a better understanding of node performance and issues and scalability bottlenecks in the CESM code, new mesh partitioners, the porting of parts of the Nikam code to GPUs, and general techniques for improving resilience. I was the PI for this project
Intel amd Microsoft funded in 2008 a center for research on multicore computing at Illinois. I was the co-director of this center. Among the main research directions in the center were:
- 3D visual interfaces: Improved visual interfaces will enable new applications (such as those we explore in our teleimmersion project) and enhance user-computer interaction. More compute power is needed on clients to support 3D modeling and rendering, gesture and face recognition, etc. The interactive use of such tasks requires low latency, hence requires computating on the client.
- Safe parallelism: We firmly believe that parallel programs should be race-free by design. I also beleive that nondeterminism is seldom, if ever, needed in parallel trasnformational code (i.e., code where parallelism is introduced uniquely to enhance performance, but is not inherent in the problem specification). A strong focus in our research (mine included) is on languages that ensure code is race-free and provide, by default, determinisitc behavior.
- Easy performance: The goal of parallelism is performance -- especially scalability (increasing performance with an increasing number of cores). The creation of scalable code should be significantly facilitated by enhanced programming environments.
NCSA won in 2008 a competition to host a $200M petascale supercomputer at U. Illinois. I was one of five co-PIs on this proposal, focusing on the software architecture of Blue Waters. Among the main concerns of this effort were:
- Making performance engineering a reality by developing a more systematic methodology for performance prediction of large codes; the goal is that codes be ready t run on Blue Waters as soon as the machine is in operation; performance modeling should not be an exercise done by a small number of expert teams, but an inherent part of code development.
- Using effectively large SMP nodes: We pushed changes in MPI3, and lookied at possible usages of PGAS languages and restricted OpenMP dialects for that purpose
- Large-scale resource management: we looked at support for workflow and for better integration of storage management with CPU scheduling.
- Improved routing: The topology of Blue Waters raised very interesting questions about the best way of allocating partitions and of routing in the system.
- System monitoring: We looked at the best way to continuously collect data on the health of the system and its performance, and continuously montior the system for anomalous behaviors.
Refactoring for High Performance Computing
Tunning parallel codes for performance is a necessary but tedious activity that is not much helpted by current tools. Automated code optimization (e.g., by a compiler) has limited capabilities. Various projects have tried to develop interactive compilers where the compiler provides feedback to the user that can then guide the compiler transformations. Success has been limited: One reason is that the intermediate representation form used by the compiler is not human readable and it is hard to feed back to the user the result of transformations performed. A promising alternative is to use a refactoring infrastructure, as discussed in our recent article: The transformations are source-to-source, but a compiler infrastructure can be used to automate much of the transfromation process; multiple source versions can be maintained as one code artifact. We are curruntly investigating the use of this approach.
Classification for Performance Tuning
In many cases, there is a choice of multiple algorithms to solve a problem; and each algorithm can be implemented in different ways. One wishes to make the choice that yields the best performance, but this choice may be platform-dependent and input-dependent. Neither choices (of algorithm and implementation) can be done automatically, e.g., by a compiler.
We can consider the choice of an optimal algorithm or an optimal implementation as a classification problem: each input is tagged by the code that woks best for this input; one wishes to find a fast compute classifier that associate (with high probability) a "good" code with each input. It turns out that this apporach works well in practice. One can use machine learning techniques to train a classifier, using inputs for which we found, by brute force, the best code, and then use it at execution time to select the right code.
We have shown this technique to be effective for the well-studies Frequent Item Mining problem
Patterns of High Performance Computing
There is significant work on the development of better parallel programming models and environments. However, the evaluation of this work is far from systematic. The purpose of our research is to collect in amore systematic manner a library of parallel programming patterns, and use them to evaluate the usefulness and expressiveness of existing and proposed parallel programming models and environments. A community effort to help this effort was started at the patHPC workshop that took place in April 2005 at UIUC (work funded by DOE as part of the Center for Programming Models for Scalable Parallel Computing).
DARPA awarded IBM $53.5M in funding to pursue research in High Productivity Computing Systems. The work on the Productive, Easy-to-Use, Reliable Computing System (PERCS) covers chip technology, architecture, OS, compiler and programming environment. UIUC is a partner in this research; the UIUC team includes Profs. Vikram Adve, Ralph Johnson, David Padua, Marc Snir, and Josep Torrellas and their students.I am involved with students in several aspects of this project. In particular, we work on architectural extensions to support key applications; work on benchmarks and performance modeling; and work on compiler technology, in support of semi-automated, run-time program tuning. (Work funded by DARPA; see recent publications).
IBM Research announced in December 1999 a $100 million research initiative, to build “Blue Gene ”, a Petaflop supercomputer. This level of performance would be achieved by connecting together a massive number (32K) of single chip multiprocessors, where each chip is a highly parallel, shared memory multiprocessor. The main intended application for this machine is the simulation of protein folding, a process that is key to the understanding of the fundamentals of life.
The exploitation of such a machine will present challenging software problems. In particular, one needs to ensure that the machine will continue to operate for months, even though individual components may fail during this time. One needs to develop new methods for programming and controlling this massive number of concurrently executing parallel threads. One needs to adapt software algorithms to achieve peak performance on a novel architecture with on chip memory and hardware supported concurrent multithreading.
I initiated the project and led the system work in the initial phase. We demonstrated a speed-up in excess of 300,000 in a simulation of protein folding in Blue Gene [ICS 2000]. The Blue Gene effort had a large number of contributors, listed as authors in a survey paper. The Blue Gene project underwent significant technical and managerial changes a year after it started and is now focused on a different architecture and different set of applications.
Scalable Parallel System Architecture
Today, most large-scale parallel computers are built by assembling smaller ``commodity'' compute nodes. Two main architectures have emerged:
- Shared Memory Multiprocessors (SMM's), where each processor can access all memory (and all I/O), and all processors' caches are maintained coherent. Such systems are usually controlled by one operating system image. The prevalent parallel programming model is thread parallelism: each processor runs a separate execution thread. The threads run in a common address space, and communicate via shared memory.
- Clusters, where each node has a separate memory and a separate I/O subsystem, and each node is controlled by a separate operating system image. The prevalent parallel programming model is message passing: each processor runs a process in a private address space; processes communicate with each other via messages.
- A data mover engine that can be used for memory to memory or memory to cache copying of data in a large NUMA system. In a conventional NUMA system, data can be moved only via load and store operations. This consumes precious CPU cycles and overloads the critical resource in such a system, namely the CPU to memory communication path. A data mover engine can prefetch data so as to hide long latencies without tying up key resources.
- A coherence engine that can be used to maintain coherent shared memory segments across different operating system images. To do so, the coherence engine needs additional protection mechanisms, to ensure that a faulty kernel will not issue wild writes against the memory of another OS image, and to ensure that an error in one system will not bring down other systems. However, much of these protection mechanisms are build in messaging engines, as they are used to communicate across separate OS images.
Such a combined adapter can support efficiently a partitionable shared memory multiprocessor. Increasingly, vendors of large NUMA systems provide a dynamic partitioning capability on their system. The system's physical resources can be partitioned into several physical partitions, each controlled by a separate operating system image. The partition boundaries can be moved without rebooting affected OS images. Partitioning provides better fault isolation, and enables concurrent maintenance. A combined adapter enables the use of the shared memory interconnect for fast, efficient message passing across partitions. The main obstacle to the scaling up of NUMA systems is software, not hardware. Therefore, it is very likely that, in coming years, the ability of hardware developers to scale up SMM's will significantly outstrip the ability of OS and subsystem developers to scale up their software. With partitioning, it becomes possible to scale up NUMA systems beyond to the number of processors that can be supported by one OS image. The shared memory hardware can be exploited in various ways to support limited sharing for applications and subsystems that span multiple OS images. In particular, one can support efficiently a shared memory programming model for applications that span multiple OS images. In this model, each processor runs a process in a separate address space. Shared variables are kept in a segment that is mapped into the address space of each process, and is maintained coherent by hardware. Similarly, shared memory can be used to provide efficient global lock services, or an efficient global cache for shared disks.
One can expect that such partitionable systems will become increasingly prevalent in coming years, providing a convergence point for large-scale computer architectures. Rather than being forced into a choice between ``shared all'' and ``shared nothing'' architectures, users will dynamically select an appropriate level of sharing.
Some of these ideas were demonstrated in the Prism project.
Java for High Performance Numerical Computing
First proposed as a mechanism for enhancing Web content, Java has taken off as a serious general purpose programming language. Industry and academia alike have expressed great interest in using Java as a programming language for scientific and engineering computations. Such usage has been hampered by the relatively inferior performance of Java in numeric intensive computing applications, and by the lack of key features in the Java language. In the Ninja project, we have developed compiler technology and libraries that lead to Java numerical codes with performance comparable to Fortran or C, the more traditional languages for this field. The Java Grande Forum, reflecting in part results from our own research, has identified five critical Java Language and Java Virtual Machine issues related to Java's applicability to solving large computational problems in science and engineering. Unless these issues are resolved, it is unlikely that Java will be a successful language for numerical computing. The five critical issues are:
- Multidimensional arrays: True rectangular multidimensional arrays are the most important data structures for scientific and engineering computing.
- Complex arithmetic: Complex numbers are an essential tool in many areas of science and engineering. Computations with complex numbers need to be supported as efficiently as computations with the primitive real number types, float and double. The issue of high performance computing with complex numbers is directly tied to the next issue.
- Lightweight classes: The excessive overhead associated with manipulation of objects in Java makes it difficult to efficiently support alternative arithmetic systems, such as complex numbers, interval arithmetic, and decimal arithmetic. The ability to manipulate certain objects as having just value semantics is absolutely fundamental to achieve high performance with these alternative arithmetic systems.
- Use of floating point hardware: Achieving the highest possible level of performance on numerical codes typically requires exploiting unique floating point features in each processor. This is often at odds with the Java goal of exact reproducibility of results in every platform.
- Operator overloading: If multidimensional arrays and complex numbers (and other arithmetic systems) are to be implemented in Java as a set of standard packages, then operator overloading is necessary to make the use of these packages more attractive to the application programmer.
Parallel Programming Environments and Tools
My group at IBM provided the overall design and key components of the software architecture for the IBM SP scalable parallel. Close to half of the compute power of the Top 500 supercomputing sites is now provided by SP systems. SP is the main platform used by the ASCI program. Much of this work in described in several articles published in IBM System Journal 34(2), 1995, which also list the main contributors. The SP is listed as one of fourteen major innovations at IBM Research>..
Some contributions are listed below.
Message Passing Libraries
I worked with several collaborators on the design and development of the MPL message passing library of native SP communication commands. Based on this work, I contributed to the design of MPI which has become the industry standard message passing interface (I successfully wrote around half of the MPI-1 standard and less successfully wrote several chapters of the MPI-2 standard.) Hubertus Franke developed MPI-F, an early, complete, high-performance implementation of MPI1 on the SP2. We designed and implemented jointly with researchers at the Parallel Systems Group at the NAS facility (NASA Ames Research Center) MPI-IO, a portable parallel I/O library that evolved to become part of MPI2.
I continued research on MPI, pushing recently for end-point support and working on interfacing MPI with a task model
Vesta parallel file system prototype that was developed at IBM Research provided much of the technology for the first SP parallel file system product.
UTE Unified Trace Environment is a powerful, trace driven tool for studying the performance of parallel programs.
Some Really Old Stuff
A somewhat random selection of old personal research topics.
A Bayesian model of induction uses the following framework: A prior probability function Pr() represents our initial belief Pr(A) in A, for each empirical statement A about the state of our world. Over time, we accumulate evidence, learning that statements e1, e2,..., en actually holds true in our world. As we do so, the prior probability Pr(A) or A is replaced by the conditional probability Pr(A|e1,...,en). A simple example for such framework would be a "world" that consists of an infinite sequence of coin tosses, and statements that express assertions about this world, such as "the fifth toss is a head", "the ratio between heads and tails converges to 1", etc. Suppose that we systematically try coin toss after coin toss. We would like to believe that Bayesian induction works. I.e., if statement A actually holds in our world, then we would like Pr(A|e1,...,en) to converge to 1; if statement A is false, then we would like the conditional probability to converge to 0.
The good news: if we were not dogmatic about A, i.e., if we initially assigned to A a probability 0<Pr(A)<1, then Bayesian induction works for A. That's as good as one can hope for: clearly if Pr(A)=0, then Pr(A|e1,...,en)=0; if we assume up front that A cannot be, then no amount of evidence will change our mind, similarly, if Pr(A)=1, so that we assume up front that A must be, then no amount of evidence will change our mind.
The bad news: If the probability function Pr() can be expressed in a formal mathematical notation, then there must be some empirical statement A that is assigned a priori probability zero: we cannot avoid some preconceptions. So, if the world does not fit our preconceptions, then we are in trouble.
Some simple example of a "forced preconception" follows. Suppose that, in our simple world of coin tosses, we would use the function Pr() to bet on subsequent tosses. I.e., if e1,e2,... is the sequence of outcomes of the coin tosses, after n tosses, we bet "head" if Pr("n+1 toss is head"|e1,...,en)>1/2, "tail", otherwise. Consider the statement A="the bet will be wrong at every coin toss". If the function Pr() can be expressed in a formal mathematical notation, so can our statement be. This is an empirical statement on the sequence of coin tosses, and it is quite easy to see that it must be that Pr(A)=0. I.e., if we use Pr() to measure our prior belief in the state of the world then, to the least, we harbor the misconception that Pr() cannot continuously mislead us.
This seminal work is described in a hard-to-read long paper published in the Journal of Symbolic Logic. The work is quoted in Stanford’s online Encyclopedia of Philosophy and seems to continue to generate follow-up research by philosophers interested in the epistemology of inductive inference. (I was flattered to see a 1995 Master thesis from CMU’s dept of philosophy entitled “A Commentary on the First Three Sections of Gaifman and Snir's 1982 JSL Paper Concerning Probabilities Defined on a First Order Empirical Language," by Timothy Herron.) Here is a pointer to a recent presentation on this work.
Conventional abstract computing models assume constant access time to memory. In practice, memory accesses may take 100's of instruction cycles. Caches are key to the performance of modern computers, and good cache locality is key to the performance of algorithms. Therefore, it is important to develop abstract computing models that reflect the reality of a memory hierarchy.
In joint work with Aggarwal and Chandra we considered first a hierarchical memory model where memory hierarchy is representing assuming that access to address a costs f(a), for some monotonic function f() (most results focus on logarithmic and polynomial cost functions). We show, in this model, how to develop optimal algorithms for various problems. A simple example is provided by the matrix multiplication problem (assuming the n^3 product computation). A simple recursive algorithm turns out to have optimum locality (within a constant factor): each matrix is split into 2x2 block submatrices; the matrix product is expressed as a product of two 2x2 matrices involving, recursively, products of submatrices. This algorithm does not depend on the exact memory access cost function; Leiserson and co. recently coined the term cache oblivious for such algorithms. Furthermore, within a constant factor, an on-line LRU type memory management algorithm will work as well as off-line memory management: programmers can be effectively relieved of the need to explicitly copy data from one storage location to another.
In subsequent work, we refined the model to reflect the effect of spatial locality: access to contiguous data is cheaper than access to random locations. Interestingly, on-line memory management is not effective in handling spatial locality. Also, some problems do not admit cache oblivious algorithms (recent work by Bilardi and co.).
Shared Memory Programming
Optimizing compilers improve code performance by reordering instruction execution, when such reordering does not change the computation outcome. For sequential programs, it is well understood when such reordering is legal: one has to respect data dependencies and control dependencies. The situation is more complicated for explicitly parallel programs. It is possible to come up with examples of transformations that would be correct when applied to each sequential thread in isolation, but that lead to an incorrect result when the threads execute concurrently and communicate via shared variables, due to violations of the shared memory semantics (we assume sequential consistency). In joint work with Shasha, we developed a formalism that enables the specification of order constraints within each inpidual executing thread so that program transformations that respect these constraints within each thread will be correct. This framework also provides conditions for the preservation of the atomicity of compound operations. The recent popularity of Java, a programming language with explicit shared memory parallelism, has brought these set of issues again to the forefront, with an ongoing heated discussion on the correct memory model for Java.
The decision tree model is often used to analyze the complexity of computations that are dominated by test and branch operations. This model is used to show that sorting requires nlog (n) comparisons, or that finding a maximum requires n-1 comparisons. Note that these two lower bounds are different in nature. The first one is an information theoretic argument: There are n! possible outcomes to sorting n elements, therefore a decision tree that sorts has n! leaves, and depth log(n!). The second does not follow from an information theoretic argument: maximum has n possible outcomes, so that the information theoretic lower bound is log(n). Rather, it is an adversary argument: the element that was picked as a maximum must have been compared to all other elements, otherwise one could change the inputs so that another element is the maximum, without changing the outcome of the comparisons. Information theoretic arguments are valid, irrespective of constraints on the input domain (as long as there are still n! possible orderings) and irrespective of the type of predicates used to test and branch. It is not clear that the same holds for problems such as maximum. In fact, as the well know riddle about finding one fake coin out of n shows, the maximum (or minimum) problem can be solved with fewer comparisons, using more complex tests, when the inputs can have only two values (good heavy coin and fake light coin).
Would the same be true if coins could have more than two weights? In joint work with Moran and Manber we have applied Ramsey's theorem to show that lower bounds obtained in a model where simple comparisons are used hold for a model where arbitrary binary predicates are used, provided that the input domain is large enough. Thus, if coins can have many different weights, n-1 tests are necessary, even if one is allowed to use arbitrary predicates. Note that Ramsey's numbers are really large; this results tells nothing about the number of comparisons needed when the number of possible values (coin weights) is small.
The NYU Ultracomputer was never an ultracomputer nor a supercomputer: it was a paracomputer. It seems unlikely that paracomputers can be supercomputers.