Untitled Document

Current Projects

Parallel File Systems

Parallel file systems are the Achilles Heel of current supercomputers: Their performance is brittle, failures are frequent and data losses due to crashes are not rare. HPC I/O subsystems are complex due to their large scale, the goal of supporting POSIX semantics and the many software layers, with I/O libraries atop the parallel fule system, and local file systems below the parallel file system. Our current work in this area has two components.

Understand I/O patterns of supercomputing applications. We suspect that few HPC applications require strict POSIX semantics of the parallel file system and that the performance for parallel systems can be much improved by relaxing POSIX semantics. Our research aims at validating this hypothesis.
Understand crash consistency issues in parallel file systems and I/O libraries. This work has two aspects. The first is to define precisely what requirements we expect to be satisfied by a parallel file system or an I/O library upon recovery from crash; weaker or stronger consistency models can be defined. The second is to verify whether different parallel file systems or I/O libraries satisfy these requirements, using a combination of testing and formal methods.

LCI

The MPI message-passing library was created more than 25 years ago. While its success much surpassed the hopes of its creators, MPI is increasingly ill-matched to the current needs of supercomputers: It does not support well high thread counts and accelerators; it does not leverage well smart NICs; and it does not match well the needs of emerging applications, such as graph analytics and emerging asynchronous task programming models. In addition, while MPI was designed as an application programming interface, communication libraries are increasing used by library and framework designers that have different needs.

LCI is a light-weight communication library that avoids many of the shortcomings of MPI: it performs well with heavy multithreading, interacts directly with task schedulers in order to avoid the need for polling, interacts directly with memory managers in order to support well irregular communication patterns and reduces the end-to-end latency of message passing by focusing not only on the transport mechanisms, but also on the signaling mechanisms. LCI has been integrated with the graph analytics frameworks (in particular, D-Galois) and with the PaRSEC asynchronous task framework.

Silent Error Detection

The goal is to detect silent bit flips in iterative scientific computations. We treat the pattern generated by such a bit flip as a perturbation added to the normal evaluation of an iterative simulation. A deep nearal network is trained to detect such perturbation and used to detect them. The training is application-specific (or stencil-specific), but requires no knowledge of the application code. Errors can be detected many iterations after they occurred, as errors propagation creates a specific pattern.

Older Projects

Resilience

One major roadblock to exascale is the increase in failure rates, due to the increase in component count, the decrease in the reliability of components at smaller scale, and the reduced important of resilience in leading markets, such as mobile. I edited a major report on the problem of resilience at exascale and co-authored surveys of the field.

My research in this area includes improved failure prediction and reduced overheads for asynchronous checkpointing.

Argo

The Argo project studies operating system and runtime designs for exascale. The goals include

A hierarchical global OS design for managing parallel resources
A node OS design that delegates to runtime, to the extent possible, resource management; and executes asynchronous OS code on dedicated cores
A runtime designed to support large numbers of lightweight tasks
A distributed, hierarchical resource management infrastructure that uses online feedback and machine learning to continuously adjust resource allocation, including power.
A global information bus that distributes performance and RAS information so as to support localized performance tuning and fault handling

I am the chief scientist for this large project.

I/O autotuning

Parallel I/O systems involve a complex layering of libraries, middleware and parallel file systems. All of these layers have multiple setable configuration parameters. Since the impact of these different knobs is hard to understand, users normally use default values, resulting in a significant loss of I/O performance. Autotuning is used to significantly improve performance (see paper).

PPL

We are exploring a task programming model that provides a global name space, uses blocking one-sided operations for communication, and support semi-transparent caching of remote data, for applications were collaborative caching is necessary

G8 ECS

This international project focused on enabling climate simulation at exascale. It involved three thrusts on scalability, resilience, and the use of accelerators. It resulted in a better understanding of node performance and issues and scalability bottlenecks in the CESM code, new mesh partitioners, the porting of parts of the Nikam code to GPUs, and general techniques for improving resilience. I was the PI for this project

UPCRC

Intel amd Microsoft funded in 2008 a center for research on multicore computing at Illinois. I was the co-director of this center. Among the main research directions in the center were:

3D visual interfaces: Improved visual interfaces will enable new applications (such as those we explore in our teleimmersion project) and enhance user-computer interaction. More compute power is needed on clients to support 3D modeling and rendering, gesture and face recognition, etc. The interactive use of such tasks requires low latency, hence requires computating on the client.
Safe parallelism: We firmly believe that parallel programs should be race-free by design. I also beleive that nondeterminism is seldom, if ever, needed in parallel trasnformational code (i.e., code where parallelism is introduced uniquely to enhance performance, but is not inherent in the problem specification). A strong focus in our research (mine included) is on languages that ensure code is race-free and provide, by default, determinisitc behavior.
Easy performance: The goal of parallelism is performance -- especially scalability (increasing performance with an increasing number of cores). The creation of scalable code should be significantly facilitated by enhanced programming environments.

Blue Waters

NCSA won in 2008 a competition to host a $200M petascale supercomputer at U. Illinois. I was one of five co-PIs on this proposal, focusing on the software architecture of Blue Waters. Among the main concerns of this effort were:

Making performance engineering a reality by developing a more systematic methodology for performance prediction of large codes; the goal is that codes be ready t run on Blue Waters as soon as the machine is in operation; performance modeling should not be an exercise done by a small number of expert teams, but an inherent part of code development.
Using effectively large SMP nodes: We pushed changes in MPI3, and lookied at possible usages of PGAS languages and restricted OpenMP dialects for that purpose
Large-scale resource management: we looked at support for workflow and for better integration of storage management with CPU scheduling.
Improved routing: The topology of Blue Waters raised very interesting questions about the best way of allocating partitions and of routing in the system.
System monitoring: We looked at the best way to continuously collect data on the health of the system and its performance, and continuously montior the system for anomalous behaviors.

Refactoring for High Performance Computing

Tunning parallel codes for performance is a necessary but tedious activity that is not much helpted by current tools. Automated code optimization (e.g., by a compiler) has limited capabilities. Various projects have tried to develop interactive compilers where the compiler provides feedback to the user that can then guide the compiler transformations. Success has been limited: One reason is that the intermediate representation form used by the compiler is not human readable and it is hard to feed back to the user the result of transformations performed. A promising alternative is to use a refactoring infrastructure, as discussed in our recent article: The transformations are source-to-source, but a compiler infrastructure can be used to automate much of the transfromation process; multiple source versions can be maintained as one code artifact. We are curruntly investigating the use of this approach.

Classification for Performance Tuning

In many cases, there is a choice of multiple algorithms to solve a problem; and each algorithm can be implemented in different ways. One wishes to make the choice that yields the best performance, but this choice may be platform-dependent and input-dependent. Neither choices (of algorithm and implementation) can be done automatically, e.g., by a compiler.
We can consider the choice of an optimal algorithm or an optimal implementation as a classification problem: each input is tagged by the code that woks best for this input; one wishes to find a fast compute classifier that associate (with high probability) a "good" code with each input. It turns out that this apporach works well in practice. One can use machine learning techniques to train a classifier, using inputs for which we found, by brute force, the best code, and then use it at execution time to select the right code.
We have shown this technique to be effective for the well-studies Frequent Item Mining problem

Patterns of High Performance Computing

There is significant work on the development of better parallel programming models and environments. However, the evaluation of this work is far from systematic. The purpose of our research is to collect in amore systematic manner a library of parallel programming patterns, and use them to evaluate the usefulness and expressiveness of existing and proposed parallel programming models and environments. A community effort to help this effort was started at the patHPC workshop that took place in April 2005 at UIUC (work funded by DOE as part of the Center for Programming Models for Scalable Parallel Computing).

HPCS

DARPA awarded IBM $53.5M in funding to pursue research in High Productivity Computing Systems. The work on the Productive, Easy-to-Use, Reliable Computing System (PERCS) covers chip technology, architecture, OS, compiler and programming environment. UIUC is a partner in this research; the UIUC team includes Profs. Vikram Adve, Ralph Johnson, David Padua, Marc Snir, and Josep Torrellas and their students.I am involved with students in several aspects of this project. In particular, we work on architectural extensions to support key applications; work on benchmarks and performance modeling; and work on compiler technology, in support of semi-automated, run-time program tuning. (Work funded by DARPA; see recent publications).

Blue Gene

IBM Research announced in December 1999 a $100 million research initiative, to build “Blue Gene ”, a Petaflop supercomputer. This level of performance would be achieved by connecting together a massive number (32K) of single chip multiprocessors, where each chip is a highly parallel, shared memory multiprocessor. The main intended application for this machine is the simulation of protein folding, a process that is key to the understanding of the fundamentals of life.
The exploitation of such a machine will present challenging software problems. In particular, one needs to ensure that the machine will continue to operate for months, even though individual components may fail during this time. One needs to develop new methods for programming and controlling this massive number of concurrently executing parallel threads. One needs to adapt software algorithms to achieve peak performance on a novel architecture with on chip memory and hardware supported concurrent multithreading.
I initiated the project and led the system work in the initial phase. We demonstrated a speed-up in excess of 300,000 in a simulation of protein folding in Blue Gene [ICS 2000]. The Blue Gene effort had a large number of contributors, listed as authors in a survey paper. The Blue Gene project underwent significant technical and managerial changes a year after it started and is now focused on a different architecture and different set of applications.

Scalable Parallel System Architecture

Today, most large-scale parallel computers are built by assembling smaller ``commodity'' compute nodes. Two main architectures have emerged:

Shared Memory Multiprocessors (SMM's), where each processor can access all memory (and all I/O), and all processors' caches are maintained coherent. Such systems are usually controlled by one operating system image. The prevalent parallel programming model is thread parallelism: each processor runs a separate execution thread. The threads run in a common address space, and communicate via shared memory.
Clusters, where each node has a separate memory and a separate I/O subsystem, and each node is controlled by a separate operating system image. The prevalent parallel programming model is message passing: each processor runs a process in a private address space; processes communicate with each other via messages.

SMM's typically offer tighter coupling both in terms of performance (higher bandwidth, lower latency), and in terms of function (``everything shared'' model). This facilitates the development of parallel applications. On the down side, tighter coupling restricts scalability -- commercial single image operating systems do not scale nowadays beyond 64-128 processors. Error isolation is harder in a tightly coupled system -- any failure of a disk, processor or memory, or any system software failure brings down the entire system. The lack of internal firewalls also prevent concurrent maintenance -- hence prevents continuous availability -- an increasingly important goal for large systems. Finally, tightly coupled SMM's usually require homogeneous hardware and software, and thus are harder to upgrade. Large SMM's are built by assembling smaller SMM's into NUMA systems. Multiple nodes are attached via a high bandwidth, low latency packet switching System Area Network (SAN). This, essentially, is the same structure as for a cluster node. The main difference is the function and performance of the communication controller. A cluster communication controller contains a message passing engine that offloads much of the messaging protocol from the main processor(s). Such an engine handles the link level protocol and low level error recovery; it manages message queues, handles protection and address translation, and has one or more data mover (DMA) engines. This enables support of ``zero-copy'' protocols in user space, where data is directly copied from sender memory to receiver memory in user space. A NUMA communication controller will also handle the link protocol and low level error recovery. Rather than a messaging engine, it contains a coherence engine that keeps track of the location of cache lines checked out from local memory and executes required operations to maintain caches coherent. The two types of controllers provide much common functionality, and it is feasible and advantageous to design a controller that combines both functions. Such a combined adapter provides two new capabilities:

A data mover engine that can be used for memory to memory or memory to cache copying of data in a large NUMA system. In a conventional NUMA system, data can be moved only via load and store operations. This consumes precious CPU cycles and overloads the critical resource in such a system, namely the CPU to memory communication path. A data mover engine can prefetch data so as to hide long latencies without tying up key resources.
A coherence engine that can be used to maintain coherent shared memory segments across different operating system images. To do so, the coherence engine needs additional protection mechanisms, to ensure that a faulty kernel will not issue wild writes against the memory of another OS image, and to ensure that an error in one system will not bring down other systems. However, much of these protection mechanisms are build in messaging engines, as they are used to communicate across separate OS images.

Such a combined adapter can support efficiently a partitionable shared memory multiprocessor. Increasingly, vendors of large NUMA systems provide a dynamic partitioning capability on their system. The system's physical resources can be partitioned into several physical partitions, each controlled by a separate operating system image. The partition boundaries can be moved without rebooting affected OS images. Partitioning provides better fault isolation, and enables concurrent maintenance. A combined adapter enables the use of the shared memory interconnect for fast, efficient message passing across partitions. The main obstacle to the scaling up of NUMA systems is software, not hardware. Therefore, it is very likely that, in coming years, the ability of hardware developers to scale up SMM's will significantly outstrip the ability of OS and subsystem developers to scale up their software. With partitioning, it becomes possible to scale up NUMA systems beyond to the number of processors that can be supported by one OS image. The shared memory hardware can be exploited in various ways to support limited sharing for applications and subsystems that span multiple OS images. In particular, one can support efficiently a shared memory programming model for applications that span multiple OS images. In this model, each processor runs a process in a separate address space. Shared variables are kept in a segment that is mapped into the address space of each process, and is maintained coherent by hardware. Similarly, shared memory can be used to provide efficient global lock services, or an efficient global cache for shared disks.
One can expect that such partitionable systems will become increasingly prevalent in coming years, providing a convergence point for large-scale computer architectures. Rather than being forced into a choice between ``shared all'' and ``shared nothing'' architectures, users will dynamically select an appropriate level of sharing.
Some of these ideas were demonstrated in the Prism project.

Java for High Performance Numerical Computing

First proposed as a mechanism for enhancing Web content, Java has taken off as a serious general purpose programming language. Industry and academia alike have expressed great interest in using Java as a programming language for scientific and engineering computations. Such usage has been hampered by the relatively inferior performance of Java in numeric intensive computing applications, and by the lack of key features in the Java language. In the Ninja project, we have developed compiler technology and libraries that lead to Java numerical codes with performance comparable to Fortran or C, the more traditional languages for this field. The Java Grande Forum, reflecting in part results from our own research, has identified five critical Java Language and Java Virtual Machine issues related to Java's applicability to solving large computational problems in science and engineering. Unless these issues are resolved, it is unlikely that Java will be a successful language for numerical computing. The five critical issues are:

Multidimensional arrays: True rectangular multidimensional arrays are the most important data structures for scientific and engineering computing.
Complex arithmetic: Complex numbers are an essential tool in many areas of science and engineering. Computations with complex numbers need to be supported as efficiently as computations with the primitive real number types, float and double. The issue of high performance computing with complex numbers is directly tied to the next issue.
Lightweight classes: The excessive overhead associated with manipulation of objects in Java makes it difficult to efficiently support alternative arithmetic systems, such as complex numbers, interval arithmetic, and decimal arithmetic. The ability to manipulate certain objects as having just value semantics is absolutely fundamental to achieve high performance with these alternative arithmetic systems.
Use of floating point hardware: Achieving the highest possible level of performance on numerical codes typically requires exploiting unique floating point features in each processor. This is often at odds with the Java goal of exact reproducibility of results in every platform.
Operator overloading: If multidimensional arrays and complex numbers (and other arithmetic systems) are to be implemented in Java as a set of standard packages, then operator overloading is necessary to make the use of these packages more attractive to the application programmer.

Parallel Programming Environments and Tools

My group at IBM provided the overall design and key components of the software architecture for the IBM SP scalable parallel. Close to half of the compute power of the Top 500 supercomputing sites is now provided by SP systems. SP is the main platform used by the ASCI program. Much of this work in described in several articles published in IBM System Journal 34(2), 1995, which also list the main contributors. The SP is listed as one of fourteen major innovations at IBM Research>..
Some contributions are listed below.

Message Passing Libraries

I worked with several collaborators on the design and development of the MPL message passing library of native SP communication commands. Based on this work, I contributed to the design of MPI which has become the industry standard message passing interface (I successfully wrote around half of the MPI-1 standard and less successfully wrote several chapters of the MPI-2 standard.) Hubertus Franke developed MPI-F, an early, complete, high-performance implementation of MPI1 on the SP2. We designed and implemented jointly with researchers at the Parallel Systems Group at the NAS facility (NASA Ames Research Center) MPI-IO, a portable parallel I/O library that evolved to become part of MPI2.

I continued research on MPI, pushing recently for end-point support and working on interfacing MPI with a task model

Parallel I/O

Vesta parallel file system prototype that was developed at IBM Research provided much of the technology for the first SP parallel file system product.

Performance Tools

UTE Unified Trace Environment is a powerful, trace driven tool for studying the performance of parallel programs.
Recent Projects

Some Really Old Stuff

A somewhat random selection of old personal research topics.

Bayesian Induction

A Bayesian model of induction uses the following framework: A prior probability function Pr() represents our initial belief Pr(A) in A, for each empirical statement A about the state of our world. Over time, we accumulate evidence, learning that statements e1, e2,..., en actually holds true in our world. As we do so, the prior probability Pr(A) or A is replaced by the conditional probability Pr(A|e1,...,en). A simple example for such framework would be a "world" that consists of an infinite sequence of coin tosses, and statements that express assertions about this world, such as "the fifth toss is a head", "the ratio between heads and tails converges to 1", etc. Suppose that we systematically try coin toss after coin toss. We would like to believe that Bayesian induction works. I.e., if statement A actually holds in our world, then we would like Pr(A|e1,...,en) to converge to 1; if statement A is false, then we would like the conditional probability to converge to 0.
The good news: if we were not dogmatic about A, i.e., if we initially assigned to A a probability 0<Pr(A)<1, then Bayesian induction works for A. That's as good as one can hope for: clearly if Pr(A)=0, then Pr(A|e1,...,en)=0; if we assume up front that A cannot be, then no amount of evidence will change our mind, similarly, if Pr(A)=1, so that we assume up front that A must be, then no amount of evidence will change our mind.
The bad news: If the probability function Pr() can be expressed in a formal mathematical notation, then there must be some empirical statement A that is assigned a priori probability zero: we cannot avoid some preconceptions. So, if the world does not fit our preconceptions, then we are in trouble.
Some simple example of a "forced preconception" follows. Suppose that, in our simple world of coin tosses, we would use the function Pr() to bet on subsequent tosses. I.e., if e1,e2,... is the sequence of outcomes of the coin tosses, after n tosses, we bet "head" if Pr("n+1 toss is head"|e1,...,en)>1/2, "tail", otherwise. Consider the statement A="the bet will be wrong at every coin toss". If the function Pr() can be expressed in a formal mathematical notation, so can our statement be. This is an empirical statement on the sequence of coin tosses, and it is quite easy to see that it must be that Pr(A)=0. I.e., if we use Pr() to measure our prior belief in the state of the world then, to the least, we harbor the misconception that Pr() cannot continuously mislead us.
This seminal work is described in a hard-to-read long paper published in the Journal of Symbolic Logic. The work is quoted in Stanford’s online Encyclopedia of Philosophy and seems to continue to generate follow-up research by philosophers interested in the epistemology of inductive inference. (I was flattered to see a 1995 Master thesis from CMU’s dept of philosophy entitled “A Commentary on the First Three Sections of Gaifman and Snir's 1982 JSL Paper Concerning Probabilities Defined on a First Order Empirical Language," by Timothy Herron.) Here is a pointer to a recent presentation on this work.

Memory Hierarchy

Conventional abstract computing models assume constant access time to memory. In practice, memory accesses may take 100's of instruction cycles. Caches are key to the performance of modern computers, and good cache locality is key to the performance of algorithms. Therefore, it is important to develop abstract computing models that reflect the reality of a memory hierarchy.
In joint work with Aggarwal and Chandra we considered first a hierarchical memory model where memory hierarchy is representing assuming that access to address a costs f(a), for some monotonic function f() (most results focus on logarithmic and polynomial cost functions). We show, in this model, how to develop optimal algorithms for various problems. A simple example is provided by the matrix multiplication problem (assuming the n^3 product computation). A simple recursive algorithm turns out to have optimum locality (within a constant factor): each matrix is split into 2x2 block submatrices; the matrix product is expressed as a product of two 2x2 matrices involving, recursively, products of submatrices. This algorithm does not depend on the exact memory access cost function; Leiserson and co. recently coined the term cache oblivious for such algorithms. Furthermore, within a constant factor, an on-line LRU type memory management algorithm will work as well as off-line memory management: programmers can be effectively relieved of the need to explicitly copy data from one storage location to another.
In subsequent work, we refined the model to reflect the effect of spatial locality: access to contiguous data is cheaper than access to random locations. Interestingly, on-line memory management is not effective in handling spatial locality. Also, some problems do not admit cache oblivious algorithms (recent work by Bilardi and co.).

Shared Memory Programming

Optimizing compilers improve code performance by reordering instruction execution, when such reordering does not change the computation outcome. For sequential programs, it is well understood when such reordering is legal: one has to respect data dependencies and control dependencies. The situation is more complicated for explicitly parallel programs. It is possible to come up with examples of transformations that would be correct when applied to each sequential thread in isolation, but that lead to an incorrect result when the threads execute concurrently and communicate via shared variables, due to violations of the shared memory semantics (we assume sequential consistency). In joint work with Shasha, we developed a formalism that enables the specification of order constraints within each inpidual executing thread so that program transformations that respect these constraints within each thread will be correct. This framework also provides conditions for the preservation of the atomicity of compound operations. The recent popularity of Java, a programming language with explicit shared memory parallelism, has brought these set of issues again to the forefront, with an ongoing heated discussion on the correct memory model for Java.

Decision Trees

The decision tree model is often used to analyze the complexity of computations that are dominated by test and branch operations. This model is used to show that sorting requires nlog (n) comparisons, or that finding a maximum requires n-1 comparisons. Note that these two lower bounds are different in nature. The first one is an information theoretic argument: There are n! possible outcomes to sorting n elements, therefore a decision tree that sorts has n! leaves, and depth log(n!). The second does not follow from an information theoretic argument: maximum has n possible outcomes, so that the information theoretic lower bound is log(n). Rather, it is an adversary argument: the element that was picked as a maximum must have been compared to all other elements, otherwise one could change the inputs so that another element is the maximum, without changing the outcome of the comparisons. Information theoretic arguments are valid, irrespective of constraints on the input domain (as long as there are still n! possible orderings) and irrespective of the type of predicates used to test and branch. It is not clear that the same holds for problems such as maximum. In fact, as the well know riddle about finding one fake coin out of n shows, the maximum (or minimum) problem can be solved with fewer comparisons, using more complex tests, when the inputs can have only two values (good heavy coin and fake light coin).
Would the same be true if coins could have more than two weights? In joint work with Moran and Manber we have applied Ramsey's theorem to show that lower bounds obtained in a model where simple comparisons are used hold for a model where arbitrary binary predicates are used, provided that the input domain is large enough. Thus, if coins can have many different weights, n-1 tests are necessary, even if one is allowed to use arbitrary predicates. Note that Ramsey's numbers are really large; this results tells nothing about the number of comparisons needed when the number of possible values (coin weights) is small.

NYU Ultracomputer

The NYU Ultracomputer was never an ultracomputer nor a supercomputer: it was a paracomputer. It seems unlikely that paracomputers can be supercomputers.

Marc Snir

short bio

publications

research

contact