Current Projects
Parallel File Systems
Parallel file systems are the Achilles Heel of current supercomputers: Their performance is brittle, failures are frequent and data losses due to crashes are not rare. HPC I/O subsystems are complex due to their large scale, the goal of supporting POSIX semantics and the many software layers, with I/O libraries atop the parallel fule system, and local file systems below the parallel file system. Our current work in this area has two components.- Understand I/O patterns of supercomputing applications. We suspect that few HPC applications require strict POSIX semantics of the parallel file system and that the performance for parallel systems can be much improved by relaxing POSIX semantics. Our research aims at validating this hypothesis.
- Understand crash consistency issues in parallel file
systems and I/O libraries. This work has two aspects. The
first is to define precisely what requirements we expect to
be satisfied by a parallel file system or an I/O library
upon recovery from crash; weaker or stronger consistency
models can be defined. The second is to verify whether
different parallel file systems or I/O libraries satisfy
these requirements, using a combination of testing and
formal methods.
LCI
The MPI message-passing library was created more than 25
years ago. While its success much surpassed the hopes of its
creators, MPI is increasingly ill-matched to the current needs
of supercomputers: It does not support well high thread counts
and accelerators; it does not leverage well smart NICs; and it
does not match well the needs of emerging applications, such
as graph analytics and emerging asynchronous task programming
models. In addition, while MPI was designed as an
application programming interface, communication libraries are
increasing used by library and framework designers that have
different needs.
LCI is a light-weight communication library that avoids many
of the shortcomings of MPI: it performs well with heavy
multithreading, interacts directly with task schedulers in
order to avoid the need for polling, interacts directly with
memory managers in order to support well irregular
communication patterns and reduces the end-to-end latency of
message passing by focusing not only on the transport
mechanisms, but also on the signaling mechanisms. LCI has been
integrated with the graph analytics frameworks (in particular,
D-Galois) and with the PaRSEC asynchronous task framework.
Silent Error Detection
The goal is to detect silent bit flips in iterative
scientific computations. We treat the pattern generated by
such a bit flip as a perturbation added to the normal
evaluation of an iterative simulation. A deep nearal network
is trained to detect such perturbation and used to detect
them. The training is application-specific (or
stencil-specific), but requires no knowledge of the
application code. Errors can be detected many iterations after
they occurred, as errors propagation creates a specific
pattern.
Older Projects
Resilience
One major roadblock to exascale is the increase in failure
rates, due to the increase in component count, the decrease in
the reliability of components at smaller scale, and the
reduced important of resilience in leading markets, such as
mobile. I edited a major report
on the problem of resilience at exascale and co-authored surveys
of the field.
My research in this area includes improved failure prediction and reduced overheads for asynchronous checkpointing.
Argo
The Argo project studies operating system and runtime designs for exascale. The goals include
- A hierarchical global OS design for managing parallel resources
- A node OS design that delegates to runtime, to the extent possible, resource management; and executes asynchronous OS code on dedicated cores
- A runtime designed to support large numbers of lightweight tasks
- A distributed, hierarchical resource management infrastructure that uses online feedback and machine learning to continuously adjust resource allocation, including power.
- A global information bus that distributes performance and RAS information so as to support localized performance tuning and fault handling
I am the chief scientist for this large project.
I/O autotuning
Parallel I/O systems involve a complex layering of libraries, middleware and parallel file systems. All of these layers have multiple setable configuration parameters. Since the impact of these different knobs is hard to understand, users normally use default values, resulting in a significant loss of I/O performance. Autotuning is used to significantly improve performance (see paper).
PPL
We are exploring a task programming model that provides a global name space, uses blocking one-sided operations for communication, and support semi-transparent caching of remote data, for applications were collaborative caching is necessary
G8 ECS
This international project focused on enabling climate simulation at exascale. It involved three thrusts on scalability, resilience, and the use of accelerators. It resulted in a better understanding of node performance and issues and scalability bottlenecks in the CESM code, new mesh partitioners, the porting of parts of the Nikam code to GPUs, and general techniques for improving resilience. I was the PI for this project
UPCRC
Intel amd Microsoft funded in 2008 a center for research on multicore computing at Illinois. I was the co-director of this center. Among the main research directions in the center were:
- 3D visual interfaces: Improved visual interfaces will enable new applications (such as those we explore in our teleimmersion project) and enhance user-computer interaction. More compute power is needed on clients to support 3D modeling and rendering, gesture and face recognition, etc. The interactive use of such tasks requires low latency, hence requires computating on the client.
- Safe parallelism: We firmly believe that parallel programs should be race-free by design. I also beleive that nondeterminism is seldom, if ever, needed in parallel trasnformational code (i.e., code where parallelism is introduced uniquely to enhance performance, but is not inherent in the problem specification). A strong focus in our research (mine included) is on languages that ensure code is race-free and provide, by default, determinisitc behavior.
- Easy performance: The goal of parallelism is performance -- especially scalability (increasing performance with an increasing number of cores). The creation of scalable code should be significantly facilitated by enhanced programming environments.
Blue Waters
NCSA won in 2008 a competition to host a $200M petascale supercomputer at U. Illinois. I was one of five co-PIs on this proposal, focusing on the software architecture of Blue Waters. Among the main concerns of this effort were:
- Making performance engineering a reality by developing a more systematic methodology for performance prediction of large codes; the goal is that codes be ready t run on Blue Waters as soon as the machine is in operation; performance modeling should not be an exercise done by a small number of expert teams, but an inherent part of code development.
- Using effectively large SMP nodes: We pushed changes in MPI3, and lookied at possible usages of PGAS languages and restricted OpenMP dialects for that purpose
- Large-scale resource management: we looked at support for workflow and for better integration of storage management with CPU scheduling.
- Improved routing: The topology of Blue Waters raised very interesting questions about the best way of allocating partitions and of routing in the system.
- System monitoring: We looked at the best way to continuously collect data on the health of the system and its performance, and continuously montior the system for anomalous behaviors.
Refactoring for High Performance Computing
Tunning parallel codes for performance is a necessary but tedious activity that is not much helpted by current tools. Automated code optimization (e.g., by a compiler) has limited capabilities. Various projects have tried to develop interactive compilers where the compiler provides feedback to the user that can then guide the compiler transformations. Success has been limited: One reason is that the intermediate representation form used by the compiler is not human readable and it is hard to feed back to the user the result of transformations performed. A promising alternative is to use a refactoring infrastructure, as discussed in our recent article: The transformations are source-to-source, but a compiler infrastructure can be used to automate much of the transfromation process; multiple source versions can be maintained as one code artifact. We are curruntly investigating the use of this approach.
Classification for Performance Tuning
In many cases, there is a choice of multiple
algorithms to solve a problem; and each algorithm can be
implemented in different ways. One wishes to make the choice
that yields the best performance, but this choice may be
platform-dependent and input-dependent. Neither choices (of
algorithm and implementation) can be done automatically, e.g.,
by a compiler.
We can consider the choice of an optimal algorithm or an
optimal implementation as a classification problem: each input
is tagged by the code that woks best for this input; one
wishes to find a fast compute classifier that associate (with
high probability) a "good" code with each input. It turns out
that this apporach works well in practice. One can use machine
learning techniques to train a classifier, using inputs for
which we found, by brute force, the best code, and then use it
at execution time to select the right code.
We have shown this technique to be effective for the
well-studies Frequent Item Mining problem
Patterns of High Performance Computing
There is significant work on the development of better parallel programming models and environments. However, the evaluation of this work is far from systematic. The purpose of our research is to collect in amore systematic manner a library of parallel programming patterns, and use them to evaluate the usefulness and expressiveness of existing and proposed parallel programming models and environments. A community effort to help this effort was started at the patHPC workshop that took place in April 2005 at UIUC (work funded by DOE as part of the Center for Programming Models for Scalable Parallel Computing).
HPCS
DARPA awarded IBM $53.5M in funding to pursue research in High Productivity Computing Systems. The work on the Productive, Easy-to-Use, Reliable Computing System (PERCS) covers chip technology, architecture, OS, compiler and programming environment. UIUC is a partner in this research; the UIUC team includes Profs. Vikram Adve, Ralph Johnson, David Padua, Marc Snir, and Josep Torrellas and their students.I am involved with students in several aspects of this project. In particular, we work on architectural extensions to support key applications; work on benchmarks and performance modeling; and work on compiler technology, in support of semi-automated, run-time program tuning. (Work funded by DARPA; see recent publications).
Blue Gene
IBM
Research announced in December 1999 a $100 million
research initiative, to build “Blue Gene ”,
a Petaflop supercomputer. This level of performance would be
achieved by connecting together a massive number (32K) of
single chip multiprocessors, where each chip is a highly
parallel, shared memory multiprocessor. The main intended
application for this machine is the simulation of protein
folding, a process that is key to the understanding of the
fundamentals of life.
The exploitation of such a machine will present challenging
software problems. In particular, one needs to ensure that the
machine will continue to operate for months, even though
individual components may fail during this time. One needs to
develop new methods for programming and controlling this
massive number of concurrently executing parallel threads. One
needs to adapt software algorithms to achieve peak performance
on a novel architecture with on chip memory and hardware
supported concurrent multithreading.
I initiated the project and led the system work in the initial
phase. We demonstrated a speed-up in excess of 300,000 in a
simulation of protein folding in Blue Gene [ICS 2000]. The Blue Gene effort had a large number of
contributors, listed as authors in a
survey paper. The Blue Gene project
underwent significant technical and managerial changes a
year after it started and is now focused on a different
architecture and different set of applications.
Scalable Parallel System Architecture
Today, most large-scale parallel computers are built by assembling smaller ``commodity'' compute nodes. Two main architectures have emerged:
- Shared Memory Multiprocessors (SMM's), where each processor can access all memory (and all I/O), and all processors' caches are maintained coherent. Such systems are usually controlled by one operating system image. The prevalent parallel programming model is thread parallelism: each processor runs a separate execution thread. The threads run in a common address space, and communicate via shared memory.
- Clusters, where each node has a separate memory and a separate I/O subsystem, and each node is controlled by a separate operating system image. The prevalent parallel programming model is message passing: each processor runs a process in a private address space; processes communicate with each other via messages.
- A data mover engine that can be used for memory to memory or memory to cache copying of data in a large NUMA system. In a conventional NUMA system, data can be moved only via load and store operations. This consumes precious CPU cycles and overloads the critical resource in such a system, namely the CPU to memory communication path. A data mover engine can prefetch data so as to hide long latencies without tying up key resources.
- A coherence engine that can be used to maintain coherent shared memory segments across different operating system images. To do so, the coherence engine needs additional protection mechanisms, to ensure that a faulty kernel will not issue wild writes against the memory of another OS image, and to ensure that an error in one system will not bring down other systems. However, much of these protection mechanisms are build in messaging engines, as they are used to communicate across separate OS images.
Such a combined adapter can support efficiently a partitionable
shared memory multiprocessor. Increasingly, vendors of
large NUMA systems provide a dynamic partitioning capability
on their system. The system's physical resources can be
partitioned into several physical partitions, each controlled
by a separate operating system image. The partition boundaries
can be moved without rebooting affected OS images.
Partitioning provides better fault isolation, and enables
concurrent maintenance. A combined adapter enables the use of
the shared memory interconnect for fast, efficient message
passing across partitions. The main obstacle to the scaling up
of NUMA systems is software, not hardware. Therefore, it is
very likely that, in coming years, the ability of hardware
developers to scale up SMM's will significantly outstrip the
ability of OS and subsystem developers to scale up their
software. With partitioning, it becomes possible to scale up
NUMA systems beyond to the number of processors that can be
supported by one OS image. The shared memory hardware can be
exploited in various ways to support limited sharing for
applications and subsystems that span multiple OS images. In
particular, one can support efficiently a shared memory
programming model for applications that span multiple OS
images. In this model, each processor runs a process in a
separate address space. Shared variables are kept in a segment
that is mapped into the address space of each process, and is
maintained coherent by hardware. Similarly, shared memory can
be used to provide efficient global lock services, or an
efficient global cache for shared disks.
One can expect that such partitionable systems will become
increasingly prevalent in coming years, providing a
convergence point for large-scale computer architectures.
Rather than being forced into a choice between ``shared all''
and ``shared nothing'' architectures, users will dynamically
select an appropriate level of sharing.
Some of these ideas were demonstrated in the
Prism
project.
Java for High Performance Numerical Computing
First proposed as a mechanism for enhancing Web content, Java has taken off as a serious general purpose programming language. Industry and academia alike have expressed great interest in using Java as a programming language for scientific and engineering computations. Such usage has been hampered by the relatively inferior performance of Java in numeric intensive computing applications, and by the lack of key features in the Java language. In the Ninja project, we have developed compiler technology and libraries that lead to Java numerical codes with performance comparable to Fortran or C, the more traditional languages for this field. The Java Grande Forum, reflecting in part results from our own research, has identified five critical Java Language and Java Virtual Machine issues related to Java's applicability to solving large computational problems in science and engineering. Unless these issues are resolved, it is unlikely that Java will be a successful language for numerical computing. The five critical issues are:
- Multidimensional arrays: True rectangular multidimensional arrays are the most important data structures for scientific and engineering computing.
- Complex arithmetic: Complex numbers are an essential tool in many areas of science and engineering. Computations with complex numbers need to be supported as efficiently as computations with the primitive real number types, float and double. The issue of high performance computing with complex numbers is directly tied to the next issue.
- Lightweight classes: The excessive overhead associated with manipulation of objects in Java makes it difficult to efficiently support alternative arithmetic systems, such as complex numbers, interval arithmetic, and decimal arithmetic. The ability to manipulate certain objects as having just value semantics is absolutely fundamental to achieve high performance with these alternative arithmetic systems.
- Use of floating point hardware: Achieving the highest possible level of performance on numerical codes typically requires exploiting unique floating point features in each processor. This is often at odds with the Java goal of exact reproducibility of results in every platform.
- Operator overloading: If multidimensional arrays and complex numbers (and other arithmetic systems) are to be implemented in Java as a set of standard packages, then operator overloading is necessary to make the use of these packages more attractive to the application programmer.
Parallel Programming Environments and Tools
My group at IBM provided the overall design
and key components of the software architecture for the
IBM SP scalable
parallel. Close to half of the
compute power of the Top 500
supercomputing sites is now provided by SP systems. SP is
the main platform used by the
ASCI program. Much of this work in
described in several articles published in IBM System Journal 34(2),
1995, which also list the main
contributors. The SP is listed as one of
fourteen major innovations at IBM
Research>..
Some contributions are listed below.
Message Passing Libraries
I worked with several collaborators on the design and development of the MPL message passing library of native SP communication commands. Based on this work, I contributed to the design of MPI which has become the industry standard message passing interface (I successfully wrote around half of the MPI-1 standard and less successfully wrote several chapters of the MPI-2 standard.) Hubertus Franke developed MPI-F, an early, complete, high-performance implementation of MPI1 on the SP2. We designed and implemented jointly with researchers at the Parallel Systems Group at the NAS facility (NASA Ames Research Center) MPI-IO, a portable parallel I/O library that evolved to become part of MPI2.
I continued research on MPI, pushing recently for end-point support and working on interfacing MPI with a task model
Parallel I/O
Vesta parallel file system prototype that was developed at IBM Research provided much of the technology for the first SP parallel file system product.
Performance Tools
UTE Unified Trace
Environment is a powerful, trace driven tool
for studying the performance of parallel programs.
Recent Projects
Some Really Old Stuff
A somewhat random selection of old personal research topics.
Bayesian Induction
A Bayesian model of induction
uses the following framework: A prior probability
function Pr() represents our initial belief Pr(A)
in A, for each empirical statement A
about the state of our world. Over time, we accumulate
evidence, learning that statements e1, e2,..., en
actually holds true in our world. As we do so, the prior
probability Pr(A) or A is replaced by
the conditional probability Pr(A|e1,...,en). A
simple example for such framework would be a "world" that
consists of an infinite sequence of coin tosses, and
statements that express assertions about this world, such as
"the fifth toss is a head", "the ratio between heads and
tails converges to 1", etc. Suppose that we systematically
try coin toss after coin toss. We would like to believe that
Bayesian induction works. I.e., if statement A
actually holds in our world, then we would like Pr(A|e1,...,en)
to converge to 1; if statement A is
false, then we would like the conditional probability to
converge to 0.
The good news: if we were not dogmatic about A,
i.e., if we initially assigned to A a probability
0<Pr(A)<1, then Bayesian induction works for
A. That's as good as one can hope for: clearly if Pr(A)=0,
then Pr(A|e1,...,en)=0; if we assume up front that
A cannot be, then no amount of evidence will change
our mind, similarly, if Pr(A)=1, so that we assume
up front that A must be, then no amount of
evidence will change our mind.
The bad news: If the probability function Pr()
can be expressed in a formal mathematical notation, then
there must be some empirical statement A that is
assigned a priori probability zero: we cannot avoid some
preconceptions. So, if the world does not fit our
preconceptions, then we are in trouble.
Some simple example of a "forced
preconception" follows. Suppose that, in our simple world of
coin tosses, we would use the function Pr() to bet
on subsequent tosses. I.e., if e1,e2,... is the
sequence of outcomes of the coin tosses, after n
tosses, we bet "head" if Pr("n+1 toss is
head"|e1,...,en)>1/2, "tail", otherwise.
Consider the statement A="the bet will be wrong at
every coin toss". If the function Pr() can be
expressed in a formal mathematical notation, so can our
statement be. This is an empirical statement on the sequence
of coin tosses, and it is quite easy to see that it must be
that Pr(A)=0. I.e., if we use Pr() to
measure our prior belief in the state of the world then, to
the least, we harbor the misconception that Pr()
cannot continuously mislead us.
This seminal work is described in a
hard-to-read long
paper
published in the Journal of Symbolic Logic. The work is quoted
in Stanford’s online Encyclopedia of Philosophy and seems to
continue to generate follow-up research by philosophers
interested in the epistemology of inductive inference. (I was
flattered to see a 1995 Master thesis from CMU’s dept of
philosophy entitled “A Commentary on the First Three Sections
of Gaifman and Snir's 1982 JSL Paper Concerning Probabilities
Defined on a First Order Empirical Language," by Timothy
Herron.) Here is a
pointer to a recent presentation on this work.
Memory Hierarchy
Conventional abstract computing models assume
constant access time to memory. In practice, memory accesses
may take 100's of instruction cycles. Caches are key to the
performance of modern computers, and good cache locality is
key to the performance of algorithms. Therefore, it is
important to develop abstract computing models that reflect
the reality of a memory hierarchy.
In joint work with Aggarwal and Chandra we considered first a
hierarchical memory model where memory hierarchy is
representing assuming that access to address a costs
f(a), for some monotonic function f() (most
results focus on logarithmic and polynomial cost functions).
We show, in this model, how to develop optimal algorithms for
various problems. A simple example is provided by the matrix
multiplication problem (assuming the n^3 product
computation). A simple recursive algorithm turns out to
have optimum locality (within a constant factor): each matrix
is split into 2x2 block submatrices; the matrix
product is expressed as a product of two 2x2
matrices involving, recursively, products of submatrices. This
algorithm does not depend on the exact memory access cost
function; Leiserson
and co. recently coined the term cache
oblivious for such algorithms. Furthermore, within a
constant factor, an on-line LRU type memory management
algorithm will work as well as off-line memory management:
programmers can be effectively relieved of the need to
explicitly copy data from one storage location to another.
In subsequent
work, we
refined the model to reflect the effect of spatial
locality: access to contiguous data is cheaper than
access to random locations. Interestingly, on-line memory
management is not effective in handling spatial locality.
Also, some problems do not admit cache oblivious algorithms
(recent work by Bilardi and co.).
Shared Memory Programming
Optimizing compilers improve code performance by reordering instruction execution, when such reordering does not change the computation outcome. For sequential programs, it is well understood when such reordering is legal: one has to respect data dependencies and control dependencies. The situation is more complicated for explicitly parallel programs. It is possible to come up with examples of transformations that would be correct when applied to each sequential thread in isolation, but that lead to an incorrect result when the threads execute concurrently and communicate via shared variables, due to violations of the shared memory semantics (we assume sequential consistency). In joint work with Shasha, we developed a formalism that enables the specification of order constraints within each inpidual executing thread so that program transformations that respect these constraints within each thread will be correct. This framework also provides conditions for the preservation of the atomicity of compound operations. The recent popularity of Java, a programming language with explicit shared memory parallelism, has brought these set of issues again to the forefront, with an ongoing heated discussion on the correct memory model for Java.
Decision Trees
The decision tree model is often
used to analyze the complexity of computations that are
dominated by test and branch operations. This model is used to
show that sorting requires nlog (n)
comparisons, or that finding a maximum requires n-1
comparisons. Note that these two lower bounds are different in
nature. The first one is an information theoretic
argument: There are n! possible outcomes to sorting
n elements, therefore a decision tree that sorts has
n! leaves, and depth log(n!). The second
does not follow from an information theoretic argument:
maximum has n possible outcomes, so that the
information theoretic lower bound is log(n). Rather,
it is an adversary argument: the element that was
picked as a maximum must have been compared to all other
elements, otherwise one could change the inputs so that
another element is the maximum, without changing the outcome
of the comparisons. Information theoretic arguments are valid,
irrespective of constraints on the input domain (as long as
there are still n! possible orderings) and
irrespective of the type of predicates used to test and
branch. It is not clear that the same holds for problems such
as maximum. In fact, as the well know riddle about finding one
fake coin out of n shows, the maximum (or minimum)
problem can be solved with fewer comparisons, using more
complex tests, when the inputs can have only two values (good
heavy coin and fake light coin).
Would the same be true if coins could have more
than two weights? In joint
work with Moran and Manber we have applied Ramsey's
theorem to show that lower bounds obtained in a model where
simple comparisons are used hold for a model where arbitrary
binary predicates are used, provided that the input domain is
large enough. Thus, if coins can have many different weights,
n-1 tests are necessary, even if one is allowed to
use arbitrary predicates. Note that
Ramsey's numbers are really large; this
results tells nothing about the number of comparisons needed
when the number of possible values (coin weights) is small.
NYU Ultracomputer
The NYU Ultracomputer was never an ultracomputer nor a supercomputer: it was a paracomputer. It seems unlikely that paracomputers can be supercomputers.