Algorithmic fault tolerance is an important open problem for complex numerical algorithms. Substantial advances will have high impact in a wide spectrum of applications – ranging from sensor networks over P2P networks to high performance computing (HPC), since future HPC systems are expected to exhibit much higher fault rates than current systems do. It is an open question how much resilience can be achieved at the algorithmic level and how this influences sustained performance.

The REPEAL project (REsilience vs. PErformance in linear ALgebra) investigates resilient/fault-tolerant parallel algorithms for a range of numerical linear algebra problems. The objectives are to design algorithms which provably produce accurate results (within the limitations of floating-point arithmetic) in the presence of faults, and to gain a better understanding of the resilience-performance trade-off.

We address the following questions: How can existing approaches be improved to handle more general temporal and spatial fault distributions? Which numerical accuracy and sustained performance do resilient algorithms achieve in real computations? Which resilience improvements can be achieved by combining deterministic with randomized approaches? What is the "price" of resilience, i.e., which slow-down has to be expected compared to non-resilient high performance algorithms?

Project Duration: The project has a duration of three years and started on March 1st, 2016.