Redundancy and Recovery Mechanism for Reliable Computations

NocturnDragon · Jun 4, 2007

I've found this paper and even tho I still didn't read it, It look pretty interesting.

J. Sheaffer, D. Luebke, and K. Skadron. â€œA Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors.â€ In Proceedings of Eurographics/ACM Graphics Hardware 2007 (GH), Aug. 2007, to appear.

http://www.cs.virginia.edu/~skadron/Papers/sheaffer_gh2007.pdf

We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures.
Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results. Our results show that our technique imposes less than a 1.5Ã— performance penalty and saves energy for GPGPU but is completely transparent to general graphics and does not affect the performance of the games that
drive the market.

Looks like that even in the future GPGPUs and GPUs can continue to be one and the same.

3dilettante · Jun 4, 2007

From what I've skimmed, it's an elaboration of the technique used by IBM's mainframe processors, which have dual pipelines that run the same instruction stream.

The GPU variant adds flexibility by making it so the paired pipeline mode can be deactivated.

The paper mentions the likely need for raster state and all data paths to be protected by ECC.
That is something that would likely be a differentiating feature between GPGPUs and GPUs, if not on-chip, then ECC RAM may be kept out of the enthusiast space.

Redundancy and Recovery Mechanism for Reliable Computations

NocturnDragon

3dilettante

Similar threads