It seems to me you still have in mind a somewhat low-level voting system. Here's the system that I would try to build, were I given the chance:
I would have a single large cluster with a large number of compute nodes (depending upon the task). Data would be processed in chunks, with each data chunk processed twice, followed by a verification step. Presumably the optimal situation would be to have the two data chunks for each step processed by separate nodes, which would require a bit of careful programming for appropriate queuing of the verification step, but nothing too horrible. Any verification step that fails, then, would just re-insert a pair of chunks into the queue.
This all requires, of course, that the data can be divided into wholly discrete chunks that are completely independent of one another. If this is not the case, the verification obviously gets dramatically more challenging. But I doubt that it would ever be impossible.
Doubling the node count is the one case I saw that would have the highest guaranteed penalty for a GPU-system.
If we go with the 200x speedup claim for OPC (ignoring the "up to"):
Let's use a CPU cluster a the baseline, and assume an equivalent GPU system would be 200 times faster at the same footprint, or only need 1/200th the footprint for the same performance.
Doubling nodes automatically means it's 100x.
It's still very good.
GPGPU is not without other extremely large scalar divisors, unfortunately.
With just 1 GPU, various GPGPU loads are already CPU limited performing the necessary computations and bookkeeping needed to control the slave board.
The speedup with verification overhead is going to be 100*(1-(per-chunk vote overhead + re-vote cost*rate of mismatch)).
Some additional performance would be lost with driver overheads and the more complicated software stack.
I don't know how high the penalty could get.
Maybe speedup is between 50x and 90x.
Still, good...
If we couple this with the probable case that the 200x was derived from code nobody would trust a multi-million dollar chip tape-out to without serious elaboration, and that the test would not have included all the necessary processes that a full HPC installation would have, there might be another loss of performance.
It still might work for the narrow case of problems that have two orders of magnitude improvement on GPUs--assuming the 200x speedup is enough to cover potential losses of performance in other parts of the process.
We'd also ignore the extreme fragility of GPU performance to cut things by significant factors as well.
Anything even slightly more modest would be right out.
If it's 200x reduced to something like 50-80x speedup and we throw in DP (assuming GPU FLOPs weren't crap FLOPs to start with), Nvidia would be right out.
AMD could might be slightly better or also being a waste of time, depending on whatever variables interfere, even with a theoretical 200x advantage.
GPUs would be much more compelling with ECC and other RAS features.
(They'd be even more compelling if they weren't slave cards, but I digress.)
This leaves out the practicalities of handling more error-prone hardware with very poor monitoring capabilities and fault detection.
For a small compute section, it might not matter. An HPC installation may make it less practical.