The cost is more than just 75%. If the code is branchy, then the cost won't be 4x, it will be 256x in worst case with each lane diverging and even in that lane, each vliw stream diverging. There is a reason people don't just build a single core with massively wide simd, but make many cores, each with moderately wide vector width.
In the worst case, where each lane goes to a different path, GPUs will sucks, it doesn't matter if the SIMD width is 256, 32 or just 2, it will be too slow period, use a CPU for those cases, stop GPUs discussion here.
In the more common "bad case" with a great level of divergence the performance hit will depend on the pattern, if every four sequential threads each one goes to a different path a width of 32 will be hit as bad as 256, and both will be 4x slower than if there was no divergence, if the pattern is every 1024 sequential threads goes to a different path than neither options will take a performance hit, the case most biased to 32 is when every 32 sequential threads goes to a different path, but honestly, I don't mind being so "lucky". On a pure random pattern the performance hit on 256 will be bigger than on 32, but not so much bigger, try it.
BTW, in the specific case of 256x1.25 ALUs vs 64x5 ALUs, if there is no ILP at all the first will perform as well as the second one in the wrost case, but much better for not so bad cases.
Where have you seen 3D FFT on AMD gpu's? Linnky?
non-public, sorry.
I think I missed this part of presentation, could you point it please?
You have to factor in the cost of collecting operands and storing the results. If you've built a pipeline that does gather/scatter, anyway, then you can argue that this is just a gather/scatter problem. But gather/scatter is expensive.
No need to be "fully-associative", a simple form to handle most common patterns will improve performance at low cost, after all, if the cost is too high there is no reason to go SIMD.
We still don't really have a good idea whether ATI's 64-wide is considerably worse than NVidia's 32-wide, or whether Larrabee will make both seem pathetic with 16-wide. That comparison might be moot, e.g. anything wider than about 4 is in a world of hurt.
Just a sugestion, output the pattern of those tests for analysis, a simple script may check the efficiency of several widths in seconds, trying to figure it from results of very different GPUs doesn't seem very productive for me...
R600's design has to support GPUs with only 4 TUs, too, i.e. RV610 - with ALU:TEX of 2.
In R600's design leaving an entire clock for texture address made a lot of sense, TPs ("Thread Processor" - I think this is how AMD decided to call the group of 5 SPs) were aligned with TMUs, I mean, there was at most one SIMD core accessing the "TMU core" and it accessed all TMUs at same time (also, the reason why RV610 had only 4 TPs per SIMD core), during this periods that 4th clock was really used.
In RV770 the 4th clock received a few new functions but wasn't really needed anymore, with TMUs coupled with the SIMD core those tasks could be accomplished by others menas, like new special ALU instructions..
18 operands for 6 MADs, when there's currently only 17, doesn't quite work - but you could argue that one operand is likely to be shared across 2 or more lanes.
You said 17 because it is 12 from register file plus 5 from forwarding rigth? In this case the forwarding grows to 6, so the 18 operands, realocating the 4th clock to register read there are 4 more, 22 in total, 16 just from the register file.
Don't understand what you mean by per ALU predicates.
Like the predicates in ARM or better, like the predicate operand in Larrabee, it would allow for better handling of vectorized code, especially if who is vectorizing is a compiler, like in Voxilla vectorized version of Mandelbrot, he used inside as a predicate.
Apart from the gather/scatter issue with DWF (which creates an implicit synch point at the entry of each distinct clause, though you can amortise that slightly with time spent on gathers) the other killer is having enough threads available from which to select. ATI might seem happier with more threads in flight, anyway, but I think there just aren't enough. Complex code results in only a handful of threads in flight (due to register allocation). As time goes by I go off DWF more and more. It doesn't scale.
I think scan, used to generate an index of strands to execute a clause and/or to pack the data into a buffer, is more useful. It scales, even though it's only explicit scattering/gathering and synching (i.e. like DWF). Fermi has nice big, real, L1s and multiple concurrent kernel support, so this should work reasonably well. But this is for the really gnarly workloads...
Another sugestion... Let's increase the wavefront width from 64 to 256, the SIMD core still at 16 so it may required up to 16 clocks to execute instead of 4, the register read part remains only requiring 4 clocks, now when ready to start executing the wavefront, looks at each ALU and in it's threads, select one set predicate per cycle executing from that thread, if many predicates aren't set the 256-thread wavefront may execute in as low as 4 clocks performing like the 64-thread one in the worst case, but will perform better for random patterns, up to 4 times better, of course, this is more complex than nothing at all, but not so complex as full scan, gather, pack, compress, etc.
DOT4 still needs to be supported.
Sure, and there is space for it there, the point is saving resources for doubles.
Often there's other stuff to do, loop counters, array index computations. And since there aren't any DP transcendentals, some of them can be seeded by an approximation run on T, before initiating some kind of DP-approximation.
Ok, but why not using only 3 ALUs instead of 4?