NVIDIA shows signs ... [2008 - 2017]

rpg.314 · Oct 22, 2009

EduardoS said:
If it is too branchy it will be slow on GPUs anyway.

No. It has it's branches, but it won't necessarily be slow on GPUs if you write it properly (and that does not include vliw). And I dont see much scope for ILP there either.

Packing 4 lanes together will make effective simd width 256 on amd. What do you think will happen to that wide simd with even a little irregular control flow?

How about something as simple as 3D FFT with vliw. I'd like to see people doing it on amd gpu's, would be interesting to see the results/algorithms/implementation details etc. Don't think vliw will be a win there (atleast with standard data layouts), but I could be wrong.

BTW, do you know any GPU implementation of it?

I am working on it in my free time.

rpg.314 · Oct 22, 2009

Jawed said:
That was AMD's implementation. AMD's implementation of matrix multiply (texture-cache optimised) is also slower than prunedtree's.

But didn't they write it in IL directly? If so, what prevented them from using LDS efficiently. At any rate, broken toolchains from AMD so far (dunno about their ocl drivers) make it hard to make informed judgements/guesses.

I thought you might have encountered some other work, away from CAL/IL/Brook+...

No I haven't.

15% is the "reciprocal" of 659%.

IIUC, that means nv needs 659% more area per MADD vis-a-vis amd, right? (where the area considered was just the ALU area, not the whole chip area)

If so, holy cow

, fuck the potential inefficiencies of vliw, why care? It is so damn small wrt competition that ditching it for scalar or x-way vliw (x<5) isn't worth the time.

The Hoshen-Kopelman algorithm you mentioned earlier, at first glance, appears to be fetch/bandwidth limited on a GPU. I suppose the additional analysis you might run alongside the generation of the clusters would up the arithmetic intensity. I couldn't find anyone who's written-up a GPU implementation of it.

It seems mem bound because it has never (afaik) been written with gpu's in mind. Hell, people so far don't seem to have considered using the nature of cubical lattices either to speed it up. IOW, it is possible to turn (many/major parts of it atleast into compute bound). It will do fantastically well on fermi/lrb because of their caches.

I am not speaking of running additional analysis along side to increase arithmetic intensity.

See prunedtree's thread. It gets used quite heavily (20-50% of non-graphics code? More?) and it isn't just for transcendentals (5-10% of non-graphics code?), which makes it cheaper than a pure transcendental ALU. Its utility in graphics should be even higher.

t is obviously pretty good for graphics, about non-graphics code, the question is as yet unsettled.

ninelven · Oct 22, 2009

rpg.314 said:
IIUC, that means nv needs 659% more area per MADD vis-a-vis amd, right? (where the area considered was just the ALU area, not the whole chip area)

More like 230%+ -.

CarstenS · Oct 22, 2009

Isn't it kind of pointless comparing pure ALU density across various architectures? I mean, it's a known fact that the more general circuitry becomes, the more space it needs. After all, you're doing your math not on Cypress vs. Nehalem, don't you?

There was this nice slide from a Cray guy at GTC09 "FLOPS are cheap, Communication is expensive", where he stated, that a 64-Bit FPU in 40nm CMOS would fit into 0.01mm², thus fitting over 1.000 of them on under 112,5mm².

pcchen · Oct 22, 2009

CarstenS said:
Isn't it kind of pointless comparing pure ALU density across various architectures? I mean, it's a known fact that the more general circuitry becomes, the more space it needs. After all, you're doing your math not on Cypress vs. Nehalem, don't you?

There was this nice slide from a Cray guy at GTC09 "FLOPS are cheap, Communication is expensive", where he stated, that a 64-Bit FPU in 40nm CMOS would fit into 0.01mm², thus fitting over 1.000 of them on under 112,5mm².

Yeah, this is actually a key point. One of the reason why the Earth Simulator is so efficient (related to its peak performance) is that it has a crazy full crossbar connecting each nodes (there are 160 nodes). IIRC the crossbar takes more space than actual computing nodes. In the upgraded Earth Simulator, they replaced the crazy crossbar with a cheaper fat-tree, but that's mostly because the upgraded Earth Simulator has far less number of nodes.

Jawed · Oct 22, 2009

EduardoS said:
Of course, there are the scalar cases (wich proved to be far less than the discussion here implies), tipicaly using much less registers where G80 shines, but even there there is no reason to the R600 VLIW not perform well, software may split each lane in 4 bringing utilization to at least 80%, the big vec4 register file will serve it as well.

And also, there is the fat unit to talk about, there is simple no reason to have transcedental hardware on each ALU, they are not used so often, nVidia have a separate unit for handling them and it increases the costs of the control unit, the VLIW is fine with one slot, no aditional complexity.

NVidia reduced the cost of the transcendental unit in G80 by doing attribute interpolation on it, too. And MUL and integer MUL, I think.

Attribute interpolation in pixel shaders increases ILP - and since G80 was doing that, that only increased average ILP for pixel shaders. Attribute interpolation in R800 increases the utilisation of VLIW, too.

I wouldn't take VLIW out from graphics now, I think 5-issue isn't a good width, my preference for graphics now goes to VLIW multiple of 3-issue (3, 6, maybe 9, why not?) due to 64-bit math and double precision, a 64-bit mul may be done by 3 32-bit mul (Karatsuba) instead of wasting 4 slots and leaving one free like on current hardware, dedicated hardware for DP doesn't seens a good idea because DP hardware is expensive and only a very small part of code running on GPUs today depends on DP.

So you think MAD+MAD+MAD/transcendental is the right mix? I wonder about MAD+MAD+MAD+MAD/transcendental. Would either change make a notable difference in die size, though? The way virtual register file ports are effected using the 4-cycle instructions, coupled with the in-pipe registers (to avoid read-after-write latency penalties) seems to make 5-way very efficient in terms of operand handling (i.e. a capability of 17 operands per instruction for ALUs that can consume, at most, 15 operands). Everything fits together very tightly, so cutting a lane or two from the VLIW means you have to start from scratch, pretty much, to get all the savings.

The DP operation in ATI seems like it might use the dot-product capability of the 4 MAD lanes. Additionally subnormal support seems to have been added for SP and DP in ATI (don't know for sure with regard to DP though). That implies a monster adder or some other tricky stuff I don't know about.

DP operations don't cause the other lanes to idle on ATI. They're still available.

Going forwards none of these GPUs has a dedicated DP unit. NVidia looks like its exception-handling/performance will be the best. Larrabee should be in the same ballpark for performance, but I guess exceptions will be much slower.

Jawed

Jawed · Oct 22, 2009

Karoshi said:
Hmm, variable length VLIW + SMT = win?
AFAIK (which is little) ATI is already variable length VLIW 1-5 instructions.
Going to 9 wide VLIW means scheduling almost 2 threads per clock. If you can keep 4+ threads on the ready to run list and go all hyperthreading on them you could do a very simple scheduling to achieve better resource utilization.
Mix up to 9 instructions out of the 5max VLIW streams into a dynamic 9max VLIW stream.
Wait, is this superscalar VLIW (dynamic superscalar static superscalar VLIW architecture)?
Seriously, what's this called?
And couldnt this help ATI going forward? Keep the virtual 5 wide arch, make the underlying hardware wider to absorb sub 5 instructions cycles and increment efficiency.

http://en.wikipedia.org/wiki/Explicitly_parallel_instruction_computing

ATI has a compiler between the IL (which is essentially an extended version of Direct 3D's Shader Model 4 assembly language) and the 5-way VLIW (which is just a portion of the program, as there are also Control Flow instructions and texture/vertex-fetch clauses). So there's no real point in treating the 5-way VLIW as the source instruction stream to be optimised-for on a future chip.

Jawed

Jawed · Oct 22, 2009

rpg.314 said:
But didn't they write it in IL directly? If so, what prevented them from using LDS efficiently. At any rate, broken toolchains from AMD so far (dunno about their ocl drivers) make it hard to make informed judgements/guesses.

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=111171
http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=102771

IIUC, that means nv needs 659% more area per MADD vis-a-vis amd, right? (where the area considered was just the ALU area, not the whole chip area)

If so, holy cow, fuck the potential inefficiencies of vliw, why care? It is so damn small wrt competition that ditching it for scalar or x-way vliw (x<5) isn't worth the time.

Yes, holy cow. But that comparison exaggerates things, which is why it's better to normalise to some kind of realisable utilisation.

The double-precision comparison that David Kanter did is quite wild, too - based on the overall GPU size and purely theoretical FLOPs. I adapted it to estimate RV870:

Fermi seems likely to be very close to RV870. And it'll be supported properly and have more functionality.

It seems mem bound because it has never (afaik) been written with gpu's in mind. Hell, people so far don't seem to have considered using the nature of cubical lattices either to speed it up. IOW, it is possible to turn (many/major parts of it atleast into compute bound). It will do fantastically well on fermi/lrb because of their caches.

I am not speaking of running additional analysis along side to increase arithmetic intensity.

Well, dare I say, best of luck with it - it looks pretty important. Algorithms for serial hardware don't necessarily deserve to be ported to parallel hardware...

I've seen references to parallel implementations of this algorithm (but academic firewalls get in the way). I presume they break up the grid into slices which are each processed in parallel. I guess a post-process pass is then used to reconcile clusters that touch slice boundaries, to discover if they cross slice boundaries. I suppose that's sort of another H-K pass. You end up with a hierarchy across the parallelism-splits you need to fit the problem onto the GPU and onto the cores/threads of the GPU.

Jawed

EduardoS · Oct 22, 2009

rpg.314 said:
Packing 4 lanes together will make effective simd width 256 on amd. What do you think will happen to that wide simd with even a little irregular control flow?

Depends on the pattern, with 4 different paths choosen at random then about 75% of units will be masked, as it would if the width was like 32.

And there is other forms of improving branchy performance without reducing the width of the SIMD core, the simple one I'm thinking about also works better than a small SIMD for the pure random pattern of above.

VLIW and wider SIMDs allows for more peak performance, even if they had to execute with half the lanes masked it may still faster than a pure scalar small SIMD, and here the point I hate about small SIMD, if you mask half of them they doesn't worth the cost anymore.

rpg.314 said:
How about something as simple as 3D FFT with vliw. I'd like to see people doing it on amd gpu's, would be interesting to see the results/algorithms/implementation details etc. Don't think vliw will be a win there (atleast with standard data layouts), but I could be wrong.

I never looked at 3D FFT code but I can tell you, it worked just fine.

rpg.314 said:
I am working on it in my free time.

Good, when it is ready we can analysis it.

Jawed said:
So you think MAD+MAD+MAD/transcendental is the right mix?

(MAD+MAD+MAD/transcendental)*x to be precise, in current AMD chip x=2 is the easier "upgrade" path, keeping one thin MADs per vector element and add two full DP MADD capability.

Jawed said:
Would either change make a notable difference in die size, though?

For the change that's increase issue width, I hope not

Jawed said:
The way virtual register file ports are effected using the 4-cycle instructions, coupled with the in-pipe registers (to avoid read-after-write latency penalties) seems to make 5-way very efficient in terms of operand handling (i.e. a capability of 17 operands per instruction for ALUs that can consume, at most, 15 operands). Everything fits together very tightly, so cutting a lane or two from the VLIW means you have to start from scratch, pretty much, to get all the savings.

I know, but all that was done with R600 in mind (look at how TMUs were positioned, a single SIMD core could use all of them), if you look at current RV870 things doesn't make as much sense as it did, there is free space for more operands per VLIW, and, if going to make changes like that, let's add some form of DWF and per ALU predicates :smile:

Jawed said:
The DP operation in ATI seems like it might use the dot-product capability of the 4 MAD lanes. Additionally subnormal support seems to have been added for SP and DP in ATI (don't know for sure with regard to DP though). That implies a monster adder or some other tricky stuff I don't know about.

Wasn't RV870 supposed to support DOT3? Why spending 4 MAD lanes if 3 is enough?

Jawed said:
DP operations don't cause the other lanes to idle on ATI. They're still available.

Yes, but in case of DP MUL four are being used for the op itself, the fifth cannot be used for doubles.

Jawed said:
Going forwards none of these GPUs has a dedicated DP unit. NVidia looks like its exception-handling/performance will be the best. Larrabee should be in the same ballpark for performance, but I guess exceptions will be much slower.

Doesn't G200 and Fermi have a dedicated unit for DPs?

rpg.314 · Oct 22, 2009

Jawed said:
Yes, holy cow. But that comparison exaggerates things, which is why it's better to normalise to some kind of realisable utilisation.

Yeah it is semiacccurate, but even in the worst case, it will have a 5x hit in perf, so 659/5=132%. And when you already have an alu design which is 32% more area efficient than the competition, then why bother changing it now?

The double-precision comparison that David Kanter did is quite wild, too - based on the overall GPU size and purely theoretical FLOPs. I adapted it to estimate RV870:

Love it.

Well, dare I say, best of luck with it - it looks pretty important. Algorithms for serial hardware don't necessarily deserve to be ported to parallel hardware...

It's not serial, atleast not meant for serial hw. It just has low ILP, even the ones I have in mind. Even small caches, (even 8KB isn't too bad)

I've seen references to parallel implementations of this algorithm (but academic firewalls get in the way). I presume they break up the grid into slices which are each processed in parallel. I guess a post-process pass is then used to reconcile clusters that touch slice boundaries, to discover if they cross slice boundaries. I suppose that's sort of another H-K pass. You end up with a hierarchy across the parallelism-splits you need to fit the problem onto the GPU and onto the cores/threads of the GPU.

Sort of, but that is perhaps the last thing on my mind right now. The real pity (aka opportunity) here lies elsewhere.

EduardoS said:
Depends on the pattern, with 4 different paths choosen at random then about 75% of units will be masked, as it would if the width was like 32.

And there is other forms of improving branchy performance without reducing the width of the SIMD core, the simple one I'm thinking about also works better than a small SIMD for the pure random pattern of above.

The cost is more than just 75%. If the code is branchy, then the cost won't be 4x, it will be 256x in worst case with each lane diverging and even in that lane, each vliw stream diverging. There is a reason people don't just build a single core with massively wide simd, but make many cores, each with moderately wide vector width.

EduardoS said:
I never looked at 3D FFT code but I can tell you, it worked just fine.

Where have you seen 3D FFT on AMD gpu's? Linnky?

EduardoS said:
Doesn't G200 and Fermi have a dedicated unit for DPs?

GT200 has. Fermi won't

Jawed · Oct 22, 2009

EduardoS said:
Depends on the pattern, with 4 different paths choosen at random then about 75% of units will be masked, as it would if the width was like 32.

Narrower SIMDs will run the same pattern in less cycles. The cost, obviously, is the extra "overhead" hardware, as you need more SIMDs (and/or higher-clocked SIMDs).

And there is other forms of improving branchy performance without reducing the width of the SIMD core, the simple one I'm thinking about also works better than a small SIMD for the pure random pattern of above.

You have to factor in the cost of collecting operands and storing the results. If you've built a pipeline that does gather/scatter, anyway, then you can argue that this is just a gather/scatter problem. But gather/scatter is expensive.

VLIW and wider SIMDs allows for more peak performance, even if they had to execute with half the lanes masked it may still faster than a pure scalar small SIMD, and here the point I hate about small SIMD, if you mask half of them they doesn't worth the cost anymore.

We still don't really have a good idea whether ATI's 64-wide is considerably worse than NVidia's 32-wide, or whether Larrabee will make both seem pathetic with 16-wide. That comparison might be moot, e.g. anything wider than about 4 is in a world of hurt. All we've got is:

http://www.hardware.fr/articles/770-6/dossier-amd-radeon-hd-5870.html

and Voxilla's Mandlebrot and Julia implmentations:

http://forum.beyond3d.com/showthread.php?t=55330
http://forum.beyond3d.com/showthread.php?t=55344

which could in theory be used to see the incoherence penalties. Though there's not much nesting depth there, only depth 2 in the Julia - in the ATI assembly at least.

I know, but all that was done with R600 in mind (look at how TMUs were positioned, a single SIMD core could use all of them), if you look at current RV870 things doesn't make as much sense as it did, there is free space for more operands per VLIW, and, if going to make changes like that, let's add some form of DWF and per ALU predicates :smile:

R600's design has to support GPUs with only 4 TUs, too, i.e. RV610 - with ALU:TEX of 2. Yes, there is effectively an excess of texture-specific bandwidth to/from the registers with hardware whose ALU:TEX is higher than 1. But register file bandwidth for texture operations is shared with exports and LDS operations.

18 operands for 6 MADs, when there's currently only 17, doesn't quite work - but you could argue that one operand is likely to be shared across 2 or more lanes.

Don't understand what you mean by per ALU predicates.

Apart from the gather/scatter issue with DWF (which creates an implicit synch point at the entry of each distinct clause, though you can amortise that slightly with time spent on gathers) the other killer is having enough threads available from which to select. ATI might seem happier with more threads in flight, anyway, but I think there just aren't enough. Complex code results in only a handful of threads in flight (due to register allocation). As time goes by I go off DWF more and more. It doesn't scale.

I think scan, used to generate an index of strands to execute a clause and/or to pack the data into a buffer, is more useful. It scales, even though it's only explicit scattering/gathering and synching (i.e. like DWF). Fermi has nice big, real, L1s and multiple concurrent kernel support, so this should work reasonably well. But this is for the really gnarly workloads...

Wasn't RV870 supposed to support DOT3? Why spending 4 MAD lanes if 3 is enough?

DOT4 still needs to be supported.

Yes, but in case of DP MUL four are being used for the op itself, the fifth cannot be used for doubles.

Often there's other stuff to do, loop counters, array index computations. And since there aren't any DP transcendentals, some of them can be seeded by an approximation run on T, before initiating some kind of DP-approximation.

Doesn't G200 and Fermi have a dedicated unit for DPs?

GT200 has a dedicated DP unit, but Fermi re-uses the main ALUs for half-rate DP.

Jawed

rpg.314 · Oct 22, 2009

May be this thread should be renamed as "NVIDIA's ALU's show sign of strain"

just saying....

Jawed · Oct 22, 2009

rpg.314 said:
Yeah it is semiacccurate, but even in the worst case, it will have a 5x hit in perf, so 659/5=132%. And when you already have an alu design which is 32% more area efficient than the competition, then why bother changing it now?

NVidia ALUs are clocked ~2x faster than ATI. If you go back to those links I posted, you'll see the normalisations I did. That's how absolute performance of serial scalar on ATI can suck.

It's not serial, atleast not meant for serial hw. It just has low ILP, even the ones I have in mind. Even small caches, (even 8KB isn't too bad)

The columnar raster scan is purely serial in the original algorithm as far as I can tell. I'm just cribbing from this:

http://www.weizmann.ac.il/home/feamit/nodalweek/c_joas_nodalweek.pdf

Jawed

rpg.314 · Oct 22, 2009

Jawed said:
NVidia ALUs are clocked ~2x faster than ATI. If you go back to those links I posted, you'll see the normalisations I did. That's how absolute performance of serial scalar on ATI can suck.

I was under the impression that you had normalized for clock speed as well. Even then, AMD needs ~1.5x ILP to match nv, which isn't too terrible, but certainly evens the competition somewhat. Another important factor here are AMD's in pipeline registers to avoid the RAW latency penalty, which seems to have no counterpart, neither in cell, nor fermi, nor lrb.

The columnar raster scan is purely serial in the original algorithm as far as I can tell. I'm just cribbing from this:

http://www.weizmann.ac.il/home/feamit/nodalweek/c_joas_nodalweek.pdf

I have seen that presentation. Like I said, there are better ways, which afaik, are unexplored so far.

Karoshi · Oct 22, 2009

Jawed said:
http://en.wikipedia.org/wiki/Explicitly_parallel_instruction_computing

ATI has a compiler between the IL (which is essentially an extended version of Direct 3D's Shader Model 4 assembly language) and the 5-way VLIW (which is just a portion of the program, as there are also Control Flow instructions and texture/vertex-fetch clauses). So there's no real point in treating the 5-way VLIW as the source instruction stream to be optimised-for on a future chip.

Jawed

Nice, thanks. I read about itanium architecture years ago, heh.
That is basicaly what i was thinking about, minus speculation, ooe and convoluted latency/prefetch exercises, relevant to the CPU arena, not so much on a GPU with thousands of threads.
Since the hardware is gonna be running many threads anyway, just the cheapest of SMT implementations, pick and pack partial virtual VLIW bundles to fill up the real (longer) hardware VLIW. Might have been nice for rv8xx.

By using a standard ISA you avoid software development costs, putting the burden of enhanced efficiency on the hardware. It's a trade off, passing the hot potato to the hw group. I dont think rewriting your compiler backend every few years is fun, but maybe NV/ATI are getting better at it. How many have they written already? Of course, simulation of future loads might suggest VLIW is not the best way forward.

Jawed · Oct 22, 2009

Karoshi said:
Nice, thanks. I read about itanium architecture years ago, heh.
That is basicaly what i was thinking about, minus speculation, ooe and convoluted latency/prefetch exercises, relevant to the CPU arena, not so much on a GPU with thousands of threads.
Since the hardware is gonna be running many threads anyway, just the cheapest of SMT implementations, pick and pack partial virtual VLIW bundles to fill up the real (longer) hardware VLIW. Might have been nice for rv8xx.

The D3D assemby or IL is already "VLIW" in a sense - the driver compiler is already picking and packing "VLIW" bundles.

Jawed

EduardoS · Oct 23, 2009

rpg.314 said:
The cost is more than just 75%. If the code is branchy, then the cost won't be 4x, it will be 256x in worst case with each lane diverging and even in that lane, each vliw stream diverging. There is a reason people don't just build a single core with massively wide simd, but make many cores, each with moderately wide vector width.

In the worst case, where each lane goes to a different path, GPUs will sucks, it doesn't matter if the SIMD width is 256, 32 or just 2, it will be too slow period, use a CPU for those cases, stop GPUs discussion here.

In the more common "bad case" with a great level of divergence the performance hit will depend on the pattern, if every four sequential threads each one goes to a different path a width of 32 will be hit as bad as 256, and both will be 4x slower than if there was no divergence, if the pattern is every 1024 sequential threads goes to a different path than neither options will take a performance hit, the case most biased to 32 is when every 32 sequential threads goes to a different path, but honestly, I don't mind being so "lucky". On a pure random pattern the performance hit on 256 will be bigger than on 32, but not so much bigger, try it.

BTW, in the specific case of 256x1.25 ALUs vs 64x5 ALUs, if there is no ILP at all the first will perform as well as the second one in the wrost case, but much better for not so bad cases.

rpg.314 said:
Where have you seen 3D FFT on AMD gpu's? Linnky?

non-public, sorry.

rpg.314 said:
GT200 has. Fermi won't

I think I missed this part of presentation, could you point it please?

Jawed said:
You have to factor in the cost of collecting operands and storing the results. If you've built a pipeline that does gather/scatter, anyway, then you can argue that this is just a gather/scatter problem. But gather/scatter is expensive.

No need to be "fully-associative", a simple form to handle most common patterns will improve performance at low cost, after all, if the cost is too high there is no reason to go SIMD.

Jawed said:
We still don't really have a good idea whether ATI's 64-wide is considerably worse than NVidia's 32-wide, or whether Larrabee will make both seem pathetic with 16-wide. That comparison might be moot, e.g. anything wider than about 4 is in a world of hurt.

Just a sugestion, output the pattern of those tests for analysis, a simple script may check the efficiency of several widths in seconds, trying to figure it from results of very different GPUs doesn't seem very productive for me...

Jawed said:
R600's design has to support GPUs with only 4 TUs, too, i.e. RV610 - with ALU:TEX of 2.

In R600's design leaving an entire clock for texture address made a lot of sense, TPs ("Thread Processor" - I think this is how AMD decided to call the group of 5 SPs) were aligned with TMUs, I mean, there was at most one SIMD core accessing the "TMU core" and it accessed all TMUs at same time (also, the reason why RV610 had only 4 TPs per SIMD core), during this periods that 4th clock was really used.

In RV770 the 4th clock received a few new functions but wasn't really needed anymore, with TMUs coupled with the SIMD core those tasks could be accomplished by others menas, like new special ALU instructions..

Jawed said:
18 operands for 6 MADs, when there's currently only 17, doesn't quite work - but you could argue that one operand is likely to be shared across 2 or more lanes.

You said 17 because it is 12 from register file plus 5 from forwarding rigth? In this case the forwarding grows to 6, so the 18 operands, realocating the 4th clock to register read there are 4 more, 22 in total, 16 just from the register file.

Jawed said:
Don't understand what you mean by per ALU predicates.

Like the predicates in ARM or better, like the predicate operand in Larrabee, it would allow for better handling of vectorized code, especially if who is vectorizing is a compiler, like in Voxilla vectorized version of Mandelbrot, he used inside as a predicate.

Jawed said:
Apart from the gather/scatter issue with DWF (which creates an implicit synch point at the entry of each distinct clause, though you can amortise that slightly with time spent on gathers) the other killer is having enough threads available from which to select. ATI might seem happier with more threads in flight, anyway, but I think there just aren't enough. Complex code results in only a handful of threads in flight (due to register allocation). As time goes by I go off DWF more and more. It doesn't scale.

I think scan, used to generate an index of strands to execute a clause and/or to pack the data into a buffer, is more useful. It scales, even though it's only explicit scattering/gathering and synching (i.e. like DWF). Fermi has nice big, real, L1s and multiple concurrent kernel support, so this should work reasonably well. But this is for the really gnarly workloads...

Another sugestion... Let's increase the wavefront width from 64 to 256, the SIMD core still at 16 so it may required up to 16 clocks to execute instead of 4, the register read part remains only requiring 4 clocks, now when ready to start executing the wavefront, looks at each ALU and in it's threads, select one set predicate per cycle executing from that thread, if many predicates aren't set the 256-thread wavefront may execute in as low as 4 clocks performing like the 64-thread one in the worst case, but will perform better for random patterns, up to 4 times better, of course, this is more complex than nothing at all, but not so complex as full scan, gather, pack, compress, etc.

Jawed said:
DOT4 still needs to be supported.

Sure, and there is space for it there, the point is saving resources for doubles.

Jawed said:
Often there's other stuff to do, loop counters, array index computations. And since there aren't any DP transcendentals, some of them can be seeded by an approximation run on T, before initiating some kind of DP-approximation.

Ok, but why not using only 3 ALUs instead of 4?

rpg.314 · Oct 23, 2009

EduardoS said:
In the worst case, where each lane goes to a different path, GPUs will sucks, it doesn't matter if the SIMD width is 256, 32 or just 2, it will be too slow period, use a CPU for those cases, stop GPUs discussion here.

There are things much worse than branching. And GPU's have hw support to handle small divergences relatively painlessly, but I doubt if any scheme will scale to 256 wide vectors. dynamic branching was there in SM3.0. Do you think GPU's have been standing still for the last 4 years? There are limits to what any branch penalty scheme can do.

Packing branches indiscriminately is a bad idea. That scheme can work only for code that has almost no control flow.

Try packing a simple for loop, but with a thread specific loop count into 4 lanes of a vliw hw. I think what you'll end up with is a more branchy thread.

non-public, sorry.

And what was the utilization (ie perf/theoretical peak perf)?

May be you should look at that even if you can't tell us.

I think I missed this part of presentation, could you point it please?

http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932&p=7

Jawed · Oct 23, 2009

EduardoS said:
In the worst case, where each lane goes to a different path, GPUs will sucks, it doesn't matter if the SIMD width is 256, 32 or just 2, it will be too slow period, use a CPU for those cases, stop GPUs discussion here.

GPUs can still be faster. Also the logical SIMD width affects the penalty depending on how much variation in cycle count there is in the different paths.

No need to be "fully-associative", a simple form to handle most common patterns will improve performance at low cost, after all, if the cost is too high there is no reason to go SIMD.

What common patterns?

Just a sugestion, output the pattern of those tests for analysis, a simple script may check the efficiency of several widths in seconds, trying to figure it from results of very different GPUs doesn't seem very productive for me...

To output those patterns you need to run the program. We have near-zero actual data, which doesn't help a discussion of DWF. And further, it takes a substantial effort to simulate DWF to see if it even has any value for simple control flow (which it does). But nothing more complex has been analysed.

In R600's design leaving an entire clock for texture address made a lot of sense, TPs ("Thread Processor" - I think this is how AMD decided to call the group of 5 SPs) were aligned with TMUs, I mean, there was at most one SIMD core accessing the "TMU core" and it accessed all TMUs at same time (also, the reason why RV610 had only 4 TPs per SIMD core), during this periods that 4th clock was really used.

This isn't a tweak this is a total overhaul.

In RV770 the 4th clock received a few new functions but wasn't really needed anymore, with TMUs coupled with the SIMD core those tasks could be accomplished by others menas, like new special ALU instructions..

Which is NVidia's approach.

You said 17 because it is 12 from register file plus 5 from forwarding rigth? In this case the forwarding grows to 6, so the 18 operands, realocating the 4th clock to register read there are 4 more, 22 in total, 16 just from the register file.

Yes, you're right about 6th pipeline scalar, wasn't thinking straight. Routing all register file accesses through ALU instructions seriously undermines the clause-based architecture of the cores though. It makes it like NVidia's approach in G80 (or like Larrabee, as far as I can tell).

These register file bandwidth issues (peak:typical) may well indicate a serious motivation for a completely new design. In my view that stuff is secondary to generalising the cores for compute (i.e. it's just a component part of a broader solution at the architectural level).

Like the predicates in ARM or better, like the predicate operand in Larrabee, it would allow for better handling of vectorized code, especially if who is vectorizing is a compiler, like in Voxilla vectorized version of Mandelbrot, he used inside as a predicate.

ATI does this already in the hardware, the variable "inside" is just a predicate as far as the hardware is concerned.

Another sugestion... Let's increase the wavefront width from 64 to 256, the SIMD core still at 16 so it may required up to 16 clocks to execute instead of 4, the register read part remains only requiring 4 clocks, now when ready to start executing the wavefront, looks at each ALU and in it's threads, select one set predicate per cycle executing from that thread, if many predicates aren't set the 256-thread wavefront may execute in as low as 4 clocks performing like the 64-thread one in the worst case, but will perform better for random patterns, up to 4 times better, of course, this is more complex than nothing at all, but not so complex as full scan, gather, pack, compress, etc.

So you've added hardware but gained no performance.

Sure, and there is space for it there, the point is saving resources for doubles.

Depends on whether doubles are important I'd say. Maybe AMD will make doubles important with a revamp.

Ok, but why not using only 3 ALUs instead of 4?

Maybe you can improve on my understanding of the DP implementation that's in there currently:

http://forum.beyond3d.com/showthread.php?p=1142400#post1142400

because, to me, it looks very low cost (as you'd expect for quarter-rate MUL/MAD).

I'm definitely interested in what's next. I've described some mad things in the past:

http://forum.beyond3d.com/showthread.php?p=890803#post890803
http://forum.beyond3d.com/showthread.php?p=890818#post890818
http://forum.beyond3d.com/showthread.php?p=900211#post900211

I was puzzling for months over register file workings, until this and the following posts:

http://forum.beyond3d.com/showthread.php?p=912889#post912889
http://forum.beyond3d.com/showthread.php?p=913753#post913753

It's bonkers

It was ages after that until we got a good understanding of G80's register file, operand collector and instruction issue. For me there's still a question mark over the mechanics of operand collection and instruction issue for the multifunction interpolator ALU (transcendental, attribute interpolation and MUL), as the Fermi whitepaper refers to instruction-issue clashes that I wasn't aware of.

Much of that speculation was focused on branching penalties, but I was also thinking in terms of program-counter-blind instruction issue (i.e. different PCs' MAD instruction issuing concurrently). In the end, for branching penalties, I don't think it's worth the effort, because there's not enough threads in flight (and I think it's unlikely AMD will make the register files multiple times larger in order to make this worthwhile). And, since you generally want to make gather/scatter fast for compute, the right solution is something that solves that problem and scales.

For other crazy instruction-issue schemes you have a serious operand gather, resultant scatter problem.

NVidia's just stepped-back in complexity with Fermi's operand collector, making something that's simpler than G80's (hmm, maybe not that much simpler if it's tracking operands for multiple instructions per thread?). Fermi's still managing with a low count of threads per core - that's the intriguing part that's puzzling me...

There's so many choices and the trade-offs are really obscure to us armchair speculators

Jawed

EduardoS · Oct 24, 2009

Jawed said:
GPUs can still be faster. Also the logical SIMD width affects the penalty depending on how much variation in cycle count there is in the different paths.

I can't see how a G200 can be fast with only 1/32 of it's RAW performance...

Jawed said:
What common patterns?

Random.

Jawed said:
To output those patterns you need to run the program. We have near-zero actual data, which doesn't help a discussion of DWF. And further, it takes a substantial effort to simulate DWF to see if it even has any value for simple control flow (which it does). But nothing more complex has been analysed.

For a trivial mandelbrot it's simple, I will get some data when time permits... Unfortunally I will to work tomorrow...

Just to remeber, PS was faster then scalar CS right?

Jawed said:
ATI does this already in the hardware, the variable "inside" is just a predicate as far as the hardware is concerned.

It's a bool

Current hardware have no support for a real predicate register to be used in a per ALU base, in hardware also allows the implementation of the trick you said doesn't increase performance.

Jawed said:
So you've added hardware but gained no performance.

It may avoid a performance hit of up to 75%, so may increase the performance up to 4 times.

rpg.314 said:
http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932&p=7

So, they did a double precision multiplication using 1 int multiplier and 2 fp multipliers, maybe they already did what I was describing :smile:

Jawed said:
Maybe you can improve on my understanding of the DP implementation that's in there currently:

AMD probably did a long multiplier using each of the multipliers in thin ALUs to do part of it, this is the most simple method and requires few extra transistors, but a 2x precision multiplier could be done with just 3 1x precision multipliers, the two methods that mades sense for me are Karatsuba and the long multiplier replacing the lowest multiply by a good guess, both requires some extra hardware, but cheaper than an full multiplier.

Jawed said:
http://forum.beyond3d.com/showthread.php?p=890803#post890803
http://forum.beyond3d.com/showthread.php?p=890818#post890818
http://forum.beyond3d.com/showthread.php?p=900211#post900211

I was puzzling for months over register file workings, until this and the following posts:

http://forum.beyond3d.com/showthread.php?p=912889#post912889
http://forum.beyond3d.com/showthread.php?p=913753#post913753

So people already talked about some fo what I described here :smile:

rpg.314 said:
Try packing a simple for loop, but with a thread specific loop count into 4 lanes of a vliw hw. I think what you'll end up with is a more branchy thread.

For the specific case where the ALU utilization is below 25%, packing will result in the same performance, in the worst case, but may improve for others, wich case you are thinking that may see a performance drop?

NVIDIA shows signs ... [2008 - 2017]

rpg.314

rpg.314

ninelven

PM

CarstenS

Moderator

pcchen

Moderator

Jawed

Jawed

Jawed

EduardoS

rpg.314

Jawed

rpg.314

Jawed

rpg.314

Karoshi

Jawed

EduardoS

rpg.314

Jawed

EduardoS

Similar threads