AMD: R9xx Speculation

Obviously this discussion would be easier with Larrabee to play with. But I trust the Intel engineers to the extent that the originally-presented chip wasn't fundamentally broken in this respect. Though I still strongly suspect texturing is a black hole they've been struggling with.
Any particular reason why? They already have texture filtering/decompression in hw. Are you really referring to latency hiding when you say texturing? Or the connection of TMU's to cores?
 
Actually ATI is quite different. There's no latency-hiding for gather misses, all the resulting latency is always experienced by the ALUs. NVidia's operand collection hides that latency, generally.
For indexed register read, I can see there being gather misses. But why would there be a gather miss for NV, whose reg file has to be statically addressed, AFAIK. Worst case, if there are nasty register access patterns, the compiler could insert NOOPs.
ATI RF is designed, generally, never to induce variable latency. It's the entire basis of the clause-by-clause execution model.
Fine, but why will there be a variable latency in nv rf, it's all statically addressed.
 
:oops: I don't understand the distinction you're making.
Moving to a fully load/store ISA means that instructions that perform computation are explicitly separate from memory-access instructions. As far as the software is concerned, the register memory is more distinct from other memory pools with Fermi than it was prior, and the more robust memory model of modern GPUs would add to the expense of integrating it into every operand access.

This is where you get into a nebulous argument over whether the memory in the operand collector, holding operands for multiple cycles until a warp's worth of operands are all populated, is really the register file :???: In this model the "registers", the constant cache, the shared memory and global memory are all just addressable memories.
The register file is the big collection of SRAM that holds operands that resides on one side of the ALUs. It is quite distinct physically and distinct in how it is treated.

I don't see why the operand collector needs to care about memory at all. The ISA is load/store, so all it needs to track is the readiness of the destination register of a given load. No instruction other than the memory access instructions would know of an address, which is much simpler to handle.
The operand collector would be wasting its time tracking the memory addresses.


Many of these fancy new algorithms (or re-discovered supercomputing principles) push repeatedly on the gather/scatter button at the ALU instruction level.
No new high-performance ISA puts memory operands at the ALU instruction level.
x86 internally splits ALU work off from the memory access because it is such a problem. Register accesses do not generate page faults, access violations, or require paging in memory.

G80->G92->GT200 saw progressively increasing register capacity and/or increasing work-items per SIMD. Fermi actually reverses things a little, I think. In other words it seems to me NVidia hasn't really settled on anything.
What needs to be settled? An example architecture with complex ALU instructions that could source multiple operands directly from memory was VAX.

Obviously this discussion would be easier with Larrabee to play with. But I trust the Intel engineers to the extent that the originally-presented chip wasn't fundamentally broken in this respect. Though I still strongly suspect texturing is a black hole they've been struggling with.
The x86 core is what it is. Plenty of other architectures don't try to combine memory loads with ALU work, and the P55C core internally cracks the instructions apart anyway.

One could argue that texturing is still so massively important that it steers GPUs towards large RFs and the ALU-gather-scatter centric argument is merely a distraction, and Intel's stumbling block :p
I think its mostly a distraction. The register file does very well on its own. The failings GPUs have with spills are more a product of their design. Other designs that degrade more gracefully just spill with loads and stores. It's cheaper and faster than trying to drive a full cache or make the internal scheduling hardware capable of handling memory faults.

That's all very well. But GPU performance falls off a cliff if the context doesn't fit into the RF (don't know how successfully GF100 tackles this). So, what we're looking for is an architecture that degrades gracefully in the face of an increasing context.
Memory operands save little here.
The difference between an x86 instruction with a memory operand and a load/store equivalent is a Load and then the ALU instruction (which the x86 does implicitly anyway).
It saves a bit on the instruction decode bandwidth and Icache pressure, but that is far from the limiting factor for GPU loads, and is not considered a limiting factor without an aggressively OoO speculative processor.

The question is: can register files either keep growing or at the least retain their current size, in the face of ever more-complex workloads?
That is subject to speed and size constraints. It's not better with caches. L1s have stagnated and even begun to shrink.

What happens when GPUs have to support true multiple, heavyweight, contexts all providing real time responsiveness? The stuff we take for granted on CPUs?
Should they?
If you want latency-optimized performance, you don't design a throughput processor.
 
Jawed: The current perf cliff with GPUs regs spill/fill is probably a byproduct of lack of r/w caches more than anything else. Fermi already has L1s (it would be interesting to test how Fermi behaves with regs spill/fill compared to a GT200) and it's likely tha AMD GPUs will have r/w caches in a not so distant future (not just for atomics/ROPs..)
 
Any particular reason why? They already have texture filtering/decompression in hw. Are you really referring to latency hiding when you say texturing? Or the connection of TMU's to cores?
It's my suspicion that this is the hardest thing to do right at enthusiast discrete GPU performance levels for a brand new architecture (there's now something like a 20x span in performance from IGP to enthusiast level). The cornerstone is basically cache locality (i.e. enough that's fast enough for AF) along with global latency hiding and enough "optimisations" to keep up with the IHVs who've optimised-the-snot out of texturing.

Throughput (32 filtering units at 1.5GHz+, say) isn't the end of the story.

Saving lots of render target bandwidth, something that Larrabee does really sweetly, doesn't help much with making texturing work fast, though screen-space tiled rendering is relevant in texture command ordering - something the other GPUs are doing too, though.

Jawed
 
For indexed register read, I can see there being gather misses. But why would there be a gather miss for NV, whose reg file has to be statically addressed, AFAIK. Worst case, if there are nasty register access patterns, the compiler could insert NOOPs.
Search: "operand collector". NVidia (at least historically) doesn't work that way at all.

Fine, but why will there be a variable latency in nv rf, it's all statically addressed.
It's not just the RF, it's all operands (constants, shared memory, global memory) - at least, historically.

Fermi's different (I understand what 3dilettante's referring to now - gather as an explicit, presumably non-blocking, instruction - not much different from a TEX instruction in this sense) and it may be that a lot of the complexity in NVidia's old-style operand collector, including handling variable latencies, has effectively disappeared. The mechanics of this I don't understand.

It may be that Fermi is fixed-latency for RF now and that no operand can come from anywhere other than RF. I don't know much about it.

Maybe that's why you're querying what I said, because you know it's actually fixed-latency?

Jawed
 
It's not just the RF, it's all operands (constants, shared memory, global memory) - at least, historically.
Ah, I hadn't thought of local/const/global memory while thinking about operand collector. I just had reg file in mind. :oops:

Fermi's different (I understand what 3dilettante's referring to now - gather as an explicit, presumably non-blocking, instruction - not much different from a TEX instruction in this sense) and it may be that a lot of the complexity in NVidia's old-style operand collector, including handling variable latencies, has effectively disappeared. The mechanics of this I don't understand.
Well, I think this reduces complexity in a manner somewhat similar to the cisc->load-store architectures.
It may be that Fermi is fixed-latency for RF now and that no operand can come from anywhere other than RF. I don't know much about it.
It should be fixed-latency now, IIUC.
 
It's about the money Lebowski.

AMD currently has a 40nm 5xxx cash cow on their hands at all market segments/price points. Economically, why change that up anytime soon except to ...

a) move to a smaller, more cost effective node.
b) respond to competitive pressures from Nvidia.

If S.I. is on the 40nm node, that would nullify a).

That leaves what Nvidia will be bringing to the table ... after all, if AMD had reason to believe Nvidia wasn't going to bring much heat for the rest of 2010, they would have little reason to do much at all ... maybe do a lower cost 'hybrid' 5790 part to replace the 5830 (or shove it up to a higher price point) and a 5890~5950(5790x2?) to hit the $500 price point and take back the single gpu crown. It would be very time/resource economical to tape out only one 'hybrid' chip if that is all that was really needed, it would also give them working knowledge and experience of most of N.I.'s architecture at 40nm, paving the road and freeing up resources to concentrate on implementing the full N.I. on GF's 28nm. This would also keep product line confusion to a minimum ... bringing out hybrid S.I. cards that compete with existing 5xxx cards ... why? what would they be called? Better to wait and roll out an entire 6xxx product line over a few months like they did with the 5xxx line.
 
I think its mostly a distraction. The register file does very well on its own. The failings GPUs have with spills are more a product of their design. Other designs that degrade more gracefully just spill with loads and stores. It's cheaper and faster than trying to drive a full cache or make the internal scheduling hardware capable of handling memory faults.
So the question is, why not make GPUs that way. Well, Intel is - why wouldn't AMD and NVidia do the same?

I don't buy the "well, everything looks like x86 to Intel" argument. Besides, in this detail, it seems the only way to go, long-term.

That is subject to speed and size constraints. It's not better with caches. L1s have stagnated and even begun to shrink.
Well the whole deal with cache hierarchies is their spectacularly non-linear (both good and bad) effects on performance.

Should they?
If you want latency-optimized performance, you don't design a throughput processor.
Well, one could argue that Fusion style GPUs will make any such era short lived, but: if I'm video-encoding on the GPU (if it ever becomes worth doing :rolleyes:) I don't want my 3D-accelerated UI to stutter.

Jawed
 
Jawed: The current perf cliff with GPUs regs spill/fill is probably a byproduct of lack of r/w caches more than anything else. Fermi already has L1s (it would be interesting to test how Fermi behaves with regs spill/fill compared to a GT200) and it's likely tha AMD GPUs will have r/w caches in a not so distant future (not just for atomics/ROPs..)

Well, the caches aren't the entirety of all problems. There is too much wastage as well, even on Fermi. If I need 18KB local mem/work-group, then I have to waste 30KB on chip memory. It can't be used for caches or for my own needs. Sure, you could tune the parameters a bit, but it won't change the basic problem of 3 bits of SRAM sitting on-chip, yet segregated. And caches can't help with that problem.
 
Jawed: The current perf cliff with GPUs regs spill/fill is probably a byproduct of lack of r/w caches more than anything else. Fermi already has L1s (it would be interesting to test how Fermi behaves with regs spill/fill compared to a GT200) and it's likely tha AMD GPUs will have r/w caches in a not so distant future (not just for atomics/ROPs..)
R600 has a small R/W cache that was supposed to support register spill. The register architecture of R600 is supposed to be fully virtualised.

For whatever reason it hasn't turned into anything worth having, though. Or, maybe it does work as intended, but performance is still shit :???:

Yes, I'm definitely interested to see how Fermi works out for real.

Going back to the fixed-latency RF of Fermi, presuming that that's how it works: Why not just hold this stuff as locked lines in shared memory/L1, and then burst that stuff into the ALUs? Both memories are in the 10s+ KB (RF is 128KB, shared/L1 is 64KB), both are banked 10s of ways.

Then just have in-pipe registers to deal with cycle-to-cycle RAW.

Jawed
 
It's about the money Lebowski.

AMD currently has a 40nm 5xxx cash cow on their hands at all market segments/price points. Economically, why change that up anytime soon except to ...

a) move to a smaller, more cost effective node.
b) respond to competitive pressures from Nvidia.

If S.I. is on the 40nm node, that would nullify a).

That leaves what Nvidia will be bringing to the table ... after all, if AMD had reason to believe Nvidia wasn't going to bring much heat for the rest of 2010, they would have little reason to do much at all ... maybe do a lower cost 'hybrid' 5790 part to replace the 5830 (or shove it up to a higher price point) and a 5890~5950(5790x2?) to hit the $500 price point and take back the single gpu crown. It would be very time/resource economical to tape out only one 'hybrid' chip if that is all that was really needed, it would also give them working knowledge and experience of most of N.I.'s architecture at 40nm, paving the road and freeing up resources to concentrate on implementing the full N.I. on GF's 28nm. This would also keep product line confusion to a minimum ... bringing out hybrid S.I. cards that compete with existing 5xxx cards ... why? what would they be called? Better to wait and roll out an entire 6xxx product line over a few months like they did with the 5xxx line.

Verifying the NI uncore in silicon and getting a single gpu to beat 480 with sounds like a mighty big motivation to me.
 
I think its mostly a distraction. The register file does very well on its own. The failings GPUs have with spills are more a product of their design. Other designs that degrade more gracefully just spill with loads and stores. It's cheaper and faster than trying to drive a full cache or make the internal scheduling hardware capable of handling memory faults.
IMHO, Reg Files - of today's designs - are too wasteful going forward.
 
But you mentioned 64, not 32 (which is the "warp" size in NVidia, although 16 is, at least for some GPUs, the hardware thread size), so I thought you were querying ATI's architecture in comparison with NVidia.
The NVIDIA occupancy spreadsheet says registers are allocated for 2 warps at a time.
Patent documents indicate that GPRs can be allocated horizontally (a warp's single or multiple registers can occupy all banks) or vertically (each register is allocated along a bank) or mixed.
How would you write anything in CUDA which could make use of that kind of flexibility? If registers are allocated en block and accessed en block it seems unlikely to me that the register file is capable of scatter/gather.
 
So the question is, why not make GPUs that way. Well, Intel is - why wouldn't AMD and NVidia do the same?
My answer would depend on which part you are adressing.
The weakness of spill/fill for GPUs is the traditionally unclosed write/read path for the shaders, where writes had to go off-chip and then come back again.
Fermi may change this.
Until recently, it was not a deal-breaker, though as the loads become more diverse it will become so.

Improved spill performance may come almost free with the closing of this loop on chip for actual producer/consumer relationships between shaders. A spill would basically make a shader its own consumer, and it would be a less hairy problem to track as this should rarely if ever hit the same kind of problems a true operand read could cause.

If addressing the feature that a Larrabee vector ALU instruction can include a read from the L1 "for free", then:

I don't buy the "well, everything looks like x86 to Intel" argument.
This would be valid.
I suspect that little feature would not have been included if Larrabee didn't already have x86 as its baseline. The hardware itself goes out of its way to crack the instruction.

Well the whole deal with cache hierarchies is their spectacularly non-linear (both good and bad) effects on performance.
Any significant capacity gain is probably going to happen at the lower and slower cache levels.
If you limit the view to pools of on-chip memory accessible within a single-digit number of cycles, the GPUs and Larrabee become rather close. It becomes a question of what can physically be fast enough for that time frame.

Registers still need to be around in some quantity, and it can be wasteful to lean on memory too much given that each access is significantly more expensive.
It's why I'd like to see a way to access the L1 as a big backup to the reg file, but also magically skip the TLB and exception hardware (it would become like a reg file or scratchpad). Unfortunately, in real life the memory pipe stages are defined in part by that hardware for the cache that you can't freely dispense with it without some loss.


Well, one could argue that Fusion style GPUs will make any such era short lived, but: if I'm video-encoding on the GPU (if it ever becomes worth doing :rolleyes:) I don't want my 3D-accelerated UI to stutter.
Perhaps some modest improvements in this regard are possible. Fermi makes claim to this, and it seems Physx may not be a complete frame rate killer like it was before.
Even so, its context swtich times are still pretty slow compared to a CPU.
Fusion could handle most consumer loads, and if widespread the app could put the UI on the Fusion GPU and the big work on the card.
 
The NVIDIA occupancy spreadsheet says registers are allocated for 2 warps at a time.
Really? Ha.

How would you write anything in CUDA which could make use of that kind of flexibility? If registers are allocated en block and accessed en block it seems unlikely to me that the register file is capable of scatter/gather.
This is just the driver's problem. In a bid to optimise for different scenarios.

http://v3.espacenet.com/publication...b&FT=D&date=20091215&CC=US&NR=7634621B1&KC=B1

No idea if Fermi allows this flexbility.

Jawed
 
The NVIDIA occupancy spreadsheet says registers are allocated for 2 warps at a time.

I have been told that - at least on gt200 - the registers per work group are allocated in units of 512. Could be a compiler hack though.
 
So the question is, why not make GPUs that way. Well, Intel is - why wouldn't AMD and NVidia do the same?

I don't buy the "well, everything looks like x86 to Intel" argument. Besides, in this detail, it seems the only way to go, long-term.
I'm not so sure ... a fundamental problem with Larrabee's approach is that it can only handle near certain high latency events, the cost of pushing/popping is way too high to use for every type of memory access. Everything else has to be rare.

In a throughput optimized architecture doesn't that strike you as strange? I would personally want an architecture which is able to deal well with any type of cache miss through vertical multithreading ... Larrabee ain't getting me there, not enough threads.
 
All told that's potentially quite a lot of interdependent new stuff, e.g. hierarchical-Z needs to scale within screen-space tiles, not merely across tiles.

Exactly, if this is truly 67xx, I don't think they can or will simply yank NI core out and replace it with Evergreen shader core. The shader core per se is relatively easy, isn't it? Otherwise they might as well produce something like Cedar-like NI in 40nm to save some R&D.

Then there's also the question of a revised memory architecture. e.g. bigger L2s. L2s that are generalised like in Fermi. etc.
Also, I reckon it's about time for 8x Z. 4x Z is so 2006.

Quite unlikely, Fermi-like cache doesn't really translate to real world performance, neither is Cypress cache-bound and GPGPU is not SI's forte anyway.

Like I said above, improving GPGPU-related performance will be highly unlikely on SI GPUs which is either a stop-over or the next 67x0, ie. 57x0 even removed DP capability.

On the other hand, SI/NI might share some miracle-worker (MC/ROP) from Llano to drastically reduce or at least mitigate bandwidth requirement. Otherwise even GDDR5+ won't do it on NI, and SI won't even have faster memory parts. At least 20% increase in real-life performance, which should be the minimal expectation 6-month from now, can't just come out of nowhere.
 
What is all this rubbish about variable latency RFs? The operand collector is really there to handle banking conflicts and improve bandwidth.

Nobody makes variable latency RFs AFAIK, because it would cause an absolute shit storm for the compiler and rest of the pipeline. Remember the whole point of RISC was to try and get as many instructions as possible to have 1 cycle latency (or failing that, fixed latency).

DK
 
Back
Top