AMD: R9xx Speculation

Noone dares to make a game that will use more than 1GB ram if there are no 2GB cards aviable. Mod packs with high ress (i mean real high ress , not shity console port quality) textures can make masive difference with 2GB cards.
And it would be finaly time that u dont see these things http://www.pcgameshardware.de/aid,6...en-jetzt-mit-neuen-Gruseltapeten/Spiele/News/ .

Metro 2033 is clearly one of those games, though in DX 10 it will work fine with 1GB. I think Crysis Warhead can see a measurable improvement from going to 2GB in 1920×1200 AA8X too.

There are few such cases now, but they're bound to get more common.
 
Indeed 1120 divided by 64 ALU gives 17.5 arrays.

Though couldn't the SIMD width be changed from 16?
If SIMD width was 8, it would give 32 ALU per SIMD, resulting 35 arrays, on a "Pro / xx50" model, which could have some arrays disabled from the full chip, which could be 40 arrays for example, resultin 1280 ALUs for full chip.

That could easily be reasonable counts for 6700-lineup
 
Though couldn't the SIMD width be changed from 16?
If SIMD width was 8, it would give 32 ALU per SIMD, resulting 35 arrays, on a "Pro / xx50" model, which could have some arrays disabled from the full chip, which could be 40 arrays for example, resultin 1280 ALUs for full chip.

That could easily be reasonable counts for 6700-lineup

The integrated low-end models with 40 shaders already have SIMD width of 8, so it would not be a very hard thing to change.
 
The integrated low-end models with 40 shaders already have SIMD width of 8, so it would not be a very hard thing to change.

Was just thinking if this would also require other, more major changes. The lowend chips haven't had 40 shaders since HD3 series, though, HD4 and 5 series have 80.
 
Was just thinking if this would also require other, more major changes. The lowend chips haven't had 40 shaders since HD3 series, though, HD4 and 5 series have 80.

43xx and 54xx have 80 shaders but even the newest intergrated chipsets have only 40.
 
Was just thinking if this would also require other, more major changes. The lowend chips haven't had 40 shaders since HD3 series, though, HD4 and 5 series have 80.
The lowend chips might not have 40 shaders since rv710 in total, but they certainly don't have simd width 16. RV610/RV615 (and the IGPs) have simd width 4. RV710/RV730/Cedar have simd width 8.
I very seriously doubt it's worth the trouble (due to increased control overhead - FWIW I don't think it's a useful choice for cedar neither). One change this would need is that 2 simds would need to share a texture unit (could be done), unless you'd want to double that too... And if you do that, you can't have odd simd count (I don't think you can have odd simd count anyway for the dual-rasterizer Cypress neither)...
 
An increase from 20 "cores" on the high-end to 35 on the mid-range is not going to happen. The increase in scheduler overhead is far too great for no perceived benefit. The effective SIMD width is 64 (four clocks), they can keep the physical hardware 16 wide and just reduce batch size to 32 (two clocks).
 
A lot of what the hardware does seems to work in groups of 4 clocks. The register read logic and the loading of data from the TEX units to the SIMD registers all seem to assume that there are 4 cycles to work with.

Reducing batch size to 32 would keep the physical SIMD width the same, but the change in the underlying assumptions each unit has of the architecture could mean that what goes into that SIMD would look pretty different.
 
Going from quad-pumping to double-pumping would require to switch to a double ported register file in order to be able to read and write the same amount of registers (12 + 4 IIRC) per VLIW instruction as they did before.
 
Nvidia did it going to GF100 so it's not impossible. The register file has to produce all the inputs for a half-warp each cycle as opposed to half that on G80/GT200. This is assuming of course that AMD even cares about reducing batch size.

I'm not that familiar with AMD's issue logic but why is it necessary to increase register file bandwidth just because you're halving the number of threads and clocks per batch?
 
The operand read process takes 3 cycles to read operands. This is actually physically exposed by the VLIW ISA.

The way the texture unit reads are pipelined to feed the ALU registers may not work at 2 cycles. The 4 TEX units would have half the time they normally have to read in values.
 
Nvidia did it going to GF100 so it's not impossible. The register file has to produce all the inputs for a half-warp each cycle as opposed to half that on G80/GT200. This is assuming of course that AMD even cares about reducing batch size.
That's easy because they are fetching operands for two *different* threads which by definition will never collide, so it's just a matter of banking, no need to add a second port.
I'm not that familiar with AMD's issue logic but why is it necessary to increase register file bandwidth just because you're halving the number of threads and clocks per batch?
AFAIU their RF is split in 4 banks and they can read one register per clock per bank *for the entire wavefront* (regs are 64 words wide). In order to read 12 operands you need 3 clock cycles and I guess the fourth cycle is used to write up to 4 regs back to the RF.
If you only have 2 clocks to read and write data from/to the RF your effective register BW for a VLIW instruction is reduced.
Don't get me wrong, I am not saying there are not ways to work-around this limitation (nvidia's approach, more ports, etc..) but it's probably not so straightforward as it may appear.
 
http://gpgpu.org/wp/wp-content/uploads/2009/09/E1-OpenCL-Architecture.pdf

Slide 9. Exp is shader export and Interp is data written by the fixed-function interpolators into the register file (obviously no longer applies on Evergreen).

In theory that Interp cycle is spare now. Hmm, need to think about that, maybe they're using it. Or maybe that'll be something in the next family?

I think there's another version of that slide with more detail (mentioning LDS on it). Or maybe that's just me remembering the last time I talked about this, since in R700 LDS and TEX are basically sharing a data path into/out from the register file.

As for register bandwidth in NVidia:

http://forum.beyond3d.com/showthread.php?t=58077

Not sure how GF100 pans out.
 
The operand read process takes 3 cycles to read operands. This is actually physically exposed by the VLIW ISA.

The way the texture unit reads are pipelined to feed the ALU registers may not work at 2 cycles. The 4 TEX units would have half the time they normally have to read in values.

AFAIU their RF is split in 4 banks and they can read one register per clock per bank *for the entire wavefront* (regs are 64 words wide). In order to read 12 operands you need 3 clock cycles and I guess the fourth cycle is used to write up to 4 regs back to the RF.
If you only have 2 clocks to read and write data from/to the RF your effective register BW for a VLIW instruction is reduced.
Don't get me wrong, I am not saying there are not ways to work-around this limitation (nvidia's approach, more ports, etc..) but it's probably not so straightforward as it may appear.

http://gpgpu.org/wp/wp-content/uploads/2009/09/E1-OpenCL-Architecture.pdf

Slide 9. Exp is shader export and Interp is data written by the fixed-function interpolators into the register file (obviously no longer applies on Evergreen).

Oh ok, thanks guys. I didn't know operand fetch was pipelined over inputs like that for the whole wavefront, thought it was done across threads similiar to Nvidia's approach (at least for G80/GT200). Also, Nvidia buffers their operands and they aren't pulled on the fly during execution so that's another wrinkle.

So given that they can theoretically read 4 operands per clock for the entire wavefront, is the 3 cycle requirement just a worse case scenario in the event of bank conflicts? According to the slide Jawed linked it seems it's predetermined that the three inputs will be read in three consecutive cycles. So why 4 banks?
 
So given that they can theoretically read 4 operands per clock for the entire wavefront, is the 3 cycle requirement just a worse case scenario in the event of bank conflicts?
AFAIK you might need all operands even if you don't have conflicts, although in practice I don't know how often that happens.
 
So given that they can theoretically read 4 operands per clock for the entire wavefront, is the 3 cycle requirement just a worse case scenario in the event of bank conflicts?
See the pdf I posted. It's a source per clock. There's no concept of a collision for register reads by ALUs. The compiler works everything out in advance.

There is a collision with shader export for the third clock - that's a scheduling problem amongst RF clients (i.e. the sequencer's problem), and presumably export stalls in favour of ALUs.
 
Oh ok, thanks guys. I didn't know operand fetch was pipelined over inputs like that for the whole wavefront, thought it was done across threads similiar to Nvidia's approach (at least for G80/GT200). Also, Nvidia buffers their operands and they aren't pulled on the fly during execution so that's another wrinkle.

So given that they can theoretically read 4 operands per clock for the entire wavefront, is the 3 cycle requirement just a worse case scenario in the event of bank conflicts?
AMD uses a mostly VLIW architecture. Little that gets done will happen without being set in the instruction encodings.
Nvidia does not expose that low-level detail, so it is allowed to buffer operands as it sees fit.

The way the process is documented, there are 4 banks. Every cycle, one value from each bank is loaded into a single 4-element GPR0, GPR1, or GPR2 register, with the number depending on the cycle.
There are six bank swizzle bit combinations that determine which GPR a given instruction operand will source.
If an instruction's operands cannot meet the restrictions for operand source and cycle restrictions, they cannot be put in the same VLIW wavefront.
 
Back
Top