AMD: R9xx Speculation

Alexko · Sep 17, 2010

GZ007 said:
Noone dares to make a game that will use more than 1GB ram if there are no 2GB cards aviable. Mod packs with high ress (i mean real high ress , not shity console port quality) textures can make masive difference with 2GB cards.
And it would be finaly time that u dont see these things http://www.pcgameshardware.de/aid,6...en-jetzt-mit-neuen-Gruseltapeten/Spiele/News/ .

Metro 2033 is clearly one of those games, though in DX 10 it will work fine with 1GB. I think Crysis Warhead can see a measurable improvement from going to 2GB in 1920×1200 AA8X too.

There are few such cases now, but they're bound to get more common.

Kaotik · Sep 17, 2010

liolio said:
Indeed 1120 divided by 64 ALU gives 17.5 arrays.

Though couldn't the SIMD width be changed from 16?
If SIMD width was 8, it would give 32 ALU per SIMD, resulting 35 arrays, on a "Pro / xx50" model, which could have some arrays disabled from the full chip, which could be 40 arrays for example, resultin 1280 ALUs for full chip.

That could easily be reasonable counts for 6700-lineup

hkultala · Sep 17, 2010

Kaotik said:
Though couldn't the SIMD width be changed from 16?
If SIMD width was 8, it would give 32 ALU per SIMD, resulting 35 arrays, on a "Pro / xx50" model, which could have some arrays disabled from the full chip, which could be 40 arrays for example, resultin 1280 ALUs for full chip.

That could easily be reasonable counts for 6700-lineup

The integrated low-end models with 40 shaders already have SIMD width of 8, so it would not be a very hard thing to change.

Kaotik · Sep 17, 2010

hkultala said:
The integrated low-end models with 40 shaders already have SIMD width of 8, so it would not be a very hard thing to change.

Was just thinking if this would also require other, more major changes. The lowend chips haven't had 40 shaders since HD3 series, though, HD4 and 5 series have 80.

hkultala · Sep 17, 2010

Kaotik said:
Was just thinking if this would also require other, more major changes. The lowend chips haven't had 40 shaders since HD3 series, though, HD4 and 5 series have 80.

43xx and 54xx have 80 shaders but even the newest intergrated chipsets have only 40.

mczak · Sep 17, 2010

Kaotik said:
Was just thinking if this would also require other, more major changes. The lowend chips haven't had 40 shaders since HD3 series, though, HD4 and 5 series have 80.

The lowend chips might not have 40 shaders since rv710 in total, but they certainly don't have simd width 16. RV610/RV615 (and the IGPs) have simd width 4. RV710/RV730/Cedar have simd width 8.
I very seriously doubt it's worth the trouble (due to increased control overhead - FWIW I don't think it's a useful choice for cedar neither). One change this would need is that 2 simds would need to share a texture unit (could be done), unless you'd want to double that too... And if you do that, you can't have odd simd count (I don't think you can have odd simd count anyway for the dual-rasterizer Cypress neither)...

liolio · Sep 17, 2010

35 is way too much of an odd number for me

trinibwoy · Sep 17, 2010

An increase from 20 "cores" on the high-end to 35 on the mid-range is not going to happen. The increase in scheduler overhead is far too great for no perceived benefit. The effective SIMD width is 64 (four clocks), they can keep the physical hardware 16 wide and just reduce batch size to 32 (two clocks).

3dilettante · Sep 17, 2010

A lot of what the hardware does seems to work in groups of 4 clocks. The register read logic and the loading of data from the TEX units to the SIMD registers all seem to assume that there are 4 cycles to work with.

Reducing batch size to 32 would keep the physical SIMD width the same, but the change in the underlying assumptions each unit has of the architecture could mean that what goes into that SIMD would look pretty different.

nAo · Sep 17, 2010

Going from quad-pumping to double-pumping would require to switch to a double ported register file in order to be able to read and write the same amount of registers (12 + 4 IIRC) per VLIW instruction as they did before.

trinibwoy · Sep 17, 2010

Nvidia did it going to GF100 so it's not impossible. The register file has to produce all the inputs for a half-warp each cycle as opposed to half that on G80/GT200. This is assuming of course that AMD even cares about reducing batch size.

I'm not that familiar with AMD's issue logic but why is it necessary to increase register file bandwidth just because you're halving the number of threads and clocks per batch?

3dilettante · Sep 17, 2010

The operand read process takes 3 cycles to read operands. This is actually physically exposed by the VLIW ISA.

The way the texture unit reads are pipelined to feed the ALU registers may not work at 2 cycles. The 4 TEX units would have half the time they normally have to read in values.

nAo · Sep 17, 2010

trinibwoy said:
Nvidia did it going to GF100 so it's not impossible. The register file has to produce all the inputs for a half-warp each cycle as opposed to half that on G80/GT200. This is assuming of course that AMD even cares about reducing batch size.

That's easy because they are fetching operands for two *different* threads which by definition will never collide, so it's just a matter of banking, no need to add a second port.

I'm not that familiar with AMD's issue logic but why is it necessary to increase register file bandwidth just because you're halving the number of threads and clocks per batch?

AFAIU their RF is split in 4 banks and they can read one register per clock per bank *for the entire wavefront* (regs are 64 words wide). In order to read 12 operands you need 3 clock cycles and I guess the fourth cycle is used to write up to 4 regs back to the RF.
If you only have 2 clocks to read and write data from/to the RF your effective register BW for a VLIW instruction is reduced.
Don't get me wrong, I am not saying there are not ways to work-around this limitation (nvidia's approach, more ports, etc..) but it's probably not so straightforward as it may appear.

Jawed · Sep 17, 2010

http://gpgpu.org/wp/wp-content/uploads/2009/09/E1-OpenCL-Architecture.pdf

Slide 9. Exp is shader export and Interp is data written by the fixed-function interpolators into the register file (obviously no longer applies on Evergreen).

In theory that Interp cycle is spare now. Hmm, need to think about that, maybe they're using it. Or maybe that'll be something in the next family?

I think there's another version of that slide with more detail (mentioning LDS on it). Or maybe that's just me remembering the last time I talked about this, since in R700 LDS and TEX are basically sharing a data path into/out from the register file.

As for register bandwidth in NVidia:

http://forum.beyond3d.com/showthread.php?t=58077

Not sure how GF100 pans out.

trinibwoy · Sep 17, 2010

3dilettante said:
The operand read process takes 3 cycles to read operands. This is actually physically exposed by the VLIW ISA.

The way the texture unit reads are pipelined to feed the ALU registers may not work at 2 cycles. The 4 TEX units would have half the time they normally have to read in values.

nAo said:
AFAIU their RF is split in 4 banks and they can read one register per clock per bank *for the entire wavefront* (regs are 64 words wide). In order to read 12 operands you need 3 clock cycles and I guess the fourth cycle is used to write up to 4 regs back to the RF.
If you only have 2 clocks to read and write data from/to the RF your effective register BW for a VLIW instruction is reduced.
Don't get me wrong, I am not saying there are not ways to work-around this limitation (nvidia's approach, more ports, etc..) but it's probably not so straightforward as it may appear.

Jawed said:
http://gpgpu.org/wp/wp-content/uploads/2009/09/E1-OpenCL-Architecture.pdf

Slide 9. Exp is shader export and Interp is data written by the fixed-function interpolators into the register file (obviously no longer applies on Evergreen).

Oh ok, thanks guys. I didn't know operand fetch was pipelined over inputs like that for the whole wavefront, thought it was done across threads similiar to Nvidia's approach (at least for G80/GT200). Also, Nvidia buffers their operands and they aren't pulled on the fly during execution so that's another wrinkle.

So given that they can theoretically read 4 operands per clock for the entire wavefront, is the 3 cycle requirement just a worse case scenario in the event of bank conflicts? According to the slide Jawed linked it seems it's predetermined that the three inputs will be read in three consecutive cycles. So why 4 banks?

nAo · Sep 17, 2010

trinibwoy said:
So given that they can theoretically read 4 operands per clock for the entire wavefront, is the 3 cycle requirement just a worse case scenario in the event of bank conflicts?

AFAIK you might need all operands even if you don't have conflicts, although in practice I don't know how often that happens.

Jawed · Sep 17, 2010

trinibwoy said:
So given that they can theoretically read 4 operands per clock for the entire wavefront, is the 3 cycle requirement just a worse case scenario in the event of bank conflicts?

See the pdf I posted. It's a source per clock. There's no concept of a collision for register reads by ALUs. The compiler works everything out in advance.

There is a collision with shader export for the third clock - that's a scheduling problem amongst RF clients (i.e. the sequencer's problem), and presumably export stalls in favour of ALUs.

Jawed · Sep 17, 2010

trinibwoy said:
So why 4 banks?

Because there's 12 operands to fetch.

So, 3 operands from each of X, Y, Z and W banks.

3dilettante · Sep 17, 2010

trinibwoy said:
Oh ok, thanks guys. I didn't know operand fetch was pipelined over inputs like that for the whole wavefront, thought it was done across threads similiar to Nvidia's approach (at least for G80/GT200). Also, Nvidia buffers their operands and they aren't pulled on the fly during execution so that's another wrinkle.

So given that they can theoretically read 4 operands per clock for the entire wavefront, is the 3 cycle requirement just a worse case scenario in the event of bank conflicts?

AMD uses a mostly VLIW architecture. Little that gets done will happen without being set in the instruction encodings.
Nvidia does not expose that low-level detail, so it is allowed to buffer operands as it sees fit.

The way the process is documented, there are 4 banks. Every cycle, one value from each bank is loaded into a single 4-element GPR0, GPR1, or GPR2 register, with the number depending on the cycle.
There are six bank swizzle bit combinations that determine which GPR a given instruction operand will source.
If an instruction's operands cannot meet the restrictions for operand source and cycle restrictions, they cannot be put in the same VLIW wavefront.

trinibwoy · Sep 17, 2010

Jawed said:
Because there's 12 operands to fetch.

So, 3 operands from each of X, Y, Z and W banks.

Of course, silly me.....

AMD: R9xx Speculation

Alexko

Kaotik

Drunk Member

hkultala

Kaotik

Drunk Member

hkultala

mczak

liolio

Aquoiboniste

trinibwoy

Meh

3dilettante

nAo

Nutella Nutellae

trinibwoy

Meh

3dilettante

nAo

Nutella Nutellae

Jawed

trinibwoy

Meh

nAo

Nutella Nutellae

Jawed

Jawed

3dilettante

trinibwoy

Meh

Similar threads