AMD: R7xx Speculation

Mart · Apr 13, 2008

I was just reading up on a number of post from a few days ago, and I got a bit confused. I hope someone could take the time to explain a few things to me.

From what I've read, R600 (and derivatives) looks a bit like this:
R600 has 4 SIMD arrays of 16 5-wide ALU blocks. Each SIMD unit runs a batch, which would be one thread of instructions on 64 unique data-objects (64 pixels/vertices/primitives). There are only 16 units in the SIMD, so it takes 4 loops to complete one instruction, after that either the next instruction or another thread/batch is scheduled, for instance when the thread has to wait for a texture lookup. Am I correct so far? Please correct me where I'm wrong, I'm definitely not at home in architecture land.

If the above story is correct, it's got me a bit confused. Why have batches of 64? Wouldn't it be more efficient to use 16 wide batches, thus giving better branching granularity?

Also, I read this post by Jawed:

I've been talking about the way the TU is constructed, hypothesising that it's a monolithic unit in RV670, with each TEX instruction running for 4 clocks. If RV770 is the same, then this enforces a batch size of 128 on the SIMDs (since a TU batch is assumed to be 32 wide * 4 clocks). So the basic design choices restrict the options for SIMD width. Only 5 SIMDs each 32 wide fits.

This got me a bit confused. If the TEX instruction takes 4 clocks, does that mean no other TEX instruction can be executed for 4 clocks? If so, wouldn't that mean each batch (on the ALUs) would have to wait after each TEX instruction? That wouldn't fit with what I thought above about each batch running the same operation for four cycles. Or is a TMU pipelined somehow, accepting a new command each clock, but taking 4 clocks before it's done?

If you don't mind me asking some more questions, how do memory latencies fit into this story? Is the TMU not directly working on the textures in memory? Does it assume the right data is allready in some cache, waiting to be used? Otherwise I cannot see how each operation could always take exactly four cycles.

trinibwoy · Apr 14, 2008

Mart said:
From what I've read, R600 (and derivatives) looks a bit like this:
R600 has 4 SIMD arrays of 16 5-wide ALU blocks. Each SIMD unit runs a batch, which would be one thread of instructions on 64 unique data-objects (64 pixels/vertices/primitives). There are only 16 units in the SIMD, so it takes 4 loops to complete one instruction, after that either the next instruction or another thread/batch is scheduled, for instance when the thread has to wait for a texture lookup. Am I correct so far?

You got it.

If the above story is correct, it's got me a bit confused. Why have batches of 64? Wouldn't it be more efficient to use 16 wide batches, thus giving better branching granularity?

Good question and one I'd like to see the answer to myself. I dont recall seeing anything that explicitly limits R600 to a minimum batch size of 64. Maye instruction dispatch runs at that rate. Or it could just be a design decision to facilitate more effective use of bandwidth. Larger batch = more coherent reads right?

This got me a bit confused. If the TEX instruction takes 4 clocks, does that mean no other TEX instruction can be executed for 4 clocks? If so, wouldn't that mean each batch (on the ALUs) would have to wait after each TEX instruction? That wouldn't fit with what I thought above about each batch running the same operation for four cycles. Or is a TMU pipelined somehow, accepting a new command each clock, but taking 4 clocks before it's done?

If you don't mind me asking some more questions, how do memory latencies fit into this story? Is the TMU not directly working on the textures in memory? Does it assume the right data is allready in some cache, waiting to be used? Otherwise I cannot see how each operation could always take exactly four cycles.

Since the TMU's are also threaded my assumption was that the four clocks were required for filtering after the data was retrieved. TMU threads would sleep after requesting the texel info and wake up once it was returned.

silent_guy · Apr 14, 2008

Mart said:
Why have batches of 64? Wouldn't it be more efficient to use 16 wide batches, thus giving better branching granularity?

Yes, but you can do much more with 4 cycles than with 1.

In a way, you're running your pipeline at 1/4 of the clock speed, which gives you more time to set up data for the next stage.

Maybe the instruction RAMs are not fast enough to run at 900MHz and you need multiple access cycles?
Maybe they fetch different operands in different clocks cycles to reduce the number of ports in the register file?

I assume it also has some positive consequences wrt the number of in-flight batches they need to keep track off wrt latency hiding and that it increases texture fetching coherence.

Berek · Apr 14, 2008

Looks like this might show up even before Computex, somewhere at the end of May (using DDR3... DDR5 come later):

http://www.nordichardware.com/news,7623.html

Jawed · Apr 14, 2008

Mart said:
There are only 16 units in the SIMD, so it takes 4 loops to complete one instruction, after that either the next instruction or another thread/batch is scheduled, for instance when the thread has to wait for a texture lookup.

Yeah, the SIMD executes batches A and B in the pattern AAAABBBBAAAABBBB etc. As far as I can tell there's no need of any connection between the two batches, e.g. A could be a pixel shader and B could be a geometry shader. But I don't know of any definitive evidence - merely that this seems to be the model of Xenos too.

If the above story is correct, it's got me a bit confused. Why have batches of 64? Wouldn't it be more efficient to use 16 wide batches, thus giving better branching granularity?

It's a trade-off between the complexity of design required to decode instructions and fetch operands versus the lost efficiency when branching occurs.

It's similar to the trade-off made in using a 5-way ALU - instead of, say, 2-way. 5-way generally works well but will lose efficiency when the code is sequential and scalar (worst case is 20% utilisation).

Both the clocking and the width basically mean the ALUs are very dense.

This got me a bit confused. If the TEX instruction takes 4 clocks, does that mean no other TEX instruction can be executed for 4 clocks? If so, wouldn't that mean each batch (on the ALUs) would have to wait after each TEX instruction? That wouldn't fit with what I thought above about each batch running the same operation for four cycles. Or is a TMU pipelined somehow, accepting a new command each clock, but taking 4 clocks before it's done?

I expect the TU is pipelined in just the same way that the ALU is and I'd go as far as to suggest that the same AAAABBBBAAAABBBB pattern exists there too. I don't know how things like anisotropic filtering or filtering for fp32 formatted texels are scheduled.

The basic idea is to have 10s of batches available. So while batches A and B are being textured, batches C and D are in the ALUs. The ALU could get through a pile of batches before A and B texture results return (depends on the complexity of the texturing clause and how long it is, since a texturing clause can seemingly generate up to 8 texture results per pixel for a single batch). So the SIMD might have run through batches C, D, E, F, G, H. Say that the program contains 20 instructions that can execute before the next texture result is required. So in this case that's 6 batches * 20 instructions * 4 clocks = 480 clocks.

All that matters is that there are enough instructions within a big enough set of batches that the SIMD doesn't fall idle waiting for texture results. These instructions are only capable of being executed because they have texture results already or don't need texture results.

If you don't mind me asking some more questions, how do memory latencies fit into this story? Is the TMU not directly working on the textures in memory? Does it assume the right data is allready in some cache, waiting to be used? Otherwise I cannot see how each operation could always take exactly four cycles.

There's a 256KB L2 cache.

I think the rasteriser effectively drives the L2 cache, getting it to pre-fetch texels ahead of when they're needed. So, as the rasteriser generates the position of 16 pixels each clock, it gets the interpolators to generate the coordinates for the required texels associated with those pixels and the cache then takes those coordinates, gets the addressing unit to generate addresses and gathers the required texels from memory (if they're not already in L2).

This is only possible with regular mapped texels - it doesn't work when the shader is required to calculate texture coordinates. In this case of dependent texturing there clearly needs to be some signalling that indicates that the texels are ready to be used, once they're sat in L2.

Jawed

sauron · Apr 14, 2008

Jawed said:
Yeah, the SIMD executes batches A and B in the pattern AAAABBBBAAAABBBB etc.

Are you referring to "interleaving mode"? AFAIK interleaving means that in R600 each SIMD receives (by the arbeiters) two threads batch per clock with a sub-SIMD (composed by 8 SPs) working on first batch and the other sub-SIMD working on the second batch.But in this case (with branching granularity=64) 8 cycles (not 4) are needed to complete a batch. Where am I wrong? :smile:

Jawed · Apr 14, 2008

sauron said:
Are you referring to "interleaving mode"? AFAIK interleaving means that in R600 each SIMD receives (by the arbeiters) two threads batch per clock with a sub-SIMD (composed by 8 SPs) working on first batch and the other sub-SIMD working on the second batch.But in this case (with branching granularity=64) 8 cycles (not 4) are needed to complete a batch. Where am I wrong? :smile:

I've never heard of the concept of sub-SIMDs in R6xx. Doesn't really make sense as that makes them two separate SIMDs. Though you could argue that the two would at least be sharing a single register file.

Jawed

nicolasb · Apr 15, 2008

FUDzilla: http://www.fudzilla.com/index.php?option=com_content&task=view&id=6829&Itemid=1

We learned that despite much talk RV670 won't be going for A12 silicon version. The last version remains A11 as RV770 sets to replace this chip in late Q2.

As late Q2 usually means June, we are not so far off from a new chip and we learned that in spite of that partners believe AMD won't have this new faster and better revision of RV670. The current version is as good as it needs to be and it will hold the water until RV770 comes into play.

So if you still want to buy Radeon 3870 or 3850 now is a good time or you can wait for RV770 and its high end R700 bother that are set to come in the next three months.

Jawed · Apr 15, 2008

http://www.nordichardware.com/news,7638.html

On a side note, we've been told that testing of GDDR5 is coming along nicely, although slower than AMD/ATI would like, and that it will debut at 4GHz.

Hmm, anyone willing to believe RV770, per chip, will utilise ~128GB/s?

Jawed

Rangers · Apr 15, 2008

Jawed said:
http://www.nordichardware.com/news,7638.html

Hmm, anyone willing to believe RV770, per chip, will utilise ~128GB/s?

Jawed

Does it make any difference? Did R600 utilize 512 bit bus?

Jawed · Apr 15, 2008

Rangers said:
Does it make any difference? Did R600 utilize 512 bit bus?

No and currently I don't see how a "small chip" (which is pretty much definite now) with "400SPs/16TUs/16Colour/64Z" (still a guess) could justify it.

Of course, RV770 could have a 128-bit bus...

Jawed

no-X · Apr 15, 2008

512bit memory interface was definitely cheaper to implement, than new high-speed memory modules. 4 Z-samples per clock (and possibly much faster MSAA resolve) could be the reason.

v_rr · Apr 15, 2008

no-X said:
512bit memory interface was definitely cheaper to implement.

How do you know?

Entropy · Apr 15, 2008

Jawed said:
Hmm, anyone willing to believe RV770, per chip, will utilise ~128GB/s?

Depends both on the particulars of the rest of the chip + clock frequency (i.e. additional resource need), and of course application. Given the some of the rumoured specs and clock speed, a doubling is pretty much in line with the current balance.

And GDDR5 is going to have horrendous latency, so even though traditional graphics processing is very forgiving of that, it must lead to some reduction in effective bus utilization.

no-X · Apr 15, 2008

v_rr said:
How do you know?

According to local former AMD PR, GDDR3 are 30% more price/performance effective, than GDDR4 (for todays 512MB products). I expect RV770 single-core flagship product to use 1GB of graphics memory, so the difference could be even more significant.

512-bit R600 used less complex (and smaller) PCB, than 384-bit G80. I don't expect that the die space consumed by extra 256-bits for 512-bit interface (or 128-bits, if we compare to G80) caused any significant increase in production costs (definetely not higher, than hypotetical usage of faster memory modules, which would bring similar bandwidth using narrower interface).

v_rr · Apr 16, 2008

no-X said:
According to local former AMD PR, GDDR3 are 30% more price/performance effective, than GDDR4 (for todays 512MB products). I expect RV770 single-core flagship product to use 1GB of graphics memory, so the difference could be even more significant.

512-bit R600 used less complex (and smaller) PCB, than 384-bit G80. I don't expect that the die space consumed by extra 256-bits for 512-bit interface (or 128-bits, if we compare to G80) caused any significant increase in production costs (definetely not higher, than hypotetical usage of faster memory modules, which would bring similar bandwidth using narrower interface).

The diference in price between 512Mb GDDR3 and GDDR4 is 10$.
Th 512bit PCB is more complex, the core of the chip is more complex and larger too.

Also GDDR5 has more latency, but one 512bit bus have more ring-bus stops in the core and more latency too.

In the end is much more simple to pick the GDDR5 and keep the core of the grafic small and more simple.

RV670 256bit bus is god and R600 512bit bus was a mistaken. So with RV770 you only have to speed up the GDDR.

Shtal · Apr 16, 2008

The next in line is the RV770Pro, which should be basically the same, but lower GPU and memory frequencies, and possibly GDDR3 http://www.nordichardware.com/news,7638.html

Jawed said:
Hmm, anyone willing to believe RV770, per chip, will utilise ~128GB/s?

I was wondering about RV770Pro that will utilize GDDR3 memory; Why would they not use 512bit memory instead 256bit for RV770pro + plus lower memory frequency - just like R600 w/GDDR3 that gave ~105GB/s bandwidth.

no-X · Apr 16, 2008

Beacause RV770 is not large enough to fit 512bit interface?

satein · Apr 16, 2008

Shtal said:
I was wondering about RV770Pro that will utilize GDDR3 memory; Why would they not use 512bit memory instead 256bit for RV770pro + plus lower memory frequency - just like R600 w/GDDR3 that gave ~105GB/s bandwidth.

no-X said:
Beacause RV770 is not large enough to fit 512bit interface?

Or probably, they (AMD/ATi) have learnt their reason via R600 on 512bit bus already... and it might be not worth going on that routh again.

Based on my understanding (it might be wrong), 512bit bus interfacing, not only size of the die is the matter but also the complexity of pcb layout too. Narrower bus decision might indicate that they would like to concentrate on lowering cost & complexity as much as possible.

pjbliverpool · Apr 16, 2008

Just a wild guess here but could a 512bit bus restrict what they can do in terms of multiple core GPU configs? I.e. perhaps R700 (2xRV770) wouldn't be possible if RV770 utilised a 512bit memory interface?

Too much complexity on the PCB perhaps?

AMD: R7xx Speculation

Mart

trinibwoy

Meh

silent_guy

Berek

Jawed

sauron

Jawed

nicolasb

Jawed

Rangers

Jawed

no-X

v_rr

Entropy

no-X

v_rr

Shtal

no-X

satein

pjbliverpool

B3D Scallywag

Similar threads