AMD: R7xx Speculation

Status
Not open for further replies.
How much more extra 16TMU's increases transistor count ??
Dunno, but it'll be interesting to see whether cache size is increased. L2 cache in RV630 is 128KB for 8 TUs while in R600 it's 256KB for 16 TUs. Would cache need to double again?

Does RV630 have enough cache, or is it choked by having only 128KB.

Alternatively perhaps rasterisation rate is the prime determinant of cache size. As far as I can tell prefetching is primary feature of the R6xx cache and prefetching seems to be driven by the rasteriser. The RBE count tells us what the rasterisation rate is. If the RBE count remains the same then perhaps the cache size would remain at 256KB.

Jawed
 
Last edited by a moderator:
A. added additional 16 TMU's = makes total (16+16) 32 TMU's
B. added additional 96 pipelines = makes total (96+64) 160sp (800 streams).
So they're texture fetch limited now and they'll increase TMU count by 2 while increasing SPs count by 2,5? That's strange to say the least.
Also i'm having a hard time believing that +16 TMUs and +96 processors will result in a +160 M transistors increase.
 
So they're texture fetch limited now and they'll increase TMU count by 2 while increasing SPs count by 2,5? That's strange to say the least.
ATI likes to raise ALU:TEX. Whichever rumour you choose to believe, a constant or lowered ALU:TEX has gotta be worth longer odds...

Jawed
 
ATI likes to raise ALU:TEX.
'Likes' isn't really appropriate here.
They 'need' more TMU power right now.
I see no reason for them to increase their ALU power further while it's pretty clear that TMUs are the performance limiting factor for the R6x0 architecture.
While they certainly can do what they 'like' that would meen that they'll lose again.
 
Not necessarily. As long as they have "enough" TMU power the glut of available floating point performance won't disadvantage them any versus the competition.
 
Not necessarily. As long as they have "enough" TMU power the glut of available floating point performance won't disadvantage them any versus the competition.
What's 'enough power'? =) And 32 is still less than 64 -)
Their 320 ALUs are idling out right now while 16 TMUs are fetching values.
If they increase the TMU number to 32 while increasing the ALU number to 800 then that would meen that their ALUs will be idling out more than they do right now.
So why go with 800 SPs if they'll be idle in games most of the time? For some serious synthetic numbers?
I prefer the 32 TMUs / 480 ALUs rumour -- it just makes more sense for them to lower the TMU/ALU ratio in R7x0 generation.
32 TMUs / 480 ALUs also seems more plausible from the RV770 POV.
 
Not necessarily. As long as they have "enough" TMU power the glut of available floating point performance won't disadvantage them any versus the competition.
To be fair the difference between 480 and 800 is prolly a chunk of die area. For R520->R580 there was a clear need of more ALUs as R520 was ALU-limited. R520's ALUs were only about 10% of the die. RV670's might be 30%.

If custom logic can ~double the density of ALUs, then there might be some mileage in this. If that doubling can be applied to ALUs, TUs and RBEs then :oops:

Whatever, I still think the ALU:TEX ratio for the 96/32 rumour is a strong indication it's wrong. 96/24, 128/32 or 160/32 take your pick.

Earlier in the thread I talked about the TU being a single unit that's 16 lanes wide in RV670. This means each texturing instruction runs for four clocks, just like an ALU instruction runs for four clocks. I think constancy of clocking is critical, it's something that's not going to change readily as a GPU evolves. R520 and Xenos also seem to have 4-clock TUs (and they definitely have 4-clock ALUs).

If you want RV770's TU to be four clocks per instruction, then it's simply not possible with the 96/32 combination (I'm disallowing a 3 SIMD configuration for the top GPU). Whereas it's fine with:
  • 96/24 - 4 SIMDs of 6 quads (batch size 96)
  • 128/32 - 4 SIMDs of 8 quads (batch size 128)
  • 160/32 - 5 SIMDs of 8 quads (batch size 128)
If the 32 TUs are split across two units each 16 wide, then you get:
  • 96/32 - 6 SIMDs of 4 quads (batch size 64)
  • 128/32 - 8 SIMDs of 4 quads (batch size 64)
  • 160/32 - 10 SIMDs of 4 quads (batch size 64)
For what it's worth, it'd be nice if batch size stayed at 64.

Jawed
 
Last edited by a moderator:
Jawed, can you explain why exactly why you would rather the batch size stay at 64? I really don't know much about that matter, just curious to learn why.
 
Keeping granularity down would help minimize some of the problems that overly wide SIMD has with utilization.

I think AMD, if it doesn't implode in the next year, might also have an eye on the future.
Its granularity is already twice that of G80's, and will be four times that of Larrabee's. Larrabee's finer granularity may be limited in other ways, but the picture is fuzzy without more data disclosed.

There could be significant pressure to reduce or stabilize batch sizes in future hardware, and AMD isn't dominant enough to really influence this in the upward direction all by itself. If AMD misses on this, it will find itself penalized.

It's also the case that having, say 10 SIMDs, would help with the amount of independent program counters an AMD chip could have running per clock.
G80 has 8 clusters on a chip, and Larrabee will have 24 independent cores.
AMD currently has 4.

Control logic expands faster if the SIMD count rises, but I don't see AMD doing well in GPGPU if it's stuck at 4 SIMDs in 2010.
At least with 8 or 10 SIMDS per chip, an X2 would have 16-20 independent clusters, matching somewhat closely with Larrabee, though at a likely slower clock.
 
Jawed, can you explain why exactly why you would rather the batch size stay at 64? I really don't know much about that matter, just curious to learn why.
Code that uses dynamic branching pays a divergence penalty on any SIMD processor. Larger batches increase the chances of paying this penalty.

A simple example is a shadowing shader that wants to soften shadow edges. A pixel is either within the softening region (somewhere near the edge of the "hard shadow" before softening) or outside of that region. The shader uses an If statement to decide, which is "dynamic" because the result of this test varies across the screen. Let's say that when a pixel needs softening it takes 4x longer to compute that pixel.

Because pixels are lumped together in batches if any one pixel in the batch needs softening then the other pixels in the batch are forced to come along for the ride - this is the penalty of SIMD. Those other pixels don't actually get softened because the SIMD unit deactivates them while doing the softening calculations.

Imagine a shadow cast by a wire fence (contrived example but it'll do). If you're looking for soft areas in the shadow then you can see that with 4x4 pixel blocks (batch size 16) there'll be many blocks where there's no softening required. The "holes" in the shadow formed by the wire mesh are big enough that these small blocks fit. Blocks that touch or cross the shadow from the mesh will do softening.

When the block is bigger, say 8x8 (batch size 64), there's less chance that the holes in the shadow are big enough for these bigger blocks to squeeze in. Consequently more blocks (considered as a percentage of blocks required to cover the screen area) will run the "slow" shadow softening code.

So the overall effect of a bigger batch is that more pixels on the screen will "catch" on the "softening" test, even though the total number of pixels that need softening has not changed.

A simple way to imagine this is if the entire screen was rendered as one huge batch. Then all pixels would run the slow softening code even if only one pixel needed softening.

Ideally RV770 would have a smaller batch size, e.g. 16. But as GPUs progress, the cost of a smaller batch size is:
  1. an increase in the number of SIMD units, e.g. from 4 to 16 - which means that the transistor cost for scheduling across all these SIMDs is higher
  2. or reducing the per-instruction duration of the ALU pipeline, e.g. from 4 to 2 clocks - which means that register file fetches have to be "wider but shorter" (since R6xx has to juggle operands in a buffer before they can be used by the ALUs) and that instruction execution itself has to be re-designed for less clock cycles (which is more difficult and may not be possible without a complete re-think)
  3. or filling the ALU pipeline with more batches, e.g. from 2 to 4 - which increases the scheduling cost as well as increasing the instruction decoding complexity in order to juggle these extra batches
As far as I can tell NVidia has paid all of these costs, relatively speaking, in G80 (which has a batch size of 16). They mitigated the pipeline costs by having to juggle less operands per clock (apparently 4 scalars per clock are juggled into place per "pixel", instead of 15) and investing in the ability to use custom logic (which makes the cost/area of the ALU lower). There are corner cases for G80's dynamic branching that relate to the number of instructions that can be skipped by the If, and there are corner cases for scheduling caused by register pressure. These are other aspects of the trade-offs associated with batch-sizing and general batch scheduling.

It's looking very likely that GT200 will have a batch size of 32, for what it's worth.

Jawed
 
It's also the case that having, say 10 SIMDs, would help with the amount of independent program counters an AMD chip could have running per clock.
Each batch needs to have an independent program counter. Why is SIMD count relevant?

Control logic expands faster if the SIMD count rises, but I don't see AMD doing well in GPGPU if it's stuck at 4 SIMDs in 2010.
At least with 8 or 10 SIMDS per chip, an X2 would have 16-20 independent clusters, matching somewhat closely with Larrabee, though at a likely slower clock.
Can you explain why this independence is relevant when the batches themselves are independent of each other.

Do you mean independent programs currently loaded? In theory a SM4 GPU has to support at least one each of VS, GS and PS programs concurrently.

Jawed
 
What's 'enough power'? =) And 32 is still less than 64 -)
Their 320 ALUs are idling out right now while 16 TMUs are fetching values.
If they increase the TMU number to 32 while increasing the ALU number to 800 then that would mean that their ALUs will be idling out more than they do right now.

Yeah I agree with that 100%. But I'm just saying that having an underutilized shader core isn't necessarily a disadvantage vs the competition. I think G80 is a similiar case. It has 'enough' shading power and an abundance of probably underused texturing capability yet it still dominates because it has sufficient muscle where it counts.
 
To be fair the difference between 480 and 800 is prolly a chunk of die area. For R520->R580 there was a clear need of more ALUs as R520 was ALU-limited. R520's ALUs were only about 10% of the die. RV670's might be 30%.

If custom logic can ~double the density of ALUs, then there might be some mileage in this. If that doubling can be applied to ALUs, TUs and RBEs then :oops:

True, I'm also wary of the real estate cost for such a massive expansion of the shader core. In any case a 480/32 setup probably won't be too far behind 800/32 in most applications. So AMD might just be doing it because they can, not because it's necessary as was the case with R520.
 
Each batch needs to have an independent program counter. Why is SIMD count relevant?
The "per clock" modifier is part of my point.
In theory, I could run a billion independent batches on a single SIMD, one at a time.
The maximum number of concurrently executing batches at a given instant is the number of independent SIMDs.

Can you explain why this independence is relevant when the batches themselves are independent of each other.
Hardware that can muster more independent execution is more flexible than hardware that cannot.

Do you mean independent programs currently loaded? In theory a SM4 GPU has to support at least one each of VS, GS and PS programs concurrently.

Jawed
Larrabee's first instantiation could, at least in theory, support 24 actively executing programs of various types concurrently.

Will that amount to a hill of beans for consumer graphics? It doesn't seem like it will early on.
I suspect that flexibility will be very useful more immediately in other places.

My personal aesthetic prefers something closer to parity than being an order of magnitude less flexible (and likely clocked half as fast to boot).
Just keeping the batch size the same while scaling ALU count should encourage at least some increase in AMD's SIMD count.
 
The "per clock" modifier is part of my point.
In theory, I could run a billion independent batches on a single SIMD, one at a time.
The maximum number of concurrently executing batches at a given instant is the number of independent SIMDs.

But those SIMD's are pipelined so you could certainly have multiple batches in flight on a single SIMD. I don't follow....unless you mean something else?
 
But only N batches would actually be executing.
N being the SIMD count.
The rest are doing the electronic equivalent of setup, or worse, twiddling their thumbs.

Unless the base rule that every unit in a SIMD must be in lock-step and running the exact same instruction changes, a SIMD is as fine grained as the hardware can get (serializing execution or predication don't make the hardware more flexible, they just keep it correct).
 
But only N batches would actually be executing.
N being the SIMD count.

Each batch in G80 is 2 to 4 clocks deep. Considering the pipeline must be deeper than that how is it that only one batch is executing per SIMD? So on a 8 stage pipeline with a 32 pixel batch (8x4 groups) at one instant in time you can have in a single 8-way SIMD:

stage-1 ---- batch 3 group 2
stage-2 ---- batch 3 group 1
stage-3 ---- batch 2 group 4
stage-4 ---- batch 2 group 3
stage-5 ---- batch 2 group 2
stage-6 ---- batch 2 group 1
stage-7 ---- batch 1 group 4
stage-8 ---- batch 1 group 3

Three batches at various stages in the pipeline. Is that not how it works?
 
Larrabee's first instantiation could, at least in theory, support 24 actively executing programs of various types concurrently.
It needs to, since things like rasterisation, texture-filtering and Z-testing all have to be able to run as independent programs. These programs are out of sight on a current GPU.

Will that amount to a hill of beans for consumer graphics? It doesn't seem like it will early on.
I suspect that flexibility will be very useful more immediately in other places.
As far as I can tell both G80 and R600 only support one kernel executing in GPGPU mode at any one time (shared by all SIMDs). I expect that to change...

When a GPU can time-slice arbitrary kernels I don't see how you can reduce feasibility/utility/performance down to a statement defined by concurrently active kernels. Flexibility normally comes at the cost of performance.

Now, if you were to argue that GPUs will never time-slice kernels upon individual SIMDs, well that's a different matter. I might have the wrong end of the stick here, expecting GPUs to do so :cry:

Clearly, in general, you've got a problem when pipelining kernels (if you want more than one kernel to be active concurrently) as the on-die resources have to be able to manage the rates at which each pair of kernels exchanges data. That stuff is "automatic" (and hidden) for graphics in a GPU but a proper minefield for a stream programmer. Clearly Larrabee's programmers/architects have no choice but to solve this problem for "arbitrary rendering pipelines" such as SM3 or SM4 :p

Jawed
 
Each batch in G80 is 2 to 4 clocks deep. Considering the pipeline must be deeper than that how is it that only one batch is executing per SIMD? So on a 8 stage pipeline with a 32 pixel batch (8x4 groups) at one instant in time you can have in a single 8-way SIMD:

stage-1 ---- batch 3 group 2
stage-2 ---- batch 3 group 1
stage-3 ---- batch 2 group 4
stage-4 ---- batch 2 group 3
stage-5 ---- batch 2 group 2
stage-6 ---- batch 2 group 1
stage-7 ---- batch 1 group 4
stage-8 ---- batch 1 group 3

Three batches at various stages in the pipeline. Is that not how it works?
Your definition of execution is broader than mine.
My use of the term execution is limited to the time the instruction is actively engaging an ALU, not just getting to it.

I was looking more at R600 and AMD's use of the word SIMD to descirbe the 16-element array, which in current instantiations are all in lock-step and executing the same instruction.
The SIMD has two batches sequenced, but only one is actively executing in a given cycle.
My concerns were mostly in the comparison between an AMD-style SIMD and Larrabee.

If we go with G80's ALUs running on two batches (halves/groups, I forget the term now) in the same cluster, that still leaves AMD behind if it didn't add more SIMDs.
 
The "per clock" modifier is part of my point.
In theory, I could run a billion independent batches on a single SIMD, one at a time.
The maximum number of concurrently executing batches at a given instant is the number of independent SIMDs..

I could be getting very confused so excuse me. I understood that the way the ultra threaded dispatch processor is setup, each SIMD has two arbiters and two sequencers. That would mean each SIMD can run two independent programs at once. Kind of like hyper threading correct? So in theory, R600 has 8 processors?
 
Status
Not open for further replies.
Back
Top