AMD: R7xx Speculation

Status
Not open for further replies.
TF in R600 is baseline fp16 format (int8 textures are converted to fp16 for filtering) - so the filtering units are big. So you're suggesting a doubling of an already very big unit.
I think you're overestimating the size of the TFs. A simple bilinear filter uses 3 lerps - for 4 channels 12. So for 16 TFs more (granted fetch also needs to be increased from 20 values to 32), you'd need 192 lerps more. Given that a lerp should be a bit more complicated than a mad in hardware, this indeed seems to be a lot compared to the shader core - BUT those are "only" fp16 so should be way cheaper in terms of transistor count than shader alus. Though I've no idea how bilinear filtering is implemented in hardware - there might be neat tricks...

And I still don't know how the TU architecture would support this. I suspect it's tied down. The ATI guys have talked in the past about being able to scale lots of components individually (e.g. width of ALUs, number of RBEs) but there's not been any hint of varying TA:TF ratio.
I can't see why it couldn't be changed for rv770, unless you assume r7xx is the exact same architecture as r6xx just with a different configuration.
 
Yes, but AFAIK point sampling in pixel shaders (or fetch4) is quite rare compared to bilinear texturing, so it still doesn't make much sense to me.
Maybe - and I'm strechting far here - the TAs are not actually separate units, but some kind of double-pumped ALUs for INT8 needing two clocks for FP16/32.

Has anyone tried to get additional point-samples out of the TU with FP16-Texturing?
 
I'm saying that TU and ALU batch size is the same and also saying that I'm doubtful batch size will increase beyond 64.

We might have 3 SIMDs each 32 sp wide (total 96sp and 480 ALU) with 32 TFU.
TU and ALU batch size would be the same but now only 2 clocks (instead of 4) would be necessary to complete a batch (with still 64 item batch).
What do you think about?
 
Or the batch width could double. One reason, though I don't know if it's the only reason, for the 64 element batch size on a 16-wide SIMD is that there is some latency required to switch out instructions.
The 4-cycle lifetime of a given instruction (8 cycles for the SIMD) would need a more significant rearchitecting of the chip to be shortened.
 
We might have 3 SIMDs each 32 sp wide (total 96sp and 480 ALU) with 32 TFU.
TU and ALU batch size would be the same but now only 2 clocks (instead of 4) would be necessary to complete a batch (with still 64 item batch).
What do you think about?
Doesn't seem feasible to me. Not only would a shader cluster be able to dispatch twice as many instructions (per 4 clocks), but the point of running the same instructions for 4 clocks is (probably) that you know there won't be any register dependencies etc. for these 4 clocks. You can't just change that easily.
 
I think you're overestimating the size of the TFs. A simple bilinear filter uses 3 lerps - for 4 channels 12. So for 16 TFs more (granted fetch also needs to be increased from 20 values to 32), you'd need 192 lerps more. Given that a lerp should be a bit more complicated than a mad in hardware, this indeed seems to be a lot compared to the shader core - BUT those are "only" fp16 so should be way cheaper in terms of transistor count than shader alus. Though I've no idea how bilinear filtering is implemented in hardware - there might be neat tricks...
I'm looking forward to the day when there's no dedicated texture-filtering hardware in at least one GPU :smile: But for the time being it seems the balance in terms of die size is to keep it fixed-function.

This may be purely because of the range of SKUs that an architecture needs to cover, something like a 10-fold range in performance.

e.g. on a high end GPU with 2000 ALU lanes at 1GHz there might not be any need for dedicated TF, but on the $30 GPU, a couple of hundred lanes, even at 1GHz, won't be enough.

As to the actual cost of TF, one of these days perhaps we'll have a thread that tries to get to the bottom of it. I don't know how to split-out the cost of TF from the rest of a TU. I'm hazarding a guess that the whole lot is in the region of 125M transistors in R670 (caches, thread arbitration, instruction issue, point addressing, filtered addressing, fetching point samples, fetching for bilinear, filtering). A fair amount of the TU needs sizing up in order to increase the TA:TF ratio.

I can't see why it couldn't be changed for rv770, unless you assume r7xx is the exact same architecture as r6xx just with a different configuration.
Needless to say, I'm pessimistic about the degree of architectural change in R7xx. ATI's designed a set of knobs (SIMD and TU width, SIMD count, RBE count, MC count, MC width) and will frobnicate them for R7xx.

Jawed
 
I'm looking forward to the day when there's no dedicated texture-filtering hardware in at least one GPU :smile: But for the time being it seems the balance in terms of die size is to keep it fixed-function.
but on the other side, having fixed function pipelines filtering textures while ALUs are idling, waiting for the texture units is not a good balance. and AMD and NV are always pushing towards a higher ALU:TEX instruction ratio. the easiest way to have that is to do more texturing work on the ALU.
from that point of view a fetch4 would be anyway in-hardware.

So my crazy idea is, instead of just adding new TMUs, AMD could shift some more workload to the ALUs, making the TMUs 'cheaper' -> more TMUs on same Die size.
 
but on the other side, having fixed function pipelines filtering textures while ALUs are idling, waiting for the texture units is not a good balance. and AMD and NV are always pushing towards a higher ALU:TEX instruction ratio. the easiest way to have that is to do more texturing work on the ALU.
from that point of view a fetch4 would be anyway in-hardware.

So my crazy idea is, instead of just adding new TMUs, AMD could shift some more workload to the ALUs, making the TMUs 'cheaper' -> more TMUs on same Die size.
When R600 appeared ATI claimed that fp16 filtering is ~7 times faster than in R580 (where this filtering had to be performed by the ALUs). So, ahem, I don't think ATI'll be changing back to ALU-filtering soon.

Although I expect the performance deficit of doing so would be lower with the current TU architecture, since it supports double the texel bandwidth of R580. At least I presume some of the fp16 filtering performance shortfall in R580 is due to its slow fetch rate. Another chunk of shortfall would be the FLOPs available, of course (i.e. ALU:TEX ratio).

Jawed
 
I'm looking forward to the day when there's no dedicated texture-filtering hardware in at least one GPU :smile: But for the time being it seems the balance in terms of die size is to keep it fixed-function.

...

e.g. on a high end GPU with 2000 ALU lanes at 1GHz there might not be any need for dedicated TF, but on the $30 GPU, a couple of hundred lanes, even at 1GHz, won't be enough..

...

Jawed

What's considered a lane in R600 architecture?

How many more arrays of SIMDS in R600 (at core speed) would you guestimate to be needed if you wanted to do all ALL the texture operations in the shaders (mapping, filtering, everything...) at the same performance they have now? Is that even feasible?
 
What's considered a lane in R600 architecture?
There's 320 of them.

How many more arrays of SIMDS in R600 (at core speed) would you guestimate to be needed if you wanted to do all ALL the texture operations in the shaders (mapping, filtering, everything...) at the same performance they have now? Is that even feasible?
For today's performance, hmm, you'd prolly want about 1200 lanes I guess, ~2TFLOPs?

In theory ALU performance is going to go up much more quickly than TEX performance. In terms of die space ALUs are prolly a lot denser, per pixel, too. At least in R6xx I think so, anyway.

So, ahem, draw two curves and see where they intersect.

Jawed
 
Yes, but AFAIK point sampling in pixel shaders (or fetch4) is quite rare compared to bilinear texturing, so it still doesn't make much sense to me.

Fetch4 might be rare, but point sampling is not particularly rare. In a modern engine I'd guess that typically 25% or more of the fetches are point sampled. Most table lookups and screen-space based fetches are point sampled.
 
RV770 has 800 SPs? How? Be creative!:LOL:
We know, that R600 doesn't have 320 SPs and actually there are no "64 5D ALUs", but 4 Vec16-ALUs. But all marketing guys want us to think the R600 has 320 SPs, but G80 has 128 SPs only...
The first rumors said, the RV770 is a 800 SPs monster. How this? It's magic.
Here is a single R600's 5D ALU with four thin SPs and one [strike]fat[/strike] Rys SP.
diagalumb5.png


4+1=5. 5*64=320 SPs. RV770 seems to have 480 SPs. 480 is 96*5 or 24*5*4.
800 - 480 = 320. Where can we get the other 320 SPs? :?:
Be creative!
Here is R600's TU block. It contains four TU quads:
diagtmupb6.png

Imagine: Every TAP, every FP32 TS and every TFU is a SP. This means, every TU quad contains 32 SPs. For R600 it's extra 128 SPs (4*32). And we get next 192 SPs for the RV770 (6*32=192).

480+192=672.
But we need 800 SPs!
Be creative!
diagroptw1.png


Every RBE contains 32 SPs! R600 has 4 RBEs (128 SPs), RV770 will have four RBEs (128 SPs which we need).
As you see, the RV770 has 800 SPs: 480 SPs for shading, 192 SPs for texturing and 128 for ROP stuff.

Be creative!
 
We might have 3 SIMDs each 32 sp wide (total 96sp and 480 ALU) with 32 TFU.
TU and ALU batch size would be the same but now only 2 clocks (instead of 4) would be necessary to complete a batch (with still 64 item batch).
What do you think about?

32 TUs is not necessary 3 Cluster with "Vec32" ALUs.

Whats about this: two independent sampler arrays with four TUs per array (16 + 16)? Or one "fat" sampler array with eight TUs, but these eight TUs work parallel:
http://www.beyond3d.com/content/reviews/16/8
Samplers (=TUs) A1 and A2 feed Quad A in each SIMD cluster.
 
RV770 has 800 SPs? How? Be creative!:LOL:
We know, that R600 doesn't have 320 SPs and actually there are no "64 5D ALUs", but 4 Vec16-ALUs. But all marketing guys want us to think the R600 has 320 SPs, but G80 has 128 SPs only...
The first rumors said, the RV770 is a 800 SPs monster. How this? It's magic.
Here is a single R600's 5D ALU with four thin SPs and one [strike]fat[/strike] Rys SP.
diagalumb5.png


4+1=5. 5*64=320 SPs. RV770 seems to have 480 SPs. 480 is 96*5 or 24*5*4.
800 - 480 = 320. Where can we get the other 320 SPs? :?:
Be creative!
Here is R600's TU block. It contains four TU quads:
diagtmupb6.png

Imagine: Every TAP, every FP32 TS and every TFU is a SP. This means, every TU quad contains 32 SPs. For R600 it's extra 128 SPs (4*32). And we get next 192 SPs for the RV770 (6*32=192).

480+192=672.
But we need 800 SPs!
Be creative!
diagroptw1.png


Every RBE contains 32 SPs! R600 has 4 RBEs (128 SPs), RV770 will have four RBEs (128 SPs which we need).
As you see, the RV770 has 800 SPs: 480 SPs for shading, 192 SPs for texturing and 128 for ROP stuff.

Be creative!

Being creative leads me to the logic that R600 has 576SPs.
:p
 
Status
Not open for further replies.
Back
Top