Seems fake to me, but it could be AMD faked to finde some holes.
The numbers definitely add up a lot better though
Seems fake to me, but it could be AMD faked to finde some holes.
I think you're overestimating the size of the TFs. A simple bilinear filter uses 3 lerps - for 4 channels 12. So for 16 TFs more (granted fetch also needs to be increased from 20 values to 32), you'd need 192 lerps more. Given that a lerp should be a bit more complicated than a mad in hardware, this indeed seems to be a lot compared to the shader core - BUT those are "only" fp16 so should be way cheaper in terms of transistor count than shader alus. Though I've no idea how bilinear filtering is implemented in hardware - there might be neat tricks...TF in R600 is baseline fp16 format (int8 textures are converted to fp16 for filtering) - so the filtering units are big. So you're suggesting a doubling of an already very big unit.
I can't see why it couldn't be changed for rv770, unless you assume r7xx is the exact same architecture as r6xx just with a different configuration.And I still don't know how the TU architecture would support this. I suspect it's tied down. The ATI guys have talked in the past about being able to scale lots of components individually (e.g. width of ALUs, number of RBEs) but there's not been any hint of varying TA:TF ratio.
Maybe - and I'm strechting far here - the TAs are not actually separate units, but some kind of double-pumped ALUs for INT8 needing two clocks for FP16/32.Yes, but AFAIK point sampling in pixel shaders (or fetch4) is quite rare compared to bilinear texturing, so it still doesn't make much sense to me.
I'm saying that TU and ALU batch size is the same and also saying that I'm doubtful batch size will increase beyond 64.
Doesn't seem feasible to me. Not only would a shader cluster be able to dispatch twice as many instructions (per 4 clocks), but the point of running the same instructions for 4 clocks is (probably) that you know there won't be any register dependencies etc. for these 4 clocks. You can't just change that easily.We might have 3 SIMDs each 32 sp wide (total 96sp and 480 ALU) with 32 TFU.
TU and ALU batch size would be the same but now only 2 clocks (instead of 4) would be necessary to complete a batch (with still 64 item batch).
What do you think about?
I'm looking forward to the day when there's no dedicated texture-filtering hardware in at least one GPU :smile: But for the time being it seems the balance in terms of die size is to keep it fixed-function.I think you're overestimating the size of the TFs. A simple bilinear filter uses 3 lerps - for 4 channels 12. So for 16 TFs more (granted fetch also needs to be increased from 20 values to 32), you'd need 192 lerps more. Given that a lerp should be a bit more complicated than a mad in hardware, this indeed seems to be a lot compared to the shader core - BUT those are "only" fp16 so should be way cheaper in terms of transistor count than shader alus. Though I've no idea how bilinear filtering is implemented in hardware - there might be neat tricks...
Needless to say, I'm pessimistic about the degree of architectural change in R7xx. ATI's designed a set of knobs (SIMD and TU width, SIMD count, RBE count, MC count, MC width) and will frobnicate them for R7xx.I can't see why it couldn't be changed for rv770, unless you assume r7xx is the exact same architecture as r6xx just with a different configuration.
but on the other side, having fixed function pipelines filtering textures while ALUs are idling, waiting for the texture units is not a good balance. and AMD and NV are always pushing towards a higher ALU:TEX instruction ratio. the easiest way to have that is to do more texturing work on the ALU.I'm looking forward to the day when there's no dedicated texture-filtering hardware in at least one GPU :smile: But for the time being it seems the balance in terms of die size is to keep it fixed-function.
When R600 appeared ATI claimed that fp16 filtering is ~7 times faster than in R580 (where this filtering had to be performed by the ALUs). So, ahem, I don't think ATI'll be changing back to ALU-filtering soon.but on the other side, having fixed function pipelines filtering textures while ALUs are idling, waiting for the texture units is not a good balance. and AMD and NV are always pushing towards a higher ALU:TEX instruction ratio. the easiest way to have that is to do more texturing work on the ALU.
from that point of view a fetch4 would be anyway in-hardware.
So my crazy idea is, instead of just adding new TMUs, AMD could shift some more workload to the ALUs, making the TMUs 'cheaper' -> more TMUs on same Die size.
I'm looking forward to the day when there's no dedicated texture-filtering hardware in at least one GPU :smile: But for the time being it seems the balance in terms of die size is to keep it fixed-function.
...
e.g. on a high end GPU with 2000 ALU lanes at 1GHz there might not be any need for dedicated TF, but on the $30 GPU, a couple of hundred lanes, even at 1GHz, won't be enough..
...
Jawed
There's 320 of them.What's considered a lane in R600 architecture?
For today's performance, hmm, you'd prolly want about 1200 lanes I guess, ~2TFLOPs?How many more arrays of SIMDS in R600 (at core speed) would you guestimate to be needed if you wanted to do all ALL the texture operations in the shaders (mapping, filtering, everything...) at the same performance they have now? Is that even feasible?
800SP at 625MHz is twice RV670?Or it could be that his math was slightly wrong
800/320 * 625/775 ~= 2.0
Yes, but AFAIK point sampling in pixel shaders (or fetch4) is quite rare compared to bilinear texturing, so it still doesn't make much sense to me.
800SP at 625MHz is twice RV670?
Thats what you are getting at right?
We might have 3 SIMDs each 32 sp wide (total 96sp and 480 ALU) with 32 TFU.
TU and ALU batch size would be the same but now only 2 clocks (instead of 4) would be necessary to complete a batch (with still 64 item batch).
What do you think about?
RV770 has 800 SPs? How? Be creative!
We know, that R600 doesn't have 320 SPs and actually there are no "64 5D ALUs", but 4 Vec16-ALUs. But all marketing guys want us to think the R600 has 320 SPs, but G80 has 128 SPs only...
The first rumors said, the RV770 is a 800 SPs monster. How this? It's magic.
Here is a single R600's 5D ALU with four thin SPs and one [strike]fat[/strike] Rys SP.
4+1=5. 5*64=320 SPs. RV770 seems to have 480 SPs. 480 is 96*5 or 24*5*4.
800 - 480 = 320. Where can we get the other 320 SPs?
Be creative!
Here is R600's TU block. It contains four TU quads:
Imagine: Every TAP, every FP32 TS and every TFU is a SP. This means, every TU quad contains 32 SPs. For R600 it's extra 128 SPs (4*32). And we get next 192 SPs for the RV770 (6*32=192).
480+192=672.
But we need 800 SPs!
Be creative!
Every RBE contains 32 SPs! R600 has 4 RBEs (128 SPs), RV770 will have four RBEs (128 SPs which we need).
As you see, the RV770 has 800 SPs: 480 SPs for shading, 192 SPs for texturing and 128 for ROP stuff.
Be creative!