AMD: R7xx Speculation

Status
Not open for further replies.
I'm holding steadfastly to 16 TUs.
Really? Or do you mean 16 tu's but with twice the filtering capacity? So instead of 8 TA + 4 TF each of the 4 quad-units would have 8 TA + 8TF? I'd agree that would probably make a lot of sense - never really understood why r6xx has twice the TAs anyway (sure free vertex texture fetch or being able to do a fetch4 alongside "normal" bilinear texturing is nice but that's optimizing for the special case).
 

Modified compiler, overhauled scheduling and instruction/operand issue hardware. Large increase in control overhead - going from 4 (64/16) processors in RV670 to 10 (160/16) in RV770. All for no obvious benefit. Remember that one of the reasons that ATi can achieve such high ALU density is that each processor is very wide (80 ALU's).
 
Again unlike G80/G9x, RBEs are not directly coupled to memory channels (R600 still only had 4 RBEs despite having 512bit bus), so it should be possible, though I think most consider it unlikely.
Thanks for the update about the shaders. I am aware that R600 still has only 4 RBE blocks like the RV670, but I believe the number is connected with the number of memory channels, not the resulting bus width. R600 and RV670 both have 8 memory channels; in R600 these are 64bit channels, in RV670 these are 32bit channels (as with R520 & R580). On the other hand, all G8x/G9x chips and presumably GT200 as well use 64bit channels, so it then looks as if the number of RBE blocks was fixed.
 
Thanks for the update about the shaders. I am aware that R600 still has only 4 RBE blocks like the RV670, but I believe the number is connected with the number of memory channels, not the resulting bus width. R600 and RV670 both have 8 memory channels; in R600 these are 64bit channels, in RV670 these are 32bit channels (as with R520 & R580). On the other hand, all G8x/G9x chips and presumably GT200 as well use 64bit channels, so it then looks as if the number of RBE blocks was fixed.

RV620/635 have the very same structure for the RBE block as R600/RV670 and a 128 bit interface with VRAM. Same goes for RV 610/615 but with a 64 bit RAM channel. The difference is that ATI chips use the ring bus as the interface with RBEs and memory channels are not directly attached to them, whereas Nvidia chips use a crossbar only between clusters and ROP blocks, so the memory interface itself is a sort of "hard-wired" with the ROPs, AFAIK.
 
Ok, then it really looks like 96v5 shaders and @625MHz that would give about 600 gigaflops, right?

Yeah which would be embarrassing considering that RV670 easily hit 530 gigaflops. All the numbers aren't coming together as yet but we'll know soon enough.
 
Really? Or do you mean 16 tu's but with twice the filtering capacity? So instead of 8 TA + 4 TF each of the 4 quad-units would have 8 TA + 8TF? I'd agree that would probably make a lot of sense - never really understood why r6xx has twice the TAs anyway (sure free vertex texture fetch or being able to do a fetch4 alongside "normal" bilinear texturing is nice but that's optimizing for the special case).
I think the TUs cost so much that no substantial increase will be seen.

With batched texturing, a 64-object batch requires 4 clocks in RV670 TUs, and this duration is extremely unlikely to change. So then you have to consider a batch size on the TUs that matches the batch size on the ALUs (4 clocks * SIMD width).

So 480 ALU lanes is either 6 SIMDs with 64-object batches or 4 SIMDs with 96-object batches. The latter would lead to 24 TUs, which I think is in transistor budget-busting territory. Plus I think an increase in batch size is unlikely. Finally I expect ALU:TEX ratio to continue increasing, so 6:1 fits the bill. It would be ironic if it turns out that RV670 is generally ALU-bound - like R520 was, such that the TUs are underutilised... Certainly there are plenty of individual shaders in many games that are ALU bound - it's just a question of whether games are globally ALU-bound.

It may turn out that RV670 runs out of batches in flight all too easily, so turning theoretically ALU-bound shaders into TEX-bound shaders. Who knows...

As for TAs - the point sampling and filtering parts of the TUs are independent - able to run in parallel. Point sampling is not just about fetching vertex data. I will admit that this part of R6xx is pretty poorly understood - e.g. how much the compiler is able to parallelise unfiltered and filtered fetches for pixel shader code.

Jawed
 
Yeah which would be embarrassing considering that RV670 easily hit 530 gigaflops. All the numbers aren't coming together as yet but we'll know soon enough.
A 725MHz RV770 would have 32% higher FLOPs than an 825MHz RV670.

I'm sceptical about these "low clocks" for HD4850/70 though. Going backwards from RV670 just doesn't sound right to me.

Jawed
 
When Apple lowered the price of the iPhone by $200, what did that tell you about the price of included hardware? Squat!

Last time I checked, cell phones were at least in-part subsidized by (usually) co-marketing agreements between service providers and phone manufacturers whereby the BOM of the phone has no direct correlation to the price which consumers pay.

The same situation does not exist in the world of graphics cards.
 
=>Rangers: Current nVidia chips have too much texturing power for their own good. I haven't tested it myself, but I heard that synthetic texture fill-rate tests can send the GPU temperature and power consumption skyrocketing. In normal usage, the cards don't use the texturing units to their full potential. So it would appear logical for GT200 to have relatively less TUs.

Hmm, perhaps that's why my system locks up after a couple minutes in Oblivion @ 1080P w/8xAA, HDR, and everything else cranked to the max. including foliage/vegetation and shadows. :p
 
I think the TUs cost so much that no substantial increase will be seen.

With batched texturing, a 64-object batch requires 4 clocks in RV670 TUs, and this duration is extremely unlikely to change. So then you have to consider a batch size on the TUs that matches the batch size on the ALUs (4 clocks * SIMD width).

So 480 ALU lanes is either 6 SIMDs with 64-object batches or 4 SIMDs with 96-object batches. The latter would lead to 24 TUs, which I think is in transistor budget-busting territory. Plus I think an increase in batch size is unlikely. Finally I expect ALU:TEX ratio to continue increasing, so 6:1 fits the bill. It would be ironic if it turns out that RV670 is generally ALU-bound - like R520 was, such that the TUs are underutilised... Certainly there are plenty of individual shaders in many games that are ALU bound - it's just a question of whether games are globally ALU-bound.

It may turn out that RV670 runs out of batches in flight all too easily, so turning theoretically ALU-bound shaders into TEX-bound shaders. Who knows...

As for TAs - the point sampling and filtering parts of the TUs are independent - able to run in parallel. Point sampling is not just about fetching vertex data. I will admit that this part of R6xx is pretty poorly understood - e.g. how much the compiler is able to parallelise unfiltered and filtered fetches for pixel shader code.

Jawed

Yet all the rumors are now consistent on 32 TMU's..you should really give up the 16 TMU idea.
 
Actually, it's probably the stupidest rumour in recent(and not so recent) history.

320SPs reserved for "future physics processing capability" with no defined standard API to use for GPU physics calculations is RETARDED. Anyone that believes this rumor is simply not using their brain.

Yeah, it is, but I'm still amazed that launch is 1 month away and no one can get anything concrete on the architecture - very uncharacteristic of an ATI release, at least recently.

What do you mean? 480SPs is definite at this point. The 55nm process is known. We also know just about everything there is to know about the memory (clock, type, amount, # of channels/resulting interface width, etc.). The only thing we don't have absolutely nailed down is the clocks, and there is still some debate about the number and functionality of RBEs/TMUs.

As for the timing, we didn't know what RV670 was until < 1 month to launch.

Thx :)
Now i know the voltage a RV770 ES needs for 625MHz core clk.

http://www.hardware-infos.com/news.php?news=2074

A12 needs 1.3v for 625MHz, used memory is HYB18H512321BF-10.

1.3V just for 625MHz on 55nm? Seems rather high to me... I highly doubt that is shipping voltage or clock.
 
I think the TUs cost so much that no substantial increase will be seen.

With batched texturing, a 64-object batch requires 4 clocks in RV670 TUs, and this duration is extremely unlikely to change. So then you have to consider a batch size on the TUs that matches the batch size on the ALUs (4 clocks * SIMD width).

So 480 ALU lanes is either 6 SIMDs with 64-object batches or 4 SIMDs with 96-object batches. The latter would lead to 24 TUs, which I think is in transistor budget-busting territory. Plus I think an increase in batch size is unlikely. Finally I expect ALU:TEX ratio to continue increasing, so 6:1 fits the bill. It would be ironic if it turns out that RV670 is generally ALU-bound - like R520 was, such that the TUs are underutilised... Certainly there are plenty of individual shaders in many games that are ALU bound - it's just a question of whether games are globally ALU-bound.
I fail to see what's the problem really with increasing the TU batch size?

As for TAs - the point sampling and filtering parts of the TUs are independent - able to run in parallel. Point sampling is not just about fetching vertex data. I will admit that this part of R6xx is pretty poorly understood - e.g. how much the compiler is able to parallelise unfiltered and filtered fetches for pixel shader code.
Yes, but AFAIK point sampling in pixel shaders (or fetch4) is quite rare compared to bilinear texturing, so it still doesn't make much sense to me.

I think adding "only" more TF units would make sense - maybe there could be some restrictions where the results would need to go (either all 8 filtered texels from a "quad-unit" to the same shader array, or 4+4 must go to a different array or something like that) but it should be doable IMHO without screwing up the basic r6xx design principles too much.
 
Oh come now, surely you know how to use Google ;)

In semi-layman's terms: it's a way to see what voltage is necessary to run a given clockspeed for a particular MPU, and it provides the resulting TDP as well, which allows manufacturers to determine the SKU segmentation/productization options for the given product family.
 
Status
Not open for further replies.
Back
Top