I was under the impression, that it should read rahter like this:
The primary difference is that G80 just has to serializes all instructions on each object in one pixel-processor, whereas, while R6xx can under certain circumstances, such as no dependancy between scalars, do that to achieve similiar effectiveness, it generally also has to parallizes as well to get maximum utilization.
The cheapest, but not the most efficient way of which would be serializing a second MADD to all 320 ALUs.
Yeah, and i sure hope, this is not what AMD is going for with RV770.Isn't that what G71 did? It's probably hard enough to find co-issue opportunities for R6xx....don't think they would want to add dual-issue on top of that!
That's how I tend to look at it as well. You can make the case for higher ALU density with ATI's approach but I don't believe the ability to process vector instructions should be highlighted as an advantage. At least I don't recall any cases where R6xx proves more efficient per flop.
This is interesting, actually with the g80 vs the r600 in geometry performance (which use alot of vector instructions) in Dx9 it certianly does have a huge advantage and fairs well in Dx10 too, but for Dx9 performance I'm not fully convinced the g80 is doing load balancing, because of its compartively poor performance vs Dx10.
TBH I don't know what numbers you're referring to here but G8x's synthetic geometry performance isn't shader bound. See G94 vs G92. Do you have a link demonstrating what you're talking about re DX9 vs DX10?
http://www.digit-life.com/articles3/video/rv670-part2-page8.html (Dx10)
http://www.digit-life.com/articles3/video/rv670-part2-page5.html (Dx9)
Sorry should have added vertex shader performance in my intial statement too.
Seems like the "800 stream processors" rumor is picking up speed...
Tech Report link which leads to vr-zone and finally finishes up at Chiphell.
I've just thought, two RV770s (R780 is what I like to call it) could have 400SPs each arranged in 5 SIMDs. Then 32 TUs is 16 on each RV770 and 32 RBEs would arise the same way.I'm gritting my teeth, as I think this rumour is one of those dreams, but 160:32 is a 5:1 ALU:TEX ratio
Because if all 5 units had the same capabilities the compiling/scheduling would be easier.
This was moronic when I said it but when you said it with some extra explanation & caveats its not?I suggested this a long time ago - you'd want look-up tables for each lane and then use repeated MADs (e.g. to produce a result every 4 clocks). But the look-up tables are still relatively costly so they'd be anything but "simple units".
Then you get into the question of why have 5 and then you get into questions of register file organisation, batching, clause-scheduling (ALU instructions are issued in groups of a maximum of 32 slots) etc. It would be a complete re-design.
Seems like the "800 stream processors" rumor is picking up speed...
Tech Report link which leads to vr-zone and finally finishes up at Chiphell.
I don't think it's a good idea because the lookup tables are expensive.This was moronic when I said it but when you said it with some extra explanation & caveats its not?
Note that I didn't specify that they would be full blown trancendental units.
You put those words in my mouth.
Obviously would need to compromise on either number of units or complexity of units.
Sounds like a reasonable size to me. Say 3M.How many transistors contains one 5D ALU? R5xx ALUs with appropriate reg. array was about 2M. R6xx ALU is less powerful in theoretical flops, but more effective (scalar). Could we assume 2-3M for one 5D ALU + reg. array?
I'd edge to about 8M, there's L2 cache too.One R5xx TU + one R5xx ROP cost together about 8M. R6xx's TMUs are significantly beefier, so 6M only for TU could be close to reality.
I've completely forgotten about that. Will have to rummage for that, later.RV770 is rumoured to be ~200M transistors larger than RV670. I'd be very surprised, if all those transistors would be used only for one SIMD. I think 96 5D ALUs (+64-96M) and 32 texture units (+96M) isn't unrealistic possibility (still waiting for Ortons future)
I'm presuming the RBEs will have double the per-clock Z performance and that'll have a stronger influence on bandwidth.Anyway, why to use high-speed GDDR5 modules for 16TUs GPU? That would be quite expensive overkill...
Anyway, why to use high-speed GDDR5 modules for 16TUs GPU? That would be quite expensive overkill...
I though the 670 alus were MIMD, yes?
It's true that they are all just SIMD processors of varying widths. But how exactly would you market the differences between G80 and G71 or R600 and R580? I'm sure you appreciate the added value there. IMO it's not something that should be ignored simply because "they're all SIMD".
I for one think Nvidia's approach of filling their processors with scalar operands from different pixels/vertices/threads etc instead of independent instructions from the same object is bloody fantastic and deserves to be highlighted in some way. Of course some would argue that it isn't true scalar and there's still some dual-issue, instruction reordering and other wizardry to perform in the compiler but it's disingenuous to dismiss the elegance of the whole setup.
I'm wondering where the thoughts of working on different threads or objects in the same SIMD / shader grouping comes from.
The problem with calling what GPUs do SIMD is that people will tend to think it works in the same way as desktop processors (which have no MD input parameters for load & store).