FS ATI interview on R500 + Block Diagram

dukmahsik said:
http://www.extremetech.com/article2/0,1558,1818139,00.asp

The 48 ALUs are divided into three SIMD groups of 16. When it reaches the final shader pipe, each of the 16 ALUs has the ability to write out two samples to the 10MB of EDRAM. Thus, the chip is capable of writing out a maximum of 32 samples per clock. At 500MHz, that means a peak fill rate of 16 gigasamples. Each of the ALUs can perform 5 floating-point shader operations. Thus, the peak computational power of the shader units is 240 floating-point shader ops per cycle, or 120 billion shader ops per second at 500MHz.

???

They are changing the way to calculate peak ops again, it is true that the xGPU can perform operations om 48x5 components per cycle as it does one 4D scalar operation plus one vector operation - but if we start counting that way the RSX would also do way more than the 136 ops per cycle. It is a number game and it getting annoying.
 
nelg - agreed, this idea that all 48 ALUs are doing the same thing is pretty peculiar. But it does make for an incredibly fine-grained execution of code.

But I wonder what happens when some pixels fail a branch test while others pass it.

It's interesting that the texture units are described as Texture Pipelines. Though I admit to not really understanding what would be pipelined in texturing.

The texture pipelines appear to be organised as a quad, but is that really what's happening there?

Additionally, if you have a 48-way SIMD architecture, is that broken down into quads? Why bother with quads if it's SIMD?

Jawed
 
Jaws said:
Josh378 said:
http://www.firingsquad.com/features/xbox_360_interview/

well confusion solved!!!!

-Josh378

I gotta say this is getting hilarious by the minute!

The units are now 'flops' and not 'shops'!* :p

48 Shader units ~ 196 flops per cycle

@ 500 MHz ~ 98 GFlops

*note: shops ~ shader ops!

Wasn't it thought each ALU could do a vector4 and scalar op per cycle? Aka 10 flops each per cycle? 240Gflops/sec? Where'd that go?
 
Titanio said:
Jaws said:
Josh378 said:
http://www.firingsquad.com/features/xbox_360_interview/

well confusion solved!!!!

-Josh378

I gotta say this is getting hilarious by the minute!

The units are now 'flops' and not 'shops'!* :p

48 Shader units ~ 196 flops per cycle

@ 500 MHz ~ 98 GFlops

*note: shops ~ shader ops!

Wasn't it thought each ALU could do a vector4 and scalar op per cycle? Aka 10 flops each per cycle? 240Gflops/sec? Where'd that go?

AFAIK, that's still the case, assuming FMADD...check the *official* specs from MS...i.e.

48-way vector4 units (i.e. 48 * 4-way SIMD units)
48-way scalar units
 
We now have Microsoft math to look at:
www.majornelson.com

Microsoft is saying 330 million transistors for GPU + eDram.

Hopefully Dave can clear up the bits about all ALU's operating in the same mode and FP blending and buffers. That could be significant.
 
Jaws said:
AFAIK, that's still the case, assuming FMADD...check the *official* specs from MS...i.e.

48-way vector4 units (i.e. 48 * 4-way SIMD units)
48-way scalar units

So where did the other 6 floating point ops go? ;)

Major Nelson's article is too funny..
 
Rockster said:
We now have Microsoft math to look at:
www.majornelson.com

Microsoft is saying 330 million transistors for GPU + eDram.

Hopefully Dave can clear up the bits about all ALU's operating in the same mode and FP blending and buffers. That could be significant.

RO-FL-OPS!
 
Major Nelson said:
- 48-way vec4 shader pipelines
- 48-way scalar pipelines
- 16 dedicated texture fetch pipelines
- 16 programmable vertex fetch and tessellation pipelines

Jawed
 
Titanio said:
So where did the other 6 floating point ops go?

I'd like to guess but my head hurts now, please explain! :)

Rockster said:
Yeah, it's hard to get around the propaganda. I just laugh about it. It could explain some of the RSX counts though.

What's your interpretation of the RSX figures? :)
 
Jawed, all those pipelines should be replaced with units.

Jaws, it was one possible way to explain how they're getting 136 ops/sec. The rest is pretty much useless.
 
Jaws said:
Titanio said:
So where did the other 6 floating point ops go?

I'd like to guess but my head hurts now, please explain! :)

They're now saying one of those 48 ALUs can do four floating point ops per cycle.

It was thought each ALU could do a vec4 and scalar op simultaneously - aka 10 floating point ops per cycle, right?

So where did the 6 flops go? ;)
 
I think Tim said it best.

The RSX is around 50+% faster than the xGPU in peak theoretical performance pretty in pretty much every way. How much of this advantage it can retain in real life is a hypothetical question until we see some real hardware.
________
HERBALAIRE REVIEW
 
Last edited by a moderator:
Tim said:
The RSX is around 50+% faster than the xGPU in peak theoretical performance pretty in pretty much every way.

I did not know we knew enough about the RSX to make that statement. What areas is it 50+% faster? Links/numbers?
 
mckmas8808 said:
I think Tim said it best.

The RSX is around 50+% faster than the xGPU in peak theoretical performance pretty in pretty much every way. How much of this advantage it can retain in real life is a hypothetical question until we see some real hardware.

honestly, we do not know enough about rsx to even begin to make that kind of assumption
 
Acert93 said:
Tim said:
The RSX is around 50+% faster than the xGPU in peak theoretical performance pretty in pretty much every way.

I did not know we knew enough about the RSX to make that statement. What areas is it 50+% faster? Links/numbers?

You should already know the number they have been posted and linked a million times, on this forum the last few days. Here are a few numbers:

NVIDIA vs. Ati
Z/stencil ops: 48 vs. 32 per cycle (edited wrong numbers)
Texture : 24 vs. 16 per cycle
Pixels: 16 vs. 8 per cycle
Shader ops: 136 vs. 96 per cycle
Clockspeed : 550 vs. 500MHz
(I will try to find the links Edit: I am not sure all of these are official anyway but most are).

But as I said the xGPU might very well turn out to be fastest in real life. The shader ops numbers is most likely not directly comparable and the architectures and memory sub-systems of these chips are vastly different which also impacts real life performance.

Edit again: Really I have been messing things up, the xGPU numbers are real factual numbers - the RSX are only based on rumors/leaked specs of the G70 (I don't know how I got it into my mind that these numbers was official, I am going to bed now and get some sleep before make anymore of an ass out of myself).
 
Jawed said:
It's interesting that the texture units are described as Texture Pipelines. Though I admit to not really understanding what would be pipelined in texturing.
Texture fetches have high latency because they need to access memory. If texturing wasn't pipelined, texturing performance would be very low since the TMU would be blocked waiting for the memory.
 
Tim said:
NVIDIA vs. Ati
Z/stencil ops: 24 vs. 16 per cycle
Texture : 24 vs. 16 per cycle
Pixels: 16 vs. 8 per cycle
Shader ops: 136 vs. 96 per cycle
Clockspeed : 550 vs. 500MHz
(I will try to find the links Edit: I am not sure all of these are official anyway but most are).

But as I said the xGPU might very well turn out to be fastest in real life. The shader ops numbers is most likely not directly comparable and the architectures and memory sub-systems of these chips are vastly different which also impacts real life performance.


But that is my point ;) Picking a couple numbers and drawing up theoretical max performance differences, knowing we are missing big chunks of the puzzle, is useless IMO. Real world performance aside, can we say we have enough technical information to begin discussing any realistic metric of %'s of peak theoretical difference in performance?

Giving a peak advantage based on 3 or 4 statistics, on either side, is misleading especially when there is so much confusion on how the R500 operates (think about how much we learned just yesterday... e.g. the ALUs can do either pixel or vertex tasks each cycle but not both on the same cycle).

Without knowing the configuration, workflow, and limitations of each design throwing out general numbers of peak performance is misleading. Even the stats you list are debatable to some degree and/or uncertain (e.g. the last 2 are almost irrelevant).

I know you agree in principle. You made a good point about the architectures and real world performance. I am just cautioning against throwing out numbers like "50+% peak theoretical performance" when we cannot say for certain if that is true holistically. It would not be fair to say, "MS is throwing out 192shops now, so it is 40% faster". It is just one fact in a sea of relevant figures.

You know the Inquirer and Spong are going to be plucking this forum dry for information for fud on their news peices--best not give them fuel for the fire!

Nothing personal, just kind of tired of all the numbers running around from both sides without getting the "big picture". Sony and MS are actively participating in this... I just want the facts!!! ;)

EDIT:

Tim said:
Edit again: Really I have been messing things up, the xGPU numbers are real factual numbers - the RSX are only based on rumors/leaked specs of the G70 (I don't know how I got it into my mind that these numbers was official, I am going to bed now and get some sleep before make anymore of an ass out of myself).

Not your fault... there is a lot of gunk getting pushed around as fact right now. We are all trying to disseminate the truth from hype. You are doing a good job :D
 
I'm confused by something. If the same ALU's are doing both vertex and pixel processing, at what point and how do the triangles get mapped to screen space (ie. triangle setup)? The ATI block diagram shows scan conversion prior to the ALU's but that's not possible as they need to be transformed first, right?
 
That diagram seems pretty useless in terms of dataflow (huge gaps, basically). The only value it has is in terms of an overview of the functional blocks.

Jawed
 
Back
Top