Xbox 360 1 Teraflop of Performance Explained

This thread is boring guys because you guys are either burnt-out or lazy, so I thought I'd add some spice. Remember this beaut?
2005-6-21-16-10-14-654986702.gif

Now do some math, I'll work on it too, but it might take me a while.
 
Alpha_Spartan said:
Jaws said:
Alpha_Spartan said:
Okay, so now let's talk about how Sony attained their 2TFLOP figure.

See my link above...
I saw it, I just want to see the math.

There's going to be a little variation on what people might explain to you (i.e. with regard to GPU's pixel shaders). But one possible conservative stab at it (same format as with the X360 slide):

CPU:
- 12 flops / clock cycle (PPE) + 8 flops / clock cycle (SPEs)
- 12 x 3.2 Ghz x 1 = 38.4 Gflops + 8 x 3.2 Ghz x 7 = 179.2 = 217.6 Gflops

GPU:
Vertex Shaders: 10 flops / clock cycle
- 10 x 550 Mhz x 8 Vertex Shaders = 44 GFlops

Pixel Shaders: 16 flops / clock cycle (only counting main ALUs)
- 16 x 550Mhz x 24 Pixel Shaders = 211.2 Gflops

Total Programmable flops: 255.2

Non Programable GFlops = 1544.8

Total = 2071.6 GFlops = 2 TFlop

If you were to count the mini ALUs in the Pixel shaders, that would boost the Pixel Shader flop rating to 20 flops / clock cycle (264 Gflops total for pixel shaders, 308 programmable Gflops total for RSX).

If you were to count the mini ALUs and the 16-bit normalise, that would boost the Pixel Shader flop rating to 27 flops / clock cycle (356.4 Gflops total for pixel shaders, 404 programmable Gflops total for RSX).

This is based on PC Watch Impress coverage of G70 and RSX (http://pc.watch.impress.co.jp/docs/2005/0701/kaigai195.htm)
 
The constant in all of this seems to be the non-programmable FLOPS. How is that derived?

Secondly, I can see how C1's ProgFLOPS rating is clear cut because theoretically all 48 shaders can EITHER be processing all pixels or all vertices on a given clock.

However, with the RSX am I right to conclude that it can do vertex AND pixel shading on the same clock?
 
Titanio said:
Alpha_Spartan said:
The constant in all of this seems to be the non-programmable FLOPS. How is that derived?

The total figure claimed minus programmable flops ;)
Aye, so we are trusting Microsoft & Sony more than our math. :LOL:
 
So if the PS3 has a 2 to 1 programmable flops advantage and since people still say that there will be no visual difference; then why do these companies release the flops numbers?

Do they really mean nothing or are people not giving credit to something? I don't know that's why I'm asking.
 
mckmas8808 said:
So if the PS3 has a 2 to 1 programmable flops advantage and since people still say that there will be no visual difference; then why do these companies release the flops numbers?

Do they really mean nothing or are people not giving credit to something? I don't know that's why I'm asking.
It seems as if the RSX is able to do vertex and pixel operations on the same clock.
 
serenity said:
Aye, so we are trusting Microsoft & Sony more than our math. :LOL:

Exactly. It's pretty much impossible to derive the "non-programmable flops" figures.

mckmas8808 said:
So if the PS3 has a 2 to 1 programmable flops advantage and since people still say that there will be no visual difference; then why do these companies release the flops numbers?

It's extraordinary how focussed this industry is on technical specs. When a new Microwave gets announced, I doubt they go into the little nuts and bolts of its performance and how it works ;) It's just the way it is really..there's always a "numbers game". Flops happens to be a standard metric for better or worse.

mckmas8808 said:
Do they really mean nothing or are people not giving credit to something? I don't know that's why I'm asking.

I guess wait and see.

Alpha_Spartan said:
It seems as if the RSX is able to do vertex and pixel operations on the same clock.

Every GPU with discrete vertex and pixel units can. Vertex and pixel shaders operate independently and in parallel.
 
Alpha_Spartan said:
Secondly, I can see how C1's ProgFLOPS rating is clear cut because theoretically all 48 shaders can EITHER be processing all pixels or all vertices on a given clock.

No, there are 64 threads concurrently running on Xenos (48 shader, 16 texture) in any combination of vertex or fragment (pixel) shading - the threads are batched as blocks of 16, vertex-or-fragment:

4-0
3-1
2-2
1-3
0-4

Jawed
 
Jawed said:
Alpha_Spartan said:
Secondly, I can see how C1's ProgFLOPS rating is clear cut because theoretically all 48 shaders can EITHER be processing all pixels or all vertices on a given clock.

No, there are 64 threads concurrently running on Xenos (48 shader, 16 texture) in any combination of vertex or fragment (pixel) shading - the threads are batched as blocks of 16, vertex-or-fragment:

4-0
3-1
2-2
1-3
0-4

Jawed

I'm VERY confused. I thought the ALUs were allocated to vertex or pixel work, ALU by ALU? If it's done in batches of 16, some may lie idle in a clock?
 
It's definitely in blocks of 16 - there have been quite the debates on this before. Each 'block' of ALU's, of which there are three (obviously), have to have their ALU's doing the same type of work at any given time.
 
dukmahsik said:
didnt dave say the 64 thread amount was incorrect?

There's a lack of clarity here, if I remember correctly. There's the capability to perform 32 texture operations concurrently (16 filtered and 16 point-sampled), and I think it's that that lies at the heart of the doubt.

To be honest I don't know what Dave said, just a vague memory - the search facility is terrible on this forum. It's not clear, for example, if the point-sampling operations are actually operating in a pipeline, or if it simply consists of requesting texture data - the thread is then put aside until that texture data is ready in the cache. So the actual point-sampled texture fetch never executes in a "texture pipeline".

So, 64 concurrent threads is the minimum as far as I can tell.

Jawed
 
Titanio said:
I'm VERY confused. I thought the ALUs were allocated to vertex or pixel work, ALU by ALU? If it's done in batches of 16, some may lie idle in a clock?

The GPU always creates 16-sized groups of vertices or pixels to work on. That's the smallest unit for the purposes of load-balancing. If a triangle is smaller than 16 pixels then Xenos will find another triangle to make up the numbers.

When shading triangle-edges some pixels around the edge of the triangle are effectively "empty". GPUs work in 4s of pixels, organised as a quad: 2x2. So some ALUs will indeed be doing "nothing" (they'll be running code but the results are junked). This is a common problem to both earlier GPU designs (i.e. RSX) and Xenos.

Because all GPUs shade pixels in groups of 4, the edge of triangles problem is the same whether the ALUs are grouped in 16s or 4s.

It's quite hard to work out the detailed pros and cons of this 16-per-group - is that also the batch size in Xenos (no idea!)? The batch size in RSX is 1024 pixels, organised into groups of 4.

I've been rummaging for meaningful data on this subject but haven't got anything concrete so far...

Jawed
 
Well done :!: Sadly we don't have a definitive answer to the 64/80 question, or whether it's something else entirely (which appears to be a function of texture processing).

Jawed
 
Back
Top