ATI R500 patent for Xenon GPU?

Jaws said:
Well, the 'leak' mentions that each of the 48 ALUs has a 'vector' and 'scalar' unit. So that's 48 vector and 48 scalar units. I expect the vector unit to be 4-way SIMD (128 bit, 4* 32 bit) and both units to be 32bit single precision capable. Also capable of Floats and Integers.

Different peeps, as I'm learning the hard way, have different ways of counting ALUs. ATI counts 76 non-texturing ALUs in the vertex and pixel shaders of X800.

Reviewers seem to count them as 16 pipelines x2 + 6 pipelines x2 = 44 ALUs.

:?

What's on my mind, apart from anything else, is the transistor budget for the ALUs.

If you assume that the transistors required per dimension and per bit of each ALU, then you get this rather interesting result:

b3d02.jpg

(I've edited the picture to show the vector and scalar parts of Jaws's proposal separately)

And that's before you add in SM3 functionality into the ALUs.

Jawed
 
LOL. Jaws's!!!

But I think it might be overkill and you could be talking about a blown transistor budget. There's a whole load of extra transistors required for SM3 ALUs than SM2 ALUs.

Still, we're all guessing...

Jawed
 
Jawed said:
LOL. Jaws's!!!

But I think it might be overkill and you could be talking about a blown transistor budget. There's a whole load of extra transistors required for SM3 ALUs than SM2 ALUs.

Still, we're all guessing...

Jawed

Hehhe..
Yeah, I saw the numbers.. :D

But one thing about the Mhz-number.. it could be "much" higher, it is "rumoured" that ATi used the FAST14-design tech to get higher Mhz from the GPU..(or some of ATIS newer cards)..

It should be very interesting how J Allard calculates the "over a teraflop of targeted computer performance"... must be some reaaaal fuzzy math going on there...
 
EndR said:
Hehhe..
Yeah, I saw the numbers.. :D

But one thing about the Mhz-number.. it could be "much" higher, it is "rumoured" that ATi used the FAST14-design tech to get higher Mhz from the GPU..(or some of ATIS newer cards).
Yeah, I've been wondering about Fast-14, too. After all it's supposed to be for maths.

I believe the leak of the Xbox 360's spec puts R500 at 500MHz. Maybe Fast-14 can produce some nice "multiplier" on that, in small areas of the die. :)

Jawed
 
Jawed said:
EndR said:
Hehhe..
Yeah, I saw the numbers.. :D

But one thing about the Mhz-number.. it could be "much" higher, it is "rumoured" that ATi used the FAST14-design tech to get higher Mhz from the GPU..(or some of ATIS newer cards).
Yeah, I've been wondering about Fast-14, too. After all it's supposed to be for maths.

I believe the leak of the Xbox 360's spec puts R500 at 500MHz. Maybe Fast-14 can produce some nice "multiplier" on that, in small areas of the die. :)

Jawed


The leak does say "500+Mhz"... the "+" is important. 256+ Mb Ram.. which now seems to be, according to some rumours, 512Mb.. so anything could happen. How much will they "up" the Mhz, is up in the air, but according to intrinsity (those behind FAST14), it would be no problem to achieve 1-2 Ghz...and that it wouldn´t add much to cost-issues...

More than 500Mhz should be a given...
 
Jawed said:
...
What's on my mind, apart from anything else, is the transistor budget for the ALUs.

If you assume that the transistors required per dimension and per bit of each ALU, then you get this rather interesting result:

b3d02.jpg

(I've edited the picture to show the vector and scalar parts of Jaws's proposal separately)

And that's before you add in SM3 functionality into the ALUs.

Jawed

That's an interesting 'index' to compare execution unit horsepower between different architectures. It's not strictly a transistor count but by reducing everything down to 'bit' ALU units, you've sorta got a,

"bit ALU units ops per cycle"

If i may add the clock to that 'index' to get something more accurate, so it's like a,

"bit ALU units ops per second" or GigabitALUOps

R420 @ 0.5 GHz ~ 4608*0.5 ~ 2304 Gbaops

Jawed R500 @ 0.5 Ghz ~ 2304 Gbaops

Jaws R500 @ 0.5 Ghz ~ 3840 Gbaops

And for interest, I'll throw a CELL processor into the mix,

8SPE'S+PPE ~ 8*vector + (1vect +1 scalar) ~ 9Vect + 1 Scl

CELL @ 4 GHz ~ 1184*4 ~ 4736 Gbaops

Yes, this is a new unit of measuring performance, Gbaops, to confuse the fanbois! :p

We could also add to these programmable units, the Gbaops for fixed function units too, which would boost the GPU numbers! :p
 
Jawed said:
LOL. Jaws's!!!

But I think it might be overkill and you could be talking about a blown transistor budget. There's a whole load of extra transistors required for SM3 ALUs than SM2 ALUs.

Still, we're all guessing...

Jawed

If you're dropping from 130nm process to 90nm process, it should give you,

(130/90)^2 ~ 2x

...should give you approximately 2x as many transistors for a given die area and this fits nicely with the above index. And also increasing die area is a possibility as R500 ~ 240-320 mm2 @ 90nm with my guestimate.
 
EndR said:
Jawed said:
EndR said:
Hehhe..
Yeah, I saw the numbers.. :D

But one thing about the Mhz-number.. it could be "much" higher, it is "rumoured" that ATi used the FAST14-design tech to get higher Mhz from the GPU..(or some of ATIS newer cards).
Yeah, I've been wondering about Fast-14, too. After all it's supposed to be for maths.

I believe the leak of the Xbox 360's spec puts R500 at 500MHz. Maybe Fast-14 can produce some nice "multiplier" on that, in small areas of the die. :)

Jawed


The leak does say "500+Mhz"... the "+" is important. 256+ Mb Ram.. which now seems to be, according to some rumours, 512Mb.. so anything could happen. How much will they "up" the Mhz, is up in the air, but according to intrinsity (those behind FAST14), it would be no problem to achieve 1-2 Ghz...and that it wouldn´t add much to cost-issues...

More than 500Mhz should be a given...

I doubt you're gonna get anywhere near a jump from 500Mhz to 1-2 Ghz for the R500. FAST14 is for power saving and if you still have the SAME number of execution units on the SAME die size on the SAME process, you're not gonna get anything like that. Unless HALVING the number of execution units and DOUBLING the clock, but then you're still getting the same performance roughly speaking! ;)

Instead of increasing clock, they've probably just increased the die size to fit more execution units as the R500 seems a transistor/die beast and save power that way.
 
I guess I should have asked how much power it takes to do a single multiply-add (32 bit precision IEEE floating point), assuming a 90nm
process and clock speed of say 1 GHz.

I was going to scale this number by 240 (48*4 + 48 multiply add units)
to get an estimate of how much power running the proposed number of ALUs at 1-2GHz would take...
 
Jaws said:
That's an interesting 'index' to compare execution unit horsepower between different architectures. It's not strictly a transistor count but by reducing everything down to 'bit' ALU units, you've sorta got a,

"bit ALU units ops per cycle"
Yes, my aim with this was to compare transistor counts, which is why the right hand column is "Transistor multiplier". I was thinking that per scalar ALU bit there are, say, 10,000 transistors.

Naturally my example excludes the texturing ALUs. I don't know how the sizes of math ALUs and texturing ALUs compare. And what the overheads in terms of memory controllers and registers and whatnot amount to.

There are other limitations too, particularly the fact that R420 is a mix of PS1.4 and PS2.0b, whereas R500 is SM3+. So obviously there's far more transistors per scalar bit op in R500.

We could also add to these programmable units, the Gbaops for fixed function units too, which would boost the GPU numbers! :p
One notable thing that disappears in R500 is fog and blend hardware, as SM3 dictates that this should be performed in code, not as a fixed function stage. Well, that's what Huddy was saying at GDC. Hope I've got that right.

So that's some transistors you won't be spending...

Jawed
 
Jawed said:
...
One notable thing that disappears in R500 is fog and blend hardware, as SM3 dictates that this should be performed in code, not as a fixed function stage. Well, that's what Huddy was saying at GDC. Hope I've got that right.

So that's some transistors you won't be spending...

Jawed

Do you have a link for what will be fixed function requirement for SM3.0?
 
According to Section 2 of this paper :

http://merrimac.stanford.edu/publications/sc03_merrimac.pdf

In 0.13μm CMOS technology, a 64-bit floating-point unit (FPU)
(multiplier and adder) has an area of less than 1mm2 and dissipates
about 50pJ of energy per operation

So with 5 such FPUs (1 scalar, 4 arranged in SIMD configuration) per
ALU, and 48 ALUs, my completely uninformed estimate is ~12Watts at 1GHz. This is power dissipated soley in the FPU - it doesn't count anything else (not even register reads).

Unfortunately its for a 64bit FPU when GPUs are all using 32bits. I don't know enough to scale this down to 90nm or a 32bit FPU. Can any EEs help out?

Cheers,
Serge
 
Back
Top