Xenos as Physics Processor?

Jawed said:
R520 has 16 pixel shader pipes and 8 vertex shader pipes - 24 total. Xenos has 48 pipes of roughly the same capability (but running at 500MHz instead of 625MHz).

But Xenos has other serious efficiency gains, such as the unified architecture, and the "unstoppable" ROPs (EDRAM).

So, overall, Xenos should be around twice as fast+ as R520.

I don't understand why people think 7800GTX/RSX is in the same ballpark as Xenos.

Jawed

*cough*

Xenos has 3 pipes, so it must suck compared to R520!

:!:

Geez, it's like a time warp here...

How about comparing R520 32bit programmble shader Flops to Xenos?


R520 estimate for XT (can someone veryfy...?),

VS ~ [8 x Vec4-madds x 8 Flops/cycle + 8 x scalar-madds x 2 Flops/cycle ] x 0.625 GHz
~ 50 GFlops

PS ~ [16 x vec3-adds x 3 Flops/cycle + 16 x vec3-madds x 6 Flops/cycle + 16 x scalar-adds x 1 Flop/cycle + 16 x scalar-madds x 2 Flop/cycle] x 0.625 GHz

PS ~ [48+96+16+32]x 0.625 GHz
PS ~ [192]x 0.625 GHz
PS ~ 120 GFlops

R520 XT ~ 50 + 120 ~ 170 Gflops, 32bit peak
Xenos ~ 216-240 Gflops, 32bit peak

That's nowhere near twice!
 
Jaws said:
*cough*

Xenos has 3 pipes, so it must suck compared to R520!

:!:

Geez, it's like a time warp here...

How about comparing R520 32bit programmble shader Flops to Xenos?


R520 estimate for XT (can someone veryfy...?),

VS ~ [8 x Vec4-madds x 8 Flops/cycle + 8 x scalar-madds x 2 Flops/cycle ] x 0.625 GHz
~ 50 GFlops

PS ~ [16 x vec3-adds x 3 Flops/cycle + 16 x vec3-madds x 6 Flops/cycle + 16 x scalar-adds x 1 Flop/cycle + 16 x scalar-madds x 2 Flop/cycle] x 0.625 GHz

PS ~ [48+96+16+32]x 0.625 GHz
PS ~ [192]x 0.625 GHz
PS ~ 120 GFlops

R520 XT ~ 50 + 120 ~ 170 Gflops, 32bit peak
Xenos ~ 216-240 Gflops, 32bit peak

That's nowhere near twice!

I'm not sure how it *could* be done, but do you account for the higher efficency and EDRAM mentioned by Jawed in these calculations?

J
 
Bill said:
Each pipe in Xenos is one ALU. Each in R520 is two ALU's (at least). I dont see how they're comparable (but hoping they are).
What makes you think the Xenos pipeline is less capable? We don't have a detailed description of the Xenos pipeline :cry:

I just have a hard time believing ATI could top a 320m transistor part (R520) with a 232m part. If they could, why the hell wouldn't they do that on the desktop!!?? You dont need the EDRAM there.
I'd like to provide a detailed transistor audit of the two - but I can't. I'm trying in various other threads, but it's slow work.

So it's just a guess really. I think you're over-emphasising the ALU counts and forgetting that architecture counts for a lot now, particularly when you have out of order scheduling and unified shading.

Jawed
 
expletive said:
I'm not sure how it *could* be done, but do you account for the higher efficency and EDRAM mentioned by Jawed in these calculations?

J

You won't be able to determine efficiency from these numbers as these are peak. Real world benchmarks would determine efficiency...
 
Jaws, until you can show that there's no mini-ALU in Xenos's pipeline, you're out on another one of your pseudo-science limbs.

I don't see any reason why Xenos doesn't have the mini-ALU. So if you can come up with a convincing argument...

Jawed
 
Jawed said:
Jaws, until you can show that there's no mini-ALU in Xenos's pipeline, you're out on another one of your pseudo-science limbs.

I don't see any reason why Xenos doesn't have the mini-ALU. So if you can come up with a convincing argument...

Jawed

Err...this is nothing new. The peak shading power i.e. 32 bit programmable GFlops has been known for ages. It's 216-240 Gflops depending on who you ask. It's irrelevant if it has mini ALUs or not as they would've been included in those figures. The peak R520 and Xenos compared is nowhere near twice from your hyperbole claim.

Back to basics, comparing pipe numbers is useless with these architectures.
 
Jawed said:
What makes you think the Xenos pipeline is less capable? We don't have a detailed description of the Xenos pipeline :cry:

Simple logic dictates that -- what we've seen from the Xenos hasn't been 2x the capabilities of R520 (the "devs haven't had the time!" card doesn't really work -- if Xenos was truly 2x, or anywhere near, the power it would be doing a lot more than 720p at 30fps with 2x AA), and "2x" the power from 2/3 the transistors is a bit absurd (and amazing if true on some planet). Even with ~60% efficiency vs 100%, that would only account for the transistor budget being reduced, not a 2x power gain. It just seems transistor for transistor the theorectical power is going to be about the same -- there is no magic wand to get 2x the capabilities out of the same transistor budget (especially when you have some of the best engineers working on it). It just seems absurd that anyone would think Xenos would be substantially more powerful than stuff in the same generation (or availible in the same 6month window -- R520, G70, RSX) -- I'll grant the efficiency card making up for the transistor difference (and maybe a bit extra even), but I cannot see where you get the colossul power difference outside of that. Logic dictates that 48 "pipes" in 232m transistors (with ~15% redundancy by your calculations) shouldn't beat a ~320m transistor monster (at a higher clockspeed as well)... especially when its from the same company and engineering talent.

Don't get me wrong -- I would love to be wrong in this case (who wouldn't love a system you could get in 1.5 month that has 2x the power as most high end gpu you can't even buy for another 3 weeks??), but I just can't believe it. I think part of it might because in the past 20 years we've never had a power increase of 2 fold in the same transistor count (often much more like 10-20% if you're lucky) in a given field -- the technology field usually works in evolutions, not revolutions (I'd call 2x the performance increase in the same transistor count -- counting efficiency as evening the transistor counts -- a revolution). Call me cynical though, please!
 
Like I said, until you can show that Xenos doesn't have a mini-ALU you're barking up the wrong tree.

The mini-ALU is a integral part of the GPU pipeline. It's just not normally counted. I don't know why, but there it is.

Until recently no-one outside of ATI apparently understood that there was the extra ADD capability in R3xx...R4xx due to the mini-ALU.

Have you been counting the mini-ALU in RSX? You do realise it can MUL and ADD (perhaps MAD?), don't you? And that there's two of them?

Acert dug this up earlier today:

http://www.hardspell.com/newsimage/2005-6-21-16-10-14-654986702.gif

I don't see any mention of the mini-ALUs on there. Whoops Jaws, back to square one :devilish:

Also, while you're at it, would you care to explain how a 170GFLOPs X1800XT is as fast as a 313GFLOPs 7800GTX? Or faster? :oops:

Jawed
 
Last edited by a moderator:
Jawed said:
Like I said, until you can show that Xenos doesn't have a mini-ALU you're barking up the wrong tree.

The mini-ALU is a integral part of the GPU pipeline. It's just not normally counted. I don't know why, but there it is.

Until recently no-one outside of ATI apparently understood that there was the extra ADD capability in R3xx...R4xx due to the mini-ALU.

*shakes head*

Stop clinging onto your mini alu theory. Don't you think if ATI/MS could claim some more PR GFLOPS they would! They have NOT. It's 216-240 GFLOPS 32 bit. PEEEEAAAAKKKKK.

Like I already said, it would be included if it was valid!

Jawed said:
Have you been counting the mini-ALU in RSX? You do realise it can MUL and ADD (perhaps MAD?), don't you? And that there's two of them?

A) THIS has NOTHING to do with RSX.

B) IF you read by post, you'd realise 32 BIT programmable PEAK flops being explicitely stated.

Jawed said:
Acert dug this up earlier today:

http://www.hardspell.com/newsimage/2005-6-21-16-10-14-654986702.gif

I don't see any mention of the mini-ALUs on there. Whoops Jaws, back to square one :devilish:

*Shakes head again*

What has that diagram got to do with my post?

They haven't stated what BIT! That diagram has been analysed many times on this forum...AND their mini alus are the SCALAR units!

Jawed said:
Also, while you're at it, would you care to explain how a 170GFLOPs X1800XT is as fast as a 313GFLOPs 7800GTX? Or faster? :oops:

Jawed

See above. They include 16bit Flops with 32bit flops.

Geez, talk about going over old ground again...this forum has no memory...

Just to reiterat again, NO, Xenos is nowhere near TWICE the R520 from another of your baseless, hyperbole claims.

Get back to reality dude...
 
Jawed said:
Also, while you're at it, would you care to explain how a 170GFLOPs X1800XT is as fast as a 313GFLOPs 7800GTX? Or faster? :oops:
Anyone dealing with theoretical numbers with any sense of due diligence would note that theoretical peaks are just that, theoretical, and they should be taken with a large grain of salt. Architecture and its impact on real world utilization are the most important aspect of any design. It is what you can use that is not important, not what is there. If you cannot realistically utilize performance on the chip then for all practical purposes counting it as some fantastical metric of what chip is better is really useless.

Interestingly, ATI has been quoted as saying current GPUs architectures, including their own, are only 50-70% effecient.

50-70% of 170GFLOPs is 85-119GFLOPs of "real world utilization". And that does not begin to take into consideration of the ROPs.

Last time I checked 119GFLOPs was about half of 240GFLOPs.

Bobbler said:
what we've seen from the Xenos hasn't been 2x the capabilities of R520 (the "devs haven't had the time!" card doesn't really work -- if Xenos was truly 2x, or anywhere near, the power it would be doing a lot more than 720p at 30fps with 2x AA),
This assumes that the bottleneck on titles is the GPU. I would say this assumption is wrong. We have already heard of a number of cases of developers offloading tasks to the other CPUs and the framerate improving dramatically.

Xenon is a tricore in-order PPC chip with shared cache. This is a very different environment than the PC/Xbox--where most of the devs come from--which had a single large OOO x86 processor and on the PC had a bit more cache per core.

As for their kits, they got final Beta Kits in Augest as confirmed by IGN's recent editorial and there were numerous delays after E3 in getting material and transition kits out.

Almost all the launch titles are Xbox or PC ports to some degree. Xenos, like any specialized hardware, needs to be taken into consideration in the design stages to get the best performance.

The fact a number of Xenos features, like hardware tesselation, are not being used by many devs is kind of indicative of the state of affairs: They don't have the time to create custome engines from the ground up, testing what works and does not work with the real hardware and then choosing the right engine path to exploit the strengths of the architecture. We just are not seeing that for obvious reasons.

Xenos ran the ATI R520 demos quite well. I think as developers transition to shader heavy code (which is Xenos' forte) that takes its unique design features into consideration that we will see it perform well.

Judging any console on launch titles is kind of scary. They have been developers with paper specs in hand and not much else. I still remember the PS2 launch which anyone remembers would not be classified as a launch that really showcased what was later to be produced from the system.

I guess the proof on who is right will be in the future games in late fall 2006 and into 2007 when we see the first games written for the Xbox 360 architecture from the ground up appear. In this regards I must give some praise to Sony to partnering with an IHV that had SLI with GPUs of similar features to design on. This gives PS3 devs a good heads up on the architecture to get the most out of their launch titles.
 
Jawed said:
R520 has 16 pixel shader pipes and 8 vertex shader pipes - 24 total. Xenos has 48 pipes of roughly the same capability (but running at 500MHz instead of 625MHz).

But Xenos has other serious efficiency gains, such as the unified architecture, and the "unstoppable" ROPs (EDRAM).

So, overall, Xenos should be around twice as fast+ as R520.

I don't understand why people think 7800GTX/RSX is in the same ballpark as Xenos.

Jawed

But isn't the Xenos more pixel output limited than the R520???
Since the Xenos technically have 8 output pipes???

500Mhz(8 pipes) = 4 Gigapixels/sec
 
LunchBox said:
But isn't the Xenos more pixel output limited than the R520???
Since the Xenos technically have 8 output pipes???

500Mhz(8 pipes) = 4 Gigapixels/sec

Pixel throughput is no longer an issue, especially with a fixed resolution. It's all about math now.
 
GPUs

Acert93 said:
Interestingly, ATI has been quoted as saying current GPUs architectures, including their own, are only 50-70% effecient.

50-70% of 170GFLOPs is 85-119GFLOPs of "real world utilization". And that does not begin to take into consideration of the ROPs.

Last time I checked 119GFLOPs was about half of 240GFLOPs.

Assuming RSX is overclocked G70 using your comparison method:

7800GTX
PS = 139.3-195.02 Gflops
Total = 156.5-219.1 Gflops
Peak = 313.4 Gflops

RSX
PS = 178.2-249.48 Gflops
Total = 200.2-280.28 Gflops
Peak = 400.4 Gflops

Xenos (assuming 100% efficiency)
Total = 240 Gflops

Also, why is R520 transistor count so high? If Xenos = 240Gflops with 232M, 320M if only 170Gflops peak sounds inefficient no? Maybe some information missing?

Has anyone read this?

http://www.anandtech.com/video/showdoc.aspx?i=2552&p=10
 
Jaws said:
Just to reiterat again, NO, Xenos is nowhere near TWICE the R520 from another of your baseless, hyperbole claims.

Get back to reality dude...

Full ACK. Even Microsoft does not claim that it's that fast (only faster than 2x6800 U iirc). And they cartainly would if there was any indication that it is the case.

And once again, the 48 "Shader units" are nowhere close to the performance of a fixed pixel or vertex pipeline (per "unit), that's the tradeoff of an unified architecture.
 
ihamoitc2005 said:
Assuming RSX is overclocked G70 using your comparison method:

7800GTX
PS = 139.3-195.02 Gflops
Total = 156.5-219.1 Gflops
Peak = 313.4 Gflops

RSX
PS = 178.2-249.48 Gflops
Total = 200.2-280.28 Gflops
Peak = 400.4 Gflops

Xenos (assuming 100% efficiency)
Total = 240 Gflops

Also, why is R520 transistor count so high? If Xenos = 240Gflops with 232M, 320M if only 170Gflops peak sounds inefficient no? Maybe some information missing?

Has anyone read this?

http://www.anandtech.com/video/showdoc.aspx?i=2552&p=10

In regards to those benchmarks it shows the classic strengths of the nVidia Geforce cards and that they have a superior implementation of OpenGL not to mention the benefit of the ultrashadow technology that is in use in Doom3. In any case...

I was wondering about the transistor count also... and looking at the diagrams and die shot it occurs to me that there are two things that may be taking up a lot of that transistor count... one being that "General Purpose Register Array" and the other being that "Render Back End".

But yes I would say that there is some information that is missing.

Nemo80 said:
Full ACK. Even Microsoft does not claim that it's that fast (only faster than 2x6800 U iirc). And they cartainly would if there was any indication that it is the case.

And once again, the 48 "Shader units" are nowhere close to the performance of a fixed pixel or vertex pipeline (per "unit), that's the tradeoff of an unified architecture.

I would have to disagree... a shader unit is a shader unit and both the Geforce 7800 and XENOS uses a Vec4+scalar shader unit, what is different is the arrangement of those shader units between XENOS and the R520 as well as the NV50 GPUs.
 
R520 and G70 differences

If R520 VS & PS units similarly capable as G70 VS & PS, then R520 = 320Gflops no? If similar to G70, then @ 50-70% average utilization R520 = 160-224Gflops

Peak
VS=50 Gflops
PS=270 Gflops

But if not similar what are differences? If VS and PS units very different, then for what functions are so many transistors utilized?

As for USA, most of what is called "efficiency", or better termed "utilization", in USA model merely resistance to slow-down when vertex or pixel shader load increases but not both. When both increase, then slow-down inevitable. No substitute for size. USA merely makes small size less of a liability no?

I think better thread handling should make peak performance more likely. Maybe that is why sometimes R520 even outperforms 7800GTX in some situations, although it could also be because those games need more vertex shader power and R520 VS much faster than G70 VS. OTOH, sometimes even 7800GT outperforms 625mhz R520. This I do not understand. Maybe R520 PS units very different from G70 PS units afterall?

7800GT @ 400mhz
VS= 14-19.6 Gflops (using 50-70% quote from below post)
VS = 28 Gflops (peak)
PS= 108-151.2 Gflops (using 50-70% quote from below post)
PS = 216 Gflops (peak)
Total= 122-170.8 Gflops (using 50-70% quote from below post)
Peak= 244 Gflops
 
The GameMaster said:
In regards to those benchmarks it shows the classic strengths of the nVidia Geforce cards and that they have a superior implementation of OpenGL not to mention the benefit of the ultrashadow technology that is in use in Doom3. In any case...

I was wondering about the transistor count also... and looking at the diagrams and die shot it occurs to me that there are two things that may be taking up a lot of that transistor count... one being that "General Purpose Register Array" and the other being that "Render Back End".

But yes I would say that there is some information that is missing.

Thank you for furthering my understanding. Seems API makes a big difference. What are implications of larger general purpose register and render back end?
 
Jaws said:
Stop clinging onto your mini alu theory. Don't you think if ATI/MS could claim some more PR GFLOPS they would! They have NOT. It's 216-240 GFLOPS 32 bit. PEEEEAAAAKKKKK.
Like I said, no-one counts mini-ALUs. NVidia hasn't counted the mini-ALUs in NV40/G70/RSX.

Like I already said, it would be included if it was valid!
Nope, I've never seen it included. It might be because mini-ALUs have such limited applicability and are hardly ever used.

What has that diagram got to do with my post?
I just wanted to show you how, according to your crazy pseudo-science a G70 which has twice the GFLOPs (not including mini-ALUs) of X1800XT (including mini-ALUs) is not twice as fast.

When will you get over the fact peak GFLOPs are meaningless. You've been peddling this nonsense for 6 months now.

They haven't stated what BIT! That diagram has been analysed many times on this forum...AND their mini alus are the SCALAR units!
Nope, the scalar part of the vec4+scalar (VS) or two vec3+scalar (PS) is not the mini-ALU. You really need to pay attention.

See above. They include 16bit Flops with 32bit flops.
FP16 normalise is not a function of the two mini-ALUs in G70/RSX pipeline. It's an entirely separate function.

Geez, talk about going over old ground again...this forum has no memory...
Well if you insist on polluting discussions with irrelevant GFLOPs nonsense...

Just to reiterat again, NO, Xenos is nowhere near TWICE the R520 from another of your baseless, hyperbole claims.

Get back to reality dude...
The first evidence will come with R580...

Unification of the shader architecture is going to increase utilisation further.

Xenos will be texture-bandwidth limited to the same degree as R520/R580 as both architectures have the same texturing capability (although R520/580 may have 20-40% faster caches). So any texture-limited games will not show any improvement in Xenos.

But games that are not texture bandwidth limited (going forwards this should be the norm for next-gen games) will easily get 100% faster in Xenos over R520. The combination of unified shader efficiency and twice the total pipelines will see to that.

Jawed
 
twice the pipelines?

Jawed said:
But games that are not texture bandwidth limited (going forwards this should be the norm for next-gen games) will easily get 100% faster in Xenos over R520. The combination of unified shader efficiency and twice the total pipelines will see to that.

Xenos has twice the pipelines as R520 but 2/3 transistor count?
 
ihamoitc2005 said:
Xenos has twice the pipelines as R520 but 2/3 transistor count?

I think it's because they took out the unimportant parts...

like what Nvidia is rumoured to be doing with the RSX...

like take out the video accelerator thingamajig...
 
Back
Top