Xenos as Physics Processor?

ihamoitc2005 · Oct 7, 2005

Maybe

LunchBox said:
I think it's because they took out the unimportant parts...

like what Nvidia is rumoured to be doing with the RSX...

like take out the video accelerator thingamajig...

Maybe he is accidentally comparing vertex-shader like Xenon pipeline with much more powerful pixel-shader pipeline of R520 no? Or maybe I am over-estimating capability of R520 pixel-shader pipelines. RSX PS is dual-issue vec4 and scalar no? What is difference of R520?

Jawed · Oct 7, 2005

Bobbler said:
Simple logic dictates that -- what we've seen from the Xenos hasn't been 2x the capabilities of R520 (the "devs haven't had the time!" card doesn't really work -- if Xenos was truly 2x, or anywhere near, the power it would be doing a lot more than 720p at 30fps with 2x AA), and "2x" the power from 2/3 the transistors is a bit absurd (and amazing if true on some planet).

Devs have had 2 months with XB360 dev kits. Before that they had access to graphics cards with 1/3 or 1/6 of the power. There's a vast array of algorithms that were simply impossible on pre-XB360 dev kits. Of course it's going to take time to extract the power. If you want to dismiss this, then go ahead...

Even with ~60% efficiency vs 100%, that would only account for the transistor budget being reduced, not a 2x power gain. It just seems transistor for transistor the theorectical power is going to be about the same -- there is no magic wand to get 2x the capabilities out of the same transistor budget (especially when you have some of the best engineers working on it).

R520 spends transistors on functions that aren't in Xenos, such as AVIVO, DX6/7/8. R520 also has a monstrous memory controller at its heart that looks to be about 5x bigger than the memory controllers in Xenos.

Additionally Xenos's ROPs are hidden on the EDRAM unit - R520's ROPs could amount to 40/60/80 million transistors (40m is twice Xenos's 20m, since R520 has twice the number of ROPs). They are big muthas.

Xenos also has hardware (including northbridge and Xbox Procedural Synthesis buffering...) that isn't in R520.

What you're partly arguing is that R520's mini-ALUs can't be a feature of Xenos's pipelines because Xenos is so small. The thing is mini-ALUs, because they have such limited functionality, are truly going to be a piddly proportion of the overall transistor budget. In terms of the overall transistor count, they're not relevant.

It just seems absurd that anyone would think Xenos would be substantially more powerful than stuff in the same generation (or availible in the same 6month window -- R520, G70, RSX) -- I'll grant the efficiency card making up for the transistor difference (and maybe a bit extra even), but I cannot see where you get the colossul power difference outside of that. Logic dictates that 48 "pipes" in 232m transistors (with ~15% redundancy by your calculations) shouldn't beat a ~320m transistor monster (at a higher clockspeed as well)... especially when its from the same company and engineering talent.

15% redundancy? One US is 8% of the die.

Logic dictates? A64 single core is around 110m transistors (depends on cache). P4-600 169m transistors (with humungous cache...).

Somehow, ATI has squeezed 64 shader pipes into about 72m transistors (excluding texturing, scheduling, register file - just the execution pipelines). That's my guess from the die shots.

Until someone can decode the R520 die shot (a RV530 die shot would help enormously - it's 170m transistors) I am definitely shooting in the dark on transistor counts. No doubt about it. Incidentally RV530 has:

75% of the pixel shader pipelines
25% of the texturing pipelines
50% of the ROP pipelines
62.5% of the vertex pipelines
50% of the ring bus capacity
25% of the capability for threads in flight

and is 53% of the die size, roughly.

But each shader pipeline in Xenos is Vec4+scalar, whereas most pipelines in R520 are Vec3+scalar. Logic dictates it's just not going to fit into Xenos. But it does...

The other aspect to remember about Xenos is that it doesn't operate on "quads" of pipelines (one program counter per quad) it operates on 16s of pipelines. So all the overheads to do with shader state, instruction decode and register fetch that you have in R520, per quad, are reduced to 1/4 the overhead in Xenos. Instruction decode and register fetch isn't trivial (most stages of a pipeline are spent in instruction decode and register fetch...). Xenos has 4 groups, R520 has 4 - whereas you'd expect Xenos to have 16 groups if you simply counted Xenos as having 16 quads (I'm including the redundant US).

In conclusion, a comparative audit of transistor counts in the two architectures is still fraught with gaping holes...

Don't get me wrong -- I would love to be wrong in this case (who wouldn't love a system you could get in 1.5 month that has 2x the power as most high end gpu you can't even buy for another 3 weeks??), but I just can't believe it. I think part of it might because in the past 20 years we've never had a power increase of 2 fold in the same transistor count (often much more like 10-20% if you're lucky) in a given field -- the technology field usually works in evolutions, not revolutions (I'd call 2x the performance increase in the same transistor count -- counting efficiency as evening the transistor counts -- a revolution). Call me cynical though, please!

What you're forgetting is that Xenos and R580 were roadmapped to be simultaneous releases. R580 has 3x the pixel shader pipelines as R520. The cost, apparently, will be less-efficient dynamic branching because the batch sizes will grow 3x.

Also R600 is the first PC part that's fully unified. Why is it 6-12 months later than Xenos?

I think it's all very well saying "that's incredible, can't be true" when so little is known - but I think the baldest fact, that Xenos has 48 fully functional vec4+scalar pipelines compared with R520's 24 pipelines of similar capability (with DX6/7/8 overheads), is the best clue as to what Xenos can do with shader-limited games.

Jawed

ihamoitc2005 · Oct 7, 2005

Different shader

Jawed said:
I think it's all very well saying "that's incredible, can't be true" when so little is known - but I think the baldest fact, that Xenos has 48 fully functional vec4+scalar pipelines compared with R520's 24 pipelines of similar capability (with DX6/7/8 overheads), is the best clue as to what Xenos can do with shader-limited games.

Xenos
48 x (1 have vector alu & 1 scalar alu)
Total: 48 vector alu & 48 scalar alu = 96 alu

R520
VS
8 x (1 vector alu & 1 scalar alu) = 16 alu
PS
16 x (2 vector alu & 2 scalar alu) = 64 alu
Total: 80 alu

Looking at clock speed, R520 actually slightly faster in ALU ops. Only question is, since all xenos alu = vec4, what to do with vec4 for pixel-shading? Is it be advantage or waste?

Mintmaster · Oct 7, 2005

It's really a shame that everyone thinks RSX is around the same speed or faster than Xenos. NVidia has really done a good marketing job.

Just look at the G70. They get 136 ops/cycle by assuming the PS can do what, 5 operations per cycle? If you compare G70 to R300 clock for clock, the performance advantage is maybe ~30%. Even if the Xenos pipelines are only as fast as a 2 generation old ATI architecture (note that R300 is vec3+scalar, Xenos is vec4+scalar, so this is quite conservative), it will still have a ~50% advantage over RSX.

The load sharing with vertex processing is a non-issue, because you rarely have both pixel shaders and vertex shaders under heavy loads simultaneously. I know because I've worked at ATI and studied performance using data unavailable to end users. The main reason you want good vertex performance is for blasting through triangles that have no pixels, i.e. backfaces, off screen triangles, etc.

The only advantages RSX has are fillrate without AA, and filtered texturing. The former doesn't apply for alpha blending (fog, smoke, particles), because RSX get bandwidth bottlenecked. The latter won't apply to shadow-mapped games (e.g. Unreal Engine 3) because you can use the 16 point sampled units for that.

Then you look at HDR, and Xenos has a huge advantage in that it has free AA (versus no AA) as well as a faster FP10 format.

Bobbler said:
Simple logic dictates that -- what we've seen from the Xenos hasn't been 2x the capabilities of R520 (the "devs haven't had the time!" card doesn't really work -- if Xenos was truly 2x, or anywhere near, the power it would be doing a lot more than 720p at 30fps with 2x AA)

Framerate is limited by fillrate, not shader rate, which is about equal to the current generation with 4xAA, and half with 2x or no AA. Not sure why only 2xAA is used, but maybe it has to do with getting used to the tiled rendering or something. Right now, games still rely very much on just texturing performance. Furthermore, we don't know that it is indeed the graphics card that's limiting the framerate, as we all know about the horrors of out of order execution. The developer learning curve is definately there.

Bobbler said:
, and "2x" the power from 2/3 the transistors is a bit absurd (and amazing if true on some planet). Even with ~60% efficiency vs 100%, that would only account for the transistor budget being reduced, not a 2x power gain. It just seems transistor for transistor the theorectical power is going to be about the same -- there is no magic wand to get 2x the capabilities out of the same transistor budget (especially when you have some of the best engineers working on it). It just seems absurd that anyone would think Xenos would be substantially more powerful than stuff in the same generation (or availible in the same 6month window -- R520, G70, RSX) -- I'll grant the efficiency card making up for the transistor difference (and maybe a bit extra even), but I cannot see where you get the colossul power difference outside of that. Logic dictates that 48 "pipes" in 232m transistors (with ~15% redundancy by your calculations) shouldn't beat a ~320m transistor monster (at a higher clockspeed as well)... especially when its from the same company and engineering talent.

There are many reasons for this.

First of all, these graphics firms have multiple teams working on different projects in tandem, so you can't say it's the same talent.

Second, just look at NVidia's jump between NV30 and NV40. NV30 was often less than half the speed of R300 in pixel shading unless NVidia hand tuned your shader. NV40 was faster than ATI's next gen. We're talking about a good 4x speed increase with less than 2x the transistors.

Third, the numbers you're quoting aren't comparable. 232 transistors does not include the daughter die, whose eDRAM saves you from needing z-compression, colour-compression with AA, an ultra efficient memory controller, large write caches, etc. The logic on the daughter die also saves blending, z-test, stencil test, and more. G70 and R520 are designed for the PC to use DirectX and OpenGL, so they don't have the flexibility for a radical architecture. Plus, Xenos doesn't need a 2D core, advanced video processing capabilities, and doesn't even need to worry about image output I think.

Fourth, Xenos has a unified architecture and doesn't need vertex shaders. Therefore more die space for general shader processors.

For all these reasons and more, it's more than probable that Xenos really is 2x as fast as current GPU's in pixel shading and 5x faster in heavy vertex shading, even though it 'only has 232M transistors'. Just remember that this doesn't necessarily translate into 2x performance for all scenarios.

Mintmaster · Oct 7, 2005

Jaws,

FLOP ratings are very misleading. Jawed is completely right.

Read my post above. When ATI released R300, They said it was 3-issue: vec3+scalar+texture. G70's 5-issue + free norm would lead you to think it was at least twice as fast, especially at math. This is hardly the case though. It is maybe 30% faster per clock than R300's shaders. Remember, this is comparing 2002 architecture to 2005! R520 was hardly any faster than R300 per clock, and it was labeled as 5-issue.

I fully expect each of Xenos' 48 shader pipelines to be in the same ballpark as RSX's 24 PS pipelines, maybe trailing by 20-25% at most.

Jawed · Oct 7, 2005

ihamoitc2005, you've just made the same mistake as Jaws. I can't guarantee it, but I'm fairly sure you'll find a mini-ALU in Xenos's pipeline.

Vec4+scalar for pixel shading looks like it could be a waste. Sadly the amount of publically available analysis of shaders compiled for existing GPUs - i.e. the use of co-issues and dual-issues - and the utilisation of ALU components is pretty much non-existant.

http://www.beyond3d.com/forum/showthread.php?t=20783

I can't actually dig out the 100-odd instruction shader that's referred to in that thread, that runs at 55% efficiency on NV40.

Jawed

ihamoitc2005 · Oct 7, 2005

Interesting post

Mintmaster said:
Just look at the G70. They get 136 ops/cycle by assuming the PS can do what, 5 operations per cycle? If you compare G70 to R300 clock for clock, the performance advantage is maybe ~30%. Even if the Xenos pipelines are only as fast as a 2 generation old ATI architecture (note that R300 is vec3+scalar, Xenos is vec4+scalar, so this is quite conservative), it will still have a ~50% advantage over RSX.

This claim is difficult to understand. RSX PS = 2xVec4 and 2xScalar, but Xenos US = 1xVec4 and 1xScalar no? So pixel-shader performance/shader unit is 2x for RSX and overall 10% more because of 10% higher clock-speed. But RSX (assuming overclocked G70) still has 8 vertex shader additional with 1xVec4 + 1xScalar.

The load sharing with vertex processing is a non-issue, because you rarely have both pixel shaders and vertex shaders under heavy loads simultaneously.

Assuming all US are on pixel-shading task for entire frame (so CPU must do vertex shader tasks), peak pixel processing performance is 12Gpixel compared to always available 13.2Gpixel for RSX no?

Framerate is limited by fillrate, not shader rate, which is about equal to the current generation with 4xAA, and half with 2x or no AA. Not sure why only 2xAA is used, but maybe it has to do with getting used to the tiled rendering or something.

Fill-rate does not increase with AA. Fill-rate is fill-rate. SSAA or MSAA make no difference in actual pixel-processing speed. SSAA or MSAA makes difference in frame size and both make difference in bandwidth need. But no difference to pixel processing speed.

For all these reasons and more, it's more than probable that Xenos really is 2x as fast as current GPU's in pixel shading and 5x faster in heavy vertex shading. Just remember that this doesn't necessarily translate into 2x performance for all scenarios.

How is it 2x as fast in pixel shading when it has total ALU count = G70 pixel-shaders? Also vertex shading can be very fast if pixel-performance not so important but Xenos is 500M polygon setup rate limited so really polygon performance no different than current PC GPUs.

ihamoitc2005 · Oct 7, 2005

mini-alu = waste

Jawed said:
ihamoitc2005, you've just made the same mistake as Jaws. I can't guarantee it, but I'm fairly sure you'll find a mini-ALU in Xenos's pipeline.

USA architecture is for efficiency, so goal is maximum performance from minimum transistors. It would not be efficient for extra ALU in Xenos unified shader unit. This is because when doing vertex shader operation, this extra ALU capacity will be wasted no? If extra ALU exists as you say, then pixel-shader performance is as much as 24Gpixel through 8 ROP which is 4 Gpixel, which is not efficient. ALso, if that is true USA capability, then it would be better to have use SSAA and more ROPs for very high quality anti-aliasing no? But such pixel-speed does not exist so this is why only ROPs available and why eDRAM unit is needed to put 2x anti-aliasing. Rest of architecture gives support for single Vec4+Scalar per unified shader.

Vec4+scalar for pixel shading looks like it could be a waste. Sadly the amount of publically available analysis of shaders compiled for existing GPUs - i.e. the use of co-issues and dual-issues - and the utilisation of ALU components is pretty much non-existant.

http://www.beyond3d.com/forum/showthread.php?t=20783

I can't actually dig out the 100-odd instruction shader that's referred to in that thread, that runs at 55% efficiency on NV40.

Jawed

Yes not much information on this but also keep in mind difference between generalized PC developement and closed box environment where developers will cater game to take full advantage of a GPU and also the architecture as whole.

expletive · Oct 7, 2005

Bobbler said:
(the "devs haven't had the time!" card doesn't really work -- if Xenos was truly 2x, or anywhere near, the power it would be doing a lot more than 720p at 30fps with 2x AA),

If you are going to use these numbers i think its only fair to also mention the games that are using more effects and running faster than their high-end PC counterparts (oblivion, CoD). I dont think we are seeing the type of performance you mention 'across the board' and it seems any game that has a PC counterpart (running on a 7800gtx) the 360 version is looking better and running faster.

J

Jawed · Oct 7, 2005

ihamoitc2005 said:
USA architecture is for efficiency, so goal is maximum performance from minimum transistors. It would not be efficient for extra ALU in Xenos unified shader unit. This is because when doing vertex shader operation, this extra ALU capacity will be wasted no?

A mini-ALU typically does very minor computations, like x2 or add 1 or clamp to range 0...1. It is normally ignored when talking about GPU power because it is so insignificant.

It has never, to my knowledge, been explicitly included in the calculation of the capability of any GPU. I don't know why you guys think it's so important to count it or not count it. If you're going to count it in one architecture for these fuzzy meaningless GFLOPs and ALU-count comparisons then you need to count it in the other.

I don't take those comparisons seriously, as they completely ignore architectural concepts.

If extra ALU exists as you say, then pixel-shader performance is as much as 24Gpixel through 8 ROP which is 4 Gpixel, which is not efficient.

Eh? Modern GPUs are designed to run lots of shader instructions per pixel. We don't want a GPU to be only capable of 1 instruction per pixel per clock. We want more. Hence lots of pipelines.

ALso, if that is true USA capability, then it would be better to have use SSAA and more ROPs for very high quality anti-aliasing no?

There may well be transparent texture SSAA in Xenos. Still waiting to find out. SSAA is principally work done by the shaders, not the ROPs.

Yes not much information on this but also keep in mind difference between generalized PC developement and closed box environment where developers will cater game to take full advantage of a GPU and also the architecture as whole.

A shadowed light shader (which is what this 100-instruction shader from Far Cry is) running on G70 or RSX will run with the same efficiency. Being in a closed box will not make RSX run this shader more efficiently.

Jawed

j^aws · Oct 7, 2005

Jawed said:
Like I said, no-one counts mini-ALUs. NVidia hasn't counted the mini-ALUs in NV40/G70/RSX.

This has NOTHING to do with NV. And you're mini-ALUs arn't going to magically support your claim of Xenos TWICE R520.

Let's make it simpler with MAIN ALUs, ignoring the mini-ALUs for ALL,

D = component ops

R520

PS~ (2*4D)*16 ~ 128D
VS ~ 5D*8 ~ 40D

R520 ~ 168D * 0.625 Ghz ~ 105 billion components ops /sec

Xenos
US ~ 5D*48 ~ 240D

Xenos ~ 240D * 0.5 GHz ~ 120 billion component ops/ sec

G70

PS ~ (2*4D)*24 ~ 192D
VS ~ 5D*8 ~ 40D

G70 ~ 232D*0.43Ghz ~ 100 billion component ops/sec

NONE are anywhere near TWICE R520.

Jawed said:
Nope, I've never seen it included. It might be because mini-ALUs have such limited applicability and are hardly ever used.

I'm talking about ATI/MS not including it in their PR numbers.

Jawed said:
I just wanted to show you how, according to your crazy pseudo-science a G70 which has twice the GFLOPs (not including mini-ALUs) of X1800XT (including mini-ALUs) is not twice as fast.

Err...those numbers are MIXED 16 bit and 32 bit flops which inflate the numbers. I've already stated before that those are inflated numbers and are NOT directly comparable. And it still doesn't support your claim of TWICE R520.

Jawed said:
When will you get over the fact peak GFLOPs are meaningless. You've been peddling this nonsense for 6 months now.

Excuse me but NO. It's not nonsense because the numbers represent a theoretical LIMIT. It shows the LIMITs of an architecture because it's a PEAK and always has been acknowledged as a PEAK. It's pulling efficiency numbers out of thin air like 50% 'this' and TWICE 'that' and 80% this etc... which is complete nonsense...

Jawed said:
Nope, the scalar part of the vec4+scalar (VS) or two vec3+scalar (PS) is not the mini-ALU. You really need to pay attention.

Err...NO. You really need to look at the figures again and see where they get the numbers from, this diagram shows where they get their numbers,

http://www.hardocp.com/images/articles/1119063771Y3O0GyEDBw_3_3_l.jpg

It CLEARLY shows the mini-ALUs as the scalar units.

And can we leave NV out of this now? This was about Xenos being TWICE R520 from your claim...

Jawed said:
FP16 normalise is not a function of the two mini-ALUs in G70/RSX pipeline. It's an entirely separate function.

Err...you're still missing the point. It's there to show you AGAIN where the INFALTED numbers come from because they're NOT ALL 32BIT. In fact those normalise figures contribute nearly twice that of the two mini-ALUs.

Sorry but no, they don't support your R520/Xenos claim being TWICE.

Jawed said:
Well if you insist on polluting discussions with irrelevant GFLOPs nonsense...

Burden of proof is still with YOU. And you've been WRONG before with your claims with the same stubborness after pages of posts....

Jawed said:
The first evidence will come with R580...

Unification of the shader architecture is going to increase utilisation further.

Xenos will be texture-bandwidth limited to the same degree as R520/R580 as both architectures have the same texturing capability (although R520/580 may have 20-40% faster caches). So any texture-limited games will not show any improvement in Xenos.

But games that are not texture bandwidth limited (going forwards this should be the norm for next-gen games) will easily get 100% faster in Xenos over R520. The combination of unified shader efficiency and twice the total pipelines will see to that.

ATI claim over 95 % efficiency for shader processing for R520 in their PR for ultra threaded pixel shaders and that's where most of the ALUs are in R520. That's in the same ballpark as Xenos. So your claim of Xenos being TWICE is still BS. Both are on 90nm, with Xenos logic less than R520...

I'll leave with this simple comparison without you're mini-ALUs fixation to support your hyberbole...

R520 ~ 105 billion components ops /sec
Xenos ~ 120 billion component ops/ sec
G70 ~ 100 billion component ops/sec

Xenos is nowhere near TWICE R520. PERIOD.

If you still believe that Xenos has TWICE the "pipeline horsepower" of the R520, then you can live in your fantasy world...

London Geezer · Oct 7, 2005

Am i the only one that's majorly freaked out by Jaws and Jawed arguing like that? I mean half the time it looks like it's the same guy with split personalities arguing with himself. Can't keep up with those names.

Ok sorry, ignore me.

Jawed · Oct 7, 2005

If you look closely at that [H] picture you'll see that each shader unit is "4FP MADs/pixel dual/co-issue" which means vec3+scalar.

The mini-ALU is totally separate. It is not the scalar part of the 4FP MADs/pixel.

The FP normalise is also separate, a function of shader unit 1.

The summary clearly counts only 5 instructions per pixel (co-issue MAD in shader 1 + FP normalise + co-issue MAD in shader 2). That count does not include the mini-ALUs.

Jawed

Shifty Geezer · Oct 7, 2005

This has been mentioned before. Several times. In fact, with such frequency that it should have a mandatory ban of 1 week for anyone else making this point

London Geezer · Oct 7, 2005

Shifty Geezer said:
This has been mentioned before. Several times. In fact, with such frequency that it should have a mandatory ban of 1 week for anyone else making this point

I'm unbannable didn't u know? I'm like genital herpes, once u get it, u can't get rid of it. Ever.

Acert93 · Oct 7, 2005

Mintmaster said:
For all these reasons and more, it's more than probable that Xenos really is 2x as fast as current GPU's in pixel shading and 5x faster in heavy vertex shading, even though it 'only has 232M transistors'. Just remember that this doesn't necessarily translate into 2x performance for all scenarios.

And that would be the key; designing software to the strength of the architecture (in this case shading) and avoiding bottlenecks (like texturing). Games designed with the texturing abilities of current GPUs in mind and leaning heavily on those would result in no real gain in Xenos. Yet after seeing hardware demos, like Toy Store, which are extremely shader heavy for the graphics it is hard to undrestand, at the surface, how a game with minimal shaders and effects can chug while a much more advanced (and better looking) engine like Toy Store can look so great. The obvious difference is a LOT of developers are still using techniques that are "happy" on traditional architectures or engines that are redundancy oriented with DX7/DX8 in mind. Seeing how Toy Store offloaded a substantual amount of work to the shaders, and comparing that result to the typically DX7/8 redundant engines on the market it is crystal clear to me that games are NOT currently taking advantage of the hardware on the market, let alone a more unique design like Xenos.

If the Xenos architecture is significantly faster in shading than ATI's previous chips we will need to compare shader limited situation because that is the area where Xenos has focused its performance goals (that and certain fast IQ features).

j^aws · Oct 7, 2005

Mintmaster said:
Jaws,

FLOP ratings are very misleading. Jawed is completely right.

Your opinion. Jawed is WRONG. He has the burden of proof for his claim that Xenos is twice R520. And I see none. And yes I'm fully aware that FLOPS can be misleading. Hence my repeated highlighting of 32BIT FLOPS.

Read my post above. When ATI released R300, They said it was 3-issue: vec3+scalar+texture. G70's 5-issue + free norm would lead you to think it was at least twice as fast, especially at math. This is hardly the case though. It is maybe 30% faster per clock than R300's shaders. Remember, this is comparing 2002 architecture to 2005! R520 was hardly any faster than R300 per clock, and it was labeled as 5-issue.

I appreciate your post. Noones comparing instructions/cycle here because that's even more misleading than FLOPS because they don't specify the TYPE of instructions and it's equivalent WORK done. Same with Flops, which is why I specified '32bit FLOPS' and programmable. See my post on component ops/sec...

I fully expect each of Xenos' 48 shader pipelines to be in the same ballpark as RSX's 24 PS pipelines, maybe trailing by 20-25% at most.

24 RSX PS pipes are equivalent to ONE out of THREE SIMD engines in Xenos. i.e. 1/3 of Xenos?

Err...no. I disagree strongly. I've shown why in many other threads. RSX 24 PS units would be 'on par' with ALL of Xenos, peak. Now efficiency is another thing, especially in a 'closed' box. But this has been debated before, so feel free to do a search in the forum...

ihamoitc2005 · Oct 7, 2005

Jawed said:
It has never, to my knowledge, been explicitly included in the calculation of the capability of any GPU.

Are you saying R520 & G70 are not 2 pixel/PS or are you saying Xenos is 2 pixel/US?

Eh? Modern GPUs are designed to run lots of shader instructions per pixel. We don't want a GPU to be only capable of 1 instruction per pixel per clock. We want more. Hence lots of pipelines.

A lot of pipe-lines is good but 48 G70 type pixel-shaders through 8 rops is not good. And therefore this is not the case. It is 48 half-G70 type pixel-shaders = 24 G70 type pixel shaders. Same number of pixels/clock.

There may well be transparent texture SSAA in Xenos. Still waiting to find out. SSAA is principally work done by the shaders, not the ROPs.

Yes but if pixel-shaders increase ROP must increase too. Gap cannot be too large? Only 8 ROP evidence that Xenos US is not = G70 PS in pixel/cycle. Also what about texture units?

A shadowed light shader (which is what this 100-instruction shader from Far Cry is) running on G70 or RSX will run with the same efficiency. Being in a closed box will not make RSX run this shader more efficiently.

Sure, that one may not be effective but another might. Shader performance varies from API to API, developer to developer, shader to shader and GPU to GPU regardless of USA or not-USA. USA can help improve adaptability to variations given certain capability but cannot increase actual capability.

If one looks as rest of GPU architecture, it is clear that Xenos is not ALU doubled as G70 and R520 pixel-shaders are. Hence 1 pixel/US/clock for Xenos vs 2 pixels/PS/clock for G70 and R520.

As for ATI claim of efficiency, we only have to look at R520 performance to see how effective ultra-threading, texture unit architecture, and other Xenos type modifications are. I posted link to benchmark comparison in previous post.

j^aws · Oct 7, 2005

Jawed said:
If you look closely at that [H] picture you'll see that each shader unit is "4FP MADs/pixel dual/co-issue" which means vec3+scalar.

Or vec2+vec2

The mini-ALU is totally separate. It is not the scalar part of the 4FP MADs/pixel.

The FP normalise is also separate, a function of shader unit 1.

The summary clearly counts only 5 instructions per pixel (co-issue MAD in shader 1 + FP normalise + co-issue MAD in shader 2). That count does not include the mini-ALUs.

Jawed

http://www.hardspell.com/newsimage/2005-6-21-16-10-14-654986702.gif

Do the maths from the above table and compare to diagram below,

http://www.hardocp.com/images/articles/1119063771Y3O0GyEDBw_3_3_l.jpg

The scalar flops are from the mini-ALUs...

Shifty Geezer · Oct 7, 2005

ihamoitc2005 said:
A lot of pipe-lines is good but 48 G70 type pixel-shaders through 8 rops is not good. And therefore this is not the case. It is 48 half-G70 type pixel-shaders = 24 G70 type pixel shaders. Same number of pixels/clock.

I think the point is those shaders aren't outputting per instruction, but applying multiple instructions to achieve better shading. On simple shaders the ROPs wouldn't be enough, but on complex shaders they're plenty. And anyone using only simple shaders on next-gen games is going to end up with inferior looking titles

On the flip side if you had enough ROP's to satisfy the simplest, faster shaders outputting at full speed, most of the time they'd be sitting idle.

Xenos as Physics Processor?

ihamoitc2005

Jawed

ihamoitc2005

Mintmaster

Mintmaster

Jawed

ihamoitc2005

ihamoitc2005

expletive

Jawed

j^aws

London Geezer

Jawed

Shifty Geezer

uber-Troll!

London Geezer

Acert93

Artist formerly known as Acert93

j^aws

ihamoitc2005

j^aws

Shifty Geezer

uber-Troll!

Similar threads