Ati on Xenos

pipo · Jun 10, 2005

http://www.bit-tech.net/bits/2005/06/10/richard_huddy_ati/1.html

nAo · Jun 10, 2005

Best quote:

Iâ€™d love to say that Nvidia are going to be stuck when it comes to Longhorn. But actually I do think they will have a unified shader architecture by the time WGF2 comes around. This time around, they donâ€™t have the architecture and we do, so they have to knock it and say it isnâ€™t worthwhile

This remind me something.. 8)

carpediem · Jun 10, 2005

nAo said:
Best quote:

Iâ€™d love to say that Nvidia are going to be stuck when it comes to Longhorn. But actually I do think they will have a unified shader architecture by the time WGF2 comes around. This time around, they donâ€™t have the architecture and we do, so they have to knock it and say it isnâ€™t worthwhile

Click to expand...

This remind me something.. 8)

Yup, but I asume that the step from SM2.0 -> 3.0 will be much much smaller than from SM3.0 -> WGF2?

AlNom · Jun 10, 2005

Richard: â€œMicrosoft werenâ€™t focused on hardware backwards compatibility early onâ€¦ that wasnâ€™t in the specification. They believed that any compatibility they could get would come in through a software layer, and they didnâ€™t want to compromise this generationâ€™s hardware for the sake of last generationâ€™s games.

â€œThey have implemented compatibility purely through emulation (at the CPU level). It looks like emulation profiles for each game are going to be stored on the hard drive, and I imagine that a certain number will ship with the system. They already have the infrastructure to distribute more profiles via Live, and more and more can be made available online periodically.

â€œEmulating the CPU isnâ€™t really a difficult task. They have three 3GHz cores, so emulating one 733MHz chip is pretty easy. The real bottlenecks in the emulation are GPU calls â€“ calls made specifically by games to the Nvidia hardware in a certain way. General GPU instructions are easy to convert â€“ an instruction to draw a triangle in a certain way will be pretty generic. However, itâ€™s the odd cases, the proprietary routines, that will cause hassle.â€

Nice interview, thanks for the linkage.

mckmas8808 · Jun 10, 2005

Yeah good find man. This is a good read.

Cobra101 · Jun 10, 2005

ATI rips on nVidia.

Shock.

pc999 · Jun 10, 2005

I really want DaveB. article, this thing looks swet

Fafalada · Jun 10, 2005

Interview said:
The problem they have is that CPU power isnâ€™t really what developerâ€™s need â€“ the bottleneck is really the graphics.

gurgi · Jun 10, 2005

This reads more like a PR piece than a look at the hardware.

Titanio · Jun 10, 2005

Indeed, it really was just a mouthpiece for ATi. But anyway..

Providing developers throw instructions at our architecture in the right way, Xenos can run at 100% efficiency all the time, rather than having some pipeline instructions waiting for others.

Is this not true of all chips? If you fit your workload to the architecture, you should get close to the max possible, regardless of architecture. I know Xenos can adapt to the workload rather than vice versa, but this seems to be an odd comment given that this is what an architecture like Xenos is supposed to negate (fitting your work to the architecture).

edit - dumb me, he makes the point in the next sentence

With a unified pipeline we can now devote 100% of the hardware to which ever task is the bottleneck.â€

Can anyone confirm the granularity of processing division on Xenos? Can you arbitrarily divide the processing up on a per-ALU basis, or is it per "pipe" (3 of 16 ALUs)? This may have been clarified elsewhere, but I can't recall myself.

j^aws · Jun 10, 2005

Titanio said:
...
Can anyone confirm the granularity of processing division on Xenos? Can you arbitrarily divide the processing up on a per-ALU basis, or is it per "pipe" (3 of 16 ALUs)? This may have been clarified elsewhere, but I can't recall myself.

Well, like everyone else, I'm looking forward to Dave's article for clarification that isn't sprinkled with any PR sugar!

IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...

Titanio · Jun 10, 2005

Jaws said:
IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...

There is (or may be) an upper limit on how many ALUs can work on vertices or pixels? No more than 16 at any one time on vertices, no more than 32 on pixels? :? I'm all confussed.

j^aws · Jun 10, 2005

Titanio said:
Jaws said:

IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...

Click to expand...

There is (or may be) an upper limit on how many ALUs can work on vertices or pixels? No more than 16 at any one time on vertices, no more than 32 on pixels? :? I'm all confussed.

There are 3 SIMD engines. Each SIMD engine has 16 ALUs. Each ALU is a vec4 + scalar unit.

So each SIMD engine can work on either vertices or fragments.

E.g. 2 SIMD engines on fragments and 1 SIMD engine on vertices OR, 1 SIMD engine on fragments and 2 SIMD engines on vertices.

The 3 SIMD engines then auto-load balance between fragments and vertices on any given clock cycle. It's also why Xenos has been referred to as a 32 pipeline 'equivalent' GPU, because it works on 32 fragments per cycle, peak. However it only has 8 ROPs.

EDIT: typos...

Titanio · Jun 10, 2005

Jaws said:
Titanio said:

Jaws said:

IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...

Click to expand...

There is (or may be) an upper limit on how many ALUs can work on vertices or pixels? No more than 16 at any one time on vertices, no more than 32 on pixels? :? I'm all confussed.

Click to expand...

There are 3 SIMD engines. Each SIMD engine has 48 ALUs. Each ALU is a vec4 + scalar unit.

So each SIMD engine can work on either vertices or fragments.

E.g. 2 SIMD engines on fragments and 1 SIMD engine on vertices OR, 1 SIMD engine on fragments and 2 SIMD engines on vertices.

The 3 SIMD engines then auto-load balance between fragments and vertices on any given clock cycle. It's also why Xenos has been referred to as a 32 pipeline 'equivalent' GPU, because it works on 32 fragments per cycle, peak. However it only has 8 ROPs.

EDIT: typos...

I get ya now, so the split is made on the the per "pipe" or "simd engine" level.

That then does impose some restriction..it's not as arbitrarily flexible as I first imagined. Eats a little at the 100% utilisation comments

edit - can the whole chip be working on vertices or pixels, or does it make sense to keep one engine at least working on a different workload to the others (to keep data flowing steadily from vertex shading to pixel shading)?

Rockster · Jun 10, 2005

Jaws, I don't think it's as you describe.

Firstly, ROPS are independant of the ALU's. ALU core outputs to memexport block which packs fragments and sends a max of 8 per clock to eDram module, all additional fragments (32 per clock) are computed within the eDram module itself. Not sure how many fragments per clock can get from the ALU's to memexport block or how large it's buffers are. It can not only read and write to the eDram module but to system memory as well.

Secondly, I believe the GPU works on groups of 64 pixels or vertices at a time, queuing instructions in the schedulers and then assigning the work. 100% of ALU resources can certainly be devoted either to vertex or pixel work. It may even be a required behavior at some level. Perhaps Dave can clear this up soon.

j^aws · Jun 10, 2005

Titanio said:
Jaws said:

Titanio said:

Jaws said:

IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...

Click to expand...

There is (or may be) an upper limit on how many ALUs can work on vertices or pixels? No more than 16 at any one time on vertices, no more than 32 on pixels? :? I'm all confussed.

Click to expand...

There are 3 SIMD engines. Each SIMD engine has 48 ALUs. Each ALU is a vec4 + scalar unit.

So each SIMD engine can work on either vertices or fragments.

E.g. 2 SIMD engines on fragments and 1 SIMD engine on vertices OR, 1 SIMD engine on fragments and 2 SIMD engines on vertices.

The 3 SIMD engines then auto-load balance between fragments and vertices on any given clock cycle. It's also why Xenos has been referred to as a 32 pipeline 'equivalent' GPU, because it works on 32 fragments per cycle, peak. However it only has 8 ROPs.

EDIT: typos...

Click to expand...

I get ya now, so the split is made on the the per "pipe" or "simd engine" level.

That then does impose some restriction..it's not as arbitrarily flexible as I first imagined. Eats a little at the 100% utilisation comments

edit - can the whole chip be working on vertices or pixels, or does it make sense to keep one engine at least working on a different workload to the others (to keep data flowing steadily from vertex shading to pixel shading)?

Yeah...fyi...there's a typo in the above I corrected earlier. i.e. each SIMD engine has 16 ALU clusters.

Also, IIRC, Dave mentioned that he saw a more detailed schematic of Xenos that had each of the SIMD engines split in two, i.e. kinda like 6 SIMD engines, each with 8 ALUs.

AFAICS, Xenos can't have all 48 ALUs working on vertices or fragments because it wouldn't allow the auto-load balancing mechanism to work. However, you maybe able to overide this.

j^aws · Jun 10, 2005

Rockster said:
Jaws, I don't think it's as you describe.

Firstly, ROPS are independant of the ALU's. ALU core outputs to memexport block which packs fragments and sends a max of 8 per clock to eDram module, all additional fragments (32 per clock) are computed within the eDram module itself. Not sure how many fragments per clock can get from the ALU's to memexport block or how large it's buffers are. It can not only read and write to the eDram module but to system memory as well.

Yes, I'm aware the ROPs are independant of the ALUs. I was making the point that Xenos cannot work on more than 32 fragments per cycle, even though it has 48 ALUs. And then making the point that there are 8 ROPs and not 32 ROPs.

Rockster said:
Secondly, I believe the GPU works on groups of 64 pixels or vertices at a time, queuing instructions in the schedulers and then assigning the work. 100% of ALU resources can certainly be devoted either to vertex or pixel work. It may even be a required behavior at some level. Perhaps Dave can clear this up soon.

Yes, this is not clear to me too...but it sounds like what's in 'flight' as opposed to what's being 'executed' per cycle...

cthellis42 · Jun 10, 2005

Cobra101 said:
ATI rips on nVidia.

Shock.

And you know what? Later...! Later... Wait, wait, you'll never guess... nVidia rips on ATi! DAYUM, it's a hoot!

Megadrive1988 · Jun 10, 2005

I do not understand ROPS. for me it is a new term, even though it has to have been around for years. all I know is that somehow, 8 ROPS means Xenos cannot output more than 8 pixels per clockcycle. but on the upside, those 8 pixels per clock are all 4x FSAA'd without losing fillrate.

JasonLD · Jun 10, 2005

Sounds like PR piece saying why Xbox360 is superior than PS3...looks like they are trying hard to change perception that PS3 is superior hardware..
What is funny is that it was Microsoft who said that it is games that matters, but they seems to be more focused on spec wars than Sony..since they released several "unofficial" articles on why Xbox360 is superior platform.

Ati on Xenos

pipo

nAo

Nutella Nutellae

carpediem

AlNom

Moderator

mckmas8808

Cobra101

pc999

Fafalada

gurgi

Titanio

j^aws

Titanio

j^aws

Titanio

Rockster

j^aws

j^aws

cthellis42

Hoopy Frood

Megadrive1988

JasonLD

Similar threads