Ati on Xenos

Best quote:
I’d love to say that Nvidia are going to be stuck when it comes to Longhorn. But actually I do think they will have a unified shader architecture by the time WGF2 comes around. This time around, they don’t have the architecture and we do, so they have to knock it and say it isn’t worthwhile
This remind me something.. 8)
 
nAo said:
Best quote:
I’d love to say that Nvidia are going to be stuck when it comes to Longhorn. But actually I do think they will have a unified shader architecture by the time WGF2 comes around. This time around, they don’t have the architecture and we do, so they have to knock it and say it isn’t worthwhile
This remind me something.. 8)

Yup, but I asume that the step from SM2.0 -> 3.0 will be much much smaller than from SM3.0 -> WGF2?
 
Richard: “Microsoft weren’t focused on hardware backwards compatibility early on… that wasn’t in the specification. They believed that any compatibility they could get would come in through a software layer, and they didn’t want to compromise this generation’s hardware for the sake of last generation’s games.

“They have implemented compatibility purely through emulation (at the CPU level). It looks like emulation profiles for each game are going to be stored on the hard drive, and I imagine that a certain number will ship with the system. They already have the infrastructure to distribute more profiles via Live, and more and more can be made available online periodically.

“Emulating the CPU isn’t really a difficult task. They have three 3GHz cores, so emulating one 733MHz chip is pretty easy. The real bottlenecks in the emulation are GPU calls – calls made specifically by games to the Nvidia hardware in a certain way. General GPU instructions are easy to convert – an instruction to draw a triangle in a certain way will be pretty generic. However, it’s the odd cases, the proprietary routines, that will cause hassle.â€


Nice interview, thanks for the linkage. :)
 
Indeed, it really was just a mouthpiece for ATi. But anyway..

Providing developers throw instructions at our architecture in the right way, Xenos can run at 100% efficiency all the time, rather than having some pipeline instructions waiting for others.

Is this not true of all chips? If you fit your workload to the architecture, you should get close to the max possible, regardless of architecture. I know Xenos can adapt to the workload rather than vice versa, but this seems to be an odd comment given that this is what an architecture like Xenos is supposed to negate (fitting your work to the architecture).

edit - dumb me, he makes the point in the next sentence :oops:

With a unified pipeline we can now devote 100% of the hardware to which ever task is the bottleneck.â€￾

Can anyone confirm the granularity of processing division on Xenos? Can you arbitrarily divide the processing up on a per-ALU basis, or is it per "pipe" (3 of 16 ALUs)? This may have been clarified elsewhere, but I can't recall myself.
 
Titanio said:
...
Can anyone confirm the granularity of processing division on Xenos? Can you arbitrarily divide the processing up on a per-ALU basis, or is it per "pipe" (3 of 16 ALUs)? This may have been clarified elsewhere, but I can't recall myself.

Well, like everyone else, I'm looking forward to Dave's article for clarification that isn't sprinkled with any PR sugar! :p

IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...
 
Jaws said:
IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...

There is (or may be) an upper limit on how many ALUs can work on vertices or pixels? No more than 16 at any one time on vertices, no more than 32 on pixels? :? I'm all confussed.
 
Titanio said:
Jaws said:
IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...

There is (or may be) an upper limit on how many ALUs can work on vertices or pixels? No more than 16 at any one time on vertices, no more than 32 on pixels? :? I'm all confussed.

There are 3 SIMD engines. Each SIMD engine has 16 ALUs. Each ALU is a vec4 + scalar unit.

So each SIMD engine can work on either vertices or fragments.

E.g. 2 SIMD engines on fragments and 1 SIMD engine on vertices OR, 1 SIMD engine on fragments and 2 SIMD engines on vertices.

The 3 SIMD engines then auto-load balance between fragments and vertices on any given clock cycle. It's also why Xenos has been referred to as a 32 pipeline 'equivalent' GPU, because it works on 32 fragments per cycle, peak. However it only has 8 ROPs.

EDIT: typos...
 
Jaws said:
Titanio said:
Jaws said:
IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...

There is (or may be) an upper limit on how many ALUs can work on vertices or pixels? No more than 16 at any one time on vertices, no more than 32 on pixels? :? I'm all confussed.

There are 3 SIMD engines. Each SIMD engine has 48 ALUs. Each ALU is a vec4 + scalar unit.

So each SIMD engine can work on either vertices or fragments.

E.g. 2 SIMD engines on fragments and 1 SIMD engine on vertices OR, 1 SIMD engine on fragments and 2 SIMD engines on vertices.

The 3 SIMD engines then auto-load balance between fragments and vertices on any given clock cycle. It's also why Xenos has been referred to as a 32 pipeline 'equivalent' GPU, because it works on 32 fragments per cycle, peak. However it only has 8 ROPs.

EDIT: typos...

I get ya now, so the split is made on the the per "pipe" or "simd engine" level.

That then does impose some restriction..it's not as arbitrarily flexible as I first imagined. Eats a little at the 100% utilisation comments ;)

edit - can the whole chip be working on vertices or pixels, or does it make sense to keep one engine at least working on a different workload to the others (to keep data flowing steadily from vertex shading to pixel shading)?
 
Jaws, I don't think it's as you describe.

Firstly, ROPS are independant of the ALU's. ALU core outputs to memexport block which packs fragments and sends a max of 8 per clock to eDram module, all additional fragments (32 per clock) are computed within the eDram module itself. Not sure how many fragments per clock can get from the ALU's to memexport block or how large it's buffers are. It can not only read and write to the eDram module but to system memory as well.

Secondly, I believe the GPU works on groups of 64 pixels or vertices at a time, queuing instructions in the schedulers and then assigning the work. 100% of ALU resources can certainly be devoted either to vertex or pixel work. It may even be a required behavior at some level. Perhaps Dave can clear this up soon.
 
Titanio said:
Jaws said:
Titanio said:
Jaws said:
IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...

There is (or may be) an upper limit on how many ALUs can work on vertices or pixels? No more than 16 at any one time on vertices, no more than 32 on pixels? :? I'm all confussed.

There are 3 SIMD engines. Each SIMD engine has 48 ALUs. Each ALU is a vec4 + scalar unit.

So each SIMD engine can work on either vertices or fragments.

E.g. 2 SIMD engines on fragments and 1 SIMD engine on vertices OR, 1 SIMD engine on fragments and 2 SIMD engines on vertices.

The 3 SIMD engines then auto-load balance between fragments and vertices on any given clock cycle. It's also why Xenos has been referred to as a 32 pipeline 'equivalent' GPU, because it works on 32 fragments per cycle, peak. However it only has 8 ROPs.

EDIT: typos...

I get ya now, so the split is made on the the per "pipe" or "simd engine" level.

That then does impose some restriction..it's not as arbitrarily flexible as I first imagined. Eats a little at the 100% utilisation comments ;)

edit - can the whole chip be working on vertices or pixels, or does it make sense to keep one engine at least working on a different workload to the others (to keep data flowing steadily from vertex shading to pixel shading)?

Yeah...fyi...there's a typo in the above I corrected earlier. i.e. each SIMD engine has 16 ALU clusters.

Also, IIRC, Dave mentioned that he saw a more detailed schematic of Xenos that had each of the SIMD engines split in two, i.e. kinda like 6 SIMD engines, each with 8 ALUs.

AFAICS, Xenos can't have all 48 ALUs working on vertices or fragments because it wouldn't allow the auto-load balancing mechanism to work. However, you maybe able to overide this.
 
Rockster said:
Jaws, I don't think it's as you describe.

Firstly, ROPS are independant of the ALU's. ALU core outputs to memexport block which packs fragments and sends a max of 8 per clock to eDram module, all additional fragments (32 per clock) are computed within the eDram module itself. Not sure how many fragments per clock can get from the ALU's to memexport block or how large it's buffers are. It can not only read and write to the eDram module but to system memory as well.

Yes, I'm aware the ROPs are independant of the ALUs. I was making the point that Xenos cannot work on more than 32 fragments per cycle, even though it has 48 ALUs. And then making the point that there are 8 ROPs and not 32 ROPs.

Rockster said:
Secondly, I believe the GPU works on groups of 64 pixels or vertices at a time, queuing instructions in the schedulers and then assigning the work. 100% of ALU resources can certainly be devoted either to vertex or pixel work. It may even be a required behavior at some level. Perhaps Dave can clear this up soon.

Yes, this is not clear to me too...but it sounds like what's in 'flight' as opposed to what's being 'executed' per cycle...
 
I do not understand ROPS. for me it is a new term, even though it has to have been around for years. all I know is that somehow, 8 ROPS means Xenos cannot output more than 8 pixels per clockcycle. but on the upside, those 8 pixels per clock are all 4x FSAA'd without losing fillrate.
 
Sounds like PR piece saying why Xbox360 is superior than PS3...looks like they are trying hard to change perception that PS3 is superior hardware..
What is funny is that it was Microsoft who said that it is games that matters, but they seems to be more focused on spec wars than Sony..since they released several "unofficial" articles on why Xbox360 is superior platform.
 
Back
Top