This remind me something.. 8)I’d love to say that Nvidia are going to be stuck when it comes to Longhorn. But actually I do think they will have a unified shader architecture by the time WGF2 comes around. This time around, they don’t have the architecture and we do, so they have to knock it and say it isn’t worthwhile
nAo said:Best quote:
This remind me something.. 8)I’d love to say that Nvidia are going to be stuck when it comes to Longhorn. But actually I do think they will have a unified shader architecture by the time WGF2 comes around. This time around, they don’t have the architecture and we do, so they have to knock it and say it isn’t worthwhile
Richard: “Microsoft weren’t focused on hardware backwards compatibility early on… that wasn’t in the specification. They believed that any compatibility they could get would come in through a software layer, and they didn’t want to compromise this generation’s hardware for the sake of last generation’s games.
“They have implemented compatibility purely through emulation (at the CPU level). It looks like emulation profiles for each game are going to be stored on the hard drive, and I imagine that a certain number will ship with the system. They already have the infrastructure to distribute more profiles via Live, and more and more can be made available online periodically.
“Emulating the CPU isn’t really a difficult task. They have three 3GHz cores, so emulating one 733MHz chip is pretty easy. The real bottlenecks in the emulation are GPU calls – calls made specifically by games to the Nvidia hardware in a certain way. General GPU instructions are easy to convert – an instruction to draw a triangle in a certain way will be pretty generic. However, it’s the odd cases, the proprietary routines, that will cause hassle.â€
Interview said:The problem they have is that CPU power isn’t really what developer’s need – the bottleneck is really the graphics.
Providing developers throw instructions at our architecture in the right way, Xenos can run at 100% efficiency all the time, rather than having some pipeline instructions waiting for others.
With a unified pipeline we can now devote 100% of the hardware to which ever task is the bottleneck.â€
Titanio said:...
Can anyone confirm the granularity of processing division on Xenos? Can you arbitrarily divide the processing up on a per-ALU basis, or is it per "pipe" (3 of 16 ALUs)? This may have been clarified elsewhere, but I can't recall myself.
Jaws said:IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...
Titanio said:Jaws said:IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...
There is (or may be) an upper limit on how many ALUs can work on vertices or pixels? No more than 16 at any one time on vertices, no more than 32 on pixels? :? I'm all confussed.
Jaws said:Titanio said:Jaws said:IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...
There is (or may be) an upper limit on how many ALUs can work on vertices or pixels? No more than 16 at any one time on vertices, no more than 32 on pixels? :? I'm all confussed.
There are 3 SIMD engines. Each SIMD engine has 48 ALUs. Each ALU is a vec4 + scalar unit.
So each SIMD engine can work on either vertices or fragments.
E.g. 2 SIMD engines on fragments and 1 SIMD engine on vertices OR, 1 SIMD engine on fragments and 2 SIMD engines on vertices.
The 3 SIMD engines then auto-load balance between fragments and vertices on any given clock cycle. It's also why Xenos has been referred to as a 32 pipeline 'equivalent' GPU, because it works on 32 fragments per cycle, peak. However it only has 8 ROPs.
EDIT: typos...
Titanio said:Jaws said:Titanio said:Jaws said:IIRC, Xenos handles 16 Giga fragment samples per SECOND, ~ 32 fragment samples per CYCLE. So I can't see the 48 ALU clusters (48 Vec4 + 48 Scalar) ALL working on fragments or vertices per cycle. Unless I've missed something, 32 ALUs, peak, would work on fragments and 16 ALUs on vertices and vice versa...
There is (or may be) an upper limit on how many ALUs can work on vertices or pixels? No more than 16 at any one time on vertices, no more than 32 on pixels? :? I'm all confussed.
There are 3 SIMD engines. Each SIMD engine has 48 ALUs. Each ALU is a vec4 + scalar unit.
So each SIMD engine can work on either vertices or fragments.
E.g. 2 SIMD engines on fragments and 1 SIMD engine on vertices OR, 1 SIMD engine on fragments and 2 SIMD engines on vertices.
The 3 SIMD engines then auto-load balance between fragments and vertices on any given clock cycle. It's also why Xenos has been referred to as a 32 pipeline 'equivalent' GPU, because it works on 32 fragments per cycle, peak. However it only has 8 ROPs.
EDIT: typos...
I get ya now, so the split is made on the the per "pipe" or "simd engine" level.
That then does impose some restriction..it's not as arbitrarily flexible as I first imagined. Eats a little at the 100% utilisation comments
edit - can the whole chip be working on vertices or pixels, or does it make sense to keep one engine at least working on a different workload to the others (to keep data flowing steadily from vertex shading to pixel shading)?
Rockster said:Jaws, I don't think it's as you describe.
Firstly, ROPS are independant of the ALU's. ALU core outputs to memexport block which packs fragments and sends a max of 8 per clock to eDram module, all additional fragments (32 per clock) are computed within the eDram module itself. Not sure how many fragments per clock can get from the ALU's to memexport block or how large it's buffers are. It can not only read and write to the eDram module but to system memory as well.
Rockster said:Secondly, I believe the GPU works on groups of 64 pixels or vertices at a time, queuing instructions in the schedulers and then assigning the work. 100% of ALU resources can certainly be devoted either to vertex or pixel work. It may even be a required behavior at some level. Perhaps Dave can clear this up soon.
And you know what? Later...! Later... Wait, wait, you'll never guess... nVidia rips on ATi! DAYUM, it's a hoot!Cobra101 said:ATI rips on nVidia.
Shock.