ATI - PS3 is Unrefined

pegisys said:
And I think in a ati interview they say microsoft had a big hand in designing the gpu(might have been the article that started this post)

Well of course they had a big hand, they put up all the money and the product was for them.

But as far as the actual engineering and architecting of microprocessors, MS has little to nothing to contribute. The best engineers in these fields do work for hardware companies not software companies.
 
tema said:
Xenos is an unrefined lower-clocked R580 with edram. :D

No, edram does not have only post-rendering gfx functionality, but it is also embedded frame buffer for tile-based-rendering. R580 will not do that (I think). TBR is very BW efficient.

Dave`s article
They found a way to use this Z only pass to assist with tiling the screen to optimise the eDRAM utilisation. During the Z only rendering pass the max extents within the screen space of each object is calculated and saved in order to alleviate the necessity for calculation of the geometry multiple times. Each command is tagged with a header of which screen tile(s) it will affect. After the Z only rendering pass the Hierarchical Z Buffer is fully populated for the entire screen which results in the render order not being an issue. When rendering a particular tile the command fetching processor looks at the header that was applied in the Z only rendering pass to see whether its resultant data will fall into the tile it is currently processing and if so it will queue it, if not it will discard it until the next tile is ready to render. This process is repeated for each tile that requires rendering. Once the first tile has been fully rendered the tile can be resolved (FSAA down-sample) and that tile of the back-buffer data can be written to system RAM; the next tile can begin rendering whilst the first is still being resolved. In essence this process has similarities with tile based deferred rendering, except that it is not deferring for a frame and that the "tile" it is operating on is order of magnitudes larger than most other tilers have utilised before.
 
inefficient said:
Well of course they had a big hand, they put up all the money and the product was for them.

But as far as the actual engineering and architecting of microprocessors, MS has little to nothing to contribute. The best engineers in these fields do work for hardware companies not software companies.

Bzzzt. Wrong.

I'll see if I can find you a quote.

Edit - There you go:

Microsoft chip veterans Larry Yang, Jeff Andrews and Nick Baker -- former 3DO hardware engineers who joined Microsoft through the WebTV acquisition -- told IBM exactly what they needed the chip to do. In fact, that's why IBM's top engineer on the project is considered the "chief engineer," while Micrsooft's Jeff Andrews is considered the chief architect.

"At engineering meetings, it was impossible to tell who were the Microsoft engineers and who were the IBM engineers," Comfort.

http://blogs.mercurynews.com/aei/2005/10/ibm_describes_x.html

More backgrounds here: http://blogs.mercurynews.com/aei/2005/08/a_walk_through_.html
 
Last edited by a moderator:
OK, asked ATI over XMas if the Scalar processor on each of Xenos's ALU's was a duplicate of each of the components of the vector processor, hence whether the FLOPS rating was 216 or 240, and it appears not - in fact, it seems to be more of a special function processor:

ATI Engineering said:
You could claim either are correct depending on what you want to call a FLOP.

It is not as clean (i.e. the scalar engine cannot do a MUL-ADD) which would give us a straightforward 240 GFlops.

It really comes down to how you want to term operations like LOG(x), 1/X, 1/SQRT(X), SIN(X), COS(X). If we want to say that each of these is only a single FLOP, then 216 GFlops is the “correct” number, however, these operations are NOT single flops on any standard CPU (they take 3-6x as long on an Intel CPU as standard MUL or ADD) and are actually comprised in our implementation of multiple floating point muls and adds to achieve them, so you could claim as many as perhaps 6 flops for these operations (although some are not pure IEEE floating point operations).

So, 216 is the absolute lowest number and really does not do the scalar engine justice, but in a world where people only want to talk about muls and adds, is simple.

On the other end, a number as high as 6 flops for the scalar engine would give you a total of 336 GFlops.
 
On the other end, a number as high as 6 flops for the scalar engine would give you a total of 336 GFlops.
And Nvidia used the same 'trick' to inflate their flops figures IIRC
 
Dave Baumann said:
OK, asked ATI over XMas if the Scalar processor was a duplicate of each of the components of the vector processor, hence whether the FLOPS rating was 216 or 240, and it appears not - in fact, it seems to be more of a special function processor:

Thanks Dave for getting a definitive answer. I recall having a discussion with Jawed and implying the same thing in another thread, i.e. that the scalar unit will also act like a mini-ALU ( aka SFU, not the secondary ALU in G70 etc.).

For consistency in calculations these SFU ops we're omitted for G70, R520 etc. So applying the same to Xenos, I'd get

8 flops/ALU x 48 ALUs x 0.5 GHz ~ 192 GFlops

EDIT:

It is still strange that the scalar unit cannot do a MADD...
 
Last edited by a moderator:
The Scalar ALU will act as a co-issue ALU for the the functions that it supports, which includes ADD / MUL (much the same as NVIDIA's Vector ALU will co-issue two instructions when it can).
 
Dave Baumann said:
The Scalar ALU will act as a co-issue ALU for the the functions that it supports, which includes ADD / MUL (much the same as NVIDIA's Vector ALU will co-issue two instructions when it can).

Ah okay, I read that as only SFU ops it was capable of. Indeed then it is 216 GF, (9 Flops/ALU) if counted consistently. Thanks for clearing all this up...
 
Lysander said:
No, edram does not have only post-rendering gfx functionality
What post-rendering functionality would that be?

The eDRAM has no rendering functionality at all, it's just a big, wide DRAM array (4k bits aggregate bus width?), the rendering functionality is separate logic, and as that logic is what does the actual rendering, I don't see how it could be post-rendering functionality.

but it is also embedded frame buffer for tile-based-rendering. R580 will not do that (I think). TBR is very BW efficient.
What R580 will do (or not) hasn't been announced yet. And xenos isn't any more bandwidth efficient than any other current GPU, it can't do hidden surface removal through deferred rendering, instead it (optionally) uses a Z-only rendering pass, which can be done (and IS done in some current PC game titles) on any 3D accelerator.
 
Guden Oden said:
What post-rendering functionality would that be? The eDRAM has no rendering functionality at all
Yea, I talked about core logic on daughter die. "Post-rendering" in a sense to apply hdr and aa after tile was already rendered on shaders.

Doesn`t xenos do z-depth and overdrawn calculation before tile rendering on shaders? Why it wouldn`t use that for hiden surface removal?
 
Ati Engineering said:
these operations are NOT single flops on any standard CPU (they take 3-6x as long on an Intel CPU as standard MUL or ADD
nAo said:
And Nvidia used the same 'trick' to inflate their flops figures IIRC
Indeed. Sony should take note from Ati and NVidia and revise the Flop rating of a certain CPU to 11.4GFlops.
 
Dave,

Can you clarify when Xenos is pixel shading, if the ALUs can co-issue/issue like below?

1) vec3 (madds) + scalar (non-madds) ~ 7 flops/ALU

2) vec2 (madds) + vec2 (madds) ~ 8 flops/ALU

3) vec4 (madds) ~ 8 flops/ALU

I would think 1 is fine but not sure about 2 and 3?
 
1

Jaws said:
Dave,

Can you clarify when Xenos is pixel shading, if the ALUs can co-issue/issue like below?

1) vec3 (madds) + scalar (non-madds) ~ 7 flops/ALU

2) vec2 (madds) + vec2 (madds) ~ 8 flops/ALU

3) vec4 (madds) ~ 8 flops/ALU

I would think 1 is fine but not sure about 2 and 3?

I feel 1 is correct choice in multiple choice question but I wonder what techno-masters will say. Maybe different for different pass.
 
Co-issue means two instructions can be issued in parallel of the same cycle. Xenos's structure is different from current pixel shaders in that while most PC pixel shader pipelines are only Vector (that can optionally "co-issue" some non-vector combinations) Xenos can co-issue a full vector with a scalar instruction. i.e. its combinations will be Vec4 + Scalar, Vec3 + scalar, Vec2 + Scalar, Scalar + Scalar (with the vector ALU being on the left of the +) irrespective of pixel or vertex operations. AFAIK full vector instructions are by far the most frequent, with Vec3 and scalar after, Vec2 isn't very frequent at all.
 
Techno-Master

Dave Baumann said:
Co-issue means two instructions can be issued in parallel of the same cycle. Xenos's structure is different from current pixel shaders in that while most PC pixel shader pipelines are only Vector (that can optionally "co-issue" some non-vector combinations) Xenos can co-issue a full vector with a scalar instruction. i.e. its combinations will be Vec4 + Scalar, Vec3 + scalar, Vec2 + Scalar, Scalar + Scalar (with the vector ALU being on the left of the +) irrespective of pixel or vertex operations. AFAIK full vector instructions are by far the most frequent, with Vec3 and scalar after, Vec2 isn't very frequent at all.

Thank you for your explanation. With extra Scalar available with full- vector maybe some new possibility is available that can change distribution of instruction combination?
 
Such an organisation is not a function of capability but performance. Its up to the instruction scheduler / optimiser to effectively schedule the instructions to best make use of the available ALU resources; probably about all it will mean is more scalar instructions can be performed without wasting component processing resources on a Vector ALU.
 
From the Microcode power point:


–Can co-issue 1 vector4 and 1 scalar op per cycle
mul r0,r1,r2 // vector operation
+ rsq r3.x,r4.x // scalar operation


–Special “_prevâ€￾ scalar operations use results of previous scalar operations:
rsq r3._,r0.x // scalar result is retained
mul r0,r1,r2
+ adds_prev r4.x,r5.x // Adds result of rsq to r5.x

To be quite honest I don't understand why the "_prev" modifier is so noteworthy.

Jawed
 
Dave Baumann said:
Co-issue means two instructions can be issued in parallel of the same cycle. Xenos's structure is different from current pixel shaders in that while most PC pixel shader pipelines are only Vector (that can optionally "co-issue" some non-vector combinations) Xenos can co-issue a full vector with a scalar instruction. i.e. its combinations will be Vec4 + Scalar, Vec3 + scalar, Vec2 + Scalar, Scalar + Scalar (with the vector ALU being on the left of the +) irrespective of pixel or vertex operations. AFAIK full vector instructions are by far the most frequent, with Vec3 and scalar after, Vec2 isn't very frequent at all.

Alright, so basically I could've asked my question more simply, i.e. that pixel ops can be a vector/scalar instruction mix up to 5D, the same as vertex ops, right?
 
Back
Top