Can this SONY patent be the PixelEngine in the PS3's GPU?

Megadrive1988 · Jun 1, 2004

Reading Fig.6 in the patent, there are 256 SALPs (serial operation pipelines) and each SALP consists of 32 SALCs (serial arithmetic-logic circuits)...can be thought of as 256 pipelines with 32 stages each? A total of 8192 SALCs! Whether that's per PixelEngine per VS (*4) or the entire GPU, the patent isn't clear. These SALCs do seem to be tiny though, operating usually on 1-3 bits...

this large number of SALCs remind me of the thread units we talked about over a year ago. basicly, thousands of these things.

Panajev2001a · Jun 1, 2004

DeanoC said:
Is it just me or is this just describing a pipeline system?

How I read it...

Each set of 32 serial ALU combines to provide 1 operation per cycle on 4 channel high precision ALU. (FP24 would need 8 3 bit ALU per channel). Less precise data can gets more operations per cycle.

Each one of the 256 units then operate on a single fragment as it passes through the programmed rasterisation steps.

The actual amount of fragments issued per cycle is dependent on the rasterisation complexity. Given that that scissor test would take 8 cycles and depth buffering a minimum of 4 cycles gives you an idea of how amazing complex rasterisation is. If we are nice and reckon at 50 cycles for a fragment to go from the end of the fragment shader to framebuffer (this is far too low, think about stencil, colour and depth operations) We would get 5 actual fragments per cycle.

A modern PC video card has something like 1000 fragments in progress and can output upto 16 per cycle, so this actually fairly un-parellel by graphics standards...

Well, I leave the bulk of the computational heavvy workload to the APUs and we could always whish these 256 SALCs to be per PE and having 4 PEs in a VS we would have about 20 fragments per cycle processed.

I do not think that Rasterization will be the bottleneck next-generation.

Edit:

Each set of 32 serial ALU combines to provide 1 operation per cycle on 4 channel high precision ALU. (FP24 would need 8 3 bit ALU per channel). Less precise data can gets more operations per cycle.

32 serial ALUs = ( 8x3 bits ALU ) * ( 4 channels ).

Right ?

So, this would be a SALP ? We would process each channel by a separate portion of the SALP, right ?

How about a SALP per PE ?

4 SALPs per VS then and we would get 5 fragments ( by your calculation ) per PE per cycle or about 20 in total.

So we are talking about 128x3 bits ALUs connected in 4 serial pipelines.

As I said, once you finish and test a single 3 bits SALC and you go to finish a full SALP then 4 SALPs is not too much of a jump for the engineering team.

Somethign tells me that due to the high redundancy in CELL and in this SALC/SALP structure the engineers dedicated to maximization of yelds should have at least some more breathing room.

Vince · Jun 1, 2004

Question: Why are we still worried about and practicing borderline numerology instead of worrying about functionality, effeciency and what it does for you. First of all, nobody is limited by sheer rasterization anymore. As Dr. Kirk stated, it utilized <5% of a contemporary IC and I have yet to see a single sign or statement about being raster limited moving forward -- already the NV40 saves logic by assuming that texture ops will be diminishing in importance vis-a-vis shader ops.

And assuming even if you spend 500clocks/fragment and it takes 2 clock cycles to process a fragment per virtual pipeline, it doesn't mean much without knowing the absolute clock. If it clocks at 1 GHz, thats 2GPixel/sec assuming this Fig 6 is indicative of a Pixel Engine in the 'other' Fig 6; which is logical considering the task and the construct outlined.

Isn't the fact that you have the ability to dynamically (re)assign and manipulate the resources and have a such variable output more important than sheer numbers? Especially when comparing it against, say an R300/420, or even the NV40.

nAo · Jun 1, 2004

Panajev2001a said:
Well, I leave the bulk of the computational heavvy workload to the APUs and we could always whish these 256 SALCs to be per PE and having 4 PEs in a VS we would have about 20 fragments per cycle processed.

I think we'll have much more fragment per cycles processed on a 4 VS Visualizer (or Realizer

).
At this time I (you?) need to process 4 vertices at the same time to fully exploit VU1 power, and I have (almost) single cycles access memory with no wait states. (VU1 memory accesses should have higher prority than GIF reads and VIF writes IIRC).
In a 4 Ghz chip that will have for sure longer pipelines than PS2 VUs have, on a shader that requires random accesses to external memory (TMUs? sigh..) what will we do? Or Sony provide us with a novel shading mechanism or we'll have to process a LOT of fragments at the same time in order to hide latencies.
Ok..we already discussed that in another thread some weeks ago. My idea was to convert most of the random accesses to memory to fetch texels in streamlined accesses. That's possible in many cases..but what about things such dependent textures reads or everything that has dynamic mapping coordinates (ie env maps..)
Are we doomed?

Obviously we aren't..but I'd very like to know what STI will do to avoid this kind of problems.
Or maybe they will do nothing about it..just like Sony did with clipping or mipmaps lod selection on the GS..

ciao,
Marco

nAo · Jun 1, 2004

Vince said:
Isn't the fact that you have the ability to dynamically (re)assign and manipulate the resources and have a such variable output more important than sheer numbers? Especially when comparing it against, say an R300/420, or even the NV40.

I just remembered Sony has issued a patent (Japanese patents office..eheh

) about dinamically shifting workload between geometry and shading units..too bad I can't read japanese and the online translation sucks

ciao,
Marco

DeanoC · Jun 1, 2004

Vince said:
Question: Why are we still worried about and practicing borderline numerology instead of worrying about functionality, effeciency and what it does for you. First of all, nobody is limited by sheer rasterization anymore. As Dr. Kirk stated, it utilized <5% of a contemporary IC and I have yet to see a single sign or statement about being raster limited moving forward -- already the NV40 saves logic by assuming that texture ops will be diminishing in importance vis-a-vis shader ops.

Because were examing what this is and why? This implies an architecture this scales nicely, low complexity fragment operations and low range data increases fillrate signifcantly. High complexity operations take longer but will likely be submitted less often (a complex pixel shader would take more cycles so less fragments are generated).

Also the fact this hints (just hints mind) at high gfx clocks is interesting. Because tradionally (including PS2 GS) gfx clocks have been low and rely on the natural parellism in graphics to achieve high throughput. The next question you should be asking is why? Why increase clocks over repeating functional blocks on the chip?

Raster bound is easy, shadows being the usual suspect. Simple shaders with fairly complex fragment operations. But if this is programmable then things like NV style depth scissor should be easy to help reclaim those valuable cycles.

Don't be so damn defensive... I'm here to discuss gfx architecture not to insult anybody favorite console maker. I work on consoles every day, I hate them all equally ;-)

Vince said:
Isn't the fact that you have the ability to dynamically (re)assign and manipulate the resources and have a such variable output more important than sheer numbers? Especially when comparing it against, say an R300/420, or even the NV40.

So you know that PC don't have similar architecture for there fragment ops.
The actual hardware is NOTHING like the model presented by DirectX.

Vince · Jun 1, 2004

DeanoC said:
Don't be so damn defensive... I'm here to discuss gfx architecture not to insult anybody favorite console maker. I work on consoles every day, I hate them all equally ;-)

Heh. First of all, I was talking to Panajev since it seems like he always has to justify a number and make it looked bigger. I mean, there's nothing wrong in what you stated that needs a responce to placet the people around here that love to post specs with bigger numbers over understanding why and how it applies.

Hell, I was basically agreeing with you so I'm sorry if I came off as being defensive. I think it's a smart concept all in all, although I see it applied to strictly rasterizatio, not shading tasks as you implied.

PS. Thanks for the heads up on the last part.

Laa-Yosh · Jun 2, 2004

archie4oz said:
artists apparently usually cheat in CG and just overlap pieces.

Click to expand...

A lot! Welding patches is a total PITA...

Actually, you cannot weld NURBS patches... All you can do is ask your 3D app to keep the edges of the patches stitched together, after you've done everything you could to build surfaces with tangency and such. And hope that your app will not fail (which it usually does a few times).

Another problem with NURBS is texturing. You either use the builtin UV parametrization, which will stretch and require a texture per patch; or you have to use projections with all their additional overhead.

Trims, fillets and such are also expensive, and even more problematic. No wonder the CG industry switched to subdivs...

Laa-Yosh · Jun 2, 2004

What does vertex valence mean? In all my work with polygonal modeling for subdivs (for prerendered stuff), I've never really had to care about this as far as I know... or?

DeanoC · Jun 2, 2004

Vince said:
DeanoC said:

Don't be so damn defensive... I'm here to discuss gfx architecture not to insult anybody favorite console maker. I work on consoles every day, I hate them all equally ;-)

Click to expand...

Heh. First of all, I was talking to Panajev since it seems like he always has to justify a number and make it looked bigger. I mean, there's nothing wrong in what you stated that needs a responce to placet the people around here that love to post specs with bigger numbers over understanding why and how it applies.
Hell, I was basically agreeing with you so I'm sorry if I came off as being defensive. I think it's a smart concept all in all, although I see it applied to strictly rasterizatio, not shading tasks as you implied.

PS. Thanks for the heads up on the last part.

My apolgies then for mis-reading you, Sorry.

Actually I didn't make it clear but I also mean non-shading tasks I think you right that this is logically post-pixel shader (logically becuase the fast z paths make it moveable in practise). I think its covering the tradional fixed function depth/alpha/scissor/stencil ops that go on after the colour value pops out of the shader.

I think lots of people under-estimate how complex this is, due to the hidden way its presented under OGL/DX. This seems to be a nice way to manage it (I specially like that it appears to allow simple ops to be accelerated easily), it would probably give you the double stencil only rates (32x0 of NV40) for shadows but completely programmable (i.e double colour only rate like PS2).

nAo · Jun 2, 2004

Laa-Yosh said:
What does vertex valence mean? In all my work with polygonal modeling for subdivs (for prerendered stuff), I've never really had to care about this as far as I know... or?

Given a vertex, its valence is the number of edges that shares that vertex.
In a triangles mesh a regular vertex has valence six, in a quads mesh a regular vertex has valence 4...

DeanoC · Jun 2, 2004

Laa-Yosh said:
What does vertex valence mean? In all my work with polygonal modeling for subdivs (for prerendered stuff), I've never really had to care about this as far as I know... or?

Valence is a count how how many 'things' something is attached to. Vertex valence being the number of edges each vertex is attached to.

It has massive importance to subdivision surfaces due to the algorithms requiring special cases when vertex valence is not regular(whats regular depends on the type of subdivision surface). These non-ideal vertex valence are known as extraordinary vertices as they often require special case software to subdivide them, also the mathematical properties that make subdivision surfaces so good to work with, usually break down completely as well.

In practise they are/were what caused a number of artifacts (ripples, pinching etc) in early subdivision surface algorithms.

The higher the valence number, also usually increase the cost to subdivide (extra edge walks, eigen transform and/or memory)

MfA · Jun 2, 2004

Would this architecture really force you into implementing multiplication serially? Latency for that sucks, which isnt justified by the area advantage over a matrix multiplier AFAICS.

nAo · Jun 2, 2004

Subdivision surfaces scheme can be implemented just with a bunch of vector-matrix multiply (I'm not referring to the classical representation of a subdivision scheme with a subdivision matrix..) with no knoweldge whatoever about vertices valence, connectivity or subdivision depth. There is no recursion and complexity is linear.
Unfurtunately one needs some huge tables, even if there a lot of ways to minimize tables size...all based on the topological classification of the triangles of the base mesh.

Panajev2001a · Jun 2, 2004

Heh. First of all, I was talking to Panajev since it seems like he always has to justify a number and make it looked bigger. I mean, there's nothing wrong in what you stated that needs a responce to placet the people around here that love to post specs with bigger numbers over understanding why and how it applies.

Borderline numerology ?

Justifying a number and making it look bigger ? Have you ever thought about what I do as part of understanding what number is about ?

If I tell someone about the Earth rotating around the Sun at 30 Km/s and someone tells me that it is nothign because their car can do much more than 30s, guess what I will wonder if that someone understands the difference between doing 30 Km in a second or in an hour ( no, I am not sayign that this was the case between DeanoC and I, this is just a very very loosely related example ). Guess what the context matters. I am just trying to understand the number in its context.

The speed of the SALC based Pixel Engines is important as not being TMUs embedded with APUs is important.

NV40, R42x both have TMUs embedded in their Pixel Shaders and in the case of the NV40 you have TMUs in the Vertex Shaders too.

We are understanding what the SALC/SALP applies to, but the question is now shifting to other subjects.

What will we do to hide latencies in the Pixel Shaders especially when we constantly need texture samples ( which these SALPs would provide ) and the APUs need to permorm multi-cycle context switching to pass from one pixel to the next ( old pixel is stalling and a new one has to start processing ).

You want to understand how it applies, right ? So do I, but thinking a little bit forward you can also think about implementation and about what the prostected performance might mean ?

Thank you for the label, I do not want to understand why and how it applies ? Well, at least I know you do not know something about me.

Darn, what the fuck have you eaten tonight ? I know you have somewhat of an attitude, but geesh calm down... you can say the same things in mnay different ways.

Panajev2001a · Jun 2, 2004

MfA said:
Would this architecture really force you into implementing multiplication serially? Latency for that sucks, which isnt justified by the area advantage over a matrix multiplier AFAICS.

Scalar ( 1-3 bits ) Multiplication, Addition, Subtraction and Division is provided in the SALC ( FP and FX I think ).

Using 1 bit SALC you would need 16 SALCs to do a 4 bits multiply and you would have 8 outputs to pick up.

Panajev2001a · Jun 2, 2004

Also the fact this hints (just hints mind) at high gfx clocks is interesting. Because tradionally (including PS2 GS) gfx clocks have been low and rely on the natural parellism in graphics to achieve high throughput. The next question you should be asking is why? Why increase clocks over repeating functional blocks on the chip?

Raster bound is easy, shadows being the usual suspect. Simple shaders with fairly complex fragment operations. But if this is programmable then things like NV style depth scissor should be easy to help reclaim those valuable cycles.

If APUs cannot automatically hide the context switch latency when jumping from pixel to pixel in the case of a stall ( due to a texture fetch... hopefully the time it takes an APU to swithc to another pixel is less than the time it takes for the texture fetch to be completed ) what we can do, like nAo said, is to submit the texture fetches as early as we can and in groups to take advantage of the high clock speed of the SALPs ( which is basically loosely similar to the idea [one of the ideas] why we have double pumped ALUs in the Pentium 4: they allow 0.5 cycles dependent instructions processing to unwind cases in which you filled the ROB with instructions stalling for memory accesses and you want to get them all out in a hurry ).

It is interesting what nAo says about PS2's VU1... even with 16 KB of D-Memory with single cycle access ( basically like accessing registers, 0 cycle load-use penalty for dependent instructions you still need to be processing Vertices in a group of 4 at a time.

I do not see texture fetches in Pixel Shaders becoming less frequent and I see a potential latency issue rearing its ugly head as they say.

What are the APUs going to do ? How fast can they jump from Pixel to Pixel ? How fast in terms of latency could a group of SALC provide a bi-linear filtered texture sample to an APU and how many SALC we would use ?

There are lots of questions to be asked.

I like what this SALC/SALP structure brings and I am all forward to understand how it would be used and how it would impact the Visualizer if used for the Pixel Engines as we are suggesting.

ERP · Jun 2, 2004

I do not see texture fetches in Pixel Shaders becoming less frequent and I see a potential latency issue rearing its ugly head as they say.

OK so lets say I execute a 10 instruction shader, probably 30% plus of those (probably more) would be texture fetches.

Now Lets say I can afford to spend 100 instructions on a pixel, what percentage of those instructions are likely to be texture fetches?

Now lets say I can afford to spend >1000 instructions on a pixel..... you see what I'm getting at.

Current pixel shaders are a lot like early 80's programming, if you can do it with a table (texture fetch) you do, memory accesses are cheap and you can do a lot of math with one table lookup.

Once you can afford to do a lot of math instructions, or the latency on texture fetches starts to rise (and it will if clockrates go up) you'll solve the problem differently.

This brings me back to How many ops/pixel will we see on nextgen platforms

Panajev2001a · Jun 2, 2004

ERP said:
I do not see texture fetches in Pixel Shaders becoming less frequent and I see a potential latency issue rearing its ugly head as they say.

Click to expand...

OK so lets say I execute a 10 instruction shader, probably 30% plus of those (probably more) would be texture fetches.

Now Lets say I can afford to spend 100 instructions on a pixel, what percentage of those instructions are likely to be texture fetches?

Now lets say I can afford to spend >1000 instructions on a pixel..... you see what I'm getting at.

Current pixel shaders are a lot like early 80's programming, if you can do it with a table (texture fetch) you do, memory accesses are cheap and you can do a lot of math with one table lookup.

Once you can afford to do a lot of math instructions, or the latency on texture fetches starts to rise (and it will if clockrates go up) you'll solve the problem differently.

This brings me back to How many ops/pixel will we see on nextgen platforms

See, I understand what you are saying, but why do we see in off-line CG so much time ( more or less 30% of rendering time ) spent in Texture I/O ?

This is what I am getting at... just because we can afford a lot of math ops per fragment ( a renderfarm able to work several and several hours on a single frame can afford quite long shaders ), it does not eman that our texture fetches will disappear.

Of course I am not telling you anything that you do not either know or have already experienced yourself first hand.

It is true that in off-line CG they have quite a lot of time per frame and latency can be taken care of quite well, but what do we do with the Visualizer ?

I am interested in what you propose... you are talking about a way that won't be see in Xbox 2 ( texture fetches in the Vertex or Pixel Shaders will not be a huge problem as we have TMUs quite closely tied to the Shader ALUs and the Shader ALUs are multi-threaded ) and soemthing that people are not used to in off-line CG.

How much, when we are trying to march towards the quality of off-line CG, can we substitute Textures and the needed Texture fetches with Math ops ?

Latency is a main issue in all computing and in-order, single threaded APUs do not seem so latency happy: of course we can work out things playing around with the APUs, but how much efficiency are we loosing ?

How much faster would we be able to push the APUs if each APU in the VS had a small TMU attached to it with its own Texture cache allowing low latency for Texture fetches ?

ERP · Jun 2, 2004

See, I understand what you are saying, but why do we see in off-line CG so much time ( more or less 30% of rendering time ) spent in Texture I/O ?

Because Texture IO is incredibly slow in offline rendering, best case it's coming out of main memory, more likely off disk. It's the primary reason that renderman uses a tile based system (to maximise texture locality).

In a real time system your working with somewhat different constraints. Yes texture latency is still huge by comparison to ALU ops, but a lot of texture ops in current shaders are there to approximate calculations not as part of the art assets. Really how many textures are we likely to see combined on a pixel.

I've been looking at lighting models that take >25 dot products/pixel and that doesn't include transforming all the inputs into the right space. Once you start doing lighting at a pixel level and you start looking at better lighting models, you can easilly spend 100's of ALU ops per pixel, texture complexity even if it weren't constrained by memory just isn't going to explode like that.

You can increase the speed of your ALU's with clockrate, but doing so increases texture latency, so you need more ALU's if you want to continue to hide it. At some point you just have to say textures will take multiple clocks in the shader.

Can this SONY patent be the PixelEngine in the PS3's GPU?

Megadrive1988

Panajev2001a

Vince

nAo

Nutella Nutellae

nAo

Nutella Nutellae

DeanoC

Trust me, I'm a renderer person!

Vince

Laa-Yosh

I can has custom title?

Laa-Yosh

I can has custom title?

DeanoC

Trust me, I'm a renderer person!

nAo

Nutella Nutellae

DeanoC

Trust me, I'm a renderer person!

MfA

nAo

Nutella Nutellae

Panajev2001a

Panajev2001a

Panajev2001a

ERP

Panajev2001a

ERP

Similar threads