What is PPP?

no_way · Oct 27, 2003

I remember there was quite a lenghty discussion about possible PPP positions in the pipeline on this board a while ago.
The question being, is it better to tessellate before or after VS or break VS up into two parts altogether. In world space, camera space or clip space ? Each approach has disadvantages.
EDIT:
World space IMO sucks for adaptive tessellation, and is potentially going to overload the VS. Camera space would be a nice clean solution, but current VS pipelines dont output that. Clip space is "nonlinear", thus making all kinds of interpolations a nightmare.

Simon F · Oct 27, 2003

Xmas said:
Tahir said:

Is this even remotely similar to what is meant by a primitive processor nowadays?

Click to expand...

No, this is what's usually called triangle setup and rasterizer.

One thing I wonder is what would be the best place in the pipeline for a PPP. Depending on what you actually want to do, it might be better to either put it before or after the vertex shader (but always before the clipping stage, of course). E.g. if you want some distance-based tessellation, you either need a PPP after the VS, or do the transformation in the PPP and pass view-space transformed vertices on to the VS, which could interfere with some effects in the VS.

It seems to me that you'd also want to be able to do transformation on the control points before tessellation because it'd be a shame to spend ages generating thousands of new triangles if 1/2 of them ended up being off-screen!

DemoCoder · Oct 27, 2003

Yeah, but presumably, if a PPP is capable enough to handle programmable tesselation algorithms, it's also possible for it to run vertex displacements on the control points as well before it starts dicing.

In anycase, using the CPU to do it before sending the points to the PPP seems cheap.

I'm thinking a really general purpose PPP would be able to do things like topological surgery.

Simon F · Oct 27, 2003

DemoCoder said:
I'm thinking a really general purpose PPP would be able to do things like topological surgery.

Could you drop some hints as to what you mean?

DemoCoder · Oct 27, 2003

Simon F said:
DemoCoder said:

I'm thinking a really general purpose PPP would be able to do things like topological surgery.

Click to expand...

Could you drop some hints as to what you mean?

It's a form of geometry compression.

http://citeseer.nj.nec.com/taubin98geometric.html

He has another called Progressive Forest Split Compression or something.

Although this one has more potential
http://www.gris.uni-tuebingen.de/~sgumhold/MeshComp/index.html

Decompresses at 1.5m triangles per second on a Pentium 300Mhz. Could work significantly faster in HW. Still, it needs to run 100x as fast IMHO to be acceptable (150m triangles/450m vertices per second decompressed)

psurge · Oct 27, 2003

DemoCoder, how do you see a unit that general outperforming a modern CPU? (computationally speaking - I can see the advantage of not having to stream results over AGP).

Regards,
Serge

DemoCoder · Oct 27, 2003

I don't know exactly, since I don't know where the bottlenecks are. It's like if you asked me "how could HW outperform CPU at MPEG2". Without knowing that the hotspots were the iDCT or motion compensation, I wouldn't be able to auggest what's need in HW.

I'd have to see what's expensive in the geometry compression papers, and what's plausible to accelerate.

psurge · Oct 28, 2003

DemoCoder, i'm not sure about that analogy - after all the unit isn't supposed to implement a particular algorithm, rather support all sorts of stuff from geometry decompression/compression to RenderMan style surface dicing, to physics computations. I guess what I'm saying is that it at first glance it seems like it would have to be able to run branch heavy code with random memory accesses very quickly.

BTW do you have a link to the paper describing a PPP you mentioned in the beginning of the thread?

DemoCoder · Oct 28, 2003

The analogy still holds. It would be a generalized device, but with specific instructions to accelerate some things. For example, I could design a CPU that has built in acceleration for DCT, FFT, Wavelet, and a bunch of other transforms. Then, combining these with generalized code, I could accelerate almost any fo the modern compression algorithms.

Keep in mind, I'm talking about a PPP + support for geometry mesh compression. A PPP need not have support for geometry mesh compression. It could be used to just implement programmable tesselation, in which case, it's still a form of compression, but one based on analytics (e.g. HOS, Nurbs, N-patches, etc)

Panajev2001a · Oct 28, 2003

Sorry if I enter with this kind of comment, but from what I gather ( properties of the PPP like Vertex creation/deletion, HOS tessellation, etc... ), could not the Emotion Engine's VUs be considered PPP + Vertex Shaders ?

Now I understand the love for them even better ( seeing the demand for a PPP in the PC GPU world ).

KimB · Oct 28, 2003

I don't think so. The VU's of the PS2 are generalized processors. You can do anything with them you please. They're well-suited to most 3D operations because they are vector processors, but they're still going to lose to dedicated hardware in performance, because they are a set of parallelized general processing units.

akira888 · Oct 28, 2003

Chalnoth said:
I don't think so. The VU's of the PS2 are generalized processors. You can do anything with them you please. They're well-suited to most 3D operations because they are vector processors, but they're still going to lose to dedicated hardware in performance, because they are a set of parallelized general processing units.

But isn't the reason why dedicated hardware has such better performance over general-purpose hardware 1) Extra, specialized instructions for the task that can do multiple operations at once 2) An architecure that is designed to take advantage of whatever parallelism is there to be exploited 3) Most importantly, a specialized inter-processing unit data flow sequence that allows for massive pipelining (and therefore massive speedup) of independent operations. (20 stages on P-IV, 750-1000 on modern GPU's).

So by these criteria, wouldn't the PS2's vector units be seen as specialized hardware? 1) They have special instructions for 3D transformations like a VS unit, 2) The processors and units run in parallel on independent vertices, 3) The data flow is optimized for 3D rendering.

But then again I'm almost certainly totally wrong...

KimB · Oct 28, 2003

While the PS2 architecture is certainly better than a PC CPU at geometry processing, that doesn't mean it's going to be as good as dedicated hardware.

A few points:
1. Today's dedicated 3D hardware doesn't need to deal with many of the "general purpose" type instructions the PS2's VU's need to deal with. Thus dedicated hardware is more compact.

2. Any general purpose CPU has to deal with random access memory. The GPU often just deals with an incoming vertex stream. This is much easier to optimize for.

3. A GPU is built for parallelism, and it is exceedingly easy to make use of the parallelism available.

4. The PS2's VU's are notoriously hard to program for. While it may be less of an issue today due to the evolution of design tools, one major complaint of the PS2 at its inception was that it was extremely hard to make use of all of the parallel units.

In the end, there is one major difference that will always make a GPU more efficient, and that's simply that each vertex passed to the GPU is assumed to be independent of all other vertices. This independence makes parallelism almos trivial. The more general-purpose nature of the PS2 means that the hardware and software designers can't assume such independence, making it much more challenging to make use of all units.

Panajev2001a · Oct 29, 2003

Chalnot, I am sorry we are misunderstanding each other, I was not comparing EE's VUs to the Vertex Shaders.

I was treating them as a super-set of them, a set that included also PPP functionality.

From what we are talking about here, for the tool designers ( compilers, libraries, etc... ) optimizing for the VUs or for a PPP + Vertex Shaders is not incredibly different.

VUs might be ultra slow, but even DirectX 8 Vertex Shaders take 4 cycles to do a basic Transform with perspective divide and the VUs take only 7 ( VU1 can be sped up to 5 using the additional FDIV present in the EFU ).

That is not incredibly slow.

If you had to write code for the PPP and the Vertex Shaders using for both ASM level language and not your neat HLSL you would feel a similar pain to what VU coders do, still it might be tough, but it is possible.

Better tools come to help you code for those Vector Units and I do not think that the technology on the software side stopped at the point they are now either.

The worse thing for PlayStation 2 programmers seems to be the efficiency of R5900i's memory accesses due to the low L1 cache and the lack of L2 cache.

If you gave two VU1s in parallel and bandwidth to feed them both, I do not think developers would have tons of problems splitting the T&L work between both.

VU0s was not thought for T&L jobs primarly, that was VU1's work and this has been the way it has been used: a lot of early titles started using the VU0 in mcro-mode like they had an SH-4, but then sihifted the T&L code onto the VU1 and a lot of them did not push the VU0 much for a good while.

I agree with your point regarding independence of each Vertex from the others allowing for easier parallelism, but I would hardly call the EE built for General Purpose Processing and not for multi-media number-crunching.

When PPP will come in the realm of PC GPUs you will ahve a similar situation: unless you leave the host CPU to do all that job for the VSand we keep the current scenario.

Wheher we have those kind of Vector Units on the CPU's chip or on the GPU's chip it makes no difference really: one day PC GPUs will seek the same degree of programmability of the Geometry processing part of the pipeline, they are already looking into that.

In the end, there is one major difference that will always make a GPU more efficient, and that's simply that each vertex passed to the GPU is assumed to be independent of all other vertices. This independence makes parallelism almos trivial. The more general-purpose nature of the PS2 means that the hardware and software designers can't assume such independence, making it much more challenging to make use of all units.

Parallelism can still be exploited when Vertex data has passed the reach of a PPP and has to be T&L.

Also we can still tile the screen, clip triangles and process them in tiles assigning to each Vector Processor one or more tiles.

We can exploit parallelism at the surface/object level: working around those problems can be done.

KimB · Oct 29, 2003

Panajev2001a said:
I was treating them as a super-set of them, a set that included also PPP functionality.

That still doesn't make much sense to me. The PS2's VU's are generalized processing units. They can do any sort of processing. What you're describing is like saying the CPU is a superset of pixel shaders. It just doesn't make much sense. They're different types of processors with different design goals.

AndrewM · Oct 29, 2003

Panajev2001a said:
VUs might be ultra slow, but even DirectX 8 Vertex Shaders take 4 cycles to do a basic Transform with perspective divide and the VUs take only 7 ( VU1 can be sped up to 5 using the additional FDIV present in the EFU ).

Dont forget that geforce4's etc have more than one vertex unit.

Panajev2001a · Oct 29, 2003

Chalnoth said:
Panajev2001a said:

I was treating them as a super-set of them, a set that included also PPP functionality.

Click to expand...

That still doesn't make much sense to me. The PS2's VU's are generalized processing units. They can do any sort of processing. What you're describing is like saying the CPU is a superset of pixel shaders. It just doesn't make much sense. They're different types of processors with different design goals.

The design goal of the EE and its VUs is similar to what I feel GPUs are going to look forward too ( in the near future IMHO ).

Processing Vector Math fast.

They both started with a similar design goal: VUs were purposely made more programmable to allow them to be more useful outside 3D processing.

Still some developers are already starting to exepriment how to process physics on the GPU and other functions that were off-loaded to the more flexible CPU.

I cannot agree with the classification of the EE as general purpose processor: the EE, IPF, CELL, CAKE, etc... all push quite hard the boundaries between general processors and parallel processors with the general purpose processing parts loosing in importance in the big scheme of things ( multi-media application benefits of parallel processors: the can better addressn their needs compared to what common General Purpose Processors can ).

With not too many revolutionary changes I think NV3x and R300 could proove to be decent CPUs and still perform in 3D graphics as well as we know they do.

KimB · Oct 30, 2003

Panajev2001a said:
With not too many revolutionary changes I think NV3x and R300 could proove to be decent CPUs and still perform in 3D graphics as well as we know they do.

No. Once you add random access memory to the picture, memory that can be shared between vertices, everything becomes much more complex, and parallelism becomes much harder.

A large part of the optimization of a 3D graphics pipeline is the presumption that each set of data (whether it is a surface, vertex, or pixel) can be processed independently from every other set. A general purpose CPU has no such requirement.

Dio · Oct 30, 2003

Panajev2001a said:
The design goal of the EE and its VUs is similar to what I feel GPUs are going to look forward too ( in the near future IMHO ).

Processing Vector Math fast.

The last statement is exactly right in terms of the designs of the R300. However, the compromises that have to be made to achieve truly massive vector math performance tend to make such a chip less suitable as a general purpose CPU.

Utilisation is a key phrase; you want to keep every transistor on these big, expensive math engines utilised for as large a percentage of time as possible. That inherently precludes the idea of doing anything other than math on the chip, because any cycle spent doing something other than math is a big pile of underutilised transistors.

Another key phrase is data flow. As the math engine becomes more complex, it becomes harder to keep it fully fed, and to get its results out, in a timely manner. Current GPU pipelines achieve this by being lightweight in communication with the outside world (internally using very many truly massive internal busses), having very tightly coupled pipe stages, and being reasonably 'simple' particularly in terms of execution time, which makes the process pretty deterministic and makes it reasonably easy to correctly size FIFOs, etc.

While there is more convergence and programmability to come, I don't think it's wise to try to make graphics chips into complete general-purpose CPU's, or vice versa - a design that tries to be jack of all trades will be master of none. Sooner or later it will reach a point of diminishing returns.

Panajev2001a · Oct 30, 2003

Chalnoth said:
Panajev2001a said:

With not too many revolutionary changes I think NV3x and R300 could proove to be decent CPUs and still perform in 3D graphics as well as we know they do.

Click to expand...

No. Once you add random access memory to the picture, memory that can be shared between vertices, everything becomes much more complex, and parallelism becomes much harder.

"The Paralellism is out there"

Things becoming harder is not a problem: if the vertices are independent they will still be treated as such, it will take more care loading data and it will take more registers and a better memory hierarchy ( as you do not have Vertex Buffers just there feeding you

).

Still it can be done at very good speeds and you get the benefit of never to have to send data back-and-forth between the GPU and the CPU which will involve more than 2-3 cycles of latency.

The EE design also proved such a concept: raw Transform only take 5 cycles on VU1 and 7 cycles on VU0... DirectX 8 Shaders seem to take around 4 per Vertex.

We are talking about 1 cycle and 3 cycles of extra latency and there surely are programmers out there who can keep VU1 and some even VU0 very warm at night, if you catch my drift, for how harder parallelism might become.

Sorry Chalnot if I sound like a broken record and thanks for keeping the discussion up

Some CPU concepts like I said IPF, CELL, CAKE, etc... are evolving to GPU like parallelism and GPUs are evolving towards a similar destination (They already have Random Access Memory [on-board DRAM] and adding more caches and local buffers is not exactly the end of the world for the transistors budget they will have for 110-130 nm and beyond ).

What is PPP?

no_way

Simon F

Tea maker

DemoCoder

Simon F

Tea maker

DemoCoder

psurge

DemoCoder

psurge

DemoCoder

Panajev2001a

KimB

akira888

KimB

Panajev2001a

KimB

AndrewM

Panajev2001a

KimB

Dio

Panajev2001a

Similar threads