nVidia shader patent (REYES/Raytracing/GI) destined for PS3?

well..AFAIK CELL tries to solve at least a couple of problems (A and D)
I'd add another problem:
E: special scalar units count and latency
Most of the time on SIMD units scalar operations such as reciprocal or square root are deadly slow (much longer latency than fmadd ops and not fully pipelined). That could be mitigated in the same way DeanoC proposes to solve problem C. Another solution is to have more RCP/SQRT units per SIMD unit (see PS2 VU1).

ciao,
Marco
 
so if I said
1. multiple SIMD ALUs per core
2. 4-way double precision SIMD

Which lines would you put smiley faces next to?
 
DeanoC said:
Gubbi said:
Do you have a concrete example ? Better how ?.

Inquiring minds want to know :)

Any details and I'd have to kill you :(

But taking a general tack, its fairly easy to see how we can improve a SIMD unit for the modern CPU landscape.

Problem A: RAM speeds
A SIMD unit can use alot of RAM (a 4x4 float matrix takes 1/2KB). RISC memory units are too slow (Load/Store as seperate instructions) what we want is old fashioned CISC direct to/from memory. Of course we actually want a small pool of very fast RAM. Lets call that the "register pool", saves any embarassment from RISC fans :)
So solution to problem A is to have so many registers, its uses the same amount of memory as 8 bit computers used to have. Cell SPU has mentioned 128 128bit registers (16KB), which sounds a good figure.

Ok, so you want more registers, see below why that might not be *that* good an idea. And you do want to keep values in registers, go to memory and the latency will kill you; even with superfast SRAM you're looking at 3 cycles for both loads and stores (6 i total). Compared to one or two cycles for a result-forwarding network.

DeanoC said:
Problem B: RAM speeds
O.K even with lots of registers I have to read/write stuff sometimes. If I'm going to it would be good to compress everything, say using a decoder like that is fitted to every vertex shader (including PS2) to unpack/pack data.
So solution to problem B is to have dedicated instructions/units for packing/unpacking in the formats most likely to be encountered by GPUs and CPUs.

A valid point. In altivec you'd have to pack/unpack in software, using a few instructions.

DeanoC said:
Problem C: RAM speeds
Still sometimes we are going to stall due to memory latency, so if that happens lets makes sure we have some thread contexts we can switch to see if they could be doing somethin useful.
So solution to problem C is the have multiple thread contexts per core. If one thread stalls, switch to another and do some useful work.

I agree, multple thread contexts could boost throughput. But IMO this clashes with your request in A. Each 128 register context would be 2KB. Supporting a second context will be prohibitively expensive. I'd wager you'd be better off with 4 32 register contexts and 64 rename registers and then still have die area to spare.

DeanoC said:
Problem D: We need to pretend that FLOPs figures are really important.
SIMD ALUs are cheap
<snip>
Its just finding more than about 2 non graphical math intensive (physics and sound are the obvious candidates) tasks gets real hard quickly.
But you'd want to dice each of these into chunk that can be processed in parallel by all your cores, right ?

Cheers
Gubbi
 
psurge said:
so if I said
1. multiple SIMD ALUs per core
2. 4-way double precision SIMD

Which lines would you put smiley faces next to?

1. The problem is keeping a single SIMD ALU per core busy, so adding another one might not be the best way to spend those transistors.

2. Doubles aren't ready for very high performance SIMD yet IMHO (4 way increases per register to 256bits...).
 
Gubbi said:
I agree, multple thread contexts could boost throughput. But IMO this clashes with your request in A. Each 128 register context would be 2KB. Supporting a second context will be prohibitively expensive. I'd wager you'd be better off with 4 32 register contexts and 64 rename registers and then still have die area to spare.
Register renaming doesn't help with large data sets, SIMD often have large sets of matrices (think about skinning or solving a large LCP).
You need direct access to a fairly large memory space, renaming doesn't help here. Hence having 2K of very fast direct access RAM is a good route.
Another problem is that register renaming (like OOOE) isn't something console processors do.

Gubbi said:
But you'd want to dice each of these into chunk that can be processed in parallel by all your cores, right ?
Its not the easy in practise, I agree its the right way but there is a lot of inertia toward large threads rather than task based chunks.
 
What about the addition of an L1 D-cache dedicated to the SIMD unit? I imagine it would be able to support reading and writing 128bits per cycle... Support for non-allocating memory operations would reduce cache thrashing for streaming data.
 
nAo said:
Most of the time on SIMD units scalar operations such as reciprocal or square root are deadly slow (much longer latency than fmadd ops and not fully pipelined).

Eh, RCPSS has a 3 cycle latency on K7/K8, that's pretty fast IMO. The SIMD version RCPPS has 4 cycles latency. Use Newton-Raphson (which is fully pipelined with FMADD) to increase precision if you need it.

Cheers
Gubbi
 
psurge said:
What about the addition of an L1 D-cache dedicated to the SIMD unit? I imagine it would be able to support reading and writing 128bits per cycle... Support for non-allocating memory operations would reduce cache thrashing for streaming data.
like this?
A load access pattern of the DMA mechanism is detected. At least one potential load of data is predicted based on the load access pattern. In response to the prediction, the data is prefetched from a system memory to a cache before a DMA command requests the data.

it's about a prefetch mechanism coupled with a L1 cache that feeds a SIMD unit. (CELL stuff..of course ;) )

ciao,
Marco
 
Gubbi said:
Eh, RCPSS has a 3 cycle latency on K7/K8, that's pretty fast IMO. The SIMD version RCPPS has 4 cycles latency.
Wow! that's fast :) that unit should be huge ;)
nv20 vertex shaders rcp units have 6 cycles latency AFAIK, PS2 VU1 has a 7 cycles latency div unit and a 11 cycles latency reciprocal unit IIRC.

ciao,
Marco
 
DeanoC said:
Register renaming doesn't help with large data sets, SIMD often have large sets of matrices (think about skinning or solving a large LCP).
You need direct access to a fairly large memory space, renaming doesn't help here. Hence having 2K of very fast direct access RAM is a good route.
Completely correct, you of course need enough registers to hold your problem's data. How many matrices are used in skinning these days?

As for solving LCPs you'd probably never have enough registers for that (arbitrary large data set).

DeanoC said:
Another problem is that register renaming (like OOOE) isn't something console processors do.
Both XCPU and Gecko has register renaming.

Cheers
Gubbi
 
nAo said:
Gubbi said:
Eh, RCPSS has a 3 cycle latency on K7/K8, that's pretty fast IMO. The SIMD version RCPPS has 4 cycles latency.
Wow! that's fast :) that unit should be huge ;)
nv20 vertex shaders rcp units have 6 cycles latency AFAIK, PS2 VU1 has a 7 cycles latency div unit and a 11 cycles latency reciprocal unit IIRC.

Not big at all. It's implemented using a small lookup table. And you need to do one or two NR interations on the data to get (almost) full precision. So latency would be comparable to the others, but pipelined (much higher throughput).

Cheers
Gubbi
 
nAo -

Similar, but thats all about the prefetch mechanism (to the cache). What I propose is more like this :
Code:
_____       __________       __________       ____________
     |     |          |     |          |     |            |
     |     | L2 cache | <-> | L1 cache | <-> | SIMD Units |
 Mem | <-> |          | (1) |          | (1) | Load/Store |
     |     |          |      ----------      |            |
     |     |          |                      |            |
     |     |          | <------------------> |            |
     |      ----------           (2)         |            |
-----                                         ------------
  |                       __________               |
   <-------------------- | Prefetch | <------------
                         |    (3)   |     (addresses)
                          ----------


 (1) standard load/store
 (2) non-allocating load/store
 (3) Prefetch hardware (prefetch to L2 or L1),
     with HW speculation as well as software 
     instructions

The L2 makes this less relevant to Cell and more so to my imaginary XeCPU. I imagine the L2 cache to be organized into 4 banks
so that 4 cores could load/store to it simultaneously in the absence of bank conflicts.
 
if gpu push 3 Gpoly/s, then in a frame push 50 mpoly
1 triangle about 1-10 pixels size

50-500 megapixel per frame with pixelshader , this is too much


with defferred rendering must have only 1 megapixel per frame compute with shader

this technic is 50-500 times faster !!!!!!!!!

nvidia will use it?????
 
version said:
if gpu push 3 Gpoly/s, then in a frame push 50 mpoly
1 triangle about 1-10 pixels size

50-500 megapixel per frame with pixelshader , this is too much


with defferred rendering must have only 1 megapixel per frame compute with shader

this technic is 50-500 times faster !!!!!!!!!

nvidia will use it?????

Don't you think your overdraw is a little high. Optimally at most their should really be only 2x or 3x overdraw on average unless the game engine isn't using any sorting of some sort.
 
Cryect said:
version said:
if gpu push 3 Gpoly/s, then in a frame push 50 mpoly
1 triangle about 1-10 pixels size

50-500 megapixel per frame with pixelshader , this is too much


with defferred rendering must have only 1 megapixel per frame compute with shader

this technic is 50-500 times faster !!!!!!!!!

nvidia will use it?????

Don't you think your overdraw is a little high. Optimally at most their should really be only 2x or 3x overdraw on average unless the game engine isn't using any sorting of some sort.


ok, if to continue your idea, then gpu compute 50-500 Z compare and 2-3 mils shadercycle

with deferred rendering compute 50-500 Z, and 1 mill shader
my conclusion : gpu must have more simple rasterizng unit for Z compare and less pixelshader pipeline
 
DeanoC said:
Gubbi said:
Do you have a concrete example ? Better how ?.

Inquiring minds want to know :)

Any details and I'd have to kill you :(

But taking a general tack, its fairly easy to see how we can improve a SIMD unit for the modern CPU landscape.

Problem A: RAM speeds
A SIMD unit can use alot of RAM (a 4x4 float matrix takes 1/2KB). RISC memory units are too slow (Load/Store as seperate instructions) what we want is old fashioned CISC direct to/from memory. Of course we actually want a small pool of very fast RAM. Lets call that the "register pool", saves any embarassment from RISC fans :)
So solution to problem A is to have so many registers, its uses the same amount of memory as 8 bit computers used to have. Cell SPU has mentioned 128 128bit registers (16KB), which sounds a good figure.

Problem B: RAM speeds
O.K even with lots of registers I have to read/write stuff sometimes. If I'm going to it would be good to compress everything, say using a decoder like that is fitted to every vertex shader (including PS2) to unpack/pack data.
So solution to problem B is to have dedicated instructions/units for packing/unpacking in the formats most likely to be encountered by GPUs and CPUs.

Problem C: RAM speeds
Still sometimes we are going to stall due to memory latency, so if that happens lets makes sure we have some thread contexts we can switch to see if they could be doing somethin useful.
So solution to problem C is the have multiple thread contexts per core. If one thread stalls, switch to another and do some useful work.

Problem D: We need to pretend that FLOPs figures are really important.
SIMD ALUs are cheap, so lets have a few. Makes the paper figures look good, even though the real problems are A, B and C.
So solution to problem D is to have N SIMD cores.
Note: I'm being overly sarcastic ;-) There are lots of good reasons why having multiple cores is a good things. Its just finding more than about 2 non graphical math intensive (physics and sound are the obvious candidates) tasks gets real hard quickly.

A good SIMD unit will address at least 2 of theses, a really good one will address all 4... The last two are really CPU architecture issues, but the SIMD units have to integrated into the thread architecture to get good performance.


Been spending much time on Xenon lately?
 
version said:
ok, if to continue your idea, then gpu compute 50-500 Z compare and 2-3 mils shadercycle

with deferred rendering compute 50-500 Z, and 1 mill shader
my conclusion : gpu must have more simple rasterizng unit for Z compare and less pixelshader pipeline

Z Compares will be no where on that order if they use any sort of heirarchal Z Buffer. And Nvidia definately has their own heirarchal Z Buffer and call it Z Occlussion Culling. This allows all the way up to the size of a processing block for each pixel pipeline which on recent Nvidia cards are 32x32 allowing you to toss out then 32x32=1024 pixels at once with one compare if they are all hidden and help throw out other pixels without wasting Z-Buffer bandwidth if they aren't all hidden (then who knows exactly how their table breaks down after that). Then again it might not gain Z-Buffer bandwidth since would surprise me if the entire 32x32 blocks are loaded at once but besides Nvidia doubt anyone knows exactly how they are doing it.


Edit: Thinking about not sure how much my idea holds true for small polygons. It works great for larger polygons and heirarchal Z-Buffers do help reduce Z-Bandwidth and if polygons are sent in meshes you prolly can gain similar performance to having one larger polygon.
 
Cryect said:
version said:
ok, if to continue your idea, then gpu compute 50-500 Z compare and 2-3 mils shadercycle

with deferred rendering compute 50-500 Z, and 1 mill shader
my conclusion : gpu must have more simple rasterizng unit for Z compare and less pixelshader pipeline

Z Compares will be no where on that order if they use any sort of heirarchal Z Buffer. And Nvidia definately has their own heirarchal Z Buffer and call it Z Occlussion Culling. This allows all the way up to the size of a processing block for each pixel pipeline which on recent Nvidia cards are 32x32 allowing you to toss out then 32x32=1024 pixels at once with one compare if they are all hidden and help throw out other pixels without wasting Z-Buffer bandwidth if they aren't all hidden (then who knows exactly how their table breaks down after that). Then again it might not gain Z-Buffer bandwidth since would surprise me if the entire 32x32 blocks are loaded at once but besides Nvidia doubt anyone knows exactly how they are doing it.


Edit: Thinking about not sure how much my idea holds true for small polygons. It works great for larger polygons and heirarchal Z-Buffers do help reduce Z-Bandwidth and if polygons are sent in meshes you prolly can gain similar performance to having one larger polygon.


hierarchical Z Buffer not so good for lots of small polygons
 
ERP said:
DeanoC said:
Gubbi said:
Do you have a concrete example ? Better how ?.

Inquiring minds want to know :)

Any details and I'd have to kill you :(

But taking a general tack, its fairly easy to see how we can improve a SIMD unit for the modern CPU landscape.
(...)

A good SIMD unit will address at least 2 of theses, a really good one will address all 4... The last two are really CPU architecture issues, but the SIMD units have to integrated into the thread architecture to get good performance.


Been spending much time on Xenon lately?

Why do you think that :?: :LOL: ;)
 
version said:
hierarchical Z Buffer not so good for lots of small polygons

Yeah, thats what my edit was about but if programmers use poly meshes its possible to process whole parts of the mesh like a larger polygon to a certain degree. More complex of course but allows for gains elsewhere as most optimizations do.
 
Back
Top