RSX: memory bottlenecks?

j^aws · Jan 25, 2006

trinibwoy said:
Really? Do you have a source for that info? I thought G70 was SIMD across all quads.

I'm pretty sure Jawed means that 'all' the quads together act as an MIMD 'cluster', but 'each' quad is actually still SIMD...

trinibwoy · Jan 25, 2006

Jaws said:
I'm pretty sure Jawed means that 'all' the quads together act as an MIMD 'cluster', but 'each' quad is actually still SIMD...

Yeah I got that meaning but my earlier understanding was that all 24-shaders acted as one SIMD cluster. Didnt know each quad was independent.

Jawed · Jan 25, 2006

trinibwoy said:
Really? Do you have a source for that info? I thought G70 was SIMD across all quads.

No, this is a critical, and mostly un-heralded change in G70.

It even shows itself in synthetic tests (! damn, they do have their uses).

Basically the thread size in G70 is nominally 1024 fragments (as opposed to 4096, 1024-per quad, in NV40 where all the quads are in lockstep).

Yes, that does mean that G70 is significantly better at dynamic branching than NV40 - but 1024 fragments (32x32) is a long way from the granularity of R520: 4x4 or R580: 12x4 or Xenos: 8x8.

I feel somewhat foolish for saying that G70 is single-threaded in its fragment shaders

Jawed

zidane1strife · Jan 26, 2006

Jawed said:
No, this is a critical, and mostly un-heralded change in G70.

It even shows itself in synthetic tests (! damn, they do have their uses).

Basically the thread size in G70 is nominally 1024 fragments (as opposed to 4096, 1024-per quad, in NV40 where all the quads are in lockstep).

Yes, that does mean that G70 is significantly better at dynamic branching than NV40 - but 1024 fragments (32x32) is a long way from the granularity of R520: 4x4 or R580: 12x4 or Xenos: 8x8.

I feel somewhat foolish for saying that G70 is single-threaded in its fragment shaders

Jawed

Makes me curious, what advantages do they have keeping it that large? Any advantage? h/w for handling smaller Design difficulty?

KimB · Jan 26, 2006

zidane1strife said:
Makes me curious, what advantages do they have keeping it that large?

1. I'm sure it saves transistors.
2. It may require a more significant architectural change than nVidia was willing to go for until their next generation architecture.

tema · Jan 26, 2006

Shifty Geezer said:
Kutaragi's commented a fair bit on BC. He's said they can use Cell for a software emu but he wants hardware assistance to get things 'perfect' and accomodate developers doing things in a less than usual manner. He's also said there's no eDRAM in RSX as you'd need loads of eDRAM if you aren't tile rendering to fit 1080p.

From this it's pretty safe to say, assuming things haven't changed, there's a degree of hardware BC and no eDRAM, or full PS2 chipset otherwise why aste time with software emu? Even if the PS3 chipset is only 5 quid, that'd be like 500 million pounds over the life of PS3. If they can save the cash by using PS3's hardware it'd make sense to do so.

The 48 GB/s seems the limiting factor, but PS2 didn't support hardware compression. At an average 4x compression the GDDR BW is plenty enough. I don't know what the difference in latency would be though between this and PS2's eDRAM. My guess is for some GS emulation on RSX, which accounts in part for RSX's long development from just a G70, and software EE emulation on Cell.

Onimusha3 for the PC, http://onimusha3.typhoongames.com/demo/onim3_demo.zip
nothing was left out or downgraded when i did a side by side comparison. Trillinear filtered textures locked 60FPS perfect_plus_ port.
Video captured at 15FPS with Fraps.
http://www.flurl.com/uploaded/Onimusha3_PC_51193.html

My PC card is a 128MB AGP8x 6600GT, 14GB/s.

London Geezer · Jan 26, 2006

tema said:
Onimusha3 for the PC, http://onimusha3.typhoongames.com/demo/onim3_demo.zip
nothing was left out or downgraded when i did a side by side comparison. Trillinear filtered textures locked 60FPS perfect_plus_ port.
Video captured at 15FPS with Fraps.
*pic*
My PC card is a 128MB AGP8x 6600GT, 14GB/s.

You do understand we're talking about emulation (backward compatibility), right? Not conversions.

With emulation, you have to make sure that the PS2 game (the one you buy in the shops) works on something that's not a PS2. Therefore your hardware/emulator has to basically translate the coding and make it work properly, without changing the software - because the software is the same you'd put on your PS2 and doesn't change.

You can't put your PS2 game in a PC and hope it works without some serious emulation.

One way to help that is doing it "the X360 way" and give customers updates for each selected Xbox game, so that it works on the new hardware, but that requires work and obviously restricts your library to only the games with the patch.

Conversions are just... conversions of a game from one platform to another. The software can be totally different, and it should be different in order to take advantage of each platform (in the case of Onimusha 3 PC and any other conversion ever created).

Totally different scenarios.

tema · Jan 26, 2006

I heard Sony have a high level emulation wrapper.

Onimusha 3 doesnt look like a conversion port, can somebody look at the files?

Shifty Geezer · Jan 26, 2006

You're way off the mark Tema. If you want to talk Onimushi, take it to the game forum. The only way anyone in this thread will be interested is if you can provide pics of PS2 Onimushi running on PS3 hardware.

London Geezer · Jan 26, 2006

tema said:
I heard Sony have a high level emulation wrapper. Onimusha 3 doesnt look like a conversion port, can somebody look at the files?

Onimusha 3 for PC is Onimusha 3 for PC. It's a conversion using PS2 art assets (looks like it). The coding is different because PS2 code obviously wouldn't run properly on PC architectures.

Jawed · Jan 26, 2006

zidane1strife said:
Makes me curious, what advantages do they have keeping it that large? Any advantage? h/w for handling smaller Design difficulty?

It's basically a tweak to an architecture that is fundamentally single-threaded. G70's architecture is, in effect, 6-threaded.

As far as I can see, the only way to hide texturing latency whilst also having small threads (which are normally large in order to hide texturing latency) and having useful per-pixel dynamic branching is to go multi-threaded - emphasis on "multi". 10s of threads, at least, seem to be needed.

---

There is an interesting aside here, to do with how a thread treats triangles. If you're rendering lots of small triangles (say around 4x4 pixels in size), then you really want to put lots of them together into one batch, if possible. NVidia's architectures do this - upto 20 triangles can share a thread.

Whereas I'm not sure that ATI's do... Obviously ATI's latest architectures are not hurt so badly by this, because in a thread that's only 48 or 64 fragments, a 16-fragment triangle all on its own is nowhere near as bad as a 16 fragment triangle occupying a 256-fragment thread. I am speculating, though. It certainly wouldn't make sense for ATI's threading to work like this (particularly on the older GPUs like R420), but the fact of screen-space tiling implies the possibility to me...

Jawed

j^aws · Jan 26, 2006

Jawed said:
It's basically a tweak to an architecture that is fundamentally single-threaded. G70's architecture is, in effect, 6-threaded.
...

In addition to 6 threads across 6 SIMD fragment quad units, there are further 8 MIMD vertex units, running another 8 threads capable of per vertex dynamic branching. The G70 architecture would then have 14 threads of execution...

So in smplistic terms, the G70 would have 14 prcessors, with the 8 vertex and 6 fragment quads tuned differently to dynamic branching performance. This seems to be mirrored with CELL, with 1 MIMD PPE and 7 SIMD SPUs, tuned differently for dynamic branching performace.

Incidently, Xenos makes a tradeoff with worse vertex dynamic branching performance than MIMD VS units, with its 3 SIMD unified processors, i.e. vertex batches are 64 vertices instead of 1... but XeCPU has 3 MIMD PPEs which would be better for vertex dynamic branching...

Edit: typos...

KimB · Jan 26, 2006

Jaws said:
In addition to 6 threads across 6 SIMD fragment quad units, there are further 8 MIMD vertex units, running another 8 threads capable of per vertex dynamic branching. The G70 architecture would then have 14 threads of execution...

That doesn't really help with pixel processing, though, because the pixel and vertex processors are completely independent.

scificube · Jan 26, 2006

So which is more valuable...DB in pixel shaders or vertex shaders? Or should I ask what can you do with DB in either case? POM....what other techniques could be done with it? (That are usable in real time of course.)

j^aws · Jan 26, 2006

Chalnoth said:
That doesn't really help with pixel processing, though, because the pixel and vertex processors are completely independent.

Of course, never said it would. I was just simply classifying different, 'independent' programmble processors and threads...

PS3:

1 MIMD PPE
7 SIMD SPU
8 MIMD VS
6 SIMD Quad

X360:

3 MIMD PPE
3 SIMD US

Kinda like chose the right tool for the right job...

Jawed · Jan 26, 2006

Well the 6 series has been out for nearly two years now - surely there must be a game that's using dynamic branching in the vertex shaders?

For instance, one could write a shader that looped through a certain number of vertex lights, determine which ones might influence a particular vertex, and then pass down the index of each relevant light to the pixel shader. The pixel shader could then use this â€˜light indexâ€™ to determine which light parameters to apply. The pixel shader would then loop over the active lights, then use dynamic branching to exit the shader early once all lights are processed.

Most light types only apply to the front side of an objectâ€”the side facing the light. Therefore, you can use both vertex and pixel branching to skip processing for lights that the shader detects as facing away from the light. This can save significant processing time, and speed up the shader. Similar speedups can be used to skip processing of character bone animation as well as many similar algorithms.

From: http://www.microsoft.com/whdc/winhec/partners/shadermodel30_nvidia.mspx

Alternatively, maybe creation/deletion of vertices and geometry shading will be the catalyst for dynamic branching in "vertex" shaders.

Jawed

Bobbler · Jan 26, 2006

On the BC problem:
Since the bandwidth (48gb/s) problem seems to be the only real issue that PS2 games would likely have in being run on a PS3... why don't they just use some form of compression compression? It's possible the GPU will have enough extra power (in idle shaders?) to compress all the accesses to what would be PS2's framebuffer. PPE and two SPEs would likely take care of EE, and GS could be taken care of with RSX and some use of compression. Or hell, the Cell might be able to do it all -- PPE/SPE1/SPE2 for EE, and an SPE3 for GS (?) and then an SPE4 taking care of compression to pass SPE3's data to the XDR pool.

Just a thought... not sure how practical it would be though, as I don't know the fine details of what things PS2 could do that would likely be tricky to emulate (outside of its 48gb/s pipe).

Shifty Geezer · Jan 26, 2006

Jaws said:
This seems to be mirrored with CELL, with 1 MIMD PPE and 7 SIMD SPUs, tuned differently for dynamic branching performace.

Why do you class SPU's as SIMD? Each is capable of processing data independantly of each other and with multiple instructions per data element.

rounin · Jan 26, 2006

A search on google brought me to this page with some guy asking a PSCX2 dev about EDRAM's role and getting a response about how its not just bandwidth but also the PS2's CPU complexity.

j^aws · Jan 26, 2006

Shifty Geezer said:
Why do you class SPU's as SIMD? Each is capable of processing data independantly of each other and with multiple instructions per data element.

A group of SPUs can act as an MIMD 'cluster', but the classification is for 'each', 'independent' processor. An SPU would still issue a Single Instruction to Multiple (4 component) Data elements... even though it can dual issue, it's still a 'Single' instruction stream being applied to those data elements...

RSX: memory bottlenecks?

j^aws

trinibwoy

Meh

Jawed

zidane1strife

KimB

tema

London Geezer

tema

Shifty Geezer

uber-Troll!

London Geezer

Jawed

j^aws

KimB

scificube

j^aws

Jawed

Bobbler

Shazbot!

Shifty Geezer

uber-Troll!

rounin

j^aws

Similar threads