More creativity with PS3 framebuffer needed?

Well there's a concensus that NVidia uses a single pipeline and that there's an upper limit on the number of fragments in flight. I merely came to agree with that concensus late :smile: - don't shoot the messenger.

You should scan the two threads I've linked above - you'll see there's quite a bit of evidence.

The patent you linked to provides two things:
  1. MIMD quad fragment pipelines (as opposed to NV4x style SIMD across all fragment pipelines)
  2. staggered-start of fragment processing (to stagger texture memory accesses, decreasing peak load)
Anyway, regardless of all that, show us some performance comparisons!!!

Jawed
 
Jawed said:
  1. MIMD quad fragment pipelines (as opposed to NV4x style SIMD across all fragment pipelines
IIRC I believe you can find that even in the first Dave's article about G70.
It's not what I'd call as MIMD since every quad is processing MANY pixels in lockstep.
Anyway, regardless of all that, show us some performance comparisons!!!
I can't kill you all..:)
 
Shifty Geezer said:
I thought the Sony ninjas take care of that :oops:

And who do you think they are..?

Ninja Theory may not be so theoretical.. :devilish: ;)

Anyway, this thread took a turn for the interesting, even if some things remain a mystery. Insightful conversation, though, even if it'll take me a while to figure it all out!
 
scificube said:
CBE_architecture10.pdf pg.34 of 319 said:
3.2.1 Local Storage Access
The CBEA allows the local storage of an SPU to have an alias in the real address space in the main storage
domain. This allows other processors in the main storage domain to access local storage through appropriately
mapped effective address space. It also allows external devices, such as a graphics device, to directly
access the local storage.

Thanks for this info. I'm sure I read it before in one of the CELL docs but wasn't sure which.

The bolded text strongly suggests RSX capable of directly reading from SPUs Local Stores. And this info seems to be overlooked, because I'd say it's as significant as Xenos being capable of directly reading from XeCPU L2 cache... I'd also expect RSX to be able to directly read from the PPE's L2 cache too... and, IIRC, without the tiling/ L2 issue with Xenos...
 
I expect Xenos and RSX to both have similar mechanisms for accessing cache / LS from the CPU.

Still, the main use for this will be for streaming vertices to the GPU. The CPU assembles a list of vertices and puts them in the cache. The GPU reads the next one from the list when it's ready (it can get backed up at any time by pixel processing). This is why the GPU needs to be the master in determining what to read, and the CPU can't simply send things over whenever it wants.

Not that you're suggesting this here, but FYI this does not imply that either Xenos or RSX will do something like randomly texture from these locations. That is much more complicated than simply proceeding down a list. I suppose it may be possible, but I'd expect performance issues since it isn't regular streaming. IMO we'd also hear something about it by now.
 
nAo said:
The number of pixel in flight is not hardcoded at all, basicly there is a small processor that assembles pixel batches and it does not stop until some on chip resource is no available anymore. When some resource is no more it puts a marker at the end of a segment, that's all.
I think Jawed considered the register file as part of the pipeline, and that's where a lot of this confusion is coming from. It's just a different (and equally valid, IMO) way of looking at it. But you're right in that the number of pixels in there is not hardcoded. An upper limit does exist, though, obviously.

Anyway, I think the important think to note for everyone reading this is that within each shader quad, with only one thread the number of pixels in flight equals the pixels per batch. If you reduce the batch size to 50 quads, then you only have 50 cycles to absorb latency.

This can help dynamic branching by increasing the likelyhood that all pixels in a batch take the same branch, but it hurts it by increasing the time between fetching data and being able to use it. If it's a purely mathematical shader, it's probably worth it, but otherwise it probably isn't. POM, for example, has a texture lookup inside the loop. Can't imagine how you'll net a gain there.

There's also a big difference between ATI and NVidia in their ability to hide texture latency with non-dependent math ops. For NVidia it's: time = A * (# fetches) + B * (# math ops)
For ATI it's: time = max{A* (# fetches), B * (# math ops) }

This behaviour suggests that for NVidia the texture fetch result must come back before the next instruction, irrespective of whether that instruction needs the data. I think this could be the reason ATI doesn't get blown away by NVidia (esp. with AF) despite their deficiency in peak texturing ability.
 
I think it's safe to assume that XPS (Xenos reading out of Xenon L2) is an essentially serial process. The data flow here is from RAM into Xenon L1, program execution in Xenon, then output data direct into locked L2 and XPS'd into Xenos's XPS buffer. It's a streaming paradigm.

It's worth noting that Xenon and Xenos communicate through a ring-buffer, each having their own pointer. There's gates defined around the ring for synchronisation purposes, but apart from "full" and "empty" Xenon and Xenos are free-running.

Jaws, care to explain what the tiling/L2 issue is?

Jawed
 
nAo said:
You can't increase batches size cause you'd need to cut your shader resources usage.
If your shader needs to use 4 registers most of the time there's nothing you can do about it.
You can easily and artificially use or declare more registers then what you need, but you can't use less.. if you want your shader to still work
Ho gia' dato dentro, oggi mi riposo.. :)

You can do this, but you loose the ability to hide memory latency from texture fetches.
People here tend to simplify things dramatically, any shader can be limited by textures, interpolators or ALU ops, having fewer threads in flight will in practice just reduce performance on most real world shaders. The fact that your branches may then be "cheaper" isn't really much of a consolation.
 
ERP said:
You can do this, but you loose the ability to hide memory latency from texture fetches.
I know, this is what I wrote in this very same thread:
Assuming the fragment shader pipeline is fully pipelined the pipeline length is virtually one clock cycle, hence as long as you can hide your mem latency it does not matter if your segment is filled with more or less fragments.
People here tend to simplify things dramatically, any shader can be limited by textures, interpolators or ALU ops, having fewer threads in flight will in practice just reduce performance on most real world shaders. The fact that your branches may then be "cheaper" isn't really much of a consolation.
As I already explained the number of pixels in flight you can have is not hardcoded, it does not make any sense to say this architecture supports N registers and if you use more than that suddenly performance decrease.
If you use more registers you don't lose the ability to hide mem latency, you decrease your ability to hide mem latency, that's a huge difference.
In some cases you should be able to increase DB efficiency at no cost or at very small cost.

Marco
 
For what it's worth, automatic predicated tiling will cause XPS to be re-worked per tile. Which is why you'd program your own tiling algorithm, which would determine tiles before calling heavy-duty XPS routines.

Jawed
 
Back
Top