More creativity with PS3 framebuffer needed?

Jawed said:
Additionally, the whole point of the Cell BE architecture is ultra-high bandwidth streaming - low latency is not the design target. Hence the DMA-list architecture etc. etc.

That'd be true whether it was 300,400,500,600..1000 cycles to memory, or whatever. I think the "approaching 1000 cycles" bit is possibly illustrating a trend rather than necessarily referring to what one implementation has to deal with now (and obviously quoting a figure on the high end is more conducive to the presented argument). You might be right, but I don't think it's entirely clear.

SPM - RSX doesn't just have to go through the ring bus. It has to go through FlexIO and the XDR busses also.
 
SPM said:
So it is just speculation.

To achieve coherent access to XDR, surely direct access to the memory similar to direct bus read/writes are required, as opposed to incoherent access which presumably is done with DMA.
Coherent accesses peek through caches, they have to ensure the caches of multiple units stay valid all the time. ie. if both Cell & RSX have an memory-location in private caches, and Cell writes to that location, it notifies RSX to invalidate this cacheline. noncoherent doesnt do this.
DMA just means it can happen without taking away processing resources from the CPU itself (no "mov src, dest" loop )

SPM said:
Yes, RSX access to XDR has to go through the ring bus, but so has PPE and SPE access to XDR. The latency for these is pretty low otherwise you would see horrendous performance issues. If the Flex i/o interface connecting RSX to the ring bus is configured for coherent access, then RSX is just another node on the ring bus just like the PPE and the eight SPEs, and the latency for RSX accessing XDR should not be very different.
And there you have RSX needing Cells blessing to allow accessing XDR-Ram. The RING-Bus has Priorities, and I dont believe extrnal devices will be high up on these. Further Cell is optimised for accessing Memory in big blocks, so picking a pixel from here and from there aint going to bring you peak performance on wither FlexIO, Ringbus or XDR-Ram.
Quite different from GPUs which have a Buscontroller that have been optimized for smaller chunks (google up "Crossbar Memory Controller" if you like).

SPM said:
There may be a slightly higher latency going for RSX accessing XDR, but surely that has to be very low. Remember the ring bus has two paths in either direction and it doesn't relay data from one node to the next - it jumps directly between the two nodes. The latency is therefore going to be low - similar to a bus. If latency is low enough to feed the PPE and SPEs without a major performance hit, then surely it is easily enough for textures maps.
You are again assuming that it aint doing anything else than that. Sure it will be alot better than a GPU accessing System-Ram on current PCs, but it will be a good deal worse than its "native" GDDR-Ram
 
Hmm, yeah, that's why he said "with main memory latency approaching 300 cycles". Don't be ridiculous Titanio.

Jawed
 
Jawed said:
Hmm, yeah, that's why he said "with main memory latency approaching 300 cycles". Don't be ridiculous Titanio.

What I'm saying is that the problem would exist whether it would be 300, 400...etc. It's all too much.

I'm saying do we know if this figure is illustrative or indicative of a 3.2Ghz Cell?

In other presentations we've seen illustrative figures of 100+ cycles used to make the same point.

Like I said, I don't think it's clear if that figure is purely illustrative, or a physical measurement of latency within a particular system.
 
Dave Baumann said:
The batch size of G7x is tuned to the latencies of the memories its using, being GDDR3 - changing that would require a change to the command processor (which they haven't done) and/or an increase in batch size (which would imapct other areas).
The batch size is not tuned, in fact it's higly dynamic. What is 'tuned' is the amount of available on chip storage (and other resources):
INCREASED SCALABILITY IN THE FRAGMENT SHADING PIPELINE

In this embodiment, the number of attributes per geometric primitives can be increased at the expense of reducing the maximum number of geometric primitives that can be stored and associated with fragments in a segment, thereby possibly, but not necessarily, limiting the number of fragment groups in a segment, or vice- versa.
Because each fragment is executed by a separate instance of the fragment shader program, each fragment requires its own set of data registers. In an embodiment, each fragment shader pipeline can dynamically allocate its pool of data registers among the fragment groups of a segment. For example, if the fragment shader pipeline includes 880 data registers, then a segment using four data registers per fragment group to execute its fragment shader program can include up to 220 fragment groups. Similarly, a segment using five data registers per fragment group can include up to 176 fragment groups. As the fragment shader distributor 400 processes fragment groups, the segmenter 410 changes the register counter 425 to reflect the number of data registers needed by each fragment group to execute the fragment shader program.

So on a G7x GPU is not that difficult to control pixel batches (pardon, segments!) size.
 
You can be sure that if RSX latency against XDR was good, rather than poor, we'd be hearing about it.

Ultimately 1000 cycles sounds like a bigger problem assailed than a mere 500 cycles, so sure he coulda over-egged the number for effect. But for RSX latency against XDR to be lower than its latency against GDDR3, it's going to have be in the region of 500 cycles, not 1000. So RSX loses, no matter what.

Since XDR is intrinsic to Cell, just how many varieties of system do you think he might have had to choose from when coming up with "approaching 1000 cycles"? Do you think he had a 5GHz Cell in mind when he used that number? Slower Cells, running at 2.4GHz, are going to show lower "latency cycle counts" so 1000 is definitely at the upper end of the scale.

If anything, "approaching 1000" simply seems to me like a conclusion based on the performance of a 2.4GHz Cell extrapolated to 3.2GHz - magazine articles are written way in advance (Kahle PDF is from the July/September 05 issue of IBM's Journal of Research and Development) so they prolly only had access to a slow Cell at that point. Ah, there it is, the paper was "Received February 19, 2005".

Jawed
 
nAo said:
In theory, yes.

I am not sure it would be easy to find a situation where it would make sense, but hey it gives you another toy to play with :D.

In theory, if you knew your shader was not using branching at all you could increase batch size even... do you see cases in which you would want to do it ?

One more thing: DACCI DENTRO MANZETTI!!!!!!!!
 
Panajev2001a said:
In theory, if you knew your shader was not using branching at all you could increase batch size even... do you see cases in which you would want to do it ?
You can't increase batches size cause you'd need to cut your shader resources usage.
If your shader needs to use 4 registers most of the time there's nothing you can do about it.
You can easily and artificially use or declare more registers then what you need, but you can't use less.. if you want your shader to still work
One more thing: DACCI DENTRO MANZETTI!!!!!!!!
Ho gia' dato dentro, oggi mi riposo.. :)
 
nAo said:
So on a G7x GPU is not that difficult to control pixel batches (pardon, segments!) size.
But 176 fragments take as long to shade as 220 ;) You've just junked 25% of your performance.

This need to allocate registers per fragment from a fixed pool of memory is obvious. If it wasn't like this, then a hard-coded limit of a maximum of four registers per fragment would exist, which is in breach of SM3. Either that or Bob wouldn't have scoffed at the idea of having megabytes of register file:

http://www.beyond3d.com/forum/showpost.php?p=723479&postcount=70

and I wouldn't have rejoined that Xenos already does:

http://www.beyond3d.com/forum/showpost.php?p=723497&postcount=73

If you exceed the nominal limit (4 FP32s for G7x), then performance falls off a cliff. It also explains why G7x performance is still heavily dependent on the use of FP16 in long and complex shaders - FP16 provides 2 registers for every FP32 register, so the more of the former used by a shader, the less chance of exceeding the 4-register limit (which has become, in effect, an 8-register limit if all registers are FP16).

Jawed
 
nAo said:
You can easily and artificially use or declare more registers then what you need, but you can't use less..
Question is, how do you persuade the GPU to allocate these registers? - it should just optimise them away.

Jawed
 
Jawed said:
But 176 fragments take as long to shade as 220 ;) You've just junked 25% of your performance.
I dont' think so, if 176 cycles are enough to cover your mem latency the GPU doesn't run slower at all.
If you exceed the nominal limit (4 FP32s for G7x), then performance falls off a cliff.
Have you actually measured this effect?
Are you sure you're not confusing register file bandwidth with the max amount of registers a shader can use?
it doesnt make sense to say "if you use more than N registers your perf will drop" cause mem latency can vary as there are many players juggling with the same memory!
 
Jawed said:
Question is, how do you persuade the GPU to allocate these registers? - it should just optimise them away.

Jawed
If I were a hardware designer I wouldn't put a registers optmizer in my GPU, that's a sw work.
 
nAo said:
I dont' think so, if 176 cycles are enough to cover your mem latency the GPU doesn't run slower at all.

Have you actually measured this effect?
Are you sure you're not confusing register file bandwidth with the max amount of registers a shader can use?
It's impossible to say: if you use more than N registers your perf will drop cause mem latency can vary as there are many players juggling with the same memory!
The problem is the actual fragment shader pipeline is fixed in length. So a decrease in active fragments just means the command processor inserts "no op" bubbles.

Jawed
 
Jawed said:
The problem is the actual fragment shader pipeline is fixed in length. So a decrease in active fragments just means the command processor inserts "no op" bubbles.
The fragment shader pipeline length is fixed but it is obviously (think about branching!) as long as it needs to be to execute one pass, after that fragments are costantly recirculating into the pipeline (as patents show..)
Assuming the fragment shader pipeline is fully pipelined the pipeline length is virtually one clock cycle, hence as long as you can hide your mem latency it does not matter if your segment is filled with more or less fragments.
Even though I'd expect that every additional segment that has to be processed has some associated constant cost that you can't hide, but I'm really speculating here.
 
nAo said:
If I were a hardware designer I wouldn't put a registers optmizer in my GPU, that's a sw work.
Well I don't have a Cell dev kit to hand, so I don't know what kind of software might be interceding between your "to the metal" code and what the hardware actually executes.

On PC there's always a driver, so that's likely always going to optimise-out such dummy registers.

By the way, I hypothesised, a while back,

http://www.beyond3d.com/forum/showpost.php?p=715055&postcount=17

about this mechanism of using dummy registers in order to improve dynamic branching efficiency - but I'd like to see evidence of this working before blindly accepting that it works like that. It's a pretty interesting work-around :D I suppose NVidia has the option to perform some interesting in-driver (on PC) shader replacements for games that do use DB to tweak performance.

Supposedly there are less stages in the G71 pipeline, due to 90nm tech - perhaps this actually means that G71 has nominally less fragments in flight than G70. And RSX would be the same as G71. That might explain 220 as opposed to 256.

(Though, separately, Bob noted that there are actually about 800-odd fragments in flight for NV4x/G70 per quad - so, erm G71/RSX may have a yet-smaller count due to 90nm pipeline-shortening.)

---

An alternative point of view is that the fixed length of the G7x/RSX pipeline is enough to hide double the memory latency of typical GDDR3. So even with half the fragments in flight, texturing latency would still be completely hidden by the pipeline. Though that still doesn't solve the problem that arithmetic operations will now proceed at half-rate.

It would, on the other hand, mean that longer-latency texturing from XDR memory wouldn't have a negative impact on RSX. Without knowing how tightly bound to latency G7x/RSX's pipeline is, it's hard to know whether XDR texturing would tip performance into doom and gloom.

To be honest, it seems to me there's a fair chance that XDR texturing won't adversely affect RSX performance.

Jawed
 
Jawed said:
Well I don't have a Cell dev kit to hand, so I don't know what kind of software might be interceding between your "to the metal" code and what the hardware actually executes.
This is not related to Cell or PS3 at all, I'm talking about what you can (theoretically) do on a G71.
about this mechanism of using dummy registers in order to improve dynamic branching efficiency - but I'd like to see evidence of this working before blindly accepting that it works like that. It's a pretty interesting work-around :D
Well..if you have a G70 you can test it building your own shaders and making tests..:)
Supposedly there are less stages in the G71 pipeline, due to 90nm tech - perhaps this actually means that G71 has nominally less fragments in flight than G70. And RSX would be the same as G71. That might explain 220 as opposed to 256.
I still can't the this relation between pipeline length and number of fragments in fight, imho they're very losely related , if related at all.
(Though, separately, Bob noted that there are actually about 800-odd fragments in flight for NV4x/G70 per quad - so, erm G71/RSX may have a yet-smaller count due to 90nm pipeline-shortening.)
I have no idea about how many fragments in flight per quad one can have, but I'm sure you can't give a fixed number on such an architecture, it just does not make any sense, there's many variables (not just the reg count) that can influence that.
An alternative point of view is that the fixed length of the G7x/RSX pipeline is enough to hide double the memory latency of typical GDDR3.
Pipeline length does not matter..
Though that still doesn't solve the problem that arithmetic operations will now proceed at half-rate.
No, they will not, I already explained to you why this is not the case.
 
nAo said:
Well..if you have a G70 you can test it building your own shaders and making tests..:)
I don't have any NVidia hardware. I guess when you confirm the hypothesis with results then we can all heave a big sigh of relief. As I said, I've already hypothesised about this mechanism - it needs testing, and you need to keep your fingers crossed that the driver doesn't simply optimise-out the dummy registers.

I still can't the this relation between pipeline length and number of fragments in fight, imho they're very losely related , if related at all.
I have no idea about how many fragments in flight per quad one can have, but I'm sure you can't give a fixed number on such an architecture, it just does not make any sense, there's many variables (not just the reg count) that can influence that.
It's a limit on the number of fragments in flight. 256 stages in a pipeline (which is acting like a loop) makes for 256 fragments in flight, maximum. Excess register usage, beyond the nominal design limit (e.g. 4 in G7x) will force the command processor to issue less than 256 fragments in flight.:

http://www.beyond3d.com/forum/showpost.php?p=702940&postcount=106

In that thread I learnt (the hard way, sigh) that I completely misunderstood what NVidia is doing with its pipeline design - Xmas helped a lot there. Those stage counts are guesses. But the length of the pipeline puts an upper limit on the number of fragments in flight (why I refer to the "nominal count of fragments in flight" and the "nominal per-fragment register count").

Anyway, I'm looking forward to your results from testing this stuff :!:

Jawed
 
Jawed said:
I don't have any NVidia hardware. I guess when you confirm the hypothesis with results then we can all heave a big sigh of relief. As I said, I've already hypothesised about this mechanism - it needs testing, and you need to keep your fingers crossed that the driver doesn't simply optimise-out the dummy registers.
Just don't use dummy registers.

It's a limit on the number of fragments in flight. 256 stages in a pipeline (which is acting like a loop) makes for 256 fragments in flight, maximum. Excess register usage, beyond the nominal design limit (e.g. 4 in G7x) will force the command processor to issue less than 256 fragments in flight.:

http://www.beyond3d.com/forum/showpost.php?p=702940&postcount=106

In that thread I learnt (the hard way, sigh) that I completely misunderstood what NVidia is doing with its pipeline design - Xmas helped a lot there.
No offense but I believe you're still misunderstanding it.
Just read the patents, it's relatively simple how it works!
The number of pixel in flight is not hardcoded at all, basicly there is a small processor that assembles pixel batches and it does not stop until some on chip resource is no available anymore. When some resource is no more it puts a marker at the end of a segment, that's all.
According the patents there's no uber pixel shading pipeline with a given amount of memory at each stage to store temporary registers or other stuff, there's in fact a big per quad register file.
Those stage counts are guesses. But the length of the pipeline puts an upper limit on the number of fragments in flight (why I refer to the "nominal count of fragments in flight" and the "nominal per-fragment register count").
Again, no:)
 
Back
Top