More creativity with PS3 framebuffer needed?

scificube

Regular
There is a debate about whether RSX can really take advantage of "aggregate" bandwidth due to the expected situation with the framebuffer.

I'm wondering if this is more a matter of creativity than limitation so I've thought of some ideas and I'd like to know why or they're not possible and some ideas as to their merit.

----------

A common idea is placing texture and vertex data in XDR so that RSX could further utilize the available aggregate bandwidth.

Why not place the front buffer in XDR as well? RSX would access the front buffer for full screen post processing effects such as DoF, HDR blooms and such and the number of accesses to the front buffer should eat some decent bandwidth for these affects. Wouldn't this save on ROP work and ultimately leave more bandwidth for the back buffer? Lastly, wouldn't this be a good place for the front buffer if you elected to use Cell for post processing effects (additive on top of RSX even ) vs. having to contend with the added latency of going to VRAM with Cell to the same end?

----------

Why not exchange capacity in RAM for bandwidth to RAM by using more than one framebuffer...or rather backbuffer? (actually if VRAM bandwith is not saturated it could be looked at as eating capicity to free capacity in memory.)

This would probably require some work to pull off effectively and efficiently but maybe something like as follows could work.

1) Both buffers are exact copies.
2)Use split screen rendering to avoid double work
- split screen unevenly; VRAM buffer used for majority of what's in
the frustum because latency is lower to VRAM etc.;buffer in XDR is
is used for less of the scene conversely.
-use scissor tests to split the rendering
-the split is set dynamically based on load to maximise utilization of
aggregate bandwidth
3)Resolve to front buffer in later stages and continue with post processing etc

Alternative:
1) Buffers are not exact copies. i.e. one is FP16 one is FP32 or INT8 etc.
2) Screen is not split.
-Buffers are used for different stages in the rendering pipeline;
opaque geometry, high quality blending effects, low quality blending
effects etc.
-to be effective at least 2 stages in the rendering pipeline must be
active
-stages with higher backbuffer accesses are handled with the buffer
in VRAM; stages with lower backbuffer accesses are handled with
the buffer in XDR
-if possible have more stages active in the pipeline to further utilize
aggregate bandwidth to RSX.
3)Resolve to front buffer as above or use post processing as an active pipelined stage being worked on concurrently.

-----------------

Cell assisted TBDR of a portion of the scene to lower load on RSX and thus RSX's bandwidth needs

1) LS's act as on chip cache like in Kyro series
2) Cell transforms all geometry for a portion of the scene
3) Ray casts that portion of the scene for occlusion culling
4) skins, and shades geometry
5) Combine or overlay Cell rendered portion of the scene with RSX generated front buffer representing the rest of the scene in the frustum and output.

Cell may be able to handle it all given it doesn't have to render the whole or even most of the scene helping offset the cost of raycasting (simple-only for visibility tests, and simple lighting...the of the graphic work by other means) and the fact that Cell still is not a GPU some pixel shader like effects etc. will be handled less efficiently by Cell.

Alternative 1: Don't allocate a portion of the scene to Cell but rather intelligent select an amount of objects or environmental geometry just lower than the bound the processing you have left available on Cell to be. Selection could be dynamic or these objects etc. could be tagged in advance to remove overhead as long as things are within a range of predictability.

Cell will need to over lay what it's responsible for in the scene into the front buffer image before it's displayed and to do so correctly mostly like it will need to be done with it's work doen just before any full screen post processing affects like DoF are calculated and after and work leading up to post processing begins. If post processing isn't used but instead such affects happen elsewhere along the pipeline this idea may not be viable...not to say it is in the first place.

I've wondered of selective rendering by Cell of any sort could be used in conjunction with RSX's output to really make cut-scenes even more outstanding than expected. Err...I guess that's a question?

Alternatives2: Have Cell handle a stage or two in the rendering pipeline to save RSX work and have Cell consume XDR bandwidth leaving RSX more available bandwidth to VRAM. RSX gets pre-baked data that it can continue work on.

Alternative 3: Have Cell handle only part of the work of a stage(s). via method 1 or alternative 1.

Alternative 4: If TBDR won't work then by some other means.

-----------------------------------

Well that's my mental exercise for the day...any sense to anything I said?

Any interesting ideas out there still?
 
Some of these ideas were mentioned in a lengthy discussion here.

There are several problems with your ideas.

First of all, I'm skeptical of the use of FlexIO as an efficient random access bus for various reasons. One of them is that FlexIO isn't a bus going to memory, so you don't have address lines. RSX must send a packet to CELL which contains address information for it to know what data to fetch or write. Then CELL must send translate it into a request for the memory controller, fetch it from the XDR, and transmit it back to RSX. The data going through address and control lines isn't trivial, nor is this additional latency.

For texture data in XDR, not only are you going to have the above problem, but what are you going to put in the rest of the GDDR3 besides the framebuffer? RAM is a very valuable resource on consoles. Most of the textures will have to be in GDDR3.

Vertex data I fully expect to be put in XDR, but it doesn't occupy much bandwidth. When it does (i.e. high vertex to pixel ratio), most of the time you're not drawing many pixels, so you have bandwidth to spare anyway at that time. In any case, vertex traffic is usually a non-issue. I know for a fact that IHV drivers often place vertex buffers in AGP.

For the front buffer, I'm not sure that's an option. Doesn't RSX have to output the image to the screen? That's all the front buffer is used for. In any case, RAMDAC bandwidth usage is small potatoes.

Split framebuffers won't work unless they're interleaved, and I seriously doubt RSX is capable of that. If you split the way you're saying, e.g. with a scissor test, you can only render to one buffer at a time, so you're not saving any bandwidth. One bus will sit idle. Splitting by type of workload won't work either, because you can't give RSX two independent sets of polygons to render separately with different renderstates. No GPU is capable of this, and the benefits of such a feature would be very limited while the cost would be huge.

TBDR from the LS seems unlikely. I don't see how RSX can have random read-write access to the LS, and if it could, remember the issues I mentioned above. It would make sense if RSX was built to be a TBDR and had a on-chip tile buffer, but we know it isn't. TBDR also needs quite a bit of specialized hardware to quickly determine which tiles a polygon belongs to, and it must be able to change renderstates very quickly. There are some good TBDR threads here.

If I may say one more thing, there's been a common mistake by several members here. To get 35GB/s on FlexIO, you need a constant stream of data flowing all the time at the peak rate. There simply is no way to split any given batch evenly like that. Heck, the bandwidth split can vary substantially within a single batch. The reason the speed is so fast is to handle bursts fasts. Loading textures on the fly or copying frames for CELL to work on won't take much time.
 
Last edited by a moderator:
I'll check out the thread you linked :smile:

With my TBDR idea I was looking at a CBEA document which states with memory aliasing any processor can get direct access to data in an LS and it when so far as to specifically mention a GPU. I primarily thought of TBDR because of the the Kyro II which used ray-casting and relied on a very fast cache where I thought Cell could handle 1 bounce ray casting and then the LS would save on bandwith because it would as the fast cache the Kyro II had.

CBE_architecture10.pdf pg.34 of 319 said:
3.2.1 Local Storage Access
The CBEA allows the local storage of an SPU to have an alias in the real address space in the main storage
domain. This allows other processors in the main storage domain to access local storage through appropriately
mapped effective address space. It also allows external devices, such as a graphics device, to directly
access the local storage.

This would seem to allow streaming directly from Cell's LSs to RSX. In any case, it seems what I suggest doesn't seem practical. That's ok. I know I've got lots to learn.

----------

If RSX can't act on two different render states then the split rendering idea is indeed a dud. Would there be a way to work it with MRTs?

I messed up on the render states anyway with thinking a scissor test would save double work but without two renderstates whatever outside the rectangle doesn't get rendered and you're right the bandwith doesn't get used.

Look like the idea is a dud.

----------

I wasn't so much thinking the that 35GB/s would be utilized by a constant stream but rather I was thinking of ways to get at the aggregate bandwith with more than transformed geometry and textures. The framebuffer would be the biggest possible win if someone could figure out how to to do it and work around the latency from XDR.

Also, I didn't mean to imply all textures would be in XDR, but rather just some of them. VRAM should of course be consumed by graphic related data and textures would be more than happy to eat up space there.

As for latency...

Cell clock / RSX clock = normalized speed: 3,200/550 = 5.81 Cell cycles for every RSX cycle.

If XDR latency to Cell is 140 cycles then: 150/5.81= approx. 24 cycles.
24cycles + latency to Cell+latency back from Cell. Latency to and from Cell should be low no?

Is this where your skepticism comes from? how does this compare with latency to VRAM GPUs typically have to deal with? RSX's rumored larger caches wouldn't allow this latency to be absorbed given work would be within the cache more often while fetching could be done while this is happening? Or it doesn't work like this?

---------

Sure seems a lot of effort to put a Flexio interface on RSX. Are the only wins transfromed geometry, procedural geometry/textures? I suppose the pipe is wide so it can handle big bursts but still seems an awful lot of work.

-------

Oh yeah...this stuff's over my head. I'm just exploring.
 
Last edited by a moderator:
  • Like
Reactions: Geo
scificube said:
A common idea is placing texture and vertex data in XDR so that RSX could further utilize the available aggregate bandwidth.

Why not place the front buffer in XDR as well? RSX would access the front buffer for full screen post processing effects such as DoF, HDR blooms and such and the number of accesses to the front buffer should eat some decent bandwidth for these affects. Wouldn't this save on ROP work and ultimately leave more bandwidth for the back buffer? Lastly, wouldn't this be a good place for the front buffer if you elected to use Cell for post processing effects (additive on top of RSX even ) vs. having to contend with the added latency of going to VRAM with Cell to the same end?

A developer here suggested the possibility of having multiple buffers in different locations, but didn't elaborate much beyond that.

Asides from that, I think the most likely candidates for buffers in XDR would be those being worked on by primarily by Cell.

If you wish to have Cell access a RSX buffer for post-processing, it might make more sense to have RSX operate on it in VRAM and then have it read in by the SPUs.

scificube said:
Cell assisted TBDR of a portion of the scene to lower load on RSX and thus RSX's bandwidth needs

1) LS's act as on chip cache like in Kyro series
2) Cell transforms all geometry for a portion of the scene
3) Ray casts that portion of the scene for occlusion culling
4) skins, and shades geometry
5) Combine or overlay Cell rendered portion of the scene with RSX generated front buffer representing the rest of the scene in the frustum and output.

I assume you're referring here to a Cell-rendered frame to be composited with RSX's, and yes that is possible. I'm not sure how general the split could be between what is rendered on RSX and what is rendered on Cell, but you could be smart about what you place on both, to take best advantage of both.

scificube said:
Alternatives2: Have Cell handle a stage or two in the rendering pipeline to save RSX work and have Cell consume XDR bandwidth leaving RSX more available bandwidth to VRAM. RSX gets pre-baked data that it can continue work on.

It'd be possible for Cell to compute a look-up table that could be fed into a RSX shader as a texture, to allow RSX to skip some work, if that's what you're getting at.

On a general note, of course you cannot use all of VRAM bandwidth just for the framebuffer, it'd be an enormous waste of RAM. You need to allocate some bandwidth for the rest of the memory. However, I'm not yet convinced that we know, publically, about the mechanisms and performance characteristics of RSX access to XDR. I see assumptions being made about that quite a lot, but various suggestions and comments made by people actually working with the system don't always coincide with those assumptions.
 
Thanks Titanio for the comments.

Actually, I got the idea of multiple buffers from a sideline conversation about NAO32. With the technique multiple buffers are used for different stages in the rendering pipeline except that the buffers are converted in place for each stage or rather pass. Such as NAO32 for opage geometry, FP16 for blending, and IIRC INT8 for LDR.

Mintmaster has stated I can't have two set of geometry with two renderstates with any modern GPU. I wonder though could I still accomplish the goal with one set of geometry but retain 2 renderstates? What I'm thinking is that the z-depth should be the same for both buffers where as the color depth is not. It's just a thought at this point...gotta do some more digging I suppose or someone will have mercy on me.

----------------

I was talking about compositing images from both Cell and RSX in some way. When Mintmaster mention RSX must output the image it became clear the front buffer should remain in VRAM to avoid having to ship it back to RSX just to output the final image. So I understand what you're saying that it's probably smarter to leave the front buffer in VRAM.

One approach was an attempt at Cell handling a portion of scene in view from start to finish, but I thought it might be more intelligent to use Cell for only certain passes in the rendering pipelines so I took a stab at that too. Clumsily of course ;)

I really handn't thought of storing data in a texture and passing it to RSX...I suppose it would have made too much sense to use the most basic data storage structure in graphics...I mean...why would you do that? ...sigh. Oh well :smile:
 
scificube said:
This would seem to allow streaming directly from Cell's LSs to RSX. In any case, it seems what I suggest doesn't seem practical. That's ok. I know I've got lots to learn.
The thing is that writing pixels and reading texels isn't streaming. It's random access. Feeding a vertex pipe with an index buffer or vertex list is streaming, and thus it can be done easily. From the quote you mentioned, "other processors in the main storage domain" is unlikely to include RSX. The alias seems to be internal.

If RSX can't act on two different render states then the split rendering idea is indeed a dud. Would there be a way to work it with MRTs?
MRT's is a possibility with RGBA split with RG in one buffer and B or BA in the other, but you halve the fillrate that way (though fillrate is in abundance). The bigger issue is compositing them together, which would lose most if not all your gains. Still, I'd be surprised if RSX can render directly to XDR over FlexIO, and would be shocked if it didn't have very heavy penalties.

Also, I didn't mean to imply all textures would be in XDR, but rather just some of them. VRAM should of course be consumed by graphic related data and textures would be more than happy to eat up space there.
The thing is that this means you can't offload a big portion of texture bandwidth. Size isn't necessarily proportional to BW, but consider how texture BW is usually less than FB BW, and if only some of textures can accessed from XDR and only some of the time, you're really talking about a small improvement, if any.

If XDR latency to Cell is 140 cycles then: 150/5.81= approx. 24 cycles.
24cycles + latency to Cell+latency back from Cell. Latency to and from Cell should be low no?
I don't think 140 cycles is right. Didn't nAo refute this? Could be lower, actually. I don't think you can ignore FlexIO latency, as like I said, it isn't like the bus between RAM and a processor. Unfortunately, I have no idea what the figure is.

I'm not sure what a GPU's access time to GDDR3 is, but it's something around 100 cycles from the point of view of the pixel shader. This is because the memory controller bundles requests from many pixels in order to use burst access, which helps greatly in getting high efficiency. Retrieving only a few bytes at a time from random locations is slow, and doesn't use the cache very well either.

Sure seems a lot of effort to put a Flexio interface on RSX. Are the only wins transfromed geometry, procedural geometry/textures? I suppose the pipe is wide so it can handle big bursts but still seems an awful lot of work.
Textures are accessed randomly, so procedural textures on CELL don't save you bandwidth either. They're great for variety and compression, but you have to store them somewhere once you generate them. Besides, even on a beast like CELL, procedural generation won't get close to the speed of regular texturing. Something like the oft-quoted Warhawk cloud simulation isn't really replacing a textures either. It's another effect that you'll overlay on the background.

Like I said before, FlexIO is great for streaming data transfers really fast. When the GPU needs to decide what it needs on the fly, then FlexIO's use is diminished. But even if you do 50MB of transfer over FlexIO per frame (which is a lot - fourteen 720p images!), i.e. 3GB/s @ 60fps, it's better for the transfers to happen at 35GB/s so as to limit frame time to 10%. It won't be easy to keep RSX doing other things efficiently while these transfers are going on, so getting them done as fast as possible is important. The same is true for getting through culled and clipped vertices, where you get spikes in your vertex BW load. You want to get through these as fast as possible as the rest of RSX will often be waiting for more visible polygons to rasterize.

So a fast FlexIO is very useful even if you don't have a sustained load on it.

Here's a little tidbit of info to show how unreasonable it is to expect FlexIO saturation: You need 250MB transferred from RSX to CELL and 333MB the other way round every frame @ 60 fps. A 720p image is 3.7MB. What in god's name are you going to transfer, and how do you keep the rest of PS3 working during this time? Even hitting one tenth of this number is unlikely.
 
Last edited by a moderator:
Thanks Mintmaster that was very informative.

I wasn't thinking procedural textures would be immediately used (I really don't think procedural textures are really all that and a bag of chips in the first place but just nice to have)...just decompressed and flushed as needed. I was asking more "is that all it's good for?" You did answer my question regardless so thank you.

I realize that random access is the key and what I was trying to ascertain was a way to split up those accesses to the framebuffer as to reduce contention for bandwidth to the VRAM for handling all of them...I realise the framebuffer is small but it is accessed very very often as to inhibit the requests of other things being serviced. That was all I was after. I really wasn't trying to get after 35GB/s of data.

I did realise the XDR latency was probably incorrect because the LS latency is 6 cycles not 8 to my understanding. I also saw where nAo issues the proverbial "bollocks" declaration of the slide's authenticity and later in that thread where another slide was shown stating the numbers were only there for illustrative purposes. I only elected to use 140 cycles as the latency to XDR because I had no better better guess than that slide as I assumed the slide wouldn't be radically off the mark. I only wanted to explore how bad the latency would be and how it would compare to latency from the VRAM.

If the latency is somewhere around 24 cycles in addition to whatever latency there is directly to and from cell I need to understand why the PS could absorb 100 cycles of latency from GDDR3 and not from XDR? Is it that you expect latency directly between RSX and Cell to be greater than approximately 76 cycles? Or is that you don't think Cell's MFC/DMA engines could queue up and handle all the outstanding requests (of course with Cell's approval)?

-----

As for your question...again I wasn't after that amount of "data" per second...that's more than XDR itself!...every second! I realise sustained transfer of new data is not plausible. I was more after accessing the same data over and over again and altering it every so often.

-----

It would be nice to know if RSX's caches have in fact gotten larger and just how much the alleviate the need for bandwith concerns. As the saying goes...the world may never know.
 
Last edited by a moderator:
We may never know, but I think G71 is a very good guide. It only makes sense that NVidia popped RSX's pipelines straight into G71, as they're both 90 nm and NV might as well make good use of all the design effort. They obviously put a lot of effort into trimming down G70 into G71. One more thing: Remember that caches do not alleviate bandwidth concerns unless it's some sort of special case like Xenos' eDRAM or a TBDR's cache. They merely make memory access more efficient. I think G71 is already doing a very good job at that.

Regarding latency, I didn't say it takes ~100 cycles to access stuff from GDDR3. First you have to do some math to figure out which memory location to access. Then the memory controller piles up memory requests and reorganizes them into contiguous (or nearly so) blocks during that time. The urgency of the request is also noted, as the pipelines can only completely hide latency for a certain number of cycles. Based on some complex heuristics, data is transferred from the GDDR3 in one intelligent burst after another. There's a lot of black magic going on there, and I think this is what makes it so hard for there to be a viable third player in the GPU market. Finally, once you have your data, you need to do the texture filtering too.

Just think about how every cycle you can have up to 16 z read requests (maybe 64?), 16 z write requests, 16 colour write requests, 24 texture requests (each requesting 4 texels of data from each of two mipmaps), and some vertex data too! Then you have to worry about open pages in the DRAM, accessing memory efficiently, etc.

Another 50 cycles can really mess this up. Moreover, you don't know what Cell is doing with the XDR either, and it could delay the loading of the data for a little while also. It's really impossible to make any sort of theoretical prediction of what the impact will be. I do know for a fact that GPUs, when built around a particular latency, can have big drops in efficiency when the latency is changed.
 
Sorry for all the question...just trying to understand you thats all.

The way I was looking at on chip caches is that if data were in them and used over and over again then this should have the affect or reducing the number of times you have to go out to memory. I thought this was a relevant point in the discussion about whether Xenos would is able to handle AF or not given bandwidth to GDDR3...there was a long discussion about it, so that's where I was coming from. I thought it was rumored the caches were enlarged on RSX to help deal with the relatively tight bandwidth available (although even this seems debatable...everyday something new!). I know it's the prevailing thinking that RSX is basically G71 and in not expecting wondrous changes myself either but if Sony is willing to take on an FLEXIO interface then I don't see larger caches as being out of the question. Not a prediction...just saying it could've happened...but of course, we may never know. Since the ROPs seemed to have gotten smaller...maybe some of those trannies went somewhere else...

Well anyway I do understand well were you're coming from so I won't bother you any more with questions. Till another time then. ;)
 
Titanio said:
A developer here suggested the possibility of having multiple buffers in different locations, but didn't elaborate much beyond that.

Asides from that, I think the most likely candidates for buffers in XDR would be those being worked on by primarily by Cell.

If you wish to have Cell access a RSX buffer for post-processing, it might make more sense to have RSX operate on it in VRAM and then have it read in by the SPUs.



I assume you're referring here to a Cell-rendered frame to be composited with RSX's, and yes that is possible. I'm not sure how general the split could be between what is rendered on RSX and what is rendered on Cell, but you could be smart about what you place on both, to take best advantage of both.



It'd be possible for Cell to compute a look-up table that could be fed into a RSX shader as a texture, to allow RSX to skip some work, if that's what you're getting at.

On a general note, of course you cannot use all of VRAM bandwidth just for the framebuffer, it'd be an enormous waste of RAM. You need to allocate some bandwidth for the rest of the memory. However, I'm not yet convinced that we know, publically, about the mechanisms and performance characteristics of RSX access to XDR. I see assumptions being made about that quite a lot, but various suggestions and comments made by people actually working with the system don't always coincide with those assumptions.

One thing i wonder with all this assumptions and that to my knowledge we dont know yet(?) is if any of the cache´s are coherent Cell/RSX. As im not very skilled at these things i could think that as Cell was being made first that nVidia/sony has tried to match them better or so. I dont know.. Ignore me if it makes zero sense!
 
If RSX had it's own 64-bit path to XDR then all problems concerning latency between RSX and XDR would be alieveated.
 
MBDF said:
If RSX had it's own 64-bit path to XDR then all problems concerning latency between RSX and XDR would be alieveated.

I was more talking/wondering as whe go through Cell via FlexIO if tweaking the cache´s
on RSX makes sense from a performance POV either in the XDR-MC and/or down trough the whole chip. I think that RSX would have to adjust itself after the specs of Cell as it was done and so already.
 
overclocked said:
One thing i wonder with all this assumptions and that to my knowledge we dont know yet(?) is if any of the cache´s are coherent Cell/RSX.

I'm not sure, but IIRC the part of FlexIO they are using to connect Cell and RSX is coherent.

About cache and bandwidth - it has been hinted that RSX's texture cache has been increased beyond the increases seen even in the G71. Does this help bandwidth? For cache-friendly operations, sure. You can probably feed your GPU in a way to make better use of caches also. But asides from that, larger caches could be part of a broader strategy to allow for higher levels of threading on the GPU, at the programmer's discretion. That in turn could help hide latency for access to XDR, which in turn could make use of more bandwidth on that side of the system more viable, for RSX.

Also, just a note about things like Warhawk's cloud rendering on Cell not 'replacing' texturing. I might not have quite grabbed the thrust of the argument, but whatever about replacing texturing or not, such techniques can facilitate more texturing bandwidth (some framebuffer bw consumption moved into SPUs = more main memory bandwidth for other things).
 
I am starting to wonder whether the assumption that XDR RAM cannot be used by RSX via the Flex i/o interface same way as GDDR RAM because of excessive laitance is actually correct.

http://www.cotsjournalonline.com/home/article.php?id=100412&pg=1

The Cell BE processor has well-balanced I/O both internally and externally. The nine processing elements are connected internally via an element interconnect bus (EIB) capable of moving 96 bytes/cycle. Also connected to the EIB is an external XDR DRAM memory controller capable of peak rates exceeding 25 Gbytes/s to main memory. The final I/O interface for the processor is a Flex I/O interface. This interface has both a coherent interface for connecting two Cell BE processors together and a non-coherent I/O interface, which combine to provide 60 Gbytes/s of bandwidth.

Also look at
http://ntrg.cs.tcd.ie/undergrad/4ba2.05/group12/index.html
Where is SCI used today and why is it used?
SCI has one main advantage over its competitors, namely that not only is it a System Area Network, it also allows remote memory accesses. Thus, SCI is suitable for both message passing and shared memory programming on clusters.

So scalable coherent are used for shared memory in clusters, and Flex i/o is also designed to connect together two Cell chips together to allow coherent memory access via the flex interface.

I am assuming coherent access means access without excessive laitance in the same way as direct access via a local bus would allow.

Wouldn't it be sensible in this case for Sony to simply add a little bit of interface hardware to the RSX to enable it to also do coherent access to the XDR RAM?

Am I right in assuming this would allow texture data as well as geometry data to be placed in the XDR RAM and so overcome the bandwidth limitations of RSX's 128 bit GDDR bus.
 
Coherency hasn't really anything to do with latency, AFAIK. It's about the values seen by two processors in a single memory location, and make suring that view is..coherent, the same, at all times.
 
Mintmaster said:
Another 50 cycles can really mess this up. Moreover, you don't know what Cell is doing with the XDR either, and it could delay the loading of the data for a little while also. It's really impossible to make any sort of theoretical prediction of what the impact will be. I do know for a fact that GPUs, when built around a particular latency, can have big drops in efficiency when the latency is changed.
I was talking to Tamasi about texture latency hiding and how ATI are doing it by juggling multiple batches and Tamasi said thats one way to do it, the other is to make the threads sufficiently long - which is why NVIDIA are using such large batch sizes at the moment, because they don't handle many different threads in a pipeline. The batch size of G7x is tuned to the latencies of the memories its using, being GDDR3 - changing that would require a change to the command processor (which they haven't done) and/or an increase in batch size (which would imapct other areas).
 
Titanio said:
Coherency hasn't really anything to do with latency, AFAIK. It's about the values seen by two processors in a single memory location, and make suring that view is..coherent, the same, at all times.

Hmmm. What I can figure is that coherent shared memory is used in 16 way or greater (NUMA) servers. In those machines certainly the remote memory is directly accessible to the CPU as memory would be accessed on a local bus. The ring bus itself is straight out of the architectures used for shared memory NUMA servers.

The SPE and PPE communication with main memory (XDR and GDDR via RSX) is on the basis of shared coherent memory via the ring bus.

If the RSX can access the ring bus on the basis of shared coherent memory, then why should the RSX access to XDR including latencies be any different to PPE or SPE access to XDR or GDDR in terms of latencies?

My question is, where did the assumption that excessive latencies will prevent the RSX using XDR for texture maps come from? Did someone actually announce this or is this someone's assumption based on the assumption that Flex i/o is an i/o interface rather than a bus?
 
SPM said:
Hmmm. What I can figure is that coherent shared memory is used in 16 way or greater (NUMA) servers. In those machines certainly the remote memory is directly accessible to the CPU as memory would be accessed on a local bus. The ring bus itself is straight out of the architectures used for shared memory NUMA servers.

The SPE and PPE communication with main memory (XDR and GDDR via RSX) is on the basis of shared coherent memory via the ring bus.

If the RSX can access the ring bus on the basis of shared coherent memory, then why should the RSX access to XDR including latencies be any different to PPE or SPE access to XDR or GDDR in terms of latencies?

Coherency again has nothing to do with the latency to a certain pool of memory. It simply ensures that multiple processors looking at a pool of memory see the same values in memory. For example, one processor might read a value from main memory and that gets stored in its cache. A second processor might then read the same value into its cache. Then the first processor may write a new value out to that memory location - but without coherency, the value the second processor is seeing is the old one, because that's what it has in its cache. Coherency relates to shared access of memory from multiple processors, but it bears no relation to the latencies different processors might experience when accessing that memory.

That's my understanding of it at least, if I'm wrong, someone feel free to correct me.

SPM said:
My question is, where did the assumption that excessive latencies will prevent the RSX using XDR for texture maps come from? Did someone actually announce this or is this someone's assumption based on the assumption that Flex i/o is an i/o interface rather than a bus?

It's not too much to assume that there is added latency for RSX accessing XDR versus GDDR3. GDDR3 is local to it, it's closer. To access XDR, RSX has to go through FlexIO, through Cell, to the other end where the XDR memory controller is, which in turn has to go out over the XDR bus to memory. That's obviously going to take more time than going locally to GDDR3. The assumption isn't that XDR access will bear more latency, but how that might impact use of that memory, on RSX's part. That's where people start speculating.
 
Titanio said:
It's not too much to assume that there is added latency for RSX accessing XDR versus GDDR3. GDDR3 is local to it, it's closer. To access XDR, RSX has to go through FlexIO, through Cell, to the other end where the XDR memory controller is, which in turn has to go out over the XDR bus to memory. That's obviously going to take more time than going locally to GDDR3. The assumption isn't that XDR access will bear more latency, but how that might impact use of that memory, on RSX's part. That's where people start speculating.

So it is just speculation.

To achieve coherent access to XDR, surely direct access to the memory similar to direct bus read/writes are required, as opposed to incoherent access which presumably is done with DMA.

Yes, RSX access to XDR has to go through the ring bus, but so has PPE and SPE access to XDR. The latency for these is pretty low otherwise you would see horrendous performance issues. If the Flex i/o interface connecting RSX to the ring bus is configured for coherent access, then RSX is just another node on the ring bus just like the PPE and the eight SPEs, and the latency for RSX accessing XDR should not be very different.

There may be a slightly higher latency going for RSX accessing XDR, but surely that has to be very low. Remember the ring bus has two paths in either direction and it doesn't relay data from one node to the next - it jumps directly between the two nodes. The latency is therefore going to be low - similar to a bus. If latency is low enough to feed the PPE and SPEs without a major performance hit, then surely it is easily enough for textures maps.
 
The mere fact that XDR latency in PS3 is a secret is pretty much proof, in my view, that it's high.

Additionally, the whole point of the Cell BE architecture is ultra-high bandwidth streaming - low latency is not the design target. Hence the DMA-list architecture etc. etc.

The local store organization introduces another level of memory hierarchy beyond the registers that provide local storage of data in most processor architectures. This is to provide a mechanism to combat the ‘‘memory wall,’’ since it allows for a large number of memory transactions to be in flight simultaneously without requiring the deep speculation that drives high degrees of inefficiency on other processors. With main memory latency approaching a thousand cycles, the few cycles it takes to set up a DMA command becomes acceptable overhead to access main memory. Obviously, this organization of the processor can provide good support for streaming, but because the local store is large enough to store more than a simple streaming kernel, a wide variety of programming models can be supported, as discussed later.

Google for Kahle.pdf

If you assume XDR has 1000 cycles of latency at 3.2GHz, then you're looking at ~ double the latency of GDDR3, as seen from RSX's point of view.

Jawed
 
Back
Top