Why no new "memory architecture" for Xenos?

Jawed

Legend
We keep hearing about R520's new memory "architecture" or "controller" or whatever, without really having the foggiest what it is.

All the same, isn't it kinda curious that this isn't a feature of Xenos - if both Xenos and R520 were developed sort of simultaneously and apparently share other new technologies?

I suppose the most obvious difference between the two is Xenos's function as XB360's memory controller - it's not just GPU memory it's interfacing with.

Does this give us any clues as to R520's new memory tech?

Jawed
 
It has a completely custom memory interface with 10MB embeded memory.
 
Last edited by a moderator:
Xenos's 10MB EDRAM is solely for render targets/backbuffer usage.

The rest of the memory, where textures are stored for example, is accessed through an apparently conventional GDDR3 interface.

I'm just curious what aspects of the memory system in R520 might be improved by a new design, and why those aspects aren't relevant in Xenos.

Perhaps the fact that they're not relevant in Xenos gives us a clue as to what they are. e.g. it might be an optimisation for render target/backbuffer memory access patterns.

So while Xenos has its dedicated EDRAM interface ("HSIO" on the die photo) R520 needs something else. :?:

And, whatever happened to the rumours about R520 having an internal ring bus architecture?

Jawed
 
Jawed said:
e.g. it might be an optimisation for render target/backbuffer memory access patterns.

Not too long ago a graph was posted here which showed, over a slice of time, which functions were reponsible for what percentage of main memory read/writes in a certain game (I've forgotten which one). IIRC the vast majority of memory accesses were not to textures or vertex data (which was particularly tiny to my suprise) but for render target reads/writes. And since Xenos' working FB/ZB is located in EDRAM, main memory RT access shouldn't be an issue.

But then again I could be totally wrong...
 
Jawed said:
I think you're referring to these graphs:

http://www.beyond3d.com/forum/showpost.php?p=500420&postcount=5

Which RoOoBo generates from his simulation of a GPU. Very impressive work.

Jawed

Impressive maybe but also quite buggy.

Also the graphic was for a configuration with 2 quad rops 2 quad shader units (or in common language a typical 8 pipe GPU). With the more ambitous configuration I'm currently reviewing with 2 quad rops and 6 shader units (translated a not so conventional 8 rop pipe and 24 shader pipe GPU), and a simulator with a few less bugs, the amount of bandwidth consumed by texture data is quite larger.

That frame was from a small trace from the UT2004 Primeval map which seems to be using quite a lot of blending (I guess for rendering all that grass terrain) and the color buffer doesn't support compression. Meanwhile most of the trace textures are in compressed formats (DXT1 or DXT3). I'm currently using the a different trace from a timedemo for the Primeval map (the 3dcenter timedemo) and the distribution seems to be different even for the 8 pipe configuration. It's likely to be because of errors in the simulator.

In any case it's doubtful that the frame is representative of other current or past games or graphic applications.

An example frame for the 8 pipe configuration
2sh1w-mem-f330.png


Same frame for the 24/8 pipe configuration

uni6sh2w-mem-f330.png


(the images may not load until a 10-20 minutes after I post because some weird behaviour of the website, so don't ask that they aren't working)

Edit: For when the graphics shows up ... The numbers aren't normalized and the amount of data read or written is sampled every 10K cycle so that 600000 actually means 600kBytes and the top theorical bandwidth for both the configurations is 64 bytes/cycle (the simulator is currently limited to simulate memory working at the same frequency, non double rated, than the GPU).

Edit2: Try to find the errors ;)
 
Last edited by a moderator:
What could be really nice, RoOoBo, would be a comparison of vertex usage vs. pixel usage with your project, to see how important unified pipelines could become, with different graphs spaced at different amounts of cycles (i.e. one can compensate for an imablance in pipeline usage by having a large cache between the vertex and pixel pipelines...so the question is, how large is needed?).
 
Chalnoth said:
What could be really nice, RoOoBo, would be a comparison of vertex usage vs. pixel usage with your project, to see how important unified pipelines could become, with different graphs spaced at different amounts of cycles (i.e. one can compensate for an imablance in pipeline usage by having a large cache between the vertex and pixel pipelines...so the question is, how large is needed?).

Well, that's prime class research material and it's more likely to show up in an article submission for a computer architecture congress than on a web forum, even if it's one with the quality of Beyond3D ;).

In any case, the UT2004 Primeval map is mainly fragment limited with some small regions (-10%) being fillrate limited (see the start of the frame above) or vertex limited (near the end look at the the blue vertex data zone, that zone has 0% fragment processing as it seems most of those triangles are small or are culled). Even with those conditions when I compare (using the simulator which may not be an accurate or correct representation of real GPUs) a classic configuration with 4 vertex shaders with the same configuration (num. unified shaders = num. fragment shaders and everything else the same) implementing an unified shader architecture there is a relatively small performance improvement that depending on the frame is around 5% to 10%.

Applications heavily limited by vertex processing will improve further. In the ideal case way further (imagine 24 unified shaders against 4 or 8 vertex shaders), but in many cases (or at least what is implemented in the simulator) it will become limited by the geometry stages (either data fetch or after shading) if the shader program isn't quite large. Applications with non significative vertex processing may have a very little performance lose (as there is less processing units) or show no performance difference.

The OpenGL framework doesn't support texture fetch at the vertex program so I can't test (and I think there is no OpenGL game using that feature either) if unified shading is better for this case.
 
RoOoBo said:
The OpenGL framework doesn't support texture fetch at the vertex program so I can't test (and I think there is no OpenGL game using that feature either) if unified shading is better for this case.
Well, NV_vertex_program_3 supports texture fetches. But you may not want to bother to use that.
 
Presumably the peak in vertex data towards the end of the frame render is due to the smoke effect from the rockets.

The scene itself seems fairly low in poly count and there doesn't appear to be much grass being blended - though I do know from painful experience that this map brings my Radeon 32MB SDR to its knees even at 320x240 :cry:

Jawed
 
Jawed said:
Presumably the peak in vertex data towards the end of the frame render is due to the smoke effect from the rockets.

The scene itself seems fairly low in poly count and there doesn't appear to be much grass being blended - though I do know from painful experience that this map brings my Radeon 32MB SDR to its knees even at 320x240 :cry:

Jawed

The frame for the graphics and the frame the screenshot aren't the same frame (there are 100 frames of difference). Frame 439 is a frame I had already uploaded in the server while the data is for frame 330 (I have the image somewhere but I'm too lazy too upload it). I doubt it's the smoke as that region appears in frames without smoke, weapon beams or even players. About the polygon count frame 330 has around 240K vertices and 80K triangles (the map doesn't seem to use triangle strips).
 
Chalnoth said:
Well, NV_vertex_program_3 supports texture fetches. But you may not want to bother to use that.

Framework means the simulator framework not the API itself. Becasue of lack of time we try to avoid implementing specific extensions (or even API unused features) if not required. We should be targeting glSlang in any case rather than new extensions of the pseudoassembly shader language.
 
Looking at the Xenos die photo:

MSGPU700.jpg


The HSIO interface to the EDRAM, at the top edge of the die, appears to be the same size as one of the GDDR3 interfaces (at the bottom of the die, and most of the right hand side).

Each of the GDDR3 interfaces is 64-bit. I think...

I suppose one way of interpreting this is that the HSIO is a 64-bit interface. To support 32GB/s it would need an effective rate of 4GHz.

Does that sound plausible? Sounds pretty ridiculous to me. So what's actually going on there?

Jawed
 
Back
Top