XBOX 2 GRAPHICS DETAILS EMERGE .. not much actually

Inane_Dork said:
Factor in how much color and depth compression ATi can do with 4x MSAA and one-quarter of the screen is about the worst case scenario.
We don't know a single implementation of color compression of fp render targets at this time. Color and Z-buffer compression scheme, as we know them from current hardware, don't save memory , just bandwith. With edram one has potentially tons of bandwith to spare.

Or look at it this way. Cards already have caches for the depth buffer. That must make most to all cases faster. Why would adding eDRAM slow it down?
Cause a block of eDRAM is just that, dram! that's not a cache, they act differently. Do current GPUs have in the general case cache problems (low cache hits ratio)? If that's not the case I can't see why the hw need a huge pool of embedded dram to act like a cache, GPUs already have caches, and probably their size and policy is designed according the GPU architecture and needs. That's why I believe edram will be used to store a portion of the render targets. eDRAM will guarantee all the bandwith needed for <free> MSAA, maybe even on fp16 render targets.

ciao,
Marco
 
MfA said:
You need to store all the Z-values all the time (otherwise intersecting surfaces wont work correctly). You can get away with reading less values though, which is basically what the hierarchical Z-buffer does.
Hmm, ok. The z-values within a pixel (or even within a larger tile) should have values very close together usually though, leading to high compression rate. So it should still be possible to move parts of the compressed z-buffer out to normal ram without too much performance hit.
 
Storing the framebuffer in eDRAM frees up the external memory bus for textures, vertices, shader programs and (hopefully) [semi]arbitrary external memory loads/stores from shader programs.

Also, I see no reason why tiling cannot be left to the application - at 10MB, the eDRAM is large enough to allow complete screen coverage with just a few tiles. My guess is that the performance benefits of never hitting external memory for render targets would make any kind of application/T&L overhead seem negligeable. Necessary external buffers could be allocated at screen resolution (instead of at N x res to account for MSAA).

The lack of bus contention for anything other than render targets probably has several beenfits (i'm not a hardware guy, so these are guesses, but) :

- significantly reduced memory latency for render target access
- more deterministic latency

These things probably allow the GPU designers to make the on-chip caches for color/Z/stencil smaller and/or more efficient.

Serge
 
I still would like to see main external memory bandwidth in Xenon be in the 50+ GB/sec range, rather than the 22+ GB/sec in the Xenon diagram & documents.

assuming a 256-bit bus plus GDDR3 or even plain DDR2, 50+ GB/sec should be achievable.

then the large eDRAM bandwidth takes alot of the load off of the still very substantial main memory bandwidth.


but I would expect that if Xenon's main memory bandwidth is only 22+ GB/sec they will find ways to do amazing things thanks to the eDRAM bandwidth. 22+ GB/sec is still a nice leap from the 2-3 GB/sec of GC's and PS2's main memory bandwidth.
 
psurge said:
Also, I see no reason why tiling cannot be left to the application - at 10MB, the eDRAM is large enough to allow complete screen coverage with just a few tiles. My guess is that the performance benefits of never hitting external memory for render targets would make any kind of application/T&L overhead seem negligeable.
I think so too, especially considering that the pixel processing load will greatly outweigh T&L in most nexgen apps(many times more then it does in today's pixel heavy console games) - so the T&L overhead could very well completely hide behind the pipeline and not even register in many cases.

As for benefits of render targets with no bus contention, knowing deterministic numbers for all your postprocessing, render to tex and other draw buffer ops is just well - cool ;) And I'm sure anyone that spent a decent amount of time working with hw that already has that, would agree.
 
Faf... was there a hint buried in there somewhere :D?

It also occurred to me that with really large tiles (relative to PVR) and large poly counts (relative to this gen), traversal of a heirarchy of bounding volumes is the way to go for assignment of geometry to tiles. Since this involves pointer chasing as well as data typically handled by the application (AI / physics are moving models around in the spatial heirarchy), leaving it up to the CPU and the programmer sounds like a much better idea than a semi-programmeable unit on the GPU.

The other interesting thing was that on those leaked/fake diagrams, the eDRAM chip contains all the blending/ROP units. This again seems like a great idea since the number of included units and the frequency of that chip can be tuned for a desired fillrate. The "GPU" would have more space for shader units and could be clocked higher for greater arithmetic throughput per pixel/vertex without creating ridiculous and unobtainable peak fillrates.

Wouldn't the scheme also save quite a bit of power?

Regards,
Serge
 
eDRAM will guarantee all the bandwith needed for <free> MSAA, maybe even on fp16 render targets.

MSAA has no performance cost besides memory bandwidth? How much does it increase the bandwidth costs by?
 
Thats not entirely true. MSAA does have costs - there is always a slight fill-rate penalty, however this is only at poly edges and massively smaller in comparison to the bandwudth penalty (although, probably shoudl scale with perfect compression). There is also the cost of generating multiple samples - all ROP's on current graphics boards can only generate two samples per cycle, so 2 cycles are needed for 4x MSAA (3 for 6x) - if you haven't got any bodge-ups in your pipeline this should be hidden by the length of other ops (Trilinear / Aniso / Multitexturing / fragment shading).
 
psurge said:
The other interesting thing was that on those leaked/fake diagrams, the eDRAM chip contains all the blending/ROP units. This again seems like a great idea since the number of included units and the frequency of that chip can be tuned for a desired fillrate. The "GPU" would have more space for shader units and could be clocked higher for greater arithmetic throughput per pixel/vertex without creating ridiculous and unobtainable peak fillrates.

Wouldn't the scheme also save quite a bit of power?

Regards,
Serge


Isn't that what Mitsubishi's 3DRAM did? Just taking eDRAM and placing it in a smart architechtural way to get the most out of it?
 
Alstrong said:
So 2x MSAA is essentially free :?:

Well, bear in mind that shaders are getting into the 100's of instructions meaning they take many, many cycles to execute per pixel, which also means you can hide other ops with them executing; even texture filter units are only bilinear so anything that only has one unit per pipeline will require two clocks to do a trilinear filter.
 
DaveBaumann said:
Alstrong said:
So 2x MSAA is essentially free :?:

Well, bear in mind that shaders are getting into the 100's of instructions meaning they take many, many cycles to execute per pixel, which also means you can hide other ops with them executing; even texture filter units are only bilinear so anything that only has one unit per pipeline will require two clocks to do a trilinear filter.

Uh? What, does that mean trilinear filtering is essentially free too since another part of the chip has already been slowed down, or that it slows down even more because of it?
 
The cost for the two cycles required for 4x MSAA can be hidden by the cost of the two cycles for the trilinear filtering (in this very simple example).
 
DaveBaumann said:
The cost for the two cycles required for 4x MSAA can be hidden by the cost of the two cycles for the trilinear filtering (in this very simple example).

Ok, but you mentioned shaders, I was wondering if that meant in a game with very complex shaders that take many cycles if that would hide the performance hit of the MSAA? Are shaders used to do MSAA? Just wondering since MSAA came around at the same time as shaders.
 
I was wondering if that meant in a game with very complex shaders that take many cycles if that would hide the performance hit of the MSAA?

Yes. Tak a look here at a very shader heavy test and MSAA levels. Bear in mind though, that poly edges will require more processing.

Are shaders used to do MSAA?

No. The ROP's at the backend of the pipeline have special hardware for MSAA - the only time things need to be reprocessed through the entire pixel pipeline is at the poly edge.

Just wondering since MSAA came around at the same time as shaders.

Entirely coincidental on consumer hardware - this was just at a period where both could be afforded on the processes of the time. Multisampling was a well known and understood technique long before that under OpenGL.
 
DaveBaumann said:
I was wondering if that meant in a game with very complex shaders that take many cycles if that would hide the performance hit of the MSAA?

Yes. Tak a look here at a very shader heavy test and MSAA levels. Bear in mind though, that poly edges will require more processing.

Are shaders used to do MSAA?

No. The ROP's at the backend of the pipeline have special hardware for MSAA - the only time things need to be reprocessed through the entire pixel pipeline is at the poly edge.

Just wondering since MSAA came around at the same time as shaders.

Entirely coincidental on consumer hardware - this was just at a period where both could be afforded on the processes of the time. Multisampling was a well known and understood technique long before that under OpenGL.

Ah, I see but the memory bandwidth on the video cards still keeps the performance hit from being low in most situations.
BTW, what handles anisotropic filtering? Does that have a fillrate hit?
 
Back
Top