Predict: The Next Generation Console Tech

Status
Not open for further replies.
I am expecting games to look like high end PC games and beyond on release.
And then you go to say not to expect anything above (likely below) 18 CUs...
Think about what the console games like today with GF7 and similar-powered but bit more advanced (not quite DX10 though (even if it surpasses even 10.1 and possibly 11 on certain feats)) GPUs and CPUs of that time - now think about how much power for example 18 CU GCN offers compared to what they have now and assume that CPUs will get nice boost too.

You can't compare what 18 CU GCN does in a PC to what it would do in a console
 
As far as compute capability, I would imagine that the usable compute capability of Xenos and RSX would be approximately equivalent to a modern 3 CU GPU at the same clock. And I think that's very generous, as it would assume that GCN is only architecturally better/more efficient by about 25% over a ~2005 GPU.

So, I think even a modest GPU consisting of around 16 CU's at a faster clock will be an order of magnitude upgrade over previous gen. I think too many are focusing on the PC space versus a comparison to the existing consoles.
 
While the CU/Flop's are a nice easy to grasp "power" metric, they are really only meaningful if you have the bandwidth to feed them.
From what little I've seen compute shaders are more often constrained by memory than ALU count.
But there are also (a few) problems which are computationally bound. And as the aggregate bandwidth to the caches scales with the numbers of CUs (and a GCN CU is quite small with ~5.5mm² [that's actually Tahiti with 1/4 DP rate and ECC everywhere, Pitcairn is probably <~5mm²] while the pad area alone for a 64bit partition of Tahiti's memory interface measures >~10mm², the actual controllers and ROPs come on top of it), it can be cheaper to increase the the CU count than to increase the external memory bandwidth to get a similar performance for everything which has at least some small scale data reuse, even if you are already in the range of diminishing returns. On average it is probably always a bit cheaper to have a bit more raw flops than one would deem necessary for a "balanced" design.
If your bandwidth constrained, doubling the CU count isn't going to help much, you could probably take the current PC figures as what NVidia/ATI believe are useful ALU to bandwidth ratios. Though both are likely optimized for current PC games.
Yes, it may not help much, but as long as it helps something and only inflates the (on the long term) decreasing wafer fabrication costs while reducing the more stable board costs and decreasing the number of memory chips needed, it may still be the better long term (performance/$) choice. And for a console, you can also try something a bit more future looking than focussing on current PC games (like XBox360 had the first unified shader design). ;)
 
hm... so 154GB/s for a fully operational Pitcairn.

Sort of my point, fantasizing about lots of FLOS is fine, but it's only useful if you can feed it.
I think I said earlier that I could imagine 2 designs, one with 2x the raw flop count with minimal visual difference.
FLOPS are easy to latch onto as a measure of raw power, but really it's a small part of the performance puzzle.

What about ROP considerations (with MSAA)?

Part of the equation, look at what you're getting out of 8 ROP parts in the 360/PS3, but again the useful number is limited by the amount of available bandwidth.
 
As far as compute capability, I would imagine that the usable compute capability of Xenos and RSX would be approximately equivalent to a modern 3 CU GPU at the same clock. And I think that's very generous, as it would assume that GCN is only architecturally better/more efficient by about 25% over a ~2005 GPU.

So, I think even a modest GPU consisting of around 16 CU's at a faster clock will be an order of magnitude upgrade over previous gen. I think too many are focusing on the PC space versus a comparison to the existing consoles.
Would that also require an order of magnitude improvement in memory bandwidth, latency, and CPU power to scale everything linearly?
 
As far as compute capability, I would imagine that the usable compute capability of Xenos and RSX would be approximately equivalent to a modern 3 CU GPU at the same clock. And I think that's very generous, as it would assume that GCN is only architecturally better/more efficient by about 25% over a ~2005 GPU.
Xenos is actually relatively comparable (from a pure shader perspective) to a R600 design with three SIMDs (R700 through R900 didn't change too much in this respect) if one factor in the reduced efficiency of the vec4+1 of Xenos compared to the more flexible VLIW5/4 layout (if I would be forced to make a shoot from my hip, I would estimate this advantage of the later designs to be ~20% on average, of course it varies significantly depending on the workload), while reducing the relative texture fetch bandwidth a bit (at least for the versions with full size SIMDs). GCN is a different breed which again adds probably something in the range of 30% on average (bringing the total advantage compared to Xenos to ~50-60% per CU/SIMD). But of course this applies mainly to "old fashioned" tasks. In more modern workloads it could easily exceed these numbers even without taking the added flexibility and enabling of new algorithms into account.
But it is very hard to generalize such comparisons without looking at a specific workload. Never it can be more than a very rough rule of thumb for something which does not run into a hard limitation for the majority of the runtime. I actually doubt it makes a lot of sense trying to quantify this in a seemingly accurate way (it never will be accurate).
IMO, the best achievable generalization would be that it will be significantly (close to an order of magnitude) faster in almost all situations while additionally offering the way more modern architecture with more features helping or even allowing in the first place to implement new stuff.
 
Sort of my point, fantasizing about lots of FLOS is fine, but it's only useful if you can feed it.
I think I said earlier that I could imagine 2 designs, one with 2x the raw flop count with minimal visual difference.
FLOPS are easy to latch onto as a measure of raw power, but really it's a small part of the performance puzzle.

Part of the equation, look at what you're getting out of 8 ROP parts in the 360/PS3, but again the useful number is limited by the amount of available bandwidth.

Are the ALUs and ROPs working simultaneously (or potentially) ? i.e. Is the bandwidth you'd want a summation of ALU and ROP figures or should we only be looking at worst case between the two? (Assuming no eDRAM split)

The 256GB/s calculation for Xenos ROPs @4xAA seems rather disturbing as we head to 16 or 32ROPs, though I suppose you could make the case that devs can just forget about MSAA for consoles.

hm... on the other hand... that 256GB/s is 128GB/s read + 128GB/s write. Does that mean a bidirectional bus of 128GB/s is adequate?
 
Are the ALUs and ROPs working simultaneously (or potentially) ? i.e. Is the bandwidth you'd want a summation of ALU and ROP figures or should we only be looking at worst case between the two? (Assuming no eDRAM split)

The 256GB/s calculation for Xenos ROPs @4xAA seems rather disturbing as we head to 16 or 32ROPs, though I suppose you could make the case that devs can just forget about MSAA for consoles.

hm... on the other hand... that 256GB/s is 128GB/s read + 128GB/s write. Does that mean a bidirectional bus of 128GB/s is adequate?
The eDRAM of Xenos didn't use any compression (as other GPUs do). You have to compare that with the bandwidth to the ROP tile caches (Color + Z) of recent GPUs. I don't have any clue about the size of these caches, i.e. how much reuse one may achieve on average. But one purpose of these caches is that the external bandwidth is only used for compressed data (and with 4x MSAA one probably easily achieves compression ratios of >2:1).
 
Are the ALUs and ROPs working simultaneously (or potentially) ? i.e. Is the bandwidth you'd want a summation of ALU and ROP figures or should we only be looking at worst case between the two? (Assuming no eDRAM split)

The 256GB/s calculation for Xenos ROPs @4xAA seems rather disturbing as we head to 16 or 32ROPs, though I suppose you could make the case that devs can just forget about MSAA for consoles.

hm... on the other hand... that 256GB/s is 128GB/s read + 128GB/s write. Does that mean a bidirectional bus of 128GB/s is adequate?

Yes everything works at the same time.
I'm just suggesting that in a lot of ways you can see what ATI/Nvidia believe are the sweet spots by looking at PC parts. But you can't look at part of it in isolation, and BW is an integral part of any performance equation.

In fact I could argue if you're looking to figure out how something will perform, I think the memory subsystem is the first thing you look at.

The Xenos bandwidth though doesn't include any compression, because of the EDRAM it doesn't have much of the BW saving hardware that's common today, ans MSAA is one of those excellent cases for compression.
 
The Xenos bandwidth though doesn't include any compression, because of the EDRAM it doesn't have much of the BW saving hardware that's common today, ans MSAA is one of those excellent cases for compression.

The eDRAM of Xenos didn't use any compression (as other GPUs do). You have to compare that with the bandwidth to the ROP tile caches (Color + Z) of recent GPUs. I don't have any clue about the size of these caches, i.e. how much reuse one may achieve on average. But one purpose of these caches is that the external bandwidth is only used for compressed data (and with 4x MSAA one probably easily achieves compression ratios of >2:1).

Depth compression must be pretty good? (only asking because ROPs have been extended to quad-rate for rv7xx+ or octal rate for GT200+).
 
Depth compression must be pretty good? (only asking because ROPs have been extended to quad-rate for rv7xx+ or octal rate for GT200+).
It's easy to come up with efficient and easy to implement depth compression schemes, as most areas in depth buffer form a continuous surface (triangles are connected to each other, and have linear slopes). For example you could split the depth buffer to 8x8 tiles (256 bytes uncompressed). During the rendering remember the minimum and maximum depth value of each tile (this data is also very useful for depth culling bigger blocks of data, often called Hi-Z by GPU manufacturers). As you know the minimum and maximum values of each tile, you can easily determine how many bits you need to store the range. Most 8x8 blocks do not have that big depth changes, so most blocks would be pretty well compressed (while some blocks of course be left uncompressed).

Better variation of the previous would be to fit a plane (just take three values in opposite corners of the block), and calculate difference to it instead. This often requires less bits, as surfaces tend to have near constant slopes (at least for such small areas as 8x8). This could offer (roughly estimated) up to 4x compression ratio.

I am not sure if entropy encoding techniques / LZ-variants are suitable for such high speed fixed function hardware, but if they are, you could use similar compression scheme that we use to compress/decompress our terrain height maps (height and depth are similar). For the first scanline of data, you make a guess that the next height map pixel is in straight slope formed by the two earlier pixels in the same scanline. The same is true for first pixels of each scanline (but instead use two earlier row first pixels). For all the other pixels, form a triangle from two pixels of a row above, and the last pixel in the same scanline (these pixels will be also available during decompression, if you uncompress in scanline order). Guess that the new heightmap pixel lies on this triangle. Compress the errors with fixed length entropy encoding or your preferred LZ-variant. This technique results in around 20x compression ratio for well behaved height fields. It would be even better in depth buffer compression, as the algorithm would guess depth values inside each triangle perfectly (pretty much storing all triangle insides for free).

Of course no lossless algorithm can compress all types of data. You have to always be prepared for the case that all depth buffer blocks contain uncompressed data. So this kind of techniques are best for preserving bandwidth, not for minimizing memory usage (as you still need to allocate room for the worst case). I am not sure what kind of algorithms are used in real GPUs, but they should be at least as good as the first two simple techniques mentioned (and thus likely resulting in at least 4x average compression rate).
 
It's easy to come up with efficient and easy to implement depth compression schemes, as most areas in depth buffer form a continuous surface (triangles are connected to each other, and have linear slopes). For example you could split the depth buffer to 8x8 tiles (256 bytes uncompressed). During the rendering remember the minimum and maximum depth value of each tile (this data is also very useful for depth culling bigger blocks of data, often called Hi-Z by GPU manufacturers). As you know the minimum and maximum values of each tile, you can easily determine how many bits you need to store the range. Most 8x8 blocks do not have that big depth changes, so most blocks would be pretty well compressed (while some blocks of course be left uncompressed).

Better variation of the previous would be to fit a plane (just take three values in opposite corners of the block), and calculate difference to it instead. This often requires less bits, as surfaces tend to have near constant slopes (at least for such small areas as 8x8). This could offer (roughly estimated) up to 4x compression ratio.

I have no idea how modern hardware does depth compression, but the early solutions were much like you describe, though tiles were typically smaller than you suggest, with the addition of having a basic classification for the tile, i.e. keep 1 value if all the pixels in a tile are at identical depth, tile that can be compressed with 4 bits/depth values, uncompressed tiles etc.
There was also on chip memory dedicated to the compression, so under some circumstances a Z operation would require no external memory access.

I'd imagine any entropy coding for Z would be impractical.
 
Crytek's CEO Cervat Yerli let it slip out involuntarily that the new consoles might feature 8 or 10 GB of RAM. :smile:

http://www.videogamer.com/xbox360/c...essors_will_impact_next-gen_specs_crytek.html

When asked what he would like Microsoft and Sony to factor into their next-generation consoles, Yerli stated that more memory would allow developers to do "so many more techniques and tricks".

"As a person who likes to drive technology-meets-game design as art, you can never have enough memory. Ever. Simple as that," Yerli told us. "Memory is the single most important thing that is always going to be underbalanced - I've never seen a console where the memory was the right balance.

"Xbox 360, underbalanced. PlayStation 3, underbalanced. Simply because memory is the most expensive part, hence I wish there would be cheaper ways of doing memory so that memory doesn't become an issue anymore.

"If they find ways to cheapen the cost to a degree they could triple or quadruple their memory. Just say, 'Hey we're going to have 32 gigs of memory'. That would be quite amazing because memory can do so many more techniques and tricks."


Triple or quadruple the amount of RAM to 32GBs? Could that be a hint as to what to expect from next-generation consoles? Both Xbox 360 and PlayStation 3 only carry 512MB of RAM, with the latter split equally between system and video resources.
 
Status
Not open for further replies.
Back
Top