Predict: The Next Generation Console Tech

Status
Not open for further replies.
You say >200 GB/s like that is supposed to be an impressive number.
It is an impressive number, considering a Radeon 6970 has 176GB/s. I also think it's much more impressive to have 2-4 GB of memory at 200 GB/s compared to having a few tens of MB at 256 GB/s.
 
What kind of bandwidth would be available with on-die RAM at this point, and is it needed above and beyond the 200-250?GB/sec that a shared memory pool would offer?
 
Time once was... lol

Anways, those numbers were assuming a simple forward renderer (depth+backbuffer, 32bpp/RGBA8). The numbers would be higher with multiple render targets & FP16.

can you update the speculations in that post?

and please use less technical words :S
 
As far as the next xbox is concerned, I'd say the writing is rather on the wall... (SoC from Nvidia)

Nvidia announcing *now a SoC which will utterly destroy current *phones* in 6 months makes them a shoe-in for a system that has to compete favorably with PCs in 2013-2014? that was quite the leap of faith.
 
The 4-way ALU clusters are part of a 16-wide SIMD.
A 16-ALU cluster would have 256 units in the SIMD.
It would also be more challenging to connect 16 units to the per-cluster register file.

What if they keep the same SIMD width? So four 16-ALU cluster, for 64 units per SIMD.
Due to the elimination of redundant transistors, SIMDs should be smaller.

As for AMD, well they reduced the width to 4 rather than increased it beyond 5. I think that may be a clue as to the current direction of the shader workloads. AMD said that the average lane utilisation was infact 3.5 I believe which shows that 5 wide was indeed far too wide.

In the PC space you need an architecture that can perform at its "best" at launch or in a meaningful lifespan ( one year?) . In the console space an architecture more future oriented. And the trend it's increasing shader workloads.. and computing. A future oriented architecture in the console space would probably have a much higher ALU:texture ratio, and maybe the elimination of some fixed function (well, due to the failure of Larabee, we will have to stick with TMUs, but we might get rid of ROPs at this point and maybe also fixed tessellation unit).

For example:
32 SIMD 64-wide (4*16 ALU cluster)
64 TMUs
128bit MC
32 to 64 mb of L3 cache on die.
At 28nm, it shouldn't be much over 200-250 mm^2.
I think XDR2 have higher cost.


What kind of bandwidth would be available with on-die RAM at this point, and is it needed above and beyond the 200-250?GB/sec that a shared memory pool would offer?

Internal crossbar can have a bandwidth as high as 1 Terabyte/s.

BTW: Tim Sweeney (Epic) said that for a real leap in game technology, a huge step in bandwidth is needed... in the order of Terabyte/s. Since there aren't news about the Tb Initiative from Rambus, the only way to achieve this is with a large cache.
And someone from DICE in one of his presentation, said that it's time to move to 16-ways ALUs.
 
NVIDIA and Microsoft simultaneously dropping out of the PCGA. Microsoft hiding Fermi specific shading language assembly instructions from the DX 11 documentation ... together with mobile cooperation I would say the writing is on the wall.
 
assen said:
Nvidia announcing *now a SoC which will utterly destroy current *phones* in 6 months makes them a shoe-in for a system that has to compete favorably with PCs in 2013-2014? that was quite the leap of faith.
Maxwell says hi...
 
What if they keep the same SIMD width? So four 16-ALU cluster, for 64 units per SIMD.
Due to the elimination of redundant transistors, SIMDs should be smaller.
SIMD-width has more to do with the number of clusters, not ALUs.
The helpful thing about having a 4 or 5 ALU cluster is that those counts are roughly what is needed to synthesize the range of more complex operations found in the ISA, with the added benefit of allowing the compiler to repurpose the components of those instructions to perform 4 or 5 simpler operations.
There isn't an instruction that needs 16 simpler operations, and it would be extremely challenging to find enough ILP to fill a 16-wide VLIW instruction bundle.

The instruction bundles would be roughly 4 times as wide, and with the lack of IPC, mostly unused.
The register file would be badly taxed by this as well. It is designed to provide peak read bandwidth for 4 FMADD operations. It would be more complex to supply 16.
The lack of banking means routing data is much harder.

I'm not sure if the 16 ALU-cluster SIMD would be smaller. It would most likely be 1/4 utilized for almost all workloads.
 
NVIDIA and Microsoft simultaneously dropping out of the PCGA. Microsoft hiding Fermi specific shading language assembly instructions from the DX 11 documentation ... together with mobile cooperation I would say the writing is on the wall.

The PCGA drop-out is something I saw news about.
Is there more information on the assembly instructions?
 
Internal crossbar can have a bandwidth as high as 1 Terabyte/s.

BTW: Tim Sweeney (Epic) said that for a real leap in game technology, a huge step in bandwidth is needed... in the order of Terabyte/s. Since there aren't news about the Tb Initiative from Rambus, the only way to achieve this is with a large cache.
And someone from DICE in one of his presentation, said that it's time to move to 16-ways ALUs.

Well if that is the case, does that not seal the deal? If you have to have it and the only way to get it is through on-die memory then ... ? (Not an accusation, just saying the decision may be made for the manufacturers.)
 
The only way to get such a huge boost in throughput from cache is with tile based rendering, unless the entire frame fits of course (and of course even then with really heavy unique texturing you're still going to be external bandwidth limited).

Using generic cache as a render target (like Larrabee) is a bit of a waste ... for small tri's (which is where things are heading) you want really fine grained access (32 bit banked ports, similar to GPU local memory only with some FIFOs to even out the load on the banks) and normal caching would add lots of overhead for that.
 
Last edited by a moderator:
Using generic cache as a render target (like Larrabee) is a bit of a waste ... for small tri's (which is where things are heading) you want really fine grained access (32 bit banked ports, similar to GPU local memory only with some FIFOs to even out the load on the banks) and normal caching would add lots of overhead for that.

If you are doing one work item per fragment, why not keep the render target data in registers? Pixels aren't going to talk so why use a shared resource?

If the shading is happening in ocl or equivalent, then it doesn't matter anyway.

Using generic cache is obviously less wastes some area and/or power over a dedicated shared mem, but the unification of reg/shared/cached mem is a VERY BIG DEAL, imo. The increase in flexibility of mem hierarchy is totally worth it for me.
 
Pixels do talk ... z-buffering, blending, atomics.

PS. just because the memory space if unified doesn't mean there can't be specialized caches.
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top