Predict: The Next Generation Console Tech

Status
Not open for further replies.
It's a good design feature for a GPU using eDRAM. The alternative is to sit the ROPs outside the eDRAM and have them share the same GPU<>eDRAM bus. I'd be surprised if MS don't go with the same design, but maybe there's a downside I'm unaware of? Only I'd expect the eDRAM to be advanced further to have full GPU access for read/write.

I expect MS to use eDRAM as a very high bandwidth victim cache for main memory. Possibly with controls to segment the eDRAM for various uses, ie. lock parts for texture caching, render targets, compute jobs etc.

That way you can allocate massive buffers for expensive AA (expensive in memory terms, like MSAA) in main memory but only have to store the actual fragments used in eDRAM.

Cheers
 
I guess it makes sense it's a monster because of all the added chips. ARM is in there too isn't it? And hell, maybe even that PowerPC cpu for back compat.

What added chips? Most extra things probably won't take up much space. If we're talking an ARM TrustZone implementation for example, then you should consider that it probably takes up less than 1 mm². AMD's UVD and VCE probably don't take up much space either.
 
Quote:
Originally Posted by McHuj View Post
IMO, it isn't so much the complexity of adding instructions, but moving the work elsewhere to a more efficient processor. A DSP will be a simple in-order core, probably clocked under 500 MHz and may only be fixed point (although I don't know if modern audio DSP's have moved on to floating point yet or not). It's going to be a tiny core in comparison to a CPU core. For DSP type work loads, it's power efficiency can be orders of magnitude better than the CPU core. I'd guess that AMD/MS would just license a core from a DSP company and they wouldn't be reinventing the wheel.

Chances are the DSP will execute out of it's own cache or local memory space. So not only do you get the benefit of off-loading all the compute to it, you're not polluting the main core's caches with that data either. It's a win-win, imo.

TI High end DSP have fast float engines last time I checked. You can get hybrid designs pretty easy now, something like a Cortex Mx with a DSP engine + decompression unit could be pretty tiny and very capable...

Thanks to both of you for your explanations.

If I am not wrong you could have a DSP+blitter (http://forum.beyond3d.com/showpost.php?p=1693561&postcount=47) tiny chip with direct access to resourcess without bothering CPU.
 
i'm skeptical we'll see a 256 bus for the ddr3. for size reasons. am i wrong?

that would be only 34 gb/s at 128 bit.

Perhaps all the supposed specialized hardware is precisely to work around the extremely low bandwidth.
 
To answer both of you, the rendering of pixels is a large consumer of bandwidth. Moving that to the smart eDRAM is equivalent to saving that BW from main RAM, but not the same as adding the BWs together (in the early days of PS3 vs. XB360 we had the ludicrous PR bandwidth wars on top of the Flops wars and the Prerender wars). Xenos's internal BW was exactly as needed to never starve the ROPS, but the ROPs wouldn't be consuming that much BW all the time, so it wouldn't peak at 256 GBps.

It's a good design feature for a GPU using eDRAM. The alternative is to sit the ROPs outside the eDRAM and have them share the same GPU<>eDRAM bus. I'd be surprised if MS don't go with the same design, but maybe there's a downside I'm unaware of? Only I'd expect the eDRAM to be advanced further to have full GPU access for read/write.

I see, so if I am getting this correctly as these pixel operations are the main consumers of bandwidth, moving the to the edram enable them to have all the bandwidth they want and need. So does that mean that the main bandwidth was enough for other graphic operation given the size and capability of the GPU?
 
I expect MS to use eDRAM as a very high bandwidth victim cache for main memory.
I suppose a lot depends on the BW of the eDRAM. POWER7 provides 100 GB/s. Wii U OTOH appears to just be using the eDRAM maybe around the 20-30 GB/s mark. I assume MS are going highend on the eDRAM, perhaps mitigating the value of ROPS in the eDRAM, but that still strikes me as a smart move.
 
i'm skeptical we'll see a 256 bus for the ddr3. for size reasons. am i wrong?

that would be only 34 gb/s at 128 bit.

Perhaps all the supposed specialized hardware is precisely to work around the extremely low bandwidth.
Oh hey specialguy,what 128?again?:LOL:
 
i'm skeptical we'll see a 256 bus for the ddr3. for size reasons. am i wrong?

that would be only 34 gb/s at 128 bit.

Perhaps all the supposed specialized hardware is precisely to work around the extremely low bandwidth.

It's possible. 34 GB/sec would be about 50% more BW than the 360. But, if you look at WiiU, it's getting comparable results to the 360 with only 12.8 GB/sec of main memory BW. That tells me it's a lot more efficient then the older consoles. Perhaps 34GB/sec (1GB per frame at 30fps) would be sufficient.
 
There too many users on gaf who can't read...

McHuj said:
It's possible. 34 GB/sec would be about 50% more BW than the 360. But, if you look at WiiU, it's getting comparable results to the 360 with only 12.8 GB/sec of main memory BW. That tells me it's a lot more efficient then the older consoles. Perhaps 34GB/sec (1GB per frame at 30fps) would be sufficient.

But that would still only just be double the amount of assets compared to current gen games.
 
It's possible. 34 GB/sec would be about 50% more BW than the 360. But, if you look at WiiU, it's getting comparable results to the 360 with only 12.8 GB/sec of main memory BW. That tells me it's a lot more efficient then the older consoles. Perhaps 34GB/sec (1GB per frame at 30fps) would be sufficient.

Or it could mean that anything BW intensive was avoided this gen. Is that how it'll be next gen?
 
i'm skeptical we'll see a 256 bus for the ddr3. for size reasons. am i wrong?

that would be only 34 gb/s at 128 bit.

Perhaps all the supposed specialized hardware is precisely to work around the extremely low bandwidth.

I think the 256bit bus would fit well, cost of the bus itself you eat it up. There's the I/O as well but it's pretty serial in nature (even display I/O?), you could maybe get away with 4 PCIe 2.0 lanes or 8 bit wide Hypertransport 3.1 link to a southbridge.

Die size can be pretty big, you have the dual jaguar, on-die RAM and GPU and some redundancy. Desktop/laptop APUs are not small chips either.

For the 20nm shrink we might worry more. It would maybe work with a combination of having a not so small chip in 28nm in the first place, using GPU redundancy, a bit of empty space even, shrink maybe not giving that big of a density increase (that single nanometer figure adds the complexity of a reality that gets more and more tricky each time, I think /edit: TSMC claims a 1.9x density improvement, so that latter idea of mine might be bunk)
 
See my post:
http://forum.beyond3d.com/showpost.php?p=1693204&postcount=18196


That's why I also think that if it was 1.6 or 1.7 TF it would be too close in power for both your source and lherre to say it's definitively less than PS4, they'd say it was slightly less or about the same etc if there was only a 10% difference in FLOPS.

That and the fact that I haven't seen any Durango rumours mentioning a GPU in the 1.6-1.8 range, most of the rumours with FLOPS point to 1.2 TF or thereabouts (or go wildly over 2+, 3+ etc)

Cool. Kinda sounds like a Crossfire set up and would explain the uniqueness of the design being mentioned rather consistently over time. If that's the case it's more believable, though for me it also goes back to the AMD Flops vs nVidia Flops debate and that still wouldn't be close to a 680 if the main GPU is comparable to an 8000-series 7770 equivalent. Should surpass the PS4 though based on what's known so far. Only time will tell how much if this is the case.
 
Last edited by a moderator:
DDR4 banking is different than in DDR3. In some pathological cases, I can see a workload run much slower on DDR4 than it does on DDR3 of same specified speed. I don't see them doing that.

Which is fine, because they don't have to. DDR4 is not in some way inherently more expensive. It's made in the same facilities by the same companies as DDR3, and each die is ~of the same size. It's market price is simply a reflection of the fact that there is little demand or supply. If you placed a large order for DDR4 for the summer, you'd probably get a price reasonably close to DDR3 price.

Same facilities, but what about the production tooling?
They would have to supply many millions chips for the launch and following monthes, then Intel launches Haswell-EX in 2014 which will routinely be used to build computers with 1TB ram (128x the amount of an Xbox)

So, a big demand on the new tech and you're asking them to supply at the price they sell DDR3 at a loss.
That's my reasoning (but I don't mind being wrong if so be it :))
 
I see, so if I am getting this correctly as these pixel operations are the main consumers of bandwidth, moving the to the edram enable them to have all the bandwidth they want and need. So does that mean that the main bandwidth was enough for other graphic operation given the size and capability of the GPU?

An old rule of thumb is that the rops take half of the total B/W, so adding a separate fast pool effectively doubles the total bandwidth (regardless of how fast the second pool is). Deferred rendering makes this more complex.

On the subject of the 1TF GPU in Durango, isn't the HD8800M a 1TF GPU designed for laptops?

The 8800M is just a binned Cape Verde (=HD7700) with a new name.

Same facilities, but what about the production tooling?
They would have to supply many millions chips for the launch and following monthes, then Intel launches Haswell-EX in 2014 which will routinely be used to build computers with 1TB ram (128x the amount of an Xbox)

As I understand it, retooling for DDR4 just needs new testers and masks. There's no reason you can't churn them out from the same fab with minimal changes between runs.

So, a big demand on the new tech and you're asking them to supply at the price they sell DDR3 at a loss.
That's my reasoning (but I don't mind being wrong if so be it :))

The reason DDR3 is so cheap is that there is oversupply because of the state of the economy. This also makes all the foundries desperate for orders. They'd be happy to sell you DDR4 at a reasonable markup.
 
Aegis has stated that Durango has a monster SoC and that both next-gen consoles consumes around 170-200W (he did that in two different statement on GAF).

The current rumored specs leaves quite a large space for our "mistery sauce".

Which is simply incompatible with the rumors of 8 jaguar cores at 1.6 GHz and a 1.2 TF GPU. That alone can be achieved in less than 125W. So, if that's correct, either our ~1.2TF number is wrong or there's a lot of custom stuff on this SoC eating 30W+ by itself.

A few questions based on the rumors we've heard.

The consensus seems to be that MS will go with motherboard based RAM and not stacking, and rumors peg them as having the lighter power SoC with a lower power CPU and GPU as compared to what Orbis is rumored to have (at least comparable CPU, 1.8TF GPU). Given Sony is likely more risk and cost averse given their financial situation, how does it make sense that they would want to take the engineering (with heat and new technology), cost and schedule risk with putting RAM on top of a hot part as opposed to a conventional method?

Second, what are the use cases where a fast traditional RAM bus (say providing 192+ GB/s) beats a more traditional RAM bus (say under 70 GB/s) with a fast local RAM (say >512GB/s, 64MB). Vice versa?

Also forgot to ask, adding on the the above, do the various memory schemes have any benefits or downsides when talking about utilizing PRT? What about adding SVO capabilities? Too early?
 
Last edited by a moderator:
1T-SRAM, which, as you said, is basically EDRAM, sounds good. Wikipedia only reports the densities up to 45 nm. If we assume at 28 nm the density, including the overhead, is roughly ~0.08 mm^2 per Mbit, 100 MB of 1T-SRAM would be roughly 64 mm^2. It should be doable in a large SoC.



How expensive will DDR3/DDR4 be 3-4 years from now, though? The long term cost might be in favor of DDR4.



If there is a large chunk of fast 1T-SRAM/EDRAM, it should be fine. Otherwise, it will be hard to compete with the PS4 if it has ~192 GB/s of main memory bandwidth.



The best option, in my opinion, would be a high-speed ring bus with attached:
- the GPU second-level caches*
- the CPU second-level caches
- the 1T-SRAM/EDRAM
- the 256-bit wide memory interface.
In this way the 1T-SRAM/EDRAM could be mapped to a specific memory address range and accessed by both the CPU and GPU for any intent and purpose.

*The current bandwidth of the L2 caches in GCN is 64B/cycle per 64-bit channel. At 1 GHz with 4 channels (256-bit), the peak is 256 GB/s. Considering the chip is supposedly highly customized, I wouldn't rule out a doubling of the bandwidth to 128B/cycle per 64-bit channel, though. With a sufficiently fast ring bus, this would allow up to 512 GB/s of bandwidth from the 1T-SRAM/EDRAM to the GPU.

High speed ring bus connecting SRAM sounds like an improved Cell EIB. ^_^
The key differences are shared memory, and better integrated GPU.

The rumored specialized units (audio DSP and other pseudo-GPU units, also on ring bus) remind me of the SPUs with their LocalStores too, 'specially when Microsoft referred to the SPUs as DSPs early this generation.

In Cell's original design, there is also a Super Companion Chip hooked up to the ring bus (via FlexIO) to prioritize requests into DDR memory and I/O devices to ensure QoS.


I'm also curious about the rumored MS gaming portable/tablet. Will it share design elements with the home console so that developers only need to make "one" general version.
 
Which is simply incompatible with the rumors of 8 jaguar cores at 1.6 GHz and a 1.2 TF GPU. That alone can be achieved in less than 125W. So, if that's correct, either our ~1.2TF number is wrong or there's a lot of custom stuff on this SoC eating 30W+ by itself.
It's not so bad, if the PS has an 80% efficiency, 7W for HDD, 7W for Bluray drive, 5W misc hardware:
170W to 200W power consumption means there's about 115W to 140W for the SoC and it's memory.
 
Status
Not open for further replies.
Back
Top