Predict: The Next Generation Console Tech

Rangers · Nov 12, 2011

(big die size + GDDR5) vs (smaller die + edram + DDR3)

I see two cost reductions on the right (DDR3, small die) and only one on the left (minus EDRAM).

With DD3 so dang cheap they could pack in a ton too. like 8GB.

I'm just not technical enough to know if DDR3+EDRAM could really be enough for the bandwidth needs. Isn't DDR3 pitifully slow? Even with the EDRAM wouldn't it hinder things?

bgassassin · Nov 12, 2011

Rangers said:
Isn't DDR3 pitifully slow?

Hey AlStrong? You wanna make this the third time we discuss this? I'm game if you are.

[/insidejoke]

AlNom · Nov 12, 2011

Rangers said:
Isn't DDR3 pitifully slow?

hm... I suppose if they picked 1866MHz DDR3, that gives them 14.9GB/s per 64-bit memory channel. Still not very flattering. DDR3 2500 might be too expensive, I don't know. It's beyond JEDEC spec.

Even with the EDRAM wouldn't it hinder things?

Right... that'll depend on just how functional the edram is I guess (more than just a place for framebuffers). Maybe one of you devs can chime in here on some hypothetical edram functionality though I wonder if that's just too prohibitive from a hardware design POV. *shrug*

They can certainly mitigate main memory texture bandwidth reqs by increasing texture cache size on the GPU itself. Xenos was pretty pitiful there, so main mem would have to be hit up more often, sucking up more main mem bandwidth.

function · Nov 12, 2011

AlStrong said:
They can certainly mitigate main memory texture bandwidth reqs by increasing texture cache size on the GPU itself. Xenos was pretty pitiful there, so main mem would have to be hit up more often, sucking up more main mem bandwidth.

Yeah, wouldn't virtual texturing and a stupidly large texture cache massively reduce texture access in main memory? Wasn't it something like 7 MB should be enough for texturing a 1280 x 720 frame if you're using a texture created just for that frame?

If the CPU had access to the texture cache maybe you could keep your up-to-date virtual texture constantly in cache. And if all your framebuffer(s) could stay in embedded video memory then almost all your graphics data reads/writes would stay away from main memory.

Rangers · Nov 12, 2011

that gives them 14.9GB/s per 64-bit memory channel.

There seems to be quad channel DDR3 coming into being in the PC world. Given a couple more years might be viable for console.

That could give 60 GB/s of bandwidth. Woeful compared to todays graphics cards of course.

Lets max it out and at DDR2400 it would be 76 GB/s.

I'm not sure if quad channel is cost effective from a board design standpoint.

Seems like around half what current upper range GPU's sport.

The X1800 XT was ATI's top around release of X360, and had 48 GB/s. So the 360's 22.4 of it's main bus was around half...so that seems reasonable, I guess. Not that I expect this to happen.

hoho · Nov 12, 2011

Rangers said:
Seems like around half what current upper range GPU's sport.

x1800 XT had 48GB/s, 7800GTX 512M had 54.4GB/s. I wouldn't be surprised if next-gen will have just half the bandwidth of high-end GPUs of the time.

Rangers · Nov 12, 2011

hoho said:
x1800 XT had 48GB/s, 7800GTX 512M had 54.4GB/s. I wouldn't be surprised if next-gen will have just half the bandwidth of high-end GPUs of the time.

Yeah, but that was with the help of EDRAM in 360's case (and PS3 had 40+GB of BW combined), which I would expect again in this scenario.

liolio · Nov 12, 2011

IN regard to this article, first it's not coming from a reputable website so it's safe to discard. The statement coming from Clubic about Nintendo produciton issues were way more solid and coming from one of the biggest French website.
Then in regard to "dual core GPU" they state "double GPU" in the paper, a proper translation would be be "dual GPUs setup" (think Crossfire/SLI).

Catching up with the talk going on about DDR3 and EDRAM, I still find bothering that the size of EDRAM is likely to augment faster than the bandwidth to the main pool of RAM which means that resolve times to the main RAM will take longer and the other way around (like accessing fetching the framebuffer to the GPU for post process purpose).
There is also something else that bother I ask about it more than once without having a proper explanation. In modern GPU ROP/RBE are tied to the L2, right? And ROP/RBE along with the L2 are used for atomic operations, right? I don't know if when modern GPU want to read the a render target they do it always from the V-RAM or if they sometime do it from the L2? Modern GPU can read and write from a render target, right? It means faster communication between an hypothetical main die and the daughter die as to be faster and the EDRAM to be more complex (I mean if you're done with the RT you resolve in main RAM +x ms and fetch it to the Shader core +x ms, but if you're not?).
From my outsider pov I see more implication to having separate ROP/RBE+a form of VRAM than with last gen hardware as modern GPU are able to more things than in 2005/6. Basically you would want the EDRAM to act as a tiny, high bandwidth V-RAM which for example in the 360 it doesn't, and it doesn't look trivial to me. Could some people explain me more on the matter?

AlNom · Nov 12, 2011

Rangers said:
There seems to be quad channel DDR3 coming into being in the PC world. Given a couple more years might be viable for console.
...
I'm not sure if quad channel is cost effective from a board design standpoint.

Well, that also necessitates a pretty large amount of die area along the border.

e.g. Lynnfield's triple channel memory controller is almost entirely along its longer dimension (~22.01mm x 13.45mm -> 296mm^2). A rough calc would put a quad-channel interface up to 26mm of real-estate.

RV770's 256-bit GDDR5 memory I/O takes up about 2.5 sides of the chip (16mm x 16mm -> 256mm^2). So... ~24mm of real-estate (pretty close to the above example for just rough eyeball estimates here).

liolio · Nov 12, 2011

AlStrong said:
Well, that also necessitates a pretty large amount of die area along the border.

e.g. Lynnfield's triple channel memory controller is almost entirely along its longer dimension (~22.01mm x 13.45mm -> 296mm^2). A rough calc would put a quad-channel interface up to 26mm of real-estate.

RV770's 256-bit GDDR5 memory I/O takes up about 2.5 sides of the chip (16mm x 16mm -> 256mm^2). So... ~24mm of real-estate (pretty close to the above example for just rough eyeball estimates here).

That's why I want a pretty big SoC, start on a 256 bits bus of the slowest GDDR5 and then do a Nvidia and move to 192 bits bus and if there is a second shrink a 128bits one. GDDR5 road map should allow for such a replacement even before the aforementioned shrinks but delaying til a die shrink is a good bet so you never use top speed and so the most expansive kind of RAM (and don't redesign the chip too many time with the matching R&D testing and production costs);

AlNom · Nov 12, 2011

liolio said:
That's why I want a pretty big SoC, start on a 256 bits bus of the slowest GDDR5 and then do a Nvidia and move to 192 bits bus and if there is a second shrink a 128bits one. GDDR5 road map should allow for such a replacement even before the aforementioned shrinks but delaying til a die shrink is a good bet so you never use top speed and so the most expansive kind of RAM (and don't redesign the chip too many time with the matching R&D testing and production costs);

I'm not sure I understand. To replace a 256-bit bus with 192-bit, you'd need the GDDR5 speed to increase by 33%, and then a subsequent increase by 50% when moving to 128-bit. What GDDR5 speeds are you thinking of starting with?

You'd probably have to do something funky with the # of RAM chips (when you think about 192-bit) so that you can still maintain your target RAM amount whilst decreasing the available I/O. The roadmap for GDDR5 is looking pretty slow as far as hitting beyond 2Gbit. (1 GDDR5 chip per 16-bit or 32-bit I/O is doable, implications are the wiring)

----------
hm... so let's say you started with
256-bit, 8x2Gbit chips -> 2GB
192-bit, 12x derp or 6x derp
128-bit, 8x2Gbit chips (clamshell mode) or 4x4Gbit (non-existant on the roadmap last I checked, might be wrong)

----------

I'm not sure that's a wise thing to do as far as making sure the game runs as intended, similar to how they kept the FSB the same when designing the 360 Slim CGPU. *shrug*

liolio · Nov 12, 2011

AlStrong said:
I'm not sure I understand. To replace a 256-bit bus with 192-bit, you'd need the GDDR5 speed to increase by 33%, and then a subsequent increase by 50% when moving to 128-bit. What GDDR5 speeds are you thinking of starting with?

I'm not sure that's a wise thing to do as far as making sure the game runs as intended, similar to how they kept the FSB the same when designing the 360 Slim CGPU.

The slowest and cheapest which should be around 800 MHz (800MHz would be the best case). It would provide north of 100GB/s of bandwidth which taking in account nowadays RBE/ROP efficiency should do the trick.
Actually Nvidia often uses slow RAM with wide buses whereas AMD goes with faster ram and narrower buses. From here:
The GTX470 uses GDDR5 clocked at 837MHz (slowest in this pool).
The HD6970 uses GDDR5 clocked at 1375 MHz (fastest in the pool).
It would already be possible to move from a 256 bits bus to a 192 bits but using expansive RAM.
My belief is that mimicking Nvidia could be the way to go, the RAM is there to pursue an aggressive bus size reduction (along with die size shrink, reduction of the number of memory chips), and that allow to use really cheap RAM, pass on what could be a complex EDRAM implementation (another bus, R&D, cost of the daughter die, etc.) and offer a leaner system to program for.
At this stage stage wider bus + cheap RAM looks a lot less like a headache than the alternative. Slowest DDR3 has to be cheaper than GDDR5 but it must not be that much and you may not want to use the slower DDR3 anyway.

AlNom · Nov 12, 2011

liolio said:
My belief is that mimicking Nvidia could be the way to go, the RAM is there to pursue an aggressive bus size reduction (along with die size shrink, reductin of the number of memory chips),

I'm not so sure that's as simple a plan as you make it sound.

At each step, as you know, it's essential to redesign the entire chip layout aside from the simpler optical shrinks/half-nodes. But changing the memory I/O so drastically isn't that easy because now you have to make sure the rest of the chip can accommodate it, and who knows how the next process node will scale (possibility of wasted die area or just really strange layout). It's just another complicating factor, and we have seen how crappy and slow the scaling has been already this past generation.

FWIW, I think they would have to skip 192-bit on the path to 128-bit in your example, if only because you get a strange mismatch in memory chips and the targeted total memory amount.

Take Lynnfield for instance... It physically has triple channel, but Intel certainly didn't want to spend the time to redesign the layout (at the same process node even) for the majority of products that are limited to just the dual channel.

Meanwhile, you're dealing with changes to memory performance/characteristics. 256-bit @ X-MHz != 128-bit@ 2X-MHz.

---------

Actually, if you take the RV770 for example, the 256-bit I/O basically dictated a certain size for the chip so that you could get something rectangular. AMD was able to fit in two more SIMDs as a result of that, but of course, you won't have that luxury for the path of die shrinks. It would just be wasted die area.

liolio · Nov 12, 2011

AlStrong said:
I'm not so sure that's as simple a plan as you make it sound.

At each step, as you know, it's essential to redesign the entire chip layout aside from the simpler optical shrinks/half-nodes. But changing the memory I/O so drastically isn't that easy because now you have to make sure the rest of the chip can accommodate it, and who knows how the next process node will scale (possibility of wasted die area or just really strange layout). It's just another complicating factor, and we have seen how crappy and slow the scaling has been already this past generation.

I agree it's not completely trivial, in a previous (quiet some pages ago) I stated that a given manufacturer would have to plan in advance for thing like number of ROP/RBE as well as the L2 size. I was considering either a clean approach or simply disabling some ROP/RBE (expecting possibly a positive impact on yields).
as an example say you have 4 "RBEs blocks +L2" sor 32 RBE going by AMD current chips, you would have to disable 2 RBE per block and a part of the 1/4 of the L2 or go with lesser blocks including only 6 RBEs and less cache. That's just an example they could fine tune.

I don't get your part about "bad scaling" the bigger the chip remains the easier to fit the memory controller / various IO.

FWIW, I think they would have to skip 192-bit on the path to 128-bit in your example, if only because you get a strange mismatch in memory chips and the targeted total memory amount.

I wondered about that and I'm happy about you digging further because I could not answer by my-self. When memory chips where of a lesser size it was may be easier to go with "odd" amounts of ram. I don't know how memory chip sizes scale (in granularity). I don't know what arrangement could be a good match or if simply having at some point (say most likely the 192bits bus revision) some RAM (most likely not much) unsued/hidden from the developer would be much of a problem. Honest/naive question.

Take Lynnfield for instance... It physically has triple channel, but Intel certainly didn't want to spend the time to redesign the layout (at the same process node even) for the majority of products that are limited to just the dual channel.

Well it's a custom design trade off could be made (see above, I'm not sure but it seems they could have options).

Meanwhile, you're dealing with changes to memory performance/characteristics. 256-bit @ X-MHz != 128-bit@ 2X-MHz.

Well I find this less of a problem, both 360 and PS3 characteristics have remained untouched on their life spam, but that's not ture for the PSP or the PS2 if memory serves right. For convenience marginal difference between revisions should not be discarded.

Actually, if you take the RV770 for example, the 256-bit I/O basically dictated a certain size for the chip so that you could get something rectangular. AMD was able to fit in two more SIMDs as a result of that, but of course, you won't have that luxury for the path of die shrinks. It would just be wasted die area.

How that different for a lesser chip using a lesser bus? IO will remain constant while execution units / cores will get tinier.

I acknowledge that the approach as issues that needs to be considered seriously but the same is true when you consider multiple chips so multiple shrinks, multiple buses, multiple cooling systems, etc. It's clearly disputable I don't pretend to have the answer though but it looks to me worse the discussion, as I don't bite into the "it doesn't scale" argument (in regard to bus size).

[edit]
Actually my talk about ROP/RBE is off, I think that chip like llano already offer the solution, ROP/RBE are not strictly tied to the memory controller (as in current AMD GPUs) as the CPU also have to access RAM, so there is "something" in between that do arbitration for how much bandwidth to give a given unit. So the number of "RBE/ROP per block tied to a memory controller" which is problematic when you jump for 4 to 3 (256 to 192 and so on) is irrelevant.
Actually even latencies (most likely improving overall with faster RAM) could be adapt to remain constant (so marginal differences/improvements could good enough while the cheapest option).[/edit]

fehu · Nov 12, 2011

that is an argument that bugged me for some time
why can't microsoft redesign an ultra cheap 360 using a 64bit bus and gddr5?
of course adapting the memory controller with more cache to compensate the difference
as i understand at this moment gddr3 is being costly because all the production shifted to the newer mainstream standard

i remember that nintendo done something similar with the ds, and (i'm not sure) pstwo done it too
if it's an option, can't a future home console armonize quantity speed and cost changing memory configuration and standard down the road?

AlNom · Nov 12, 2011

Changing the memory controller entirely (GDDR5 instead of GDDR3) isn't that trivial. Completely different signalling... and I've already mentioned the compatibility issue (RE: FSB replacement to maintain 100% equal functioning, and even transistor scaling itself can present issues). It's just more risk when it's not necessary (at least yet). Also keep in mind that Valhalla integrated the CPU and mother GPU dice, so it's still big enough that 128-bit is fine.

I don't know if they plan on going to 28 or 32nm for 360 at this point considering how late and expensive those will be for quite awhile until there's actually good fab capacity or until the process even matures. It's not just a simple shrink and the thing is cheaper. There are a lot more factors that can go into price than just wafer cost divided by the number of good chips.

You might also want to consider that the GDDR3 memory interface isn't the only IO. There's also the massive link to eDRAM, and there's not really a whole lot they can do about it until they integrate both chips now, which may be cost prohibitive anyway given the different and more costly method of manufacturing eDRAM. The eDRAM itself still appears to be on 65nm rather than 45nm (lest the 45nm design is considerably broken in terms of scaling).

----------
Also keep in mind that 512MB of GDDR5 @ 1Gbit, is still 4 chips, so even if the costs were equal per chip compared to 1Gbit GDDR3, they wouldn't be saving anything. So what's the point?

2Gbit GDDR5 is still relatively new, so I wouldn't expect that to be cheaper at all.

as i understand at this moment gddr3 is being costly because all the production shifted to the newer mainstream standard

Unless MS has a deal in place, but we'll never know. It's also still a production that is keeping the fabs busy if that's the case (since it's on an older process than GDDR5, and so would still carry some cost advantages to some extent).

Who knows, maybe the memory suppliers have surplus they need to get rid of (in the same vein as all those really low capacity HDDs).

Either way, GDDR3 hasn't completely disappeared. It may have for the high end, but the low end is using either that or DDR3, and those SKUs are in much higher supply relatively.

*shrug*

liolio · Nov 13, 2011

Hi Alstrong,
I had to move yesterday just after editing my post, after reading it again today it appears that the last minute editing make the whole thing shaky to say the least.
I was posting with the assumption that, like in current AMD/ATI GPUs, RBE partitions in a SOC would be strictly tied to a memory controller, they should not. It's a good thing as RBE partitions are decoupled and don't have to be changed down the road. So with this in mind I'll try to make a clearer answer through various points.

1) I don't expect half node in case of a SoC, I don't expect the SoC to be made on bulk process but on GF/IBM SOI process. Usually there is no half node for these processes.

2) So jumping from one process node to another is somehow a new implementation. As you say changing IO is an extra work there is no disputing it but an extra work that has to be compare to the effort of reimplementing multiple chips (2 possibly 3) on a new process.

3) In regard to bad scaling, I maintain that if the scaling poor so the chip remains bigger than expected fitting lesser IO should not be a problem.
Overall in regard to scaling I would say it a problem shared among "fixed designs" that have to go through more than one shrink. You may want to redesign but the "fixed" nature of the design prevent you to do so. As an example, the Cell @45nm is bigger than it could be because the EIB didn't not scale as well the logic. It kind of loops to the "possibly wasted die space area" argument, it has not to be specific to what I imply.

4) I also want to point out to the answer you gave to Fehu about using a 64bit bus for an hypothetical 360 32nm shrink. Specifically "the massive link to EDRAM". As other IO it doesn't shrink at all. So I want to loop to previous argument (not from you though) about how a "wide bus doesn't scale (down)". It's another reason why I doesn't think it holds, you won't go lower than a 128bits bus (so it really doesn't scale) and you actually have another bus/link which share actually the same characteristic. All together I believe that it's really disputable which solution doesn't "scale". A wide bus can be narrowed assuming proper RAM (and some efforts but as I say in previous points both design need).

5) About memory characteristic aka "256-bit @ X-MHz != 128-bit@ 2X-MHz"
It's indeed not disputable. It should have an impact on latency, power and possibly the "granularity" at which the "kind of integrated north bridge" attributes bandwidth to the various units.
In our case we're looking at first a 33% then a 50% increase in RAM clock speed. Latency characteristics degrade as the RAM is clock higher but adjusted by the clock speed it should be a win in this regard (marginal one). So if moving to a more serial handling of the memory requests (write of read) (as you have less "lanes" to spread the requests) doesn't scale linearly with the clock speed increase the gain in latency could cover this degradation.
Marginal differences should not be a problem as long as positive (even if that mean pumping the RAM frequency a tad higher). It could imply starting with a bit down clock RAM in the first revision (say a bit under 800MHz).
Truth is I can't tell but I'm sure that that the kind of things engineers should predict pretty accurately.

6) the memory chip number.
I miss this part in one of your post:

hm... so let's say you started with
256-bit, 8x2Gbit chips -> 2GB
192-bit, 12x derp or 6x derp
128-bit, 8x2Gbit chips (clamshell mode) or 4x4Gbit (non-existant on the roadmap last I checked, might be wrong)

It looks like you add it just after I click the "quote button".
Here I can't tell if it doable or not. It could be the most trouble some part. I expected to the memory chip size to go higher another round so I expected some wasted/unused RAM with the 192-bit configuration. Something like:
256-bit, 8x2Gbit chips -> 2GB

I remember the talk about this limitation and it's fine with me 4GB will take age to load, implies DDR3 so EDRAM and from my pov extra costs, after an almost 10 years wait I want the system to be reasonably affordable so I don't wait an extra 1 or 2 years before jumping in

192-bit, 6x4Gbit chip -> 3 GB (wasted ram) or 3X4Gbit + 3x2Gbit -> ~2.3GB (less wasted RAM, may not be possible)
128-bit, 4x4Gbit chips -> 2GB
The point is that whereas I remember the density issue for GDDR5 discarding 4GB configuration I though it was by launch time, I did not expect the road-map to "stop".
So again that might be the most trouble some part about the approach. And I don't know the performances implications and do-ability of using the memory chip in the configuration you gave as reference.

liolio · Nov 13, 2011

As a follow up I tried to research roadmap for GDDR5 and it's kind of hellish!
I could not find anything. The more data I found were from Samsung and they speak of they current product ie 2gb chip using 40nm process (as well as some other links which don't offer more information). Actually they don't give a roadmap at all or I failed to find it.
It seems that most RAM manufacturers communicates either about their advancements about DDR3 or DDR4 which is understandable as it's where volume is.
In the same time they are now using 30nm node for DDR3 (and test DDR4 chips). It's pretty a pretty new (~1year) and demand on DDR3 is strong. Still whether they communicate on it or not I find difficult to believe that they won't transition their GDDR5 production to this node. Their production capacity at 30nm should grow.
I mean it's unclear when DDR4 will land, wiki is unclear either the first products supporting it could launch in 2013 and meet mass adoption by 2015 or the whole thing could be delayed till 2015. In any case it's quiet an extended period of time, initially servers are likely to eat most of they production. So I think it's a reasonable bet to expect GDDR5 to last a bit more and so to go through further improvements (in speed and density). We may not know but a company like MS or Sony has easier means to know where RAM manufacturers are heading.

Kaotik · Nov 13, 2011

AlStrong said:
Well, that also necessitates a pretty large amount of die area along the border.

e.g. Lynnfield's triple channel memory controller is almost entirely along its longer dimension (~22.01mm x 13.45mm -> 296mm^2). A rough calc would put a quad-channel interface up to 26mm of real-estate.

RV770's 256-bit GDDR5 memory I/O takes up about 2.5 sides of the chip (16mm x 16mm -> 256mm^2). So... ~24mm of real-estate (pretty close to the above example for just rough eyeball estimates here).

RV770 actually has both GDDR5 and GDDR3 I/O, which probably take more space than just GDDR5 would?

AlNom · Nov 13, 2011

I'll have to look at your posts later today, lio. No time right now.

Kaotik said:
RV770 actually has both GDDR5 and GDDR3 I/O, which probably take more space than just GDDR5 would?

It'd take more space towards the interior. You need perimeter for physical I/O.

Predict: The Next Generation Console Tech

Rangers

bgassassin

AlNom

Moderator

function

None functional

Rangers

hoho

Rangers

liolio

Aquoiboniste

AlNom

Moderator

liolio

Aquoiboniste

AlNom

Moderator

liolio

Aquoiboniste

AlNom

Moderator

liolio

Aquoiboniste

fehu

AlNom

Moderator

liolio

Aquoiboniste

liolio

Aquoiboniste

Kaotik

Drunk Member

AlNom

Moderator

Similar threads