Predict: The Next Generation Console Tech

Status
Not open for further replies.
The 360 GPU was ~ 180 mm^2, the CPU a touch smaller. I don't know how big the CGPU in Valhalla is, but it's probably under 200 mm^2. At no point has a "big chip" been the answer to MS's technical or cost issues and you would expect there to be a reason for that.

G71 was a shrink and tweak of G70, which was 334 mm2 and therefore utterly dwarfed Xenos. X1800XT was also much bigger as you say (288 mm2 vs 180 mm2) and the x1900XT which came out a few months later was ~ 350 mm2. Xenos really wasn't a big chip by enthusiast standards even in 2005/2006. MS could have gone much bigger but they didn't.

I doubt that adding the area of two chips together can simply give you a total die area that you can "spend" as you see fit next generation for the same cost, and so I don't see "Xenos + daughter die = PC GPU XXXX" as actually having any meaning.

The 360 GPU was not 180 MM^2, we covered that. Figure perhaps 200, 260 with EDRAM. A design without EDRAM (possible imo) therefore could allocate 260 if we constrain exactly to Xenos. RSX was 240 with no EDRAM.

Adding the area doesn't, we obviously have no idea what the manufacturers are up too, they could decided to spend less, they could decide to spend more, imo they will shift more silicon to the GPU this time. If they have a bigger budget plus more of the balance to the GPU than it could be quite bigger, If not it might be smaller. We can look to the beginning of this gen for whats reasonable and possible for sure.

I also point out top end PC GPU's have gotten bigger, so consoles may be expected to follow. Or, they might not. Just that it's plausible the limits last time may be greater this time.


so I don't see "Xenos + daughter die = PC GPU XXXX" as actually having any meaning.

You're the one who brought it up. I dont see how it doesnt have some meaning. Of course we're just speculating in extreme generalities as thats all we can do at this point.

I don't know how big the CGPU in Valhalla is, but it's probably under 200 mm^2.

Two nodes later doesn't have any bearing on a hypothetical starting limit. Whatever we start at will get shrunk accordingly later.
 
I doubt that adding the area of two chips together can simply give you a total die area that you can "spend" as you see fit next generation for the same cost, and so I don't see "Xenos + daughter die = PC GPU XXXX" as actually having any meaning.

Actually, silicon budget in terms of die area is about the only meaningful metric we have to speculate with right now.
 
The 360 GPU was not 180 MM^2, we covered that. Figure perhaps 200, 260 with EDRAM. A design without EDRAM (possible imo) therefore could allocate 260 if we constrain exactly to Xenos. RSX was 240 with no EDRAM.

My point is that I don't see how you can just "allocate" a separate chip, probably made using a different manufacturing process and serving an entirely different purpose, into a more conventional PC GPU. Not having the edram would have repercussions on the ROPs and the memory bus, and yield may not scale linearly with die size.

RSX is a swollen G71 with added redundancy and other stuff. It's also still considerably smaller than enthusiast PC GPUs - X1900 XT was 350 mm2 and G80 was a wtf worthy 480 mm2.

Adding the area doesn't, we obviously have no idea what the manufacturers are up too, they could decided to spend less, they could decide to spend more, imo they will shift more silicon to the GPU this time. If they have a bigger budget plus more of the balance to the GPU than it could be quite bigger, If not it might be smaller. We can look to the beginning of this gen for whats reasonable and possible for sure.

I think next generation neither Sony nor MS will want to wait for the first shrink to turn their systems into things people want to buy. The Wii you own runs games better than the PS360 you don't.

I also point out top end PC GPU's have gotten bigger, so consoles may be expected to follow. Or, they might not. Just that it's plausible the limits last time may be greater this time.

PC GPUs haven't got much bigger since 2006. AMD are still around the same size as the X1900XT (after 2800 who can blame them) and I don't think that Nvidia have beat G80 yet.

Two nodes later doesn't have any bearing on a hypothetical starting limit. Whatever we start at will get shrunk accordingly later.

They waited two nodes to do it instead of using a monster 90nm or huge 65nm chip, although perhaps there were other reasons unrelated to die size.
 
PC GPUs haven't got much bigger since 2006. AMD are still around the same size as the X1900XT (after 2800 who can blame them) and I don't think that Nvidia have beat G80 yet.

I'm not sure if you mean something else by "beat", but GT200 was 579mm2 and big Fermi (GTX 480/580) was about 520mm2. However your point still stand, they haven't become much bigger.
 
I'm not sure if you mean something else by "beat", but GT200 was 579mm2 and big Fermi (GTX 480/580) was about 520mm2. However your point still stand, they haven't become much bigger.

I think at 20% bigger it's fair to say GT200 "beats" G80 then. :eek:
 
They waited two nodes to do it instead of using a monster 90nm or huge 65nm chip, although perhaps there were other reasons unrelated to die size.

I wonder if production capacity could be the answer. One can only reserve that many wafers as there is also a strong pressure on bleeding edge process capacity production.

Thing is along with Yields predictions they may consider the max size for chips. If the chip is to big they that there is no way that they will match projected production figures and so sales goals.

In early life of the system I wonder to which extend capacity production is as much of a concern as price reduction.

Looking forward the pressure on bleeding edge process has grown since 2005/6 thanks to the high demand for mobile products. This trend is going nowhere anytime soon.

That might be an explanation for the noise about MS producing "something" both at GF and IBM.

We may take the problem the in the 'wrong direction", we consider the price of the chip based on yields and size without any regard for wafers availability.
Manufacturers may do the other way looks at production capacity in wafer, then try to match they production goals out of the wafers available and their budget.

Most likely it's a blend of both approaches but that's still different for the approach here where performances and cost is the only factors. Production capacity has to be a strong concern and it's truer now than in 2005/6.

May be in 2005 MS found that the sweet spot was ~175 sq.mm chips (both xenon and xenos are in this ballpark) not only because that's what they need in perfs pov but that could have been the further they are willing to go from a production capacity pov.
Again I believe that it's a blend of both, arbitrations are done between perfs, price and production capacity.

We should not let production capacity out of our speculations, no manufacturers will launch a product if for example they are certain that production won't exceed 4 millions the first year for example. No matter how good the product is they need to grow an significant user base fast.

If we looked at the ps360 both claimed production capacity to be a constrain at some point, that's with 175 sq.mm chips. Now taking in account diminishing yields with bigger chips and less chips per wafer with bigger chip... I believe any one can come to its own conclusion or continue to ignore quiet some evidence like for me my idea of a big SoC (+300 sq.mm) may not be workable for a mass production pov :(
 
Another thing that I would happily see senior members discuss is "transactional memory".
I tried to read the last entry from D.Kanter about Haswell transactional memory. It's too technical for me.

It happens that power A2 / blue gene/Q support some forms(s) of transactional memory, namely:
speculative multithreading (SpMT) and transactional memory (or TM).
D.Kanter doesn't discuss IBM implementation much but I wonder it this could constitute an intensive for manufacturers to pass on OoO execution.

Power A2 /blue gene/q are super tiny and the implementation of TM / SpMT doesn't seem to take either that much space of power. It could offer great bang for bucks/Watts especially for manufacturers concerned by cost, production capacity, etc.

I think about this part of the article:
David Kanter said:
Speculative multithreading (SpMT) focuses on using multiple hardware threads (or multiple cores) to work together and accelerate a single software thread. In essence, the single software thread is speculatively split into multiple threads that can be executed in parallel. Transactions are a natural fit for speculative threads, since it offers an easy way to rollback incorrect speculation. The key advantage of SpMT is that it enables existing single threaded code (which is the vast majority of all software) to reap the benefit of multi-core processors.

The article can be read here.

Long story made short could that be a cheap way to raise perfs of sucky but cheap IO processors and so represent a strong intensive for manufacturers to stick with IO cores?
And would IBM implementatin fit the bill?
 
The problem with speculative multithreading is the communication latency.
Found a bit more info on IBM implementation still possibly not technical enough:
http://arstechnica.com/hardware/new...r-break-time-for-multithreaded-revolution.ars

It implies that IBM implementation is pretty "free' for perfs but only work on a per core basis.
Arstechnica said:
Will it deliver?

The implementation of the transactional memory itself is complex. Ruud Haring, who presented IBM's work at Hot Chips, claimed that "a lot of neat trickery" was required to make it work, and that it was a work of "sheer genius." After careful design work, the system was first built using FPGAs (chips that can be reconfigured in software) and, remarkably, it worked correctly first time. As complex as it is, the implementation still has its restrictions: notably, it doesn't offer any kind of multiprocessor transactional support. This isn't an issue for the specialized Sequoia, but it would be a problem for conventional multiprocessor machines: threads running on different CPUs could make concurrent modifications to shared data, and the transactional memory system won't detect that.
Edit
I mean the fact that the functionality applies only to the threads of a given core could that help with latencies? The thing only seems to rely on hardware present on a core basis, no modification to L2 no core to core communication.
 
Last edited by a moderator:
In the generation after the one to come, I'd certainly expect at least one vendor to provide hardware assistance for transactional memory and garbage collection in the style of Azul's Vega, to allow efficient code to be written in a nicer language than C++; I think the most plausible candidate is Cray's Chapel. I'd also expect graphics to move into software. That will require a CPU roughly like Tilera's: a mesh of many small, power-efficient cores, with Larrabee-like 512-bit vector units that can scatter/gather. Stacked DRAMs should allow at least a byte of bandwidth per flop.
 
I wonder if production capacity could be the answer. One can only reserve that many wafers as there is also a strong pressure on bleeding edge process capacity production.

Thing is along with Yields predictions they may consider the max size for chips. If the chip is to big they that there is no way that they will match projected production figures and so sales goals.

In early life of the system I wonder to which extend capacity production is as much of a concern as price reduction.

Looking forward the pressure on bleeding edge process has grown since 2005/6 thanks to the high demand for mobile products. This trend is going nowhere anytime soon.

That might be an explanation for the noise about MS producing "something" both at GF and IBM.

We may take the problem the in the 'wrong direction", we consider the price of the chip based on yields and size without any regard for wafers availability.
Manufacturers may do the other way looks at production capacity in wafer, then try to match they production goals out of the wafers available and their budget.

Most likely it's a blend of both approaches but that's still different for the approach here where performances and cost is the only factors. Production capacity has to be a strong concern and it's truer now than in 2005/6.

May be in 2005 MS found that the sweet spot was ~175 sq.mm chips (both xenon and xenos are in this ballpark) not only because that's what they need in perfs pov but that could have been the further they are willing to go from a production capacity pov.
Again I believe that it's a blend of both, arbitrations are done between perfs, price and production capacity.

We should not let production capacity out of our speculations, no manufacturers will launch a product if for example they are certain that production won't exceed 4 millions the first year for example. No matter how good the product is they need to grow an significant user base fast.

If we looked at the ps360 both claimed production capacity to be a constrain at some point, that's with 175 sq.mm chips. Now taking in account diminishing yields with bigger chips and less chips per wafer with bigger chip... I believe any one can come to its own conclusion or continue to ignore quiet some evidence like for me my idea of a big SoC (+300 sq.mm) may not be workable for a mass production pov :(

Good point WRT manufacturing capability.

I'd add a note here though:

MS had manufacturing issues for the first 6 months, but it's not clear that chip production was the issue (though it wouldn't be surprising as final silicone wasn't available until a few months prior to launch). Sony had issues with BR diodes, again, I'm not sure chip manufacturing was the bottleneck.

One other thing, MS was around 170mm2 per chip, but Sony was around 250mm2 per chip. As I said above, I don't recall Sony having an issue manufacturing chips.

I'm sure both Sony and MS would like to have issues with producing enough chips to meet demand, but that may not be the case. Especially if people in general are so ho hum on cutting edge graphics as people seem to suggest.

Either way, I don't think a ~250mm2 would be an issue for mass production on 28nm. Especially if neither one is focusing on 2012. If we're looking at 2013, I don't think either Sony nor MS would have a second thought about producing >250mm2 chips on 28nm.
 
Going to ask a question a little bit out of the mold of typical discussion..

Can some one specify (in single precision FLOPS) what floating point ability should a next-gen console CPU be targeting..? and maybe put that answer along with an explanation in the context of the rest of the hardware, thanks.
 
Going to ask a question a little bit out of the mold of typical discussion..

Can some one specify (in single precision FLOPS) what floating point ability should a next-gen console CPU be targeting..? and maybe put that answer along with an explanation in the context of the rest of the hardware, thanks.

I'm sure one of the devs here could provide a more specific answer, but I'd say in general, not to expect much improvement on the CPU side. Reason being is that for many calculation intensive cases, a GPGPU could provide much more perf/mm than a CPU and for cases where CPU calcs aren't necessary, the GPU budget will be put to good use in producing better graphics which are an easier sell than a more realistic cloth simulation or better AI algorithm.

Essentially, invest more in the graphics side while ensuring that the graphics side is also more flexible will be the design manifesto.

What this means for the cpu is anyone's guess.

For me it means the performance importance is lessened and thus usability concerns are higher priority. This means things like backwards compatibility, programmer familiarity, codebase-compatibility, and reduced R&D costs. In my mind, this translates to scaled CPU's of the existing architectures.

For MS, likely this means a simple doubling upgrade of Xenon (6 core, 12 thread, 2mb cache, and improved VMX).

For Sony, this is a bit trickier as the things Cell does well, GPGPU also does well, thus doubling Cell may not be the most efficient use of the die budget. I'd expect a 4 PPE, and 6 SPE Cell variant for compatibility and performance.

The Flops that either CPU would end up with is mostly irrelevant as both machines would be reliant more on the GPGPU for their Flop prowess and bragging rights.
 
I'm sure one of the devs here could provide a more specific answer, but I'd say in general, not to expect much improvement on the CPU side. Reason being is that for many calculation intensive cases, a GPGPU could provide much more perf/mm than a CPU and for cases where CPU calcs aren't necessary, the GPU budget will be put to good use in producing better graphics which are an easier sell than a more realistic cloth simulation or better AI algorithm.

Essentially, invest more in the graphics side while ensuring that the graphics side is also more flexible will be the design manifesto.

What this means for the cpu is anyone's guess.

For me it means the performance importance is lessened and thus usability concerns are higher priority. This means things like backwards compatibility, programmer familiarity, codebase-compatibility, and reduced R&D costs. In my mind, this translates to scaled CPU's of the existing architectures.

For MS, likely this means a simple doubling upgrade of Xenon (6 core, 12 thread, 2mb cache, and improved VMX).

For Sony, this is a bit trickier as the things Cell does well, GPGPU also does well, thus doubling Cell may not be the most efficient use of the die budget. I'd expect a 4 PPE, and 6 SPE Cell variant for compatibility and performance.

The Flops that either CPU would end up with is mostly irrelevant as both machines would be reliant more on the GPGPU for their Flop prowess and bragging rights.

thanks, I appreciate that excellent response and explanation.

For Sony, this is a bit trickier as the things Cell does well, GPGPU also does well, thus doubling Cell may not be the most efficient use of the die budget. I'd expect a 4 PPE, and 6 SPE Cell variant for compatibility and performance.

This was actually behind my interest in the question spurred by the recent Forbes article that AMD was the graphics supplier for Sony's next-generation console. (Charlie Demerjian of Semiaccurate.com who has been behind recent next-gen rumors reaffirmed the validity of the Forbes claim, stating that he has know AMD is behind Sony's console for over a year and would 'spill more beans' soon, for what that's worth..)

I was thinking that Sony might forgo backwards compatibility (seeing as how it's switching from Nvidia to AMD, anyways?) and might go for something simpler and more efficient in the CPU department - like 4 or 6 PowerA2 cores, or other suitable Power core solution, and stay away from Cell entirely if there is significant cost, die and thermal budget gains to be gained by doing so?

Things may not work this way in the real world, but I wouldn't doubt AMD would not want to have anything to do with Cell on several levels.. If this chip is an SoC then adding Cell is going to complicate it significantly. For one, how would manufacturing of a Cell based SoC work? Global Foundries has never touched the Cell processor.. Future die shrinks will require more complicated redesigns as well, owing to the complexity of Cell. Lastly, this might be my own interpretation, but I think as a matter of pride AMD wouldn't want to combine its tech with Cell, it's really contrary to its vision of heterogeneous and GPGPU computing.

In any case, it's my feeling that Cell is done in consoles if anyone wants to give that some thought and expand upon it..
 
Going to ask a question a little bit out of the mold of typical discussion..

Can some one specify (in single precision FLOPS) what floating point ability should a next-gen console CPU be targeting..? and maybe put that answer along with an explanation in the context of the rest of the hardware, thanks.

Well starting with MS, a trio of 7th gen Power cores would be a good bet. That'd be 12 OoO threads coming to around 105 gflops (but that may be DP). Adding vmx128 should bump the theoretical up and there's the possibility that it could include vmx256 (which is rumored to be coming with power8). I guess you could just double Xenon flops which would put it at around 230 gflops which sounds realistic.

As for Sony, IMO they'll match MS thread for thread so I'd expect a similar trio of 4 thread cores along with a similar 230 gflops. Then it's a question of including spe's for BC. They could just include an array of 6-8 spe's but that may be more work than you'd expect. The other option would be to include a cluster of 4 spe's with each core (effectively giving you a trio of cutdown Cell's). That'd be alot less work and leave you with a little monster coming in at around 550 gflops.
 
I was thinking that Sony might forgo backwards compatibility (seeing as how it's switching from Nvidia to AMD, anyways?) and might go for something simpler and more efficient in the CPU department - like 4 or 6 PowerA2 cores, or other suitable Power core solution, and stay away from Cell entirely if there is significant cost, die and thermal budget gains to be gained by doing so?

I look at the intended use case and come to a very different conclusion.

If the CPU performance is no longer key in getting overall performance up, then the choice of which cpu becomes less important. This being the case, the investments by all involved in Cell would lead me to believe this would be the best architecture to choose.

As I said, there are BC issues from the consumers side, but there are also libraries which developers have built, and years of experience in designing software for Cell. Granted, many complaints were lodged against Cell along the way, but solutions to many those problems were found.

Thus the Cell architecture in and of itself isn't a performance prohibitor in the way it once was at the outset of the generation. But in an effort to make Developer's lives easier, I'd suggest bumping up the PPE count to 4 (8 threads), with VMX units and also leave the 6 SPE units on board for Backwards Compatibility of existing software while also providing for an additional workforce for many other tasks going forward that PS3 devs have found use for this gen.

Those 6 SPE's would hardly be sitting idle when running new ps4 games. The could be used for sound, physics, post processing effects, decompression, procedural generation, or whatever else developers might find a use for.

For Developers that either don't want to bother learning Cell if they haven't already, or that don't like coding for SPE's, they can simply use them for sound, and focus on the 4 core 8 thread PPE for the rest of their code.



If the idea behind these consoles is to become more service oriented, one good way to go about that is servicing the existing software libraries that consumers already have (to be used or not, it is still the right gesture in presenting a "platform" as IOS and others have already done).

As long as the architecture isn't holding the machine back in a significant way, I see no reason for them to dump Cell. And with further emphasis on the GPU end and away from the CPU end, I'd say whatever possible edge may be found in a known alternate architecture is irrelevant.

What is relevant is getting new consumers onboard. Having a CPU that is 10-20% faster per mm2 will not negate the loss of compatibility to a significant portion of online software purchases.



For the GPU, if indeed they are turning to AMD, they can surely license or pay a royalty for emulating the RSX on future machines. But again, much like the alternate CPU situation, AMD and Nvidia are both very close in performance. Usability should trump a 10-20% edge in performance and with that, I'd say the only reason for them to dump Nvidia would be if Nvidia is being unreasonable in contract negotiations for ps4 GPU designs.

Either way, I'd expect the issue to be resolved and BC to become a standard for both Sony and MS as they are both actively looking to create Platform ecosystems. And one of the keys to doing that is with software compatibility.
 
The other option would be to include a cluster of 4 spe's with each core (effectively giving you a trio of cutdown Cell's). That'd be alot less work and leave you with a little monster coming in at around 550 gflops.

Monster indeed!

That would be rather large and potentially cut down on the GPU budget in a significant way (either that or bring the BOM up a good chunk leading to either less profit down the road or higher MSRP).

If having a 4 PPE 6 SPE Cell would prove problematic, I'd think it would be easier for them to literally do a pair of Cell twins.

Having said that, I can't think of a reason they couldn't do a 4PPE 6SPE Cell.

1 PPE is attached to the 6 SPE's just as it is now in Cell, and the trio of PPE's sits next to the 1st PPE. Have a large shared cache for the 4 PPE's and it's a wrap.

I could be wrong on the above, but from what I've seen in other designs it seems quite possible.
 
Well starting with MS, a trio of 7th gen Power cores would be a good bet. That'd be 12 OoO threads coming to around 105 gflops (but that may be DP). Adding vmx128 should bump the theoretical up and there's the possibility that it could include vmx256 (which is rumored to be coming with power8). I guess you could just double Xenon flops which would put it at around 230 gflops which sounds realistic.

Is PowerPC A2 a suitable next-gen console core? I just envisioned a similar quad core PPC A2 chip with 2 quad floating point units (as it is equipped in Blue Gene) per core around 204 flops @ 3.2GHZ under 55 watts.


Thus the Cell architecture in and of itself isn't a performance prohibitor in the way it once was at the outset of the generation. But in an effort to make Developer's lives easier, I'd suggest bumping up the PPE count to 4 (8 threads), with VMX units and also leave the 6 SPE units on board for Backwards Compatibility of existing software while also providing for an additional workforce for many other tasks going forward that PS3 devs have found use for this gen.

Aside from the 6 SPE for compatibility, how many SPE's do the 4 PPE's include? Or are the VMX units taking their place? Just want to understand this concept.. And does this still fit under the cost, die size, and thermal requirements for a GPU dominant console?
 
Status
Not open for further replies.
Back
Top