Predict: The Next Generation Console Tech

Status
Not open for further replies.
Your theory is all nice but system like this has two problems: network bandwidth and latency. I can't imagine that some fast FPS could have most of its processing being done far away from the place stuff gets shown to the player.

The computing chores you are talking about would not occur in real-time. But the results of these calculations will be coordinated in real-time. ;)

It will be like interacting with a Disney-PIXAR film. Consoles needn't worry about the nuts and bolts of programming (like rendering scenes in Ratatouille in microseconds); all they really need are frames of animation. Online servers will busy themselves with production issues using the information players send and the results of days, weeks, even months of calculating.
 
I would hope they would go with a 512-bit bus, but I could have sworn we all thought they were going to go with a 256-bit bus in the 360. Aren't they using a 128-bit bus now?

yes exactly. most figured Xbox2 would use a 256-bit bus, but I think that when the Xenon system specs got leaked in mid 2004, link it showed a 128-bit bus or bandwidth that was in line with what a 128-bit bus could provide

I hope there is distance in enough time (2011-2012) and desire/need to go to something very high performance for 3rd-gen Xbox, that requires the leap to a 512-bit bus. certainly by then, 1024-bit busses will be used in highend PC graphics cards, if not multiple 512-bit busses. alternatively, Microsoft would have to find memory technology that provides hundreds of GB/sec using a 128-bit or 256-bit bus.

that's just main memory bandwidth -- I think there will still be a need for incredibly high bandwidth for graphics, in the TB/sec range.

going back to main memory bandwidth, I might have it all wrong. if Rambus is going to be able to provide 1 TB/sec by 2010, does that mean large busses of 512-bit and up won't be needed ? 1 TB/sec is more than I expected. that's 44 times Xbox 360 (1000 / 22.4). all that bandwidth will be needed ( developers will still be wanting more, beyond 1 TB/sec) its not just about a given resolution + anti-aliasing.
 
Last edited by a moderator:
yes exactly. most figured Xbox2 would use a 256-bit bus, but I think that when the Xenon system specs got leaked in mid 2004, link it showed a 128-bit bus or bandwidth that was in line with what a 128-bit bus could provide

I hope there is distance in enough time (2011-2012) and desire/need to go to something very high performance for 3rd-gen Xbox, that requires the leap to a 512-bit bus. certainly by then, 1024-bit busses will be used in highend PC graphics cards, if not multiple 512-bit busses. alternatively, Microsoft would have to find memory technology that provides hundreds of GB/sec using a 128-bit or 256-bit bus.

that's just main memory bandwidth -- I think there will still be a need for incredibly high bandwidth for graphics, in the TB/sec range.

going back to main memory bandwidth, I might have it all wrong. if Rambus is going to be able to provide 1 TB/sec by 2010, does that mean large busses of 512-bit and up won't be needed ? 1 TB/sec is more than I expected. that's 44 times Xbox 360 (1000 / 22.4). all that bandwidth will be needed ( developers will still be wanting more, beyond 1 TB/sec) its not just about a given resolution + anti-aliasing.

A wider bus is not necessarily faster. More wires in a bus means more crosstalk and more signals to get out of phase, requiring a longer time for all to get into phase before the signal can be read. Rambus's speed advances in XDR RAM are based on using fewer wires in a bus and using a 1:1 bus wire connections between the RAM chip and CPU chip, eliminating taps on a bus so that signal reflections at taps are reduced, so that the signal on the bus settles down more quickly, allowing it to be read much more rapidly.

Hence 256 bit and 512 bit busses may not be the way to go in future.
 
A wider bus is not necessarily faster. More wires in a bus means more crosstalk and more signals to get out of phase, requiring a longer time for all to get into phase before the signal can be read. Rambus's speed advances in XDR RAM are based on using fewer wires in a bus and using a 1:1 bus wire connections between the RAM chip and CPU chip, eliminating taps on a bus so that signal reflections at taps are reduced, so that the signal on the bus settles down more quickly, allowing it to be read much more rapidly.

Hence 256 bit and 512 bit busses may not be the way to go in future.


that's exactly why i edited in the last paragraph of my post. even though i don't understand all the signaling, crosstalk, wiring stuff, i know that Rambus gets more bandwidth out of narrow busses.
 
300-400 Gb/s of bandwidth will still be huge in 2011!
I don't think that developers needs so much bandwidth, even with 1080p@16xAA and 16xAF games.
 
300-400 Gb/s of bandwidth will still be huge in 2011!
I don't think that developers needs so much bandwidth, even with 1080p@16xAA and 16xAF games.

But it seems like it would be nice to have the system bottle neck back at the drive (DVD,BR,HD-DVD,HDD,etc). But of course tons of bandwidth is useless if the devs don't find a way to use it all. Or if the GPU isn't fast enough.
 
XDR ram in the PS3 uses a 64bit bus correct? so maybe a 256bit bus in 2011/2012 might be enough with XDR2.

256bit bus XDR2@500MHz (8.0GHz effective) = 256GB/s
256bit bus XDR2@800MHz (12.8GHz effective) = 409.6GB/s
256bit bus XDR2@1066MHz = 545.792 GB/s

now say they keep the same formula for PS4 and have separate memory pools, Cell2 or whatever could have 1 or 2 gigs of XDR2 at say 800MHz on a 256 bit bus yielding over 400GB/s bandwidth.
That would be a 16x increase in bandwidth and a 4 to 8 times increase in ram for the CPU. It would seem logical to have Nvidia create their next GPU for PS4 using Rambus as well. So if they keep the separate buses for PS4 I think around 800GB/s total system bandwidth wouldn't be out of the question. Thats assuming the XDR2 is clocked at 800MHz, a somewhat realistic clock rate in 4 years time me thinks.
Also, I don't think Next gen consoles will use a UMA if total system memory is 4GB, on the other hand if it's 2GB System memory then UMA would be good.
 
XDR ram in the PS3 uses a 64bit bus correct? so maybe a 256bit bus in 2011/2012 might be enough with XDR2.

256bit bus XDR2@500MHz (8.0GHz effective) = 256GB/s
256bit bus XDR2@800MHz (12.8GHz effective) = 409.6GB/s
256bit bus XDR2@1066MHz = 545.792 GB/s

now say they keep the same formula for PS4 and have separate memory pools, Cell2 or whatever could have 1 or 2 gigs of XDR2 at say 800MHz on a 256 bit bus yielding over 400GB/s bandwidth.
That would be a 16x increase in bandwidth and a 4 to 8 times increase in ram for the CPU. It would seem logical to have Nvidia create their next GPU for PS4 using Rambus as well. So if they keep the separate buses for PS4 I think around 800GB/s total system bandwidth wouldn't be out of the question. Thats assuming the XDR2 is clocked at 800MHz, a somewhat realistic clock rate in 4 years time me thinks.
Also, I don't think Next gen consoles will use a UMA if total system memory is 4GB, on the other hand if it's 2GB System memory then UMA would be good.


If Rambus is going to have 1 TB/sec bandwidth by 2010, I'm sure PS4 will have at least that much external, main memory bandwidth. XDR2 would be considered too old for PS4 since it's a current highend memory technology, and we are probably 5 years away from getting PS4.

400 GB/sec isn't going to be enough IMO.

also, on a seperate note, Nvidia in 2004 (well over 3 years ago) said 3D games would require 3 TB/sec bandwidth.
 
also, on a seperate note, Nvidia in 2004 (well over 3 years ago) said 3D games would require 3 TB/sec bandwidth.

Nvidia don't said that, just observe the progress rating between 1994 and 2004 and apply this coefficient to 2014 (10Tflop GPU, 3TB/s, 32GB framebuffer memory, 270Gpps... and even 100Ghz CPU, 44Ghz memory, 30TB HDD ect...) , it was more for point the past/present big evolution than forecast the futur
 
Last edited by a moderator:
but a X360 without EDRAM would have 256bit unified bus or two 128 bit bus
we compare in "equivalence" for more relevance :)

I'm fairly certain the eDRAM has a 2048-bit wide bus. At 500MHz, bi-directional, that gives the 256GiB/s. :)
 
Nvidia don't said that, just observe the progress rating between 1994 and 2004 and apply this coefficient to 2014 (10Tflop GPU, 3TB/s, 32GB framebuffer memory, 270Gpps... and even 100Ghz CPU, 44Ghz memory, 30TB HDD ect...) , it was more for point the past/present big evolution than forecast the futur

For a real-time global illumination and a really complex 3D dynamic world, we will need a huge, incredible raw power. A 10 Teraflops GPU, It's a very prudent valutation. I exprect 50 Teraflops GPU in 2014, and a totaly different approach to the 3D realtime rendering. Not more Rops/TMU/Shader,but a super-unificated architetecture that balance the work every single frame.
 
One little question by which kind of units tessellation and geometry are likely to be processed in up coming gpu?
EDIT
My question si not innocent, Nao hinted many time that simd units like spe could do a great job at tessellation, so I want to know if this kind of units could do great with geometry shaders.
It could be interesting if up coming gpu include more "general purpose" simd that the one (more and more flexible) dedicated to graphic tasks.

It could give us some clues about how heavily Ms is likely to invest its silicon budget in the gpu.
 
Last edited by a moderator:
One little question by which kind of units tessellation and geometry are likely to be processed in up coming gpu?
EDIT
My question si not innocent, Nao hinted many time that simd units like spe could do a great job at tessellation, so I want to know if this kind of units could do great with geometry shaders.
It could be interesting if up coming gpu include more "general purpose" simd that the one (more and more flexible) dedicated to graphic tasks.

It could give us some clues about how heavily Ms is likely to invest its silicon budget in the gpu.

Little by litte i think that gpu will merge with cpu.
I won't be surprised if in 2017 each core of a processor can process data and even graphics. For example have a game a massive physics and quite good graphics? A Drivers assign the 60% of the cores on physis and the 40% on graphics.
Have a game an incredible graphics but normal physis? Driver assign 80% of cores on graphics and 20% on physis. And inside this task division, a subdivision: tmus, rops or shader? Driver manages the resources needed by every single games.
So, a processor would be complete used. Can i call this kind of architecture SUA: Super-Unified-Architecture..:LOL:?
 
Little by litte i think that gpu will merge with cpu.
I won't be surprised if in 2017 each core of a processor can process data and even graphics. For example have a game a massive physics and quite good graphics? A Drivers assign the 60% of the cores on physis and the 40% on graphics.
Have a game an incredible graphics but normal physis? Driver assign 80% of cores on graphics and 20% on physis. And inside this task division, a subdivision: tmus, rops or shader? Driver manages the resources needed by every single games.
So, a processor would be complete used. Can i call this kind of architecture SUA: Super-Unified-Architecture..:LOL:?

I wasn't forwarding that far :LOL:
But if we want to make the most out of a given silicon budget every single cycle it makes sense / if for a given task a too general purpose execution unit is really too slow.
 
Little by litte i think that gpu will merge with cpu.
I won't be surprised if in 2017 each core of a processor can process data and even graphics. For example have a game a massive physics and quite good graphics? A Drivers assign the 60% of the cores on physis and the 40% on graphics.
Have a game an incredible graphics but normal physis? Driver assign 80% of cores on graphics and 20% on physis. And inside this task division, a subdivision: tmus, rops or shader? Driver manages the resources needed by every single games.
So, a processor would be complete used. Can i call this kind of architecture SUA: Super-Unified-Architecture..:LOL:?

Now try designing one...
The needs of tasks can be so different that using current technology it wouldn't be possible, or at least highly improbably. That's why there's all the talk about heterogeneous processors these days. You build a processor with different cores, each suited to a different task. 1 for Physics, 1 for GPU 1 for general purpose etc...
 
Without a lot of extra logic and resources, POWER6 would suffer a huge dip due to its lower efficiency.

It's a trade off, while it's performance per MHz dropped, it's performance per watt doubled. Aggressive OOO is a huge power hog, dropping it allowed them to add other things which increased performance and functionality without increasing power.

I think it would be an interesting comparison to see how much closer POWER5+ would be if it had the same expansion of bandwidth, better process, and time to refactor some implementation-specific faults, like its 2-cycle result forwarding.
I doubt they would have even got close to the performance of POWER6. P6 added a lot of extra stuff which has nothing to do with being in-order, the I/O system was all changed, there were a load of new instructions (Altivec, Decimal FP) as well as reliability features. If this was added to POWER5+ it would have left little room for clock speed gains. P5+ didn't hit it's clock speed goals as it was so while the new process and design techniques will help I can't see the same sorts of gains being made.


Unless you use the real-world applications that are either the exact same applications or operate similarly to the real-world application exemplars in SPEC.
...and you use the same compiler with the same level of tuning and assuming it's not data set sensitive.


The key point is that there are measures of performance where a spectacularly aggressive in-order design loses to an OoO chip, and in some measures it loses incredibly badly: such as power and cost.
One is designed to run largely single threaded apps in a desktop box. The other is designed to run largely multithreaded apps and sit in a large multiprocessor box. The tradeoffs involved are completely different so comparisons of power and cost are meaningless. e.g. P6 has a hefty I/O system needed to communicate with a large number of other processors, that alone means the power figures are going to be very different.

There are also other factors, the high clock on P6 increases leakage, 40% of it's power goes to this. Intel has a lower clock rate so leakage is vastly reduced.

Anyway, I note you quoted SPECint, how about quoting SPECfp rate, very different story there.

What rule would that be?
SPEC doesn't allow source code to be changed. Since Cell relies on new code for it's performance SPEC can't be used as a performance indicator of any reliability.

The circuit design techniques used in the PPE and Xenon did inform the final circuit design for POWER6.
POWER6 will have been in development for years, it probably started before Xenon started. While there could have been some knowledge flow I doubt it would have been of any significance because the circuit design techniques will have come from the R&D division.

It's not entirely coincidence that they all are on high-performance SOI processes. IBM has low-power and bulk processes as well, and they would have been cheaper.
They used SOI for the speed, it's more complex but it allows higher clock speeds. There are other advantages (it reduces leakage and can also reduce die size quite significantly).

That would be true, but the point you used, specifically tying OoO with more register file read ports was wrong.
In the context that I made that comment it was correct, that the same is true for superscalar wasn't relevant and I clarified this.

Furthermore, using POWER6 as an example is fraught with danger because the chip uses a gigantic amount of resources to make up for its being in-order. It does a number of things that would make no sense for an SPE.
I was simply saying going OOO doesn't increase performance in all cases and quoted an example where doing the opposite has given advantages. This won't happen for all applications or all processors of course but does illustrate the point. In-order x86 processors tend to suck and I don't expect that to change anytime soon.

Since clock scaling leads to rapid climbs in power consumption and future processes are making it increasingly difficult to yield great clocks, irrespective of design, console chips will likely have modest gains in clock speed, if any.

I'm sort of inclined to agree but I'd like to see how IBM's high-K effects things. If they can crank the clock one last time I suspect they will try. I don't think they'll just up the clock though, they'll add more cores first then see how high they can take them.
 
It's a trade off, while it's performance per MHz dropped, it's performance per watt doubled. Aggressive OOO is a huge power hog, dropping it allowed them to add other things which increased performance and functionality without increasing power.
If you want to qualify things to say that aggressive OoO is a huge power hog.
OoO doesn't appear to be a huge power hog for other example chips, in part because other factors can intrude.

I doubt they would have even got close to the performance of POWER6. P6 added a lot of extra stuff which has nothing to do with being in-order, the I/O system was all changed,
Massively increased bandwidth and lower interconnect latency is rather helpful for an in-order core that has a much lower tolerance for latency.

POWER6 without the enhanced infrastructure would suffer on all performance fronts. The cache size and enhanced IO contribute a measurable amount to the performance of the chip.

there were a load of new instructions (Altivec, Decimal FP) as well as reliability features. If this was added to POWER5+ it would have left little room for clock speed gains. P5+ didn't hit it's clock speed goals as it was so while the new process and design techniques will help I can't see the same sorts of gains being made.
None of those things directly impacts clock speeds, however.
IBM's reasons for going the route it did are many, but if POWER5 were to have been given a number of the gifts POWER6 enjoys, the gap would not be all that great.

...and you use the same compiler with the same level of tuning and assuming it's not data set sensitive.
POWER6 is priviledged in that it has an entire platform from CPU, system, compiler, to OS stack taylored specifically for it.

A number of competitors do not have that advantage.

One is designed to run largely single threaded apps in a desktop box. The other is designed to run largely multithreaded apps and sit in a large multiprocessor box. The tradeoffs involved are completely different so comparisons of power and cost are meaningless. e.g. P6 has a hefty I/O system needed to communicate with a large number of other processors, that alone means the power figures are going to be very different.
That does not mean POWER6 would be as impressive a performer if it didn't have those other compensating advantages.

There are also other factors, the high clock on P6 increases leakage, 40% of it's power goes to this. Intel has a lower clock rate so leakage is vastly reduced.
It can go with a lower clock rate because it is more efficient per clock. Both OoO and clock speed scaling hit diminishing returns when taken to the extreme.

Anyway, I note you quoted SPECint, how about quoting SPECfp rate, very different story there.
In part due to a number of significant ISA and implementation peculiarities on both designs that makes it hard to tease out in-order or OoO as primary causes.

POWER6 will have been in development for years, it probably started before Xenon started. While there could have been some knowledge flow I doubt it would have been of any significance because the circuit design techniques will have come from the R&D division.
The alliance between Sony, IBM and Toshiba started design work on Cell in 2001.
At that point, IBM had released POWER4, and was in the thick of designing POWER5.

A good portion of the design effort between the three in-order chips from IBM would have overlapped. It is not a coincidence that all three designs had the same goal of minimizing the logic complexity of pipeline stages to the degree they did.

IBM did benefit by having Sony and Microsoft foot the bill for work that would help IBM everywhere else.

In the context that I made that comment it was correct, that the same is true for superscalar wasn't relevant and I clarified this.
An OoO scalar pipeline would have the same number of register ports as a scalar in-order. A dual pipeline in an OoO design has enough read ports for each of the two instructions that it can issue at a time, same as an in-order because neither OoO or in-order need more operands than what is needed for the issue stage.

The width of the pipeline is the primary issue, as superscalar dependency checking and wide bypass networks scale quadratically.
A modest OoO implementation would not scale that badly.
 
Status
Not open for further replies.
Back
Top