Predict: The Next Generation Console Tech

Status
Not open for further replies.
Reading the reviews made me sad. Where's the AMD of old? I mean, seriously, they're not even competing right now, they're just screwing around.
There was a blog post about this from a former engineer. It seems all of the "AMD of old" has left.
 
Something that I never understood about IPC and multithreading is when they say for example Dual Issue and Multithreaded x 2. Sorry if I sound like an idiot but I need to know if they are talking about 2 instructions per cycle shared between 2 threads or 2 instructions per thread.
 
From some reviews I read, I saw that Bulldozer actually has quite a significant advantage in Integer performance. As I am not a games programmer, I can't really tell if that's good for gaming, though.

.

I remember early in gen MS touting Xenon's superior integer performance over Cell as an advantage. Take with salt.
 
Something that I never understood about IPC and multithreading is when they say for example Dual Issue and Multithreaded x 2. Sorry if I sound like an idiot but I need to know if they are talking about 2 instructions per cycle shared between 2 threads or 2 instructions per thread.
I believe it depends on the type of multi-threading.Depending on the architecture one CPU may issue instructions from threads available concurrently or not. I remember reading that Xenon/ppu was using a simple form of multi-threading (either round robin or barrel processor) so only instruction from a given thread are issue at one time. I believe Intel SMT is different as intructions from the 2 different threads can be issued concurrently.
A more complex, intel like, implementation makes a lot of sense as the CPU are wide and have a ot of execution units, still I read in some presentation that Power A2 should be able to issue instructions concurrently (I'm not sure IBM was not that precise in that regard we may learn more as the chip ships).
 
I like the sound of a pp6 derivative, in order, with 4 way SMT.
4 cores x 4 threads for a total of 16 threads sounds nice.

Why? Can't we finally have proper OoO cpu's in consoles? So they wouldn't suck quite so much?

Especially given that IBM now has a complete design that's more or less tailor-made for consoles, with both physical and synthesizable solutions (476FP and 470S). I'd much rather have a single core of those than 16 cores of in-order crap.

Something that I never understood about IPC and multithreading is when they say for example Dual Issue and Multithreaded x 2. Sorry if I sound like an idiot but I need to know if they are talking about 2 instructions per cycle shared between 2 threads or 2 instructions per thread.

Dual issue means two instructions issued, period. Whether they need to come from the same thread, or can contain an instruction from both threads, depends on design. It can even be both, just in different parts of the pipeline. For example, in Intel SNB, the frontend will read and decode up to 4 instructions per clock from a single thread, potentially swapping threads every cycle. Down from there, the instructions can be mixed and matched in whatever way the CPU chooses, and after their registers are renamed, the execution pipeline doesn't even really know or care what thread they are from.
 
Last edited by a moderator:
So, i think it's fair to assume we won't have a bulldozer as a CPU...
I hope, it's a big inefficient warm power hungry chip, not what we need. Still I find that no matter the messy implementation AMD somehow proved that CMT can be a good concept.
I read this (among others) review of the bulldozer and this page somehow prove that the concept is pretty valid. But taking the chip as a whole and the sub par result I'm not sure that this will be enough to convince competing engineering teams to follow AMD. It has to depend on what one wants for its CPU, CMT (properly implemented) may be a good option to raise integer throughput of the chip at a reasonable price, it also help to amortize the cost of a decent front end.
I don't know what Sony and MS will choose as a philosophy for the design, a significant raise in IPC or to go further ahead with TLP.
Anyway that's bland talk as I can't see IBM going for this approach.
Some times I hope that Intel sales volumes will crumble and they are desperate to keep their facilities busy... as they are so ahead of the curve in CPU performances (damned a core i3 keep up with a 4 modules/ 8 cores bulldozer in most games :oops: ).

At least bulldozer give us some concrete hints about GF 32nm process density, with a lot of cache they managed to pack 2 billions transistors together on a 315mm² die.
 
CVG next gen feature with Saber Interactive Ceo Mathew Karch

http://www.computerandvideogames.co...al-things-id-be-shocked-if-it-isnt-out-first/


The next-generation of consoles will do great things. We're limited in what we can do right now in terms of games and that comes primarily from the power of the processors.

The best way to put it is it's kind of like being given a Lego set with 100 block and a set with 1000 - you can do a lot more with the second set. You have more wiggle room and more blocks to make something big and great.
 
It's unclear what he means by "power of processors", sounds like he is speaing of CPU power but it also could be (taken gpgpu in account) "overall processing power".

I'm not sure about what he means about this:
New consoles are going to be awesome for that; they're going to be to enable you to do new things in terms of design, and if you can do them in smaller chunks great.
That sounds a lot like fine grained multi-tasking. I wonder if that could be interpreted as TLP being the focus in regard to CPU design.
 
Heard from other developers mention many cores.
Maybe IBM have promised some anemic many core CPU monster for the future. Imagine a Bobcat-type CPU with full speed L2 cache running at a modest 3GHz but with 16 cores on die. Any nice?
 
Heard from other developers mention many cores.
Maybe IBM have promised some anemic many core CPU monster for the future. Imagine a Bobcat-type CPU with full speed L2 cache running at a modest 3GHz but with 16 cores on die. Any nice?

Do you mean the PPC 470? It has 4 cores per cluster, up to 4 clusters on chip.

Also, bobcat isn't exactly anemic compared to what we are used to in the console world. Quite the opposite.
 
I don´t think 3D-chips will find it´s place in the next incarnation of the Xbox or Playstation, but I think it´s a given in the ones that will appear beyond 2014.

Pretty interesting article on the topic.

I found this particulary encouraging.

The Joint Electron Device Engineering Council is pioneering a Wide I/O standard for 3-D ICs that’s due by year’s end. The Jedec spec will support 512-bit-wide interfaces.
 
I found this particulary encouraging.
> The Joint Electron Device Engineering Council is pioneering a Wide I/O standard for 3-D ICs that’s due by year’s end. The Jedec spec will support 512-bit-wide interfaces.

That's actually not nearly as interesting as it sounds. The JEDEC Wide I/O is meant for mobile and embedded devices, and they are using the high width to compensate for not running the interface faster than the ram chips are ran like in most present memory standards. Presently the fastest Wide I/O spec is 512bit * 200MHz, or 12GB/s. This makes sense, because when the ram chips are integrated, wide interfaces are cheap and wide and slow takes only a fraction of the power of narrow and fast.

But I'm pretty sure they'd have real trouble fitting eight of those on a single shrunk-once next-gen console cpu, so they can't really meet the bandwidth of using GDDR5 (or DDR4).
 
Last edited by a moderator:
Do you mean the PPC 470? It has 4 cores per cluster, up to 4 clusters on chip.

Also, bobcat isn't exactly anemic compared to what we are used to in the console world. Quite the opposite.

PPC 470 looks nice, it is OoO but designed to only scale to 2GHz on 45nm probably due to its very short pipeline - 9 stages according to Wikipedia. Seems like it is very power efficient - only 1.6W at 1.6Ghz! Would it be good enough for the consoles as is?

Looking at the implementation by LSI Axxia ACP3448 I would say no. Maybe with VSX and other features to speed up graphics work would make it more suitable. Any idea on its die size in various configurations?
 
Last edited by a moderator:
Do you mean the PPC 470? It has 4 cores per cluster, up to 4 clusters on chip.

Also, bobcat isn't exactly anemic compared to what we are used to in the console world. Quite the opposite.
We had this discussion
quiet
some pages ago and whereas bobcat most likely beat Xenon PPU at IPC it's unclear how high it could clock and so how how much it could make up of the difference in clock speed.
Drifting from bobcat specific case, I believe that manufacturers are facing a tough choices, they have to deal contradictory wanted features:
Higher IPC weighted by clock speed (measured in MIPS?).
Higher power efficiency and lower power consumption (per core at least).
Point 2 would command for significantly lower clock speed than in nowadays systems. So that calls for a consistent increase in IPC only to keep up with nowadays systems single thread performances. So this have a cost in silicon and possibly significant one to beat nowadays systems while running at significantly lower clocks (for power consumption).
So bigger cores means less cores (especially in case of a APU). I'm not sure at which kind of sustained throughput manufacturers are looking at but my dumbness push to consider lower or equal single thread performances as an option.
So either we could look at simple IO cores with 4 way SMT (with most likely lower single thread performances) possibly close parent to power a2/en architecture, either at simple OoO cores with 2 way SMT (~equal single thread performances, a bit lower number of cores or bigger area devoted to CPU cores).

If they want significantly higher single thread performances, 4 cores may be the max number of cores for a SoC, more for a discrete CPU(s) chip.
I guess it depends on the kind of throughput they want the CPU cores to achieve and how they deem relevant to dump some tasks to the GPU.

Criticizing the Cell is not relevant to the topic, but honestly to be fair I believe that neither Microsoft or Sony made the most out of their silicon budget. That would be funny/interesting to discuss what could have been better designs but manufacturers can't guess completely how systems will be used, etc. it did not turn out bad in the long run.
 
Personally I would prefer an architecture that would be simple enough to achieve good throughput in highly multithreaded workloads. Single threaded performance is not that important for games, since all current game engines are highly multithreaded (most are job based, and scale pretty much perfectly along increased thread counts). Adding extra cores/threads (beyond the current six we have in Xbox 360) wouldn't increase the development time at all. 2/4 way SMT would fit perfectly the simple core / high thread count system. It would help keeping the (simple) pipelines filled at all times with a very low cost of silicon, and would lower the manual optimization burden of developers. I would prefer out-of-order execution, since it solves so many pipeline stall cases efficiently, but with 4 way SMT in-order execution wouldn't be that much a burden either. Still, instruction reordering and LHS stall solving can be pretty time consuming to do manually.

I would prefer a single address space (unified memory), because it makes things much easier. It allows efficient low latency communication between the CPU cores/threads and between the CPU and the GPU (mixed CPU/GPU calculation will be really popular in the future games). To reduce the traffic to shared main memory, all execution units should have efficient low latency caches. In addition to automatic cache control, there should be solid instruction set for manual cache control. I wouldn't mind cache lines getting larger or not having exotic fully automatic prefetchers, since most talented developers are already using structures that are cache-optimized, and have predictable access patterns.
 
Nice, thanks for the input sebbbi. I was wondering, having a look at PC game benchmarks Sandybridge is able to pull a lot higher FPS when all else is equal (apart from chipset) compared to Bulldozer and Phenom II.

Why is this? And would that fact alone (a stronger CPU) increase visual fidelity or are those extra frames wasted and cannot be used for anything purposeful. What I mean by that is can the CPU be given other tasks to improve the virtual world - whether it be AI, physics or more particle effects instead to target a reasonable 60fps.
 
Single threaded performance is not that important for games, since all current game engines are highly multithreaded (most are job based, and scale pretty much perfectly along increased thread counts).
Agreed with the sentiment, but the scaling of current game workloads leaves a lot to be desired. I've yet to see a game that makes good use of more than a quad core on PC for instance and most only show moderate increases from dual -> quad.

Good work stealing schedulers and job systems should scale really well assuming a game engine has a reasonable task granularity, but for whatever reason I've not seen any really good poster child for this in practice. Anyone know of a game that scales really well on PC? On consoles it's a bit harder to know since you can't exactly go buy a 12 core 360 or something.

Agreed on going out-of-order. It's just a lot easier to deal with and not really an unreasonable hardware cost IMHO.

Agreed on your memory hierarchy ideas as well. From experience with CPUs and Larrabee I can say that caches + cache manipulation instructions (evict, prefetch#, temporal/non-temporal, exclusive) provide the most powerful set of tools for expressing a broad range of memory access patterns. It is nice to have a basic hardware prefetcher though as it works well for a large fraction of your code that it is otherwise tedious to insert software prefetching code for. As you note, it handles stuff like streaming arrays/structures in and out very efficiently.
 
Status
Not open for further replies.
Back
Top