The Case for Arm in Nextgen Consoles?

ERP · Jan 18, 2012

I'm intrigued by the notion of a fat cluster of ARM cores as a main CPU though, and how perfomant such a part could be.

Throwing lots of cores on a chip is the easy part Intel demonstrated silicon with 200 P5 cores on a chip 4 or 5 years ago (proof of concept).
The problem is what's the memory architecture, 200 chips banging on a single FSB in't going to work, sure you could probably demonstrate astonishing IPS for programs running on whatever local memory they have, but that's not really indicative of useful workloads.

TheChefO · Jan 18, 2012

ERP said:
...sure you could probably demonstrate astonishing IPS for programs running on whatever local memory they have, but that's not really indicative of useful workloads.

Isn't that the whole concept of Cell and why they chose that direction instead of a cache for each SPE?

Sure it was a PITA for devs initially, but I'd figure between familiarity, middleware, and other toolsets, such a programming environment wouldn't be that problematic or different to Cell.

No?

liolio · Jan 18, 2012

east of eastside said:
Reusability of the Microsoft 720 ARM chip in other hardware i.e. tablets, desktop, laptops and/or set top boxes would be another huge advantage extending from this.

Microsoft could turn around and generate licensing revenue from licensing out the chip to other device makers.

PPC has no life in the consumer space beyond consoles. ARM the total opposite.

Fair question how about designing a next generation system as a software thing, ie a virtual machine?
You have you hypervisor/virtualizer running on any architectures/system you want as long as it meets the performance requirement of the virtual machine/next gen system

TheChefO · Jan 18, 2012

liolio said:
I somehow agree with you and Duck, for gaming consoles rules but their marginal value is going down as TVs start to ship with Ice Cream Sandwich. A lot of the services console manufacturers included or planned to include to their systems are made irrelevant by the market evolutions.

for any important business being stuck to a 7 years or more business plan is madness especially now when we are in many regard at a corner point

There are ways to achieve greater stability in sales while combating things like Smarter TVs (I don't see them taking off, I think most people like their phones smart, their TV's dumb ... aside from a netflix streamer app which people would love, as long as the interface didn't suck and it was priced the same as competing TVs without the feature.)

They could go the cablebox route where they talk to prospective cable companies across the regions in selling them on the idea of a cablebox/xbox combo unit. Sell it to the cable companies for roughly the same cost they are paying now for a STB, and the cable cos gain another feather in the cap vs dish/sat while the game co gets a huge boost on userbase/potential customers.

Arm would be instrumental in such a setup, but not necessarily the core CPU.

TheChefO · Jan 18, 2012

liolio said:
Fair question how about designing a next generation system as a software thing, ie a virtual machine?
You have you hypervisor/virtualizer running on any architectures/system you want as long as it meets the performance requirement of the virtual machine/next gen system

I like the concept, but I don't want nextgen limping along in a VMWARE construct. Too many potential performance issues.

east of eastside · Jan 18, 2012

Does Microsoft possess the engineering capability to design a next-gen console CPU core off of an ARMv8 64bit design on its own, or would it have to contract outside help?

ERP · Jan 18, 2012

TheChefO said:
Isn't that the whole concept of Cell and why they chose that direction instead of a cache for each SPE?

Sure it was a PITA for devs initially, but I'd figure between familiarity, middleware, and other toolsets, such a programming environment wouldn't be that problematic or different to Cell.

No?

Huge difference between 8 core and 200, or even 32.

ERP · Jan 18, 2012

east of eastside said:
Does Microsoft possess the engineering capability to design a next-gen console CPU core off of an ARMv8 64bit design on its own, or would it have to contract outside help?

Could they do it?
Probably although there hardware group is smaller than it used to be.
They would most likely either buy a company with relevant experience or work with an external party.

TheChefO · Jan 18, 2012

ERP said:
Huge difference between 8 core and 200, or even 32.

So you'd say it would be better to stick with a smaller, beefier 6-8 core?

Perhaps trading actual cores for increased 4-8way SMT?

ERP · Jan 18, 2012

This is only my opinion, I never was a Cell fan FWIW, I think the design was a dead end, and one that had been done before at smaller scales IMO.

If were talking the future of computing which is clearly parallelism, I think you end up trending to where the GPU market is and where Cray went with the XMT. You budget computational resources then have many hardware threads that share the resources. The Threads let you hide communication latency (on and off chip or even box), and get close to actually exploiting the computational resources. Having said that this assumes you can get enough parallelism out of a problem to make it worthwhile, and not all problems fall into that bucket.

If we're talking about a CPU for a next gen console then I think that good single threaded performance for general code is important. Whether that means reducing core counts or whatever doesn't really matter.
I think GPU's are at a point where they can do a lot of the computationally heavy lifting, and outside of a very narrow set of problems in games, what game code (what most code) does is move memory around.

TheChefO · Jan 18, 2012

ERP said:
This is only my opinion, I never was a Cell fan FWIW, I think the design was a dead end, and one that had been done before at smaller scales IMO.

If were talking the future of computing which is clearly parallelism, I think you end up trending to where the GPU market is and where Cray went with the XMT. You budget computational resources then have many hardware threads that share the resources. The Threads let you hide communication latency (on and off chip or even box), and get close to actually exploiting the computational resources. Having said that this assumes you can get enough parallelism out of a problem to make it worthwhile, and not all problems fall into that bucket.

If we're talking about a CPU for a next gen console then I think that good single threaded performance for general code is important. Whether that means reducing core counts or whatever doesn't really matter.
I think GPU's are at a point where they can do a lot of the computationally heavy lifting, and outside of a very narrow set of problems in games, what game code (what most code) does is move memory around.

Interesting. I agree with your take on this, but I'm not seeing how you translate that to your stance in the BC thread. I digress...

___________________

I'm not sure how much you know WRT CPU design but bear with me:

8 core no SMT 400mm2 logic w/ 1MB cache/core
vs
1 core 8way SMT 400mm2 logic w/ 8mb cache

I would think the single core 8 way SMT would be the victor here assuming it had enough Execution silicon and a scheduler which could feed them properly.

The obvious benefit in my mind would be that the monolithic core could put all of its resources into single threaded performance, or split its resources accordingly to meet up to 8 smaller tasks simultaneously.

Am I missing something?

Why is such an approach not followed by bigger CPU designs from Intel/IBM/ARM/AMD?

Ninjaprime · Jan 18, 2012

TheChefO said:
Am I missing something?

Why is such an approach not followed by bigger CPU designs from Intel/IBM/ARM/AMD?

POWER7 relevant specs:

Big, heavy OoOE cores.
45nm
1.2 billion transistors.
567 mm2
Power diss. ~200 watts.
Performance max 264 GFLOPS per chip on a 8 core chip.
GFLOPS/watt 1.32
GFLOPS/mm2 0.46

PowerA2 based Blue Gene/Q relevant specs:

Small in-order cores.
45nm
1.47 billion transistors.
359 mm2
Power diss. 55 watts.
Performance max 205 GFLOPS per chip, 18 core chip.
GFLOPS/watt 3.72
GFLOPS/mm2 0.57

Small multicore ~181.8% more efficent per watt.
Small multicore ~23.9% more efficent per mm2.

Big fast core needs a lot more "support" to achieve big fast-ness.

tuna · Jan 18, 2012

liolio said:
Sometime ago in a thread about Sony business model I said that they are missing the point and that they should have a clear goal like "android everywhere" and aim at be leader in a few years. I overshoot by a lot, Lenovo and others are already getting there with TV that ship in China with Ice Cream Sandwich running. This will soon open casual gaming to many household, with casual games you can play on your phone (OS agnostic on top of it), tablet PC and soon (if not now) on your tv. I think that Sony is already lagging too much in this regard behind competition / I wonder if they still have the 'how to" to beat Samsung, Lenovo and the likes on that front. Clearly Sony has no choice but to be a bigger fish in a tiner pond but it's not at all as good as a situation as it sounds.

How would you control the Android games on your TV?

ERP · Jan 18, 2012

TheChefO said:
Interesting. I agree with your take on this, but I'm not seeing how you translate that to your stance in the BC thread. I digress...

___________________

I'm not sure how much you know WRT CPU design but bear with me:

8 core no SMT 400mm2 logic w/ 1MB cache
vs
1 core 8way SMT 400mm2 logic w/ 8mb cache

I would think the single core 8 way SMT would be the victor here assuming it had enough Execution silicon and a scheduler which could feed them properly.

The obvious benefit in my mind would be that the monolithic core could put all of its resources into sing threaded performance, or split its resources accordingly to meet up to 8 smaller tasks simultaneously.

Am I missing something?

Why is such an approach not followed by bigger CPU designs from Intel/IBM/ARM/AMD?

My personal view on backwards compatibility, is I'd rather they just ignored it, it means nothing to me personally, and I'd rather it wasn't a consideration in hardware design. But those types of decisions are way over my pay grade.

I am not a CPU designer, but I would suggest that given the same computational resources the 8 way shared computation design would be more complex and show no better peak performance numbers, and in code not written to exploit the design I think you'd see minimal if any performance advantage, I base this on how long it took GPU's to pool there computation units.

When I talk about lots of hardware threads, I'm talking about cores with 100+ hardware threads capable of hiding 1000+ cycle latencies to memory (or network).

The problem with any of these large scale parallel designs is outside of a small number of niche problem areas, software just isn't at a point it can exploit them. And I personally don't see the "paradigm shift" coming soon.

TheChefO · Jan 19, 2012

ERP said:
My personal view on backwards compatibility, is I'd rather they just ignored it, it means nothing to me personally, and I'd rather it wasn't a consideration in hardware design. But those types of decisions are way over my pay grade.

Right, but I was more looking at it from the standpoint of -- the CPU is losing importance, so, why waste valuable time/money/resources in changing the architecture for something that in the end, will not have much impact on the final result ... where the heavy lifting will be done by the GPU.

Thus, scale the existing arch, and keep BC.

I am not a CPU designer, but I would suggest that given the same computational resources the 8 way shared computation design would be more complex and show no better peak performance numbers, and in code not written to exploit the design I think you'd see minimal if any performance advantage, I base this on how long it took GPU's to pool there computation units.

When I talk about lots of hardware threads, I'm talking about cores with 100+ hardware threads capable of hiding 1000+ cycle latencies to memory (or network).

How about the 32MB EDRAM IBM has in their latest designs? Change the structure from a cache to a scratch pad / local store and this would seem to be pretty useful in working on that latency issue for many use cases.

No?

The problem with any of these large scale parallel designs is outside of a small number of niche problem areas, software just isn't at a point it can exploit them. And I personally don't see the "paradigm shift" coming soon.

I could see that being an issue. That's why I was asking from the more knowledgeable members of B3D *ahem*

to pipe up about what kind of advantage a 50 core ARM might bring vs a similar sized 50 core Power CPU or other alternate hardware.

ERP · Jan 19, 2012

Right, but I was more looking at it from the standpoint of -- the CPU is losing importance, so, why waste valuable time/money/resources in changing the architecture for something that in the end, will not have much impact on the final result ... where the heavy lifting will be done by the GPU.

Thus, scale the existing arch, and keep BC.

I didn't say any of that, both current console CPU's have relatively speaking appalling single threaded performance for their clock speed.
What I said was single threaded performance is important.
That does not discount all multi core design, it states the need that the cores be able to perform well for large blocks of single threaded code.
If you go back and check my post history since 360/PS3 were announced you'll see that I've consistently stated that the CPU's in question were designed around stellar paper specifications, with no real eye to real world performance.
It shouldn't be about Terra Flops or MIPS, it should be about being able to build something that works well without having to micromanage instruction scheduling.
This is why I advocate approaches like the Cray XMT that try and address the real performance issues, which are largely about cache misses.
You'll also note my continued push for programmers to have a better grasp of hardware architecture and what their programs actually do, I'm also a big believer i do things right first time.
But games are 3 Million+ line applications costing 10's of millions of dollars. It's a different development environment than it was even 10 years ago.

How about the 32MB EDRAM IBM has in their latest designs? Change the structure from a cache to a scratch pad / local store and this would seem to be pretty useful in working on that latency issue for many use cases.
No?

I don't like the idea of local store outside of small set of specific problems.
Ignoring concurrency issues with scratch pads.
Scratch pads will never be "big enough" then your back to shuffling data around.
If you assume that writing large scale software is already challenging (and judging by some of the buggy crap that ships on all platforms I think we have sufficient evidence to back up that hypothesis) why force developers to manage a scratchpad to get decent performance?

liolio · Jan 19, 2012

Ninjaprime said:
POWER7 relevant specs:

Big, heavy OoOE cores.
45nm
1.2 billion transistors.
567 mm2
Power diss. ~200 watts.
Performance max 264 GFLOPS per chip on a 8 core chip.
GFLOPS/watt 1.32
GFLOPS/mm2 0.46

PowerA2 based Blue Gene/Q relevant specs:

Small in-order cores.
45nm
1.47 billion transistors.
359 mm2
Power diss. 55 watts.
Performance max 205 GFLOPS per chip, 18 core chip.
GFLOPS/watt 3.72
GFLOPS/mm2 0.57

Small multicore ~181.8% more efficent per watt.
Small multicore ~23.9% more efficent per mm2.

Big fast core needs a lot more "support" to achieve big fast-ness.

Then there is the real world benchmark, how close they come to their max FLOPS figures, running which applications, etc.
Comparing CPUs on GFLOPs figures (peak theorical ones on top of it) is doomed, 7 years of bad sex may follow lol

liolio · Jan 19, 2012

tuna said:
How would you control the Android games on your TV?

Well in case of Lenovo (they are the first and only for now) the tv comes with both a remote control and a controller.
My tegra 2 based phone still haven't been upgraded to ICS so I can't confirm but I read that ICS provides better supports for peripherals like external keyboards, controller, possibly mouse etc.
Android is a full blown OS still evolving and evolving fast I don't think we would miss options, I would not be surprise if the best fit would simply be another ICS device.

TheChefO · Jan 19, 2012

ERP said:
I didn't say any of that, both current console CPU's have relatively speaking appalling single threaded performance for their clock speed.
What I said was single threaded performance is important.
That does not discount all multi core design, it states the need that the cores be able to perform well for large blocks of single threaded code.
If you go back and check my post history since 360/PS3 were announced you'll see that I've consistently stated that the CPU's in question were designed around stellar paper specifications, with no real eye to real world performance.

Gotcha. Sorry, I misunderstood this statement:

...If were talking the future of computing which is clearly parallelism, I think you end up trending to where the GPU market is...

It shouldn't be about Terra Flops or MIPS, it should be about being able to build something that works well without having to micromanage instruction scheduling.

I understand the issues most had with SPE's local store, but I thought the SDK for Xenos was relatively developer friendly ... :???:

I don't like the idea of local store outside of small set of specific problems.
Ignoring concurrency issues with scratch pads.
Scratch pads will never be "big enough" then your back to shuffling data around.
If you assume that writing large scale software is already challenging (and judging by some of the buggy crap that ships on all platforms I think we have sufficient evidence to back up that hypothesis) why force developers to manage a scratchpad to get decent performance?

Interesting.

What do you think the chances are of Sony or MS addressing this issue in their nextgen designs?

TheChefO · Jan 19, 2012

The Threadstorm processor in the Cray XMT is interesting indeed.

I'd never heard of this new "barrel processor" concept. It's an interesting method to deal with large latency issues and stalls.

I also like the rotation of thread execution where the CPU is constantly working on an op (assuming it isn't flooded with 100's of thread calls to main memory at the same time).

I'd think such an approach could be adopted for different ISA's and if so, well that would be interesting.

If not, licensing the Cray Threadstorm seems plausible...

...Threadstorm chip supports 128 threads. Better yet, each Threadstorm draws just 30 watts, or about a third that of a high-end x86 CPU. In addition, the XMT supports fine-grained synchronization in the hardware, in order to hide latencies across the threads.

http://www.hpcwire.com/hpcwire/2011-01-26/cray_pushes_xmt_supercomputer_into_the_limelight.html

The Case for Arm in Nextgen Consoles?

ERP

TheChefO

liolio

Aquoiboniste

TheChefO

TheChefO

east of eastside

ERP

ERP

TheChefO

ERP

TheChefO

Ninjaprime

tuna

ERP

TheChefO

ERP

liolio

Aquoiboniste

liolio

Aquoiboniste

TheChefO

TheChefO

Similar threads