Predict: The Next Generation Console Tech

Status
Not open for further replies.
The weird thing is that the fundamental reason for why Cell is what it is, and any trouble that may cause, did not go away.

Yes, those who don't understand history will always be with us.


Cell designers did not toss the traditional memory model out by accident.

no they did it by either willful ignorance or ignorance.

It is a willful exchange for more core scalability,

This isn't true. The memory model has little to do with the scalability. In fact, if anything they chose the less scalable path.

It is a very deliberate shuffling of complexity out of the hardware (which incures material costs per piece), into the software (which incurs one-time costs but is then either free or very cheap to replicate in mass).

No, its not a one-time cost. It is a continual cost with each project and each minor change that is made. There is no flexibility in the model and as such, even minor changes and completely break the existing code and require complete rewrites.

Sony is in the business to mass-manufacture things.

Sony is in the business of marketing things. They farm out a large part of their manufacturing.


Coherent caches may be "nicer", but the cost of coherency scales exponentially with the number of cores.

linearly actually.
 
I don't know where you are getting this darker days talk wrt development from. Its just as dark now as it was then.

I think you're viewing it from an architectural perspective rather than a tools/experience perspective; Sony's tools are mature enough at this point (night and day compared to launch), and enough Cell-centric developers embedded in the major houses, that I don't think Cell is any longer a barrier in and of itself to title development. I think the steady reduction in Cell-bitching stories out of the general developer populace points to as much.

I'm not making light of it via that last statement, I am among the first to acknowledge the challenges the architecture brought to the majority of the industry, but the majority of the pain is in the past rather than the present. And I do think if assessed newly today, it would hold up better vs x number of years ago. That is not to say that it would win its place in PS4 de facto, but that it would hold itself up better in terms of attitudes from the dev community.
 
As does the snooping bandwidth necessary per core without tag directories, so the actual area devoted to it increases quadratically with the number of cores.

linearly with increasing cores. If you aren't getting linear, then you are doing it wrong.
 
I don't know where you are getting this darker days talk wrt development from. Its just as dark now as it was then.

It's better than launch or pre-launch. Sony has tools that plugin to VS, a job manager for SPUs (after everybody wrote their own, of course), improved SPU debugging, etc. The fundamental issues are still there, of course, but the biggest problems for the PS3 in terms of ports are the memory split and the GPU. Of course, that's not necessarily to suggest that Cell will appear in a successor.
 
linearly with increasing cores. If you aren't getting linear, then you are doing it wrong.
Not even the necessary aggregate bandwidth for "normal" communication scales linearly with the most efficient topology, a mesh, while the individual bandwidth needs per core for that traffic is constant.

Snooping at the same time has increasing bandwidth needs per core as the number of cores increases and a network topology which scales far worse at the same time. The network overhead per core in a ring bus topology wouldn't stay constant (which is necessary for true linear scaling) without coherency, and would increase faster per core than with a mesh, snooping traffic just makes it worse. The Larrabee team didn't go for a hierarchical ring with tag directories in between for a laugh for their next generation Larrabee ... it was necessity. All that extra complexity did most certainly come with an increased overhead per core (and increased overhead per core means super-linear scaling).

Any way, it's not me who says it scales poorly ... it's the SCC group from Intel :p
 
Last edited by a moderator:
Yeah, sorry, quadratically. My bad.

Of course it's possible to limit size increases below the "naive" scaling curve by tweaking the topology, but that will limit your comms bandwidth and/or introduce more latency. I put naive in quotes because it maintains performance, and that's not "wrong" per se.

E.g. you can couple pairs or triplets of nodes plus a downlink to some arbiting node (which tends to be the next cache level up in practice). This gives you "free" scaling beyond two/three nodes, because you'll have no more than three/four links touching each other anywhere in the resulting tree, but a tree doesn't perform like a star topology either.

You should also ask yourself if you'd really put double-digit MBs of L3 cache in there, if it weren't for inter-communication one way or another.
 
The Larrabee team didn't go for a hierarchical ring with tag directories in between for a laugh for their next generation Larrabee ... it was necessity. All that extra complexity did most certainly come with an increased overhead per core (and increased overhead per core means super-linear scaling).
Was it ever confirmed that Larrabee would have directories at ring cross-links?
I suppose it could have had snoop filters rather than a full directory-based coherency scheme. It didn't sound like Intel had actually changed the old-fashioned broadcast scheme of the old cores, and a real directory would need more changes to that part of the system.
 
No, you're right ... I was mis-remembering, they only showed tag directories on the cores and no filtering at the crosslinks. So access to shared data still causes global broadcasts ... guess I was just being optimistic.
 
Snooping at the same time has increasing bandwidth needs per core as the number of cores increases and a network topology which scales far worse at the same time.

Then you are doing it wrong. There are reasons why a variety of different coherence topologies/tactics have been invented, used, and studied over the years. Given the right coherence system, it is effectively linear to sub-linear.

Any way, it's not me who says it scales poorly ... it's the SCC group from Intel :p

wake me up when anyone has anything close to a couple hundred nodes per chip.
 
No, you're right ... I was mis-remembering, they only showed tag directories on the cores and no filtering at the crosslinks. So access to shared data still causes global broadcasts ... guess I was just being optimistic.

Why would anyone ever need to do global broadcasts?
 
Because if the data is shared it can be shared by every single cache on the chip, without a snoop filter between the rings broadcasting invalidates or updates is the only option on a write.
 
Then you are doing it wrong. There are reasons why a variety of different coherence topologies/tactics have been invented, used, and studied over the years.
Because the pre-existing ones didn't scale so well. There is a reason the SCC guys thinks software managed coherency on a packet switched on chip network is the future too.
wake me up when anyone has anything close to a couple hundred nodes per chip.
Well, how many do you think next gen Larrabee will have when/if it finally comes out?
 
Because if the data is shared it can be shared by every single cache on the chip, without a snoop filter between the rings broadcasting invalidates or updates is the only option on a write.

MfA, read the hotchips paper/slides on Nehalem-EX.
 
Because the pre-existing ones didn't scale so well. There is a reason the SCC guys thinks software managed coherency on a packet switched on chip network is the future too.

No, because different configurations and level of a hierarchy work better with different coherence protocols and different combinations of coherence protocols.

My axiom, the only reason anyone things software coherency is the future is because they don't think they will have to write the software.

We already have coherency algorithms that scale to 100+ nodes. If you need 100+ nodes, it can be done. They've been proven in silicon, they have been proven to scale. The bigger issue is interconnects in general when you get up to large node counts and don't have an extremely simple workload.

Well, how many do you think next gen Larrabee will have when/if it finally comes out?

Let me know when GPUs cross the 50 mark... The current gen are sitting at 16 and 10.

But the future of graphics really is coherence, as pin bandwidth isn't going to scale at the historical rates. The only option is to utilize coherence and caches to minimize off chip requirements. Software is hard enough as is without having to go to the next step and manage every bit of coherence as well.
 
No, because different configurations and level of a hierarchy work better with different coherence protocols and different combinations of coherence protocols.

so, in essence, what's best depends on what you want your chip to be best at.

just like right now, you may have Sun's Niagara best at massively parallel and simple general purpose tasks, x86 or POWER best at heavy general purpose FP load, Cell good at some things, GPU great at other things, or likewise for SMP versus clusters.

We may imagine a chip down the road as 10% area for OOOe SMP cores, 25% as "coherent manycore" and the rest as "non coherent manycore"?
 
just like right now, you may have Sun's Niagara best at massively parallel and simple general purpose tasks
Why would niagara not be good for complex-tasks. AFAICS, 8 way SMT should be a good compromise for OoOE. Of course, you'd need 128 threads in flight for this.

But why would niagara be a poor fit if you had 128 complex threads running concurrently?
 
We already have coherency algorithms that scale to 100+ nodes.
CPUs have such pitiful little bandwidth needs though, and the cores are so huge.

Lets say the Larrabee team had put a banked L0 in there with full speed scatter/gather in absence of bank conflicts (like GPU local memory). Would you really want to use generic hardware coherency for the scatter/gather accesses? IMO you want to at least be able to tell the hardware where the data is (if not already exclusively owned) through the page table, not software coherency as much ... but software assisted coherency.

Coherency is the future, hardware coherency might even be the future ... the classical hardware enforced sequential consistency you get from snooping with ring/tree buses almost certainly isn't. As you say, interconnect is a problem ... and ring/tree buses are a handicap, so the software will have to get smarter.

Cypress has 20 SIMDs BTW.
 
Last edited by a moderator:
Why would niagara not be good for complex-tasks. AFAICS, 8 way SMT should be a good compromise for OoOE. Of course, you'd need 128 threads in flight for this.

But why would niagara be a poor fit if you had 128 complex threads running concurrently?

To clarify, what I wanted to say is that it's the best [chip] at doing many simple tasks, rather than best at X, therefore sucky at Y and Z.
(the grammar was poor)
 
CPUs have such pitiful little bandwidth needs though, and the cores are so huge.

Pretty much immaterial.

Lets say the Larrabee team had put a banked L0 in there with full speed scatter/gather in absence of bank conflicts (like GPU local memory). Would you really want to use generic hardware coherency for the scatter/gather accesses? IMO you want to at least be able to tell the hardware where the data is (if not already exclusively owned) through the page table, not software coherency as much ... but software assisted coherency.

Then you are just wasting all your time updating and keeping page table coherent which is orders of magnitude more painful than keeping memory coherent.

S/G are just normal memory accesses with all the plus and minuses that it entails. Without lower level caches and coherency the impact is the same. While S/G can be beneficial from a software perspective, it is far from ideal from a performance perspective on any modern hardware. The last hardware that supported full bandwidth S/G were the CRAY vector machines.

Modern designs primarily use S/G as a software convenience and method for extracting higher MLP.

Coherency is the future, hardware coherency might even be the future ... the classical hardware enforced sequential consistency you get from snooping with ring/tree buses almost certainly isn't. As you say, interconnect is a problem ... and ring/tree buses are a handicap, so the software will have to get smarter.

From a programming model perspective, everything which has tried to get away from a sequential consistency model has failed. There are very good reasons why they have failed. Nothing fundamental has changed to affect those reasons.

As far as snoops, etc. I'm not sure you really understand the options in modern coherency protocols.

Cypress has 20 SIMDs BTW.

yeah, for some reason I was thinking 160 instead of 80.
 
Status
Not open for further replies.
Back
Top