Predict: The Next Generation Console Tech

aaronspink · Apr 16, 2010

rpg.314 said:
Why would niagara not be good for complex-tasks. AFAICS, 8 way SMT should be a good compromise for OoOE. Of course, you'd need 128 threads in flight for this.

But why would niagara be a poor fit if you had 128 complex threads running concurrently?

because it has very poor single thread performance, therefore in any case when any of the threads of stalled, the other threads will continue to execute very very slowly. Complex code tends to have a lot of stalls.

rpg.314 · Apr 16, 2010

aaronspink said:
From a programming model perspective, everything which has tried to get away from a sequential consistency model has failed. There are very good reasons why they have failed. Nothing fundamental has changed to affect those reasons.

Two things

a) We'll see how much Fermi's caches are accepted by the devs.

b) Massively parallel code, ie meant for gpu's, has so far not needed that kind of hardware coherency. As a programmer, you want to get rid of shared mutable state as much as possible, not carry it around. Why do you think functional programming is making a comeback ?

aaronspink · Apr 16, 2010

rpg.314 said:
b) Massively parallel code, ie meant for gpu's, has so far not needed that kind of hardware coherency. As a programmer, you want to get rid of shared mutable state as much as possible, not carry it around. Why do you think functional programming is making a comeback ?

Actually it has, it has explicitly made it such that there is no sharing of written data. Which works fine until you actually need to do something that requires producer consumer.

MfA · Apr 16, 2010

aaronspink said:
S/G are just normal memory accesses with all the plus and minuses that it entails.

Orders of magnitude do tend to make a difference and it can touch over an order of magnitude more cache lines per core per cycle (and thus cause over an order of magnitude more invalidates on writes for the network, snoop filters and caches to deal with).

Modern designs primarily use S/G as a software convenience and method for extracting higher MLP.

If each strand, in larrabee lingo, is independently walking a complex datastructure then there's not always a way to converge them. In some cases like raytracing you can kinda/sorta optimize for wide accesses by dynamically trying to merge rays into raypackets ... but that's a huge headache (certainly not making programming easier there). If you are rasterizing micro-polygons in each strand then there is no real way to merge the accesses period.

Call it extracting MLP if you want, I call optimizing for narrow accesses a necessity as triangles get smaller and rendering less coherent (multiple bounce raytracing).

From a programming model perspective, everything which has tried to get away from a sequential consistency model has failed.

Message passing isn't quite dead yet.

aaronspink · Apr 16, 2010

MfA said:
Orders of magnitude do tend to make a difference and it can touch over an order of magnitude more cache lines per core per cycle (and thus cause over an order of magnitude more invalidates on writes for the network, snoop filters and caches to deal with).

You are still fundamentally limited by cache/memory bandwidth.

If each strand, in larrabee lingo, is independently walking a complex datastructure then there's not always a way to converge them. In some cases like raytracing you can kinda/sorta optimize for wide accesses by dynamically trying to merge rays into raypackets ... but that's a huge headache (certainly not making programming easier there). If you are rasterizing micro-polygons in each strand then there is no real way to merge the accesses period.

if any architecture is walking a complex data structure X ways parallel, its going to suck unless main memory is 8b ultra high speed SRAM. This is true for Nvidia/ATI/ETC. S/G in general is good for striping offsets and rare random accesses, but if used as a general random source gather mechanism, its going to suck. These aren't 1970s/1980's vector machine.

Call it extracting MLP if you want, I call optimizing for narrow accesses a necessity as triangles get smaller and rendering less coherent (multiple bounce raytracing).

Yet, unless you are getting cacheline reuse, you are going to need to do algorithmic changes to make things work/scale.

Message passing isn't quite dead yet.

MPI has an EXTREMLY strict sequential consistency model.

MfA · Apr 16, 2010

aaronspink said:
Yet, unless you are getting cacheline reuse, you are going to need to do algorithmic changes to make things work/scale.

Locality is more easily maintained than strictly horizontal access ... take tiling with micro-polygons, local but not horizontal.

MPI requires ordering between the source and the destination, which is easily enough maintained, but there is no risk of say two caches thinking they are both owners because strict ordering of cache updates was not maintained between all the caches on the chip ... MPI works fine over a packet switched mesh, or flattened butterfly or whatever, snooping is a lot pickier.

aaronspink · Apr 16, 2010

MfA said:
Locality is more easily maintained than strictly horizontal access ... take tiling with micro-polygons, local but not horizontal.

Which is fine if you want a single datum wide cache. But otherwise, you need linearized spacial locality. and sure you can go with a single datum wide cache, but now your cache size has grown 8-16x.

MPI requires ordering between the source and the destination, which is easily enough maintained, but there is no risk of say two caches thinking they are both owners because strict ordering of cache updates was not maintained between all the caches on the chip ... MPI works fine over a packet switched mesh, or flattened butterfly or whatever, snooping is a lot pickier.

snooping is a bad term to use as it has general reference to a particular sub-type of coherence protocol.

And please do not confuse coherency and consistency/MOM. They are two unrelated things.

Cache coherency works fine over effectively any network topology you want to use.

MfA · Apr 17, 2010

aaronspink said:
Which is fine if you want a single datum wide cache. But otherwise, you need linearized spacial locality. and sure you can go with a single datum wide cache, but now your cache size has grown 8-16x.

You can use the same compromise GPUs use, use multiple banks, so you don't need a 32 ported dword wide monstrosity of a cache. You don't get full speed scatter and gather, but it does help quite a bit on average. Also very useful for things like column wise access in image processing (with appropriate stride). Faster and more convenient than transposition.

I really don't think you want to hit the normal coherency mechanisms with the number of invalidates scatters to a dword banked cache could generate though, you'd have to over-dimension everything.

Cache coherency works fine over effectively any network topology you want to use.

At the cost of using directories.

rpg.314 · Apr 17, 2010

aaronspink said:
Actually it has, it has explicitly made it such that there is no sharing of written data. Which works fine until you actually need to do something that requires producer consumer.

Append/consume buffers FTW

aaronspink · Apr 17, 2010

rpg.314 said:
Append/consume buffers FTW

Sure if you don't mind the extra latency and don't need to do anything that is synchronous.

rpg.314 · Apr 17, 2010

aaronspink said:
Actually it has, it has explicitly made it such that there is no sharing of written data.

Which is exactly what you want in massively parallel codes.

As for latency of append/consume buffers, I think it can be reduced without having to resort to fully coherent caches. Also, I haven't seen (so far) programmers complaining about it. May be it's too early for that. We'll see.

aaronspink · Apr 17, 2010

rpg.314 said:
Which is exactly what you want in massively parallel codes.

sure, there are a lot of things I *want*, its what you actually get that is the problem. Outside of a small subset, there is actual data sharing and interaction.

As for latency of append/consume buffers, I think it can be reduced without having to resort to fully coherent caches. Also, I haven't seen (so far) programmers complaining about it. May be it's too early for that. We'll see.

Um, there is a reason why people who can afford it don't use ethernet for their interconnect.

rpg.314 · Apr 17, 2010

aaronspink said:
sure, there are a lot of things I *want*, its what you actually get that is the problem. Outside of a small subset, there is actual data sharing and interaction.

Well, the scaling of your code is always going to be limited by the serializations/interaction. But IMHO, in the consumer apps (and in HPC, to whatever extent I have seen it, even if limited), the apps that *need* performance, already have a *lot* of parallelism.

Um, there is a reason why people who can afford it don't use ethernet for their interconnect.

Not sure what you are trying to convey. :???:

Svensk Viking · May 7, 2010

Perhaps not the right location, but what single component upgrade would give the PS3/360 considerable better performance? I guess the RAM?

Dural · May 7, 2010

^
I believe so. On that same track, more ram with lower bandwidth or less ram with more bandwidth?

Neb · May 7, 2010

Svensk Viking said:
Perhaps not the right location, but what single component upgrade would give the PS3/360 considerable better performance? I guess the RAM?

GPU as RAM should help a bit but wont create miracles when the rest is just slow by todays standards or even yesterdays.

Shifty Geezer · May 7, 2010

Svensk Viking said:
Perhaps not the right location, but what single component upgrade would give the PS3/360 considerable better performance?

Better performance in what area?

I presume you mean 'look better' and more RAM would definitely help there. However PS3 would benefit also from BW for IQ improvements. XB360 could have done with more eDRAM to solve its FB issues, while a more flexible eDRAM architecture (fully bidirectional access for the GPU) would be hugely beneficial.

However, these machines were built to a budget, and all in all they're looking pretty balanced to me. PS3's is somewhat screwed with the broken scaler. Tiling hasn't worked out how we hoped on XB360. Still, the machines don't appear starved in one particular aspect unlike maybe last gen. XB was bandwidth starved. PS3 was RAM starved.

joker454 · May 7, 2010

Svensk Viking said:
Perhaps not the right location, but what single component upgrade would give the PS3/360 considerable better performance? I guess the RAM?

Easily the gpu. Of all the components in these machines, their gpu's are what are now the most outdated. After that would be ram. Their cpu's oddly enough are still 'ok' now that everything has been re-written to use the vector units.

Xenus · May 7, 2010

Yeah shifty I'm kinda surprised that they didn't fix that when they made the PS3 slim. It would of given them a perfect spot to do it and also give people an excuse to buy the new system.

Billy Idol · May 7, 2010

Maybe a dumb question (I take the risk):

Is it possible for MS and Sony to increase performance via a software update and "exchange/upgrade a component in this sense"?
Maybe over-clocking or something like this?

Predict: The Next Generation Console Tech

aaronspink

rpg.314

aaronspink

MfA

aaronspink

MfA

aaronspink

MfA

rpg.314

aaronspink

rpg.314

aaronspink

rpg.314

Svensk Viking

Dural

Neb

Iron "BEAST" Man

Shifty Geezer

uber-Troll!

joker454

Xenus

Billy Idol

Similar threads