Predict: The Next Generation Console Tech

Status
Not open for further replies.
Why would niagara not be good for complex-tasks. AFAICS, 8 way SMT should be a good compromise for OoOE. Of course, you'd need 128 threads in flight for this.

But why would niagara be a poor fit if you had 128 complex threads running concurrently?

because it has very poor single thread performance, therefore in any case when any of the threads of stalled, the other threads will continue to execute very very slowly. Complex code tends to have a lot of stalls.
 
From a programming model perspective, everything which has tried to get away from a sequential consistency model has failed. There are very good reasons why they have failed. Nothing fundamental has changed to affect those reasons.

Two things

a) We'll see how much Fermi's caches are accepted by the devs.

b) Massively parallel code, ie meant for gpu's, has so far not needed that kind of hardware coherency. As a programmer, you want to get rid of shared mutable state as much as possible, not carry it around. Why do you think functional programming is making a comeback ?
 
b) Massively parallel code, ie meant for gpu's, has so far not needed that kind of hardware coherency. As a programmer, you want to get rid of shared mutable state as much as possible, not carry it around. Why do you think functional programming is making a comeback ?

Actually it has, it has explicitly made it such that there is no sharing of written data. Which works fine until you actually need to do something that requires producer consumer.
 
S/G are just normal memory accesses with all the plus and minuses that it entails.
Orders of magnitude do tend to make a difference and it can touch over an order of magnitude more cache lines per core per cycle (and thus cause over an order of magnitude more invalidates on writes for the network, snoop filters and caches to deal with).
Modern designs primarily use S/G as a software convenience and method for extracting higher MLP.
If each strand, in larrabee lingo, is independently walking a complex datastructure then there's not always a way to converge them. In some cases like raytracing you can kinda/sorta optimize for wide accesses by dynamically trying to merge rays into raypackets ... but that's a huge headache (certainly not making programming easier there). If you are rasterizing micro-polygons in each strand then there is no real way to merge the accesses period.

Call it extracting MLP if you want, I call optimizing for narrow accesses a necessity as triangles get smaller and rendering less coherent (multiple bounce raytracing).
From a programming model perspective, everything which has tried to get away from a sequential consistency model has failed.
Message passing isn't quite dead yet.
 
Orders of magnitude do tend to make a difference and it can touch over an order of magnitude more cache lines per core per cycle (and thus cause over an order of magnitude more invalidates on writes for the network, snoop filters and caches to deal with).

You are still fundamentally limited by cache/memory bandwidth.

If each strand, in larrabee lingo, is independently walking a complex datastructure then there's not always a way to converge them. In some cases like raytracing you can kinda/sorta optimize for wide accesses by dynamically trying to merge rays into raypackets ... but that's a huge headache (certainly not making programming easier there). If you are rasterizing micro-polygons in each strand then there is no real way to merge the accesses period.

if any architecture is walking a complex data structure X ways parallel, its going to suck unless main memory is 8b ultra high speed SRAM. This is true for Nvidia/ATI/ETC. S/G in general is good for striping offsets and rare random accesses, but if used as a general random source gather mechanism, its going to suck. These aren't 1970s/1980's vector machine.

Call it extracting MLP if you want, I call optimizing for narrow accesses a necessity as triangles get smaller and rendering less coherent (multiple bounce raytracing).

Yet, unless you are getting cacheline reuse, you are going to need to do algorithmic changes to make things work/scale.

Message passing isn't quite dead yet.

MPI has an EXTREMLY strict sequential consistency model.
 
Yet, unless you are getting cacheline reuse, you are going to need to do algorithmic changes to make things work/scale.
Locality is more easily maintained than strictly horizontal access ... take tiling with micro-polygons, local but not horizontal.

MPI requires ordering between the source and the destination, which is easily enough maintained, but there is no risk of say two caches thinking they are both owners because strict ordering of cache updates was not maintained between all the caches on the chip ... MPI works fine over a packet switched mesh, or flattened butterfly or whatever, snooping is a lot pickier.
 
Locality is more easily maintained than strictly horizontal access ... take tiling with micro-polygons, local but not horizontal.

Which is fine if you want a single datum wide cache. But otherwise, you need linearized spacial locality. and sure you can go with a single datum wide cache, but now your cache size has grown 8-16x.

MPI requires ordering between the source and the destination, which is easily enough maintained, but there is no risk of say two caches thinking they are both owners because strict ordering of cache updates was not maintained between all the caches on the chip ... MPI works fine over a packet switched mesh, or flattened butterfly or whatever, snooping is a lot pickier.

snooping is a bad term to use as it has general reference to a particular sub-type of coherence protocol.

And please do not confuse coherency and consistency/MOM. They are two unrelated things.

Cache coherency works fine over effectively any network topology you want to use.
 
Which is fine if you want a single datum wide cache. But otherwise, you need linearized spacial locality. and sure you can go with a single datum wide cache, but now your cache size has grown 8-16x.
You can use the same compromise GPUs use, use multiple banks, so you don't need a 32 ported dword wide monstrosity of a cache. You don't get full speed scatter and gather, but it does help quite a bit on average. Also very useful for things like column wise access in image processing (with appropriate stride). Faster and more convenient than transposition.

I really don't think you want to hit the normal coherency mechanisms with the number of invalidates scatters to a dword banked cache could generate though, you'd have to over-dimension everything.
Cache coherency works fine over effectively any network topology you want to use.
At the cost of using directories.
 
Actually it has, it has explicitly made it such that there is no sharing of written data.
Which is exactly what you want in massively parallel codes.

As for latency of append/consume buffers, I think it can be reduced without having to resort to fully coherent caches. Also, I haven't seen (so far) programmers complaining about it. May be it's too early for that. We'll see.
 
Which is exactly what you want in massively parallel codes.

sure, there are a lot of things I *want*, its what you actually get that is the problem. Outside of a small subset, there is actual data sharing and interaction.

As for latency of append/consume buffers, I think it can be reduced without having to resort to fully coherent caches. Also, I haven't seen (so far) programmers complaining about it. May be it's too early for that. We'll see.

Um, there is a reason why people who can afford it don't use ethernet for their interconnect.
 
sure, there are a lot of things I *want*, its what you actually get that is the problem. Outside of a small subset, there is actual data sharing and interaction.
Well, the scaling of your code is always going to be limited by the serializations/interaction. But IMHO, in the consumer apps (and in HPC, to whatever extent I have seen it, even if limited), the apps that *need* performance, already have a *lot* of parallelism.

Um, there is a reason why people who can afford it don't use ethernet for their interconnect.
Not sure what you are trying to convey. :???:
 
Perhaps not the right location, but what single component upgrade would give the PS3/360 considerable better performance? I guess the RAM?

GPU as RAM should help a bit but wont create miracles when the rest is just slow by todays standards or even yesterdays.
 
Perhaps not the right location, but what single component upgrade would give the PS3/360 considerable better performance?
Better performance in what area? ;) I presume you mean 'look better' and more RAM would definitely help there. However PS3 would benefit also from BW for IQ improvements. XB360 could have done with more eDRAM to solve its FB issues, while a more flexible eDRAM architecture (fully bidirectional access for the GPU) would be hugely beneficial.

However, these machines were built to a budget, and all in all they're looking pretty balanced to me. PS3's is somewhat screwed with the broken scaler. Tiling hasn't worked out how we hoped on XB360. Still, the machines don't appear starved in one particular aspect unlike maybe last gen. XB was bandwidth starved. PS3 was RAM starved.
 
Perhaps not the right location, but what single component upgrade would give the PS3/360 considerable better performance? I guess the RAM?

Easily the gpu. Of all the components in these machines, their gpu's are what are now the most outdated. After that would be ram. Their cpu's oddly enough are still 'ok' now that everything has been re-written to use the vector units.
 
Yeah shifty I'm kinda surprised that they didn't fix that when they made the PS3 slim. It would of given them a perfect spot to do it and also give people an excuse to buy the new system.
 
Maybe a dumb question (I take the risk):

Is it possible for MS and Sony to increase performance via a software update and "exchange/upgrade a component in this sense"?
Maybe over-clocking or something like this?
 
Status
Not open for further replies.
Back
Top