Will L2 cache on X360's CPU be a major hurdle?

Gubbi · Jul 19, 2005

chachi said:
Gubbi said:

Except, of course, oodles of registers.

I do think, though, that function inlining and loop unrolling is a step backwards.

Click to expand...

Since memory latency is a bitch you can hoist loads and loop unrolling gives you a speed increase as well as something to do while waiting, so I don't think it'll be put to pasture just yet.

Yes, my comment was in response to ERP's comment about the fact that there is no out of order execution in upcoming console CPUs, unlike the XCPU and Gecko. So they'll need loop unrolling and inlining.

I think it's a net loss going from out-of-order to in-order in performance, but also in logic. We've been over this a thousand times already though, so I'll stop.

Cheers
Gubbi

Fafalada · Jul 19, 2005

Gubbi said:
I think it's a net loss going from out-of-order to in-order in performance, but also in logic.

Frankly I'd quite like to have OOOE on PPE also(and I do think it should be possible being a single core on Cell and all) - but IMO this is a wrong premise to look at things from.
Consoles are inherently in-order, they have been so for roughly 20years - there's no "switch from OOOE" happening, because the switch to OOOE never really happened in the first place.
We've had a grand total of two CPUs with it, and while they WERE faster then competing CPUs, that had a lot more to do with having 10-20x larger caches and 2-3times higher clockspeeds then their nearest competitor, rather then presence of OOOE.
People with any sort of real experience on consoles are very familiar with in-order issues, it's not like this is some kind of direction change for them.

Anyway, I suspect there's a decent chance that Revolution will have an OOOE PPC core(or cores), along with lower clock-speed, so that may well give us a clearer picture as to whether Sony and MS made a sensible decision or not.

chachi · Jul 19, 2005

Gubbi said:
[Yes, my comment was in response to ERP's comment about the fact that there is no out of order execution in upcoming console CPUs, unlike the XCPU and Gecko. So they'll need loop unrolling and inlining.

I think it's a net loss going from out-of-order to in-order in performance, but also in logic. We've been over this a thousand times already though, so I'll stop.

Ah, that makes sense. I thought you meant more in terms of not wanting to bloat code size and overloading the cache.

I'd agree with the second part if these were anything but console processors but since they're a closed system with code written specifically for them it's not as big of a deal. It'll take some getting used to but it's not like you're going to just throw generic spaghetti code at them and hope for the best. The one thing that would be bad is if this level of optimization hurt cross-platform code, but it also usually improves performance on processors with OOOE, it's just that it's easier to bump the minimum system req.

.

3dcgi · Jul 19, 2005

Guden Oden said:
Caches are divided up into what is known as cache lines, each line equipped with a "tag" which tells the cache controller logic which sequence of addresses in memory is stored in that particular cache line. When a read request comes in, the cache controller reads through its tags to see if that address is stored or not and acts accordingly. If it is, the data is delivered straight to the CPU. If it isn't, a request for it is generated and then the CPU has to wait for it to come in.

Does anyone know the granularity of LS accesses? Is it a byte, dword, 128 bits, etc.?

Jawed · Jul 19, 2005

So many answers are in here:

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D

it's untrue. Page 6.

Jawed

aaronspink · Jul 19, 2005

Shifty Geezer said:
I don't know how effective it is in real terms. I always though it very good, but Deano's Blog posted here by SanGreal pointed out investigations that show accessing the same data again doesn't hapen very often, so cache isn't too useful. Managed storage doesn't have that problem but has the faff of the dev having to manage it. XeCPU can prefetch data this way I believe.

Deano's data is somewhat off...

Cache provides a facility to utilize both temporal and spacial locality in a dynamic way.

There is no benefit that a LS provides that a cache doesn't provide, in addition, the cache can capture dynamic non-linear spacial locality while a LS must be explicitly pre-allocated. You can also pre-allocate a cache but it doesn't require this pre-allocation.

Given the choice between a design with a LS and a design with a modern cache, I and pretty much any architect or programmer would choose the cache.

Aaron Spink
speaking for myself.

nAo · Jul 19, 2005

aaronspink said:
Given the choice between a design with a LS and a design with a modern cache, I and pretty much any architect or programmer would choose the cache.

And we'd have a 4 SPEs CELL processor..
At the end of the game an optimized application (such are a lot of single platform games) would run much faster on a current CELL than a CELL with less SPEs and cache memories instead of local stores.
I don't mind if most programmers don't know how to code on a SPE..that means my monthly income could increase in the near future (I love you Mr. Kutaragi

)

Gubbi · Jul 19, 2005

nAo said:
aaronspink said:

Given the choice between a design with a LS and a design with a modern cache, I and pretty much any architect or programmer would choose the cache.

Click to expand...

And we'd have a 4 SPEs CELL processor..
At the end of the game an optimized application (such are a lot of single platform games) would run much faster on a current CELL than a CELL with less SPEs and cache memories instead of local stores.

Are you that confident?

Let me ask a question. What was the most strained resource on the PS2, The VUs or the host processor (R59xx)?

Now Sony somehow figured out that the VUs should be 4x as plentiful in their next generation MPU, even when removing the biggest workload from them (vertex processing).

Does that make sense to you? Because it doesn't to me.

nAo said:
I don't mind if most programmers don't know how to code on a SPE..that means my monthly income could increase in the near future (I love you Mr. Kutaragi )

While I'm confident you, and many of the other devs here, will be able to get the most out of CELL, in general making your system disastrously difficult to program for the majority of developers is not going to help you in the long run.

All IMO.

Cheers
Gubbi

DeanoC · Jul 19, 2005

aaronspink said:
Deano's data is somewhat off...

Its not my data (the data is Sonys) but its my interpretation. There is limit to which cache helps, you can better spend your resources somewhere else eventually.

aaronspink said:
Cache provides a facility to utilize both temporal and spacial locality in a dynamic way.

But thats only helpful if there is a temportal or spatial locality. If there isn't cache are an expensive waste...

nAo · Jul 19, 2005

Let me ask a question. What was the most strained resource on the PS2, The VUs or the host processor (R59xx)?

is it a rethorical question?

Does that make sense to you? Because it doesn't to me.

It does make sense to me, cause there's plenty of work that one'd want to do in a game that SPEs can do.
Well..I have my obsessions..multires..emh

While I'm confident you, and many of the other devs here, will be able to get the most out of CELL, in general making your system disastrously difficultto program for the majority of developers is not going to help you in the long run.

Is it difficult? maybe..but is it THAT difficult? I don't think so.
The average programmer is not stupid, he/she's just not used to work on these architectures.
Maybe many even ignore there are different architectural philosophies out there

darkblu · Jul 19, 2005

Gubbi said:
Are you that confident?

Let me ask a question. What was the most strained resource on the PS2, The VUs or the host processor (R59xx)?

Now Sony somehow figured out that the VUs should be 4x as plentiful in their next generation MPU, even when removing the biggest workload from them (vertex processing).

Does that make sense to you? Because it doesn't to me.

well, the cpu was the weakest (realtively to peers) resource, but that does not automtically make it the bottleneck..

wait! *me's getting a 80%-vs-20% thread dejavu..*

now somebody will come up with the vu0 utilization argument, to which i will answer this time that again that does not prove that the majority of the titles were not flops-limited and so and so forth..

: )

zidane1strife · Jul 20, 2005

Gubbi said:
nAo said:

aaronspink said:

Given the choice between a design with a LS and a design with a modern cache, I and pretty much any architect or programmer would choose the cache.

Click to expand...

And we'd have a 4 SPEs CELL processor..
At the end of the game an optimized application (such are a lot of single platform games) would run much faster on a current CELL than a CELL with less SPEs and cache memories instead of local stores.

Click to expand...

Are you that confident?

Let me ask a question. What was the most strained resource on the PS2, The VUs or the host processor (R59xx)?

Now Sony somehow figured out that the VUs should be 4x as plentiful in their next generation MPU, even when removing the biggest workload from them (vertex processing).

Does that make sense to you? Because it doesn't to me.

nAo said:

I don't mind if most programmers don't know how to code on a SPE..that means my monthly income could increase in the near future (I love you Mr. Kutaragi )

Click to expand...

While I'm confident you, and many of the other devs here, will be able to get the most out of CELL, in general making your system disastrously difficult to program for the majority of developers is not going to help you in the long run.

All IMO.

Cheers
Gubbi

IT is good for h/w to be a bit difficult, after all it helps distinguish those who're blessed with grace and elegance, from those who're sloppy and slouchy.

Those who're blessed with the divine grace should be ever so more distinguished from the rest who're not so graceful. It is that the top AAA killer apps and the unique games from highly skilled individuals will exceed others by just a bit more. This is the way it should be.

AlNom · Jul 20, 2005

heh. I think the average gamer wants 100% awesome looking games than 1% awesome looking games.

aaronspink · Jul 20, 2005

DeanoC said:
[
Its not my data (the data is Sonys) but its my interpretation. There is limit to which cache helps, you can better spend your resources somewhere else eventually.

I agree, but you better spending you resources in things that help, if more cache isn't helping then a LS won't help.

aaronspink said:
Cache provides a facility to utilize both temporal and spacial locality in a dynamic way.

But thats only helpful if there is a temportal or spatial locality. If there isn't cache are an expensive waste...[/quote]

If there isn't spacial or temporal locality then a local store won't work. The issue with a local store is you have to explicitly load all the information you need. A cache provides the same functionality but works in a dynamic manner. The overheads between a cache and a local store are fairly small.

Aaron Spink
speaking for myself inc.

DeanoC · Jul 20, 2005

aaronspink said:
If there isn't spacial or temporal locality then a local store won't work. The issue with a local store is you have to explicitly load all the information you need. A cache provides the same functionality but works in a dynamic manner. The overheads between a cache and a local store are fairly small.

I'm not a hardware guy, I always assumed that LS is alot simplier/faster than equivilent amounts of cache. If it is, then maybe it makes sense to have more manual LS than less automatic cache, however if its not, then I'm lost...

I always assumed its like OOOE and IOE, automatic is better but cost more real estate...

nAo · Jul 20, 2005

A cache provides the same functionality but works in a dynamic manner. The overheads between a cache and a local store are fairly small

A local store takes less transistors, reduces load/store latency, and even if it doesn't dramatically cost much less transistors than a cache it makes the design of a in order cpu much simpler.
At the end it's a matter of hw design choices, there's nothing in CELL architecture that works against having a SPE + local cache, but if STI would have made that choice now we'd have a CELL processor with less than 8 SPEs and with a higher load/store latency.
A unoptmized code would run faster than on current SPE's design, an optimized code would run slower or would require even more work to hide higher mem accesses latencies.
So I don't say having cache instead of a local mem is a bad thing(tm) but it isn't also a clear superior solution.

ERP · Jul 20, 2005

nAo said:
A cache provides the same functionality but works in a dynamic manner. The overheads between a cache and a local store are fairly small

Click to expand...

A local store takes less transistors, reduces load/store latency, and even if it doesn't dramatically cost much less transistors than a cache it makes the design of a in order cpu much simpler.
At the end it's a matter of hw design choices, there's nothing in CELL architecture that works against having a SPE + local cache, but if STI would have made that choice now we'd have a CELL processor with less than 8 SPEs and with a higher load/store latency.
A unoptmized code would run faster than on current SPE's design, an optimized code would run slower or would require even more work to hide higher mem accesses latencies.
So I don't say having cache instead of a local mem is a bad thing(tm) but it isn't also a clear superior solution.

But if the SPE's had general read write access to all the CPU memory (rather than just DMA access), and a smaller cache, then it would have been possible to target almost any game code at them.

Yes there would have been less SPE's on the die, but possibly better utilised.

It's an interesting tradeoff.

Don't know which one I would have picked.

psurge · Jul 20, 2005

aaronspink - would it be possible to have a combination of the two- i.e. a cache that still allows the programmer to explicitly submit DMA requests (possibly locking/unlocking cache-lines so that certain lines are not evicted unless you give the OK)?

When you say that a cache has small overhead compared to the local store, does this include tags + logic which stalls the execution pipeline on cache-miss? What about cache-coherency?

Thanks, Serge

patsu · Jul 20, 2005

ERP said:
nAo said:

A cache provides the same functionality but works in a dynamic manner. The overheads between a cache and a local store are fairly small

Click to expand...

A local store takes less transistors, reduces load/store latency, and even if it doesn't dramatically cost much less transistors than a cache it makes the design of a in order cpu much simpler.
At the end it's a matter of hw design choices, there's nothing in CELL architecture that works against having a SPE + local cache, but if STI would have made that choice now we'd have a CELL processor with less than 8 SPEs and with a higher load/store latency.
A unoptmized code would run faster than on current SPE's design, an optimized code would run slower or would require even more work to hide higher mem accesses latencies.
So I don't say having cache instead of a local mem is a bad thing(tm) but it isn't also a clear superior solution.

Click to expand...

But if the SPE's had general read write access to all the CPU memory (rather than just DMA access), and a smaller cache, then it would have been possible to target almost any game code at them.

Yes there would have been less SPE's on the die, but possibly better utilised.

It's an interesting tradeoff.

Don't know which one I would have picked.

But yield problem would have a greater impact on fewer SPE's ? At least for now, Sony loses 1 SPE out of 8 (instead of say 1 out of 6). However all else being equal, I know many would prefer fewer but more powerful cores.

What would people call out in the Rambus's Cell presentation (in particular slide 12, 15, 31 and 34) ? It implied why STI went for 8 SPEs, and have some data to support their choice of LS instead of cache. The engineers seem to think that modern day instruction + data hotspots are typically up to 128K (Hence, they sized LS to 256K). Did I understand it correctly ?

darkblu · Jul 21, 2005

ERP said:
But if the SPE's had general read write access to all the CPU memory (rather than just DMA access), and a smaller cache, then it would have been possible to target almost any game code at them.

with the present branching performance - hardly.

Yes there would have been less SPE's on the die, but possibly better utilised.

they'd have had higher latencies per memory access too. full-blown cache would have helped exactly zilch for the streaming purposes SPE's are designed for.

It's an interesting tradeoff.

Don't know which one I would have picked.

i believe somebody somewhere has done the math on that already.

Will L2 cache on X360's CPU be a major hurdle?

Gubbi

Fafalada

chachi

3dcgi

Jawed

aaronspink

nAo

Nutella Nutellae

Gubbi

DeanoC

Trust me, I'm a renderer person!

nAo

Nutella Nutellae

darkblu

zidane1strife

AlNom

Moderator

aaronspink

DeanoC

Trust me, I'm a renderer person!

nAo

Nutella Nutellae

ERP

psurge

patsu

darkblu

Similar threads