Will L2 cache on X360's CPU be a major hurdle?

In what way?

Or do you think D$ hit rate will plummit going from 64KB to 32KB?

No, just what i said earlier. You can compare but it wont be a realistic estimate.
A64 has great BP and is the best cpu today.

Do you really think its been no sacrifice in this element of the Xcpu??

The advantage IS that its a closed sytem ofcourse but im just saying that sizes and numbers cant tell you anything from a part we know shit about.
 
overclocked said:
In what way?

Or do you think D$ hit rate will plummit going from 64KB to 32KB?

No, just what i said earlier. You can compare but it wont be a realistic estimate.
A64 has great BP and is the best cpu today.

BP has zero influence on D$ hit rates.

Cheers
Gubbi
 
Has anybody read about the associativity of the Xenon caches?

Reduced size and possibly lower associativity could significantly lower hit rates.
 
BP has zero influence on D$ hit rates

That sentence had to do with the A64CPU, and why its great at throw shitty code at.
Sorry should have separeted the lines.

Edit

As it does not relate to the above
 
3dilettante said:
Has anybody read about the associativity of the Xenon caches?

Reduced size and possibly lower associativity could significantly lower hit rates.
A64's D$ is only 2-way associative, I'd doubt XeCPU has lower associativity.

Overclocked: Ok, misunderstanding then :)

Cheers
Gubbi
 
3dilettante said:
Has anybody read about the associativity of the Xenon caches?

Reduced size and possibly lower associativity could significantly lower hit rates.
I have reason to believe ( ;) ) XeCPU has 32 KB for both Instruction and Data caches on each core, with 2-way instruction and 4-way data associativity.

How would this much cache cope with hardware threading, being shared between two threads which each have their own integer, float and vector register spaces?
 
I can't find anything on Xenon specifically, but I just came upon a reference to Cell's PPE having a 4-way associative data cache.

It's likely that Xenon will have something similar.

In such case, it should perform better than the A64's twice-larger 2-way would indicate based on size.

I'd expect it to be slightly worse overall, but not as bad as if it were less associative.

This assumes that the working set is small enough to fit. It seems it will probably be the case for the first generation of games.
 
Here's a quick view at traces regarding cache hit rates for three games (Doom 3, Far Cry and Halflife 2).

D$, Doom 3

D$, Far Cry

D$, Halflife 2

Explanation of the event columns.

Event 0x40 are data cache accesses.
Event 0x41 are data cache misses
Event 0x43 are data cache refills from system (ie, requests, that couldn't be satisfied from L2)
Event 0x4b are dispatched PREFETCH hints.

If we look at Doom 3, we see 12498 misses on 763455 requests or 1.5% (98.5% hitrate). Level 2 cache serves 2/3 of the requests that missed D$, the rest are loaded from main memory.

Similar story for FC and HL2, D$ hitrate is in the 97-99% range, L2 serves 50-70% of D$ misses.

As the HL2 figure shows, no PREFETCH hints are used at all, and still the hit rate is very high -> automatic prefetcher must be pretty good.

Edit: Direct linking of images works in preview, but not in final submit :(

Cheers
Gubbi
 
3dilettante said:
I can't find anything on Xenon specifically, but I just came upon a reference to Cell's PPE having a 4-way associative data cache.

It's likely that Xenon will have something similar.
Yes ;) I agree ;) 32 kb 4-way data cache per core is my 'best guess' ;) ;)
 
Gubbi said:
As the HL2 figure shows, no PREFETCH hints are used at all, and still the hit rate is very high -> automatic prefetcher must be pretty good.
Gosh, caches are doing a good job! Is this something that'll not attain the same performance in XeCPU and/or Cell given lower-priority to data management? Presumably the data cache will be as functional, meaning even if L2 fails 80% of the time, that's only gonna be for 1-2% of data requests.
 
Shifty Geezer said:
3dilettante said:
I can't find anything on Xenon specifically, but I just came upon a reference to Cell's PPE having a 4-way associative data cache.

It's likely that Xenon will have something similar.
Yes ;) I agree ;) 32 kb 4-way data cache per core is my 'best guess' ;) ;)

Hmm, I didn't see your post on cache associativity until after I posted mine. You win this time... ;)
 
overclocked said:
Gubbi i think that gives agreat ex of unopt code being thrown at the A64 Beast.... 8) :devilish:

To get those hitrates some thought must have been spent on data layout etc. Don't think Doom 3 (or the others for that matter) can be called unoptimized code :)

Cheers
Gubbi
 
Gubbi said:
overclocked said:
Gubbi i think that gives agreat ex of unopt code being thrown at the A64 Beast.... 8) :devilish:

To get those hitrates some thought must have been spent on data layout etc. Don't think Doom 3 (or the others for that matter) can be called unoptimized code :)

Cheers
Gubbi

You'd be surprised, short of being incredibly stupid or attempting to break the cache, you will get very high datahit rates on any reasonably sized resonably architected cache.

From frame to frame there is just a massive amount of data coherency, and in general only a small subset of the total data is touched.

The issue on X360/PPE is that the cost of a fetch from cached memory (even L1) is none trivial and there is no logic on the chip to hide that latency.
 
"....PS2....."

/\Yea I know I choose not to mention that to keep things civil here I do my best to avoid BASHING anyone with the way things have been around here lately.

That was before console industry figured which direction to go with hardware and there wasn't even a real GPU yet, they smartened up quite a bit this gen now if they can catch up on dev friends tools they'll be looking pretty good.

For the record I am a M$(only because they brought me back into console gaming)fan but I think this next gen they lost some of their ease of use by going with an in order multiple CPU design(not that they had much of a choice though I guess
 
ERP said:
Gubbi said:
overclocked said:
Gubbi i think that gives agreat ex of unopt code being thrown at the A64 Beast.... 8) :devilish:

To get those hitrates some thought must have been spent on data layout etc. Don't think Doom 3 (or the others for that matter) can be called unoptimized code :)

Cheers
Gubbi

You'd be surprised, short of being incredibly stupid or attempting to break the cache, you will get very high datahit rates on any reasonably sized resonably architected cache.

From frame to frame there is just a massive amount of data coherency, and in general only a small subset of the total data is touched.

Actually I am surprised. First I thought hit rates were lower. I also thought data set sizes were bigger, that is, wouldn't fit in cache from frame to frame.

I could imagine using non-temporal loads+stores to stream data through without polluting the caches (like say for skinning and similar tasks). Also on stuff like collision detection, cache would hold the top many levels of your space decomposition structures (K-D or OCT tree), and hence see crazy reuse for those. - Sort queries by spatial coherence and only see misses on the bottom nodes of the SDS.

ERP said:
The issue on X360/PPE is that the cost of a fetch from cached memory (even L1) is none trivial and there is no logic on the chip to hide that latency.

Except, of course, oodles of registers. :)

I do think, though, that function inlining and loop unrolling is a step backwards.

Cheers
Gubbi
 
For the record I am a M$(only because they brought me back into console gaming)fan but I think this next gen they lost some of their ease of use by going with an in order multiple CPU design(not that they had much of a choice though I guess

Well the good thing is their main competitor is going the same route so...
To this day we still don't know what Nintendo is going with. I honestly deep in my heart think that most devs don't even know what Nintendo is going to have hardware wise. :cry: So sad. Just kill your 3rd party support why don't you.
 
Gubbi said:
Except, of course, oodles of registers. :)

I do think, though, that function inlining and loop unrolling is a step backwards.
Since memory latency is a bitch you can hoist loads and loop unrolling gives you a speed increase as well as something to do while waiting, so I don't think it'll be put to pasture just yet. ;)

Algetc., shared L2 is generally better than private, the only question really is whether 1MB is enough.

Cho, it's Waternoose and while I haven't seen a definitive statement on it, the cache is on-die and unlikely to be running at a speed other than that of the processor. If you have a link which suggests otherwise...
 
Back
Top