Wii broadway update

darkblu · Sep 11, 2006

PiNkY said:
With much higher code density and instruction decoding out of the critical path, much more sophisticated branch prediction & cache architecture & elaborate prefetching, a robust OOOE implementation and Integer ALUs at 4 and 4.8 GHz respectively, a single Xenon core would surpass it only on very,very biased workloads; but given a fair and representive set of real world problems and equally elaborate implementations on both architectures, the P4 would pull far ahead on average.

ditto. i don't see a similarly-clocked (as in 'within %100') in-order core beating an OOOe core in a general purpose scenario. for specialized cases - maybe, depening on what specialization that in-order core has.

Guden Oden · Sep 11, 2006

theafu said:
while the L1 is 256K 1st report or 128K second report.

That's pretty much impossible. No mass-market CPU in history has had a 256k L1, heck, even the athlon/opterons "only" have L1s of 32k each AFAIR, and it's known that a larger L1 leads to worse latencies than a smaller one. Admittedly, it's not as bad a problem at lower clock speeds though, but I really don't see the point of a 256k L1 and a half-meg L2; most of L2's just going to be a duplication of L1; very wasteful.

Fox5 said:
From what's been done on xbox 360 so far, I'd estimate each core to be about the performance of a 2ghz to 2.4ghz pentium 4

Ye gods!
That's a huge overestimation I'd say, a complete masturbatory fantasy from la-la land.
Yes on SOME workloads it might perform like that - and on a few, even better - but not on average!

WHAT currently available on xbox gives reason for such an optimistic interpretation of reality? Mind you, cross-platform games such as Quake 4 or Prey - which are at least somewhat comparable to what we run on PCs - do not run particulary smoothly on the 360.

though I'm also assuming current games are only single threaded.

That is incorrect. Even such a simple, early release as Geometry Wars use all three cores.

Most people would probably consider that estimation on the low side though

Ehrm, they would?

Seems to me you underestimate the throughput of a modern wide OoOE core, or seriously overestimate the throughput of a narrow IOE core like the 360s.

but it's way beyond what a sub 1ghz g3 should be capable of.

Probably depends on the code being executed. The 360s CPU cores are quite deeply pipelined from what I recall, while the G3 is extremely shallow in comparison, so loopy code for example would close the gap considerably between them. I'm sure there are other examples as well where they might perform somewhat on par, while there's also likely to be plenty examples where the 360 core would own.

N00b · Sep 11, 2006

Guden Oden said:
That's pretty much impossible. No mass-market CPU in history has had a 256k L1, heck, even the athlon/opterons "only" have L1s of 32k each AFAIR, and it's known that a larger L1 leads to worse latencies than a smaller one. Admittedly, it's not as bad a problem at lower clock speeds though, but I really don't see the point of a 256k L1 and a half-meg L2; most of L2's just going to be a duplication of L1; very wasteful.

Modern Athlon 64 / X2 chips have 128kB L1 cache. L1 and L2 work in exclusive mode meaning data is either in L1 or L2 but not in both.

Apart from that I have to agree with pretty much all you wrote.

theafu · Sep 11, 2006

My intention is not to confuse you Guden Oden, though for clarification from what I was told is that L1 256K is from adding the data and instruction together not that both of them are 256K.

Shifty Geezer · Sep 11, 2006

Doesn't large L1 add considerably to the latency? I thought that was the reason it was kept small and a separate slower, larger L2 cache used.

darkblu · Sep 11, 2006

Shifty Geezer said:
Doesn't large L1 add considerably to the latency? I thought that was the reason it was kept small and a separate slower, larger L2 cache used.

i'd assume it depends on the physical characteristics (like access latency) of the memory and the algorithm's mapping/associativity emloyed.

N00b · Sep 11, 2006

Shifty Geezer said:
Doesn't large L1 add considerably to the latency? I thought that was the reason it was kept small and a separate slower, larger L2 cache used.

Not neccessarily:

www.sandpile.org said:
P4 L1 Data Cache (65nm):
16 KB, 8-Way, 64 Byte/Line, MESI,
1 Line/Sector, Write-Through, Pseudo-LRU,
Non-blocking (up to 8 Load Misses),
Virtually Addressed, Physically Tagged,
Alias Conflicts at 4 MB: Bits 21...6,
Dual-ported (1 Load and 1 Store),
4/12 Cycle Latency (Integer/FP),
16 Byte Path to FP Unit for Loads

www.sandpile.org said:
K8 L1 Data Cache:
64 KB, 2-Way, 64 Byte/Line, MOESI, LRU,
Dual-ported, WB, WA, 8 Banks, ECC,
3 Cycle Latency

Megadrive1988 · Sep 11, 2006

I'm thinking Broadway will have 128K L1 cache - 64K data, 64K instruction

1 MB of L2 would be nice, but even 512K L2 would still be a nice boost over Gekko.

Wii broadway update

darkblu

Guden Oden

Senior Member

N00b

theafu

Shifty Geezer

uber-Troll!

darkblu

N00b

Megadrive1988

Similar threads