Xenon System Block Diagram

right I understand that.

I was trying to clarify that Xbox 2 CPU configuration is 3 cores, not 6 cores. but you get six threads or 6 virtual CPUs. each CPU is not really a duel core.

Now *If* Xbox 2 CPU was based on Power4 or Power5, you would have 2 actual cores and 4 threads per CPU.
 
Am I the only one that thinks it wierd that gamesindustry.biz doesn't even say who their so called expert is?
 
No iit's 3 CPU's on a single die with 6 virtual cores (dual threaded)... not 6 CPU's in total.

Thanks for the confirmation.

Am I the only one that thinks it wierd that gamesindustry.biz doesn't even say who their so called expert is?

Well, I don't think they can.
 
3dcgi said:
Am I the only one that thinks it wierd that gamesindustry.biz doesn't even say who their so called expert is?

maybe it was Deadmeat? :LOL:

Seriously, i think like almost everyone that it's 3 core with 2 threaded or they count the Vmx(Ibm's name of altivec) as a processor.
I don't think there's such a thing as a 6 (reals) cores Cpu on the XboxNext a.k.a Xenon, 3 PPC 970 with thoses high clocking is already a lot of processing power.
 
I don't think there's such a thing as a 6 (reals) cores Cpu on the XboxNext a.k.a Xenon, 3 PPC 970 with thoses high clocking is already a lot of processing power.

It doesn't have to be that way, they can have 2 of that 3 cores chip, for example. But Qroach confirmed it for us, so 3 cores it is now.

Each Shader Unit can co-issue a Scalar and a Vector operation ( remember, this Shader unit should be able to do both Vertex Shading and Pixel Shading ).
I am thinking about MADD as the operation for the Vector and Scalar ALUs which means respectively 8 FP ops/cycle and 2 FP ops/cycle.

If this is suppose to be a new architecture, those 48 ALU ops could all be vector ops. So there are 48 units capable of either scalar or vector ops.
 
Am I the only one that thinks it wierd that gamesindustry.biz doesn't even say who their so called expert is?
Meh, as far as rebuttals go that one was rather weak at any rate.
The diagram states size, latency And bandwith for the memory, and the guy still complains about memory "type" not being specified? I mean geez...
 
Fafalada said:
The diagram states size, latency And bandwith for the memory, and the guy still complains about memory "type" not being specified? I mean geez...

That was my thought exactly. I was half expecting the next sentence to bitch about not knowing the ram manufacturer...
 
Saem said:
Three G5 class CPUs running at 3.5 GHz, with 1 MB of shared L2, ultra fast FSB and very fast main RAM will smoke Desktop PCs that are out at the same time Xbox 2 launches ( mid 2005 ).

Tough, to say, CPU race on the desktop could speed up, but I think you're basically on the money here. I just feel like take some potency out of this statement. Price performance, this will likely make many things its bitch.

With three possibly dual-threaded CPU cores sharing the same 1MB L2, won't there be an awful amount of cache thrashing going on?

Possibly.

That tends to happen a lot on the northwood P4 after all, and it just offers 2 threads on 512k cache... Maybe cache lines can be locked in L2 or there's ways to partition the cache so different threads don't bump each other out of the cache. Guess we'll learn eventually.

The cache lines on the P4 are quite large, so eviction rates are ugly, which is one significant reason the L2 cache increase going from willy to nw was so dramatic.


High latency hampers CPU's and hyperthreading is going to try to hide the latency. Connecting very low latency memory to any multi-core CPU should improve performance. Having a segmented memory layout could be the path MS goes. Have a pool of 128 meg of Reduced Latency Dram (RLDRAM) for the the tri-core IBM CPU. Then for the VPU have a pool of GDDR-3 (or GDDR-4 for a 2006 launch) to provide a flood of bandwidth for the VPU. Get rid of any eDRAM and concentrate the transistor budget on number crunching and reley on external ram for the bandwidth.

The CPU needs low latency and the VPU needs bandwidth. No silver bullet in ram technology exists, so segmenting to me looks like the best solution for Microsoft.
 
Panajev2001a said:
The Shader units are 24 and not 48...

48 Shader ops = 24 units * ( 1 scalar op + 1 vector op )

24 * ( 8 ops + 2 ops ) * 0.5 GHz = 120 GFLOPS.

No, the terminology is not right / misleading. Although it says "48 ALU ops" a single ALU op could encompass more than a single FP operation. AFAIK its 48 ALU's, but I don't know the exact op breakdowns.
 
I think that diagram of Xenon is real. Microsoft legal department told teamxbox to remove the picture and when someone posted the picture in the forums yesterday, it was taken down today. Why would the Microsoft legal department care if it was a fake?

Qroach:

So the CPU is triple-core. I wonder what it's called. I frequent macosrumors.com and I haven't read about there being a future PPC for the Mac with three cores. Well, there is going to be a quadruple core PPC called PPC980.
 
High latency hampers CPU's and hyperthreading is going to try to hide the latency.

It's a trade, what's it traded for, you ask? Bandwidth. You hide the latency, but you need more bandwidth to allow the work to happen and keep the execution units fed.

Connecting very low latency memory to any multi-core CPU should improve performance.

Connecting very low latency memory to any CPU should improve performance, assuming the task is heavy on random memory access. The notion couched within the statement, seems off.

Having a segmented memory layout could be the path MS goes. Have a pool of 128 meg of Reduced Latency Dram (RLDRAM) for the the tri-core IBM CPU. Then for the VPU have a pool of GDDR-3 (or GDDR-4 for a 2006 launch) to provide a flood of bandwidth for the VPU. Get rid of any eDRAM and concentrate the transistor budget on number crunching and reley on external ram for the bandwidth.

Maybe.

Really, I think you're off the mark. Since there can be 6 threads executing, you have more execution oppurtunities, which will require bandwidth, the schedulers will likely work to hide latency of various things that might stall execution, branches, cache misses and so on. All of this gets traded off for bandwidth!

Also the shared L2 cache though small, might be somewhat hiding the issues in scheduling by reducing the impact of going against CPU affinity.
 
DaveBaumann said:
Panajev2001a said:
The Shader units are 24 and not 48...

48 Shader ops = 24 units * ( 1 scalar op + 1 vector op )

24 * ( 8 ops + 2 ops ) * 0.5 GHz = 120 GFLOPS.

No, the terminology is not right / misleading. Although it says "48 ALU ops" a single ALU op could encompass more than a single FP operation. AFAIK its 48 ALU's, but I don't know the exact op breakdowns.

Dave, let's go step by step, shall we ?

"48 ALU ops" ( or let's call it 48 Shader ops or 48 Shader instructions )

Ok so far ?

You mentioned co-issue: I interpreted as the Shader ALU being able to issue in parallel 1 Vector instruction ( or Vector Shader op ) and 1 Scalar instruction ( or Scalar Shader op ).

Ok so far ?

Well, we know now that one single Shader ALU is resposible for doing 2 Shader ops per cycle in the case it can co-issue.

Saying 48 Shader ops makes me think we have 24 Shader ALUs as each doing 2 Shader ops per cycle would produce the expected total of 48 Shader ops per cycle.

Ok so far ?

I interpreted each op or Shader instruction to be either a 4-way parallel MADD ( 8 FP ops/cycle ) or a single MADD ( 2 FP ops/cycle ) in the case of Scalar Shader ops/instructions.

While co-issuing, a Shader ALU could then work on a total of 10 FP operations at once ( 8 + 2 ).

This would mean a peak of 240 FP ops/cycle done by the 48 Shading ALUs.

Multiply by 0.5 GHz and yo get a nice total of 120 GFLOPS.

If it is 48 ALUs and not 24 ALUs each doing a Vec4 instruction and a Scalar instruction in parallel ( co-issuing )... how can co-issuing, the way I am understanding it, produce only 48 Shader ops per cycle ?
 
Pana - I know what you are saying, its just that the slide in this respect is misleading. Are you sure that what they are saying as an 'ALU operation' is actually the equivelent to a floating point operation?

[Note that all co-issue implementations in fragement pipelines are not a full vector op plus scalar - both R300 and NV40 are not able to co-issue when a full vec4 operation is required, only when fewer than 4 components are used can co-issue occur; this is another reason why I would suggest that the 48 number is in reference to 48 vector operations per cycle, with co-issue potentially occuring on non-vec4 ops (another note, this is not the case with R300's and NV40's Vertex pipelines as in this case they do both have the capability of co-issing a full vec4 operation and scalar)]

Take it from another angle. R300 features 16 ALU's in its pixel pipelines (24 if you wanted to count the texture address processor) and NV40 already features 32 and these get bogged down with even todays shader instruction counts, which are probably not approaching an average on 10 instruction per pixel even on "heavy" shader titles. Given the likely targets of XB2 and it time to market do you feel that 24 ALU's would be sufficient (true, they may be a little more featured than current ones, but probably not extremely so)? By the time XB2 is available I would guess that PC parts will be in the 48-64 ALU's range, if not more, and they probably won't be playing as demanding targets as the XB2 is likely to have.
 
I concede defeat then Dave, I can see your point: even with more features ( better branch handling, etc... ) this does not seem much of a jump over the NV40 if this is to be a mid 2005 product.

I think these slides are not misleading if you think about them in the context of a late 2004 Xbox 2 CPU ( even though, to hit that date the CPU would have had to be underclocked, so this diagram might be a trnsition from the 2004 specs to the 2005 specs: the GPU being the item still to upgrade ).

In this context then my analysis would make sense: even without lookin at the extremely high CPU power, this kind of GPU would bring to school the NV40 and the R420 as it is quite faster IMHO.

My assumption would be, in thsi context, that the Shader ALUs have indeed been modified since now we are talking about load balancing between VS operations and PS operations and like you said the VS of the R300 can co-issue a Vec4 isntruction and a Scalar instruction so it would make sense for these new unified Shader ALUs to have cabilities from both VS and PS kinds of ALUs as we see them in DirectX 9.0x GPUs.
 
When you consider the application in this case (and even in PC cases) I don't believe that would make sense since the majority of cases it'll be spending its time in fragment operations rather than vertex ops - best to optimise for the majority usage (are 5D ops that frequent in fragement processing?).

With the co-issue we have at the moment, you can liken it to the message I had for the memory - 1 vec4 operation per ALU is the guarantee, a co-issue is a bonus.

I'd also suggest that a 24 ALU part wouldn't be able to school NV40 or R420 as both of these have more fragment ALU's (although they are not all fully featured) and they also have separate vertex processor ALU's, which the XBox part is going to have to do along with fragment ops in it ALU allocation. IMO 48 vec4 ops is the minimum required for the number of average shader operations MS are targetting per pixel.
 
Unless "R600" represents a weird shifting of project names, or is a project running concurrently for rather than--as we expect--after R500, I don't see how Xbox2's chip would wear that moniker best. Is X2 indeed slipping well into 2006? Is R600 perhaps a project attempting to see how an eDRAM-laden card would do in the PC marketplace or in other sectors, and that's a more-determining factor? Or an alternative-designed product that's not really "the generation ahead?"

Considering that it seems like R500 will be released around the same time as the Xbox2 (at least by many assumptions right now), and quite possibly a lot after, considering X2's GPU would have to be solidified and produced in might higher volumes to assure a good console launch. But that chip will be the next designation up?

It seems much like "R600" is meaningless at the moment, as everyone expects what it would/should be representing is "the next-generation higher architecture to R500" aimed at the PC. (Which can, of course, have variations elsewhere.) If it's NOT... well, then we need to know what it is first.

Hopefully May 5th will indeed bring some clarity, because at the moment there's not much of it as far as this is concerned.
 
cthellis42 said:
Unless "R600" represents a weird shifting of project names, or is a project running concurrently for rather than--as we expect--after R500, I don't see how Xbox2's chip would wear that moniker best. Is X2 indeed slipping well into 2006? Is R600 perhaps a project attempting to see how an eDRAM-laden card would do in the PC marketplace or in other sectors, and that's a more-determining factor? Or an alternative-designed product that's not really "the generation ahead?"

Considering that it seems like R500 will be released around the same time as the Xbox2 (at least by many assumptions right now), and quite possibly a lot after, considering X2's GPU would have to be solidified and produced in might higher volumes to assure a good console launch. But that chip will be the next designation up?

It seems much like "R600" is meaningless at the moment, as everyone expects what it would/should be representing is "the next-generation higher architecture to R500" aimed at the PC. (Which can, of course, have variations elsewhere.) If it's NOT... well, then we need to know what it is first.

Hopefully May 5th will indeed bring some clarity, because at the moment there's not much of it as far as this is concerned.

Yes cthellis, an inferior card launching alongside the Box, or perhaps even somewhat later doesn't appear to make much sense from "an technologically advancing CPU graphics card market." (Especially from ATi) Perhaps a hybrid R500/600 is more feasible for Xenon given its estimated launch time frame? Giving the Xenon a small performance advantage window over its PC contemporaries? Much like the X-Box had with Nvidia? But doesn't this go against their decision to lengthen the development time of future card incarnations (architectures) to what, 18 months was it?
 
DaveBaumann said:
Take it from another angle. R300 features 16 ALU's in its pixel pipelines (24 if you wanted to count the texture address processor) and NV40 already features 32 and these get bogged down with even todays shader instruction counts, which are probably not approaching an average on 10 instruction per pixel even on "heavy" shader titles. Given the likely targets of XB2 and it time to market do you feel that 24 ALU's would be sufficient (true, they may be a little more featured than current ones, but probably not extremely so)? By the time XB2 is available I would guess that PC parts will be in the 48-64 ALU's range, if not more, and they probably won't be playing as demanding targets as the XB2 is likely to have.

I wanted to re-address this part one more time: considering the additional features these Shader ALUs might have and the fact that the target resolution for the XGPU2 s likely going to be between 640p and 720p, while PC GPUs are usually optimized for higher resolutions, how fast would you think that, in Pixel Shading ( except Vertex Texturing, I expect the Vertex Shading to be done by those three VMX/Altivec units over there ;) ) this GPU would be ? ( screen resolution = 480p-720p )
 
Back
Top