Lysander said:Jeffrey Brown`s IBM article on 360cpu
All that tells us is the scaler DP ops have a 10 cycle latency. The given figures for Xenon's floating point performance (115, or 90 or 75.6) are SP figures.
Lysander said:Jeffrey Brown`s IBM article on 360cpu
Titanio said:I think you mean parallelism. And I think you'll find that Xenon is a parallel architecture also - without parallelism on a core level, it'd be sitting there with ~25.6-30 Gflops (assuming everything else about it stayed the same). Parallelism is being generally adopted as the route to more power now, that's just the way it is.
edit - actually, I'm not really sure what you're referring to now. SPEs run one thread at a time. In that sense, it compels you to ensure that your single thread is using as much power as possible and does not block. The SPEs, on their own, are single threaded (potential software solutions for multi-threading on one SPE aside).
dukmahsik said:i wonder which has the steeper learning curve, xenos or cell
Titanio said:I've heard the FPU - the other 4 flops - cannot execute unless the VMX is executing a load/store or logical operation? I don't know if the missing 4 flops can be accounted for in other ways also, though.
ROG27 said:I think where we are getting mixed up here people is the shader operations which take place on the main die vs the daughter die. Approximately 216 programmable shader ops take place on the main die...the other 26 take place on the daughter die and have soley to do with post-rendering effects (not fully programmable).
Dave Baumann said:The leaks are not particularly accurate in all places - for instance the overview states the shader array is 24G Instructions /s, which is wrong since its 48G Instructions/s.
Dave Baumann said:They also highlight no capability differences between the Vector and Scalar portions.
aaronspink said:The problem is we don't know what is actually correct, the "technical' leak doc or the various other MS/ATI documents. Nor do we know that if the technical leak doc is taking into account certain scheduling restrictions and if so, if the numbers quoted by IBM/Sony/Nvidia are taking into account similar scheduling restrictions.
For Xenon, I get 76.8 GFLOPS if the core can't dual issue VMX and FPU, 96 GFLOPS if it can (but its kinda pointless really since you'll at least need to do some loads and stores).
aaronspink said:The 115 GFLOPs number seems to be counting 12 flops per cycle which I still don't understand how they get.
aaronspink said:Likewise, I only get 204 GFLOPs for CELL.
Jaws said:I get that too...aaronspink said:Likewise, I only get 204 GFLOPs for CELL.
Jawed said:They didn't leave any room on 110nm - 7800GTX-512 isn't a viable product, just a few thousands boards for the purposes of marketing.
I'm not a games programmer, and thus I can't really say what one of those would find difficult. It would depend on their background and their personality, I guess.dukmahsik said:i wonder which has the steeper learning curve, xenos or cell
Jaws said:Dave Baumann said:The leaks are not particularly accurate in all places - for instance the overview states the shader array is 24G Instructions /s, which is wrong since its 48G Instructions/s.
Yeah, I've got the overview doc now and it does state that. However if you're suggesting the the 216 figure is a typo, then the odds are against it for the following reasons,
- introducing a 2nd typo in a technical doc will lower the odds
- the 24/48 figure can easily be a typo, usually factors of 2, 4 or 10 are. The 216 no. isn't for 240. Nor is it near nos. on the keyboard/keypad...
- you can cross reference the 216 no. because it's involved in a calculation. I.e. 216 (Xenos)/ 88 (R420) ~ 2.455 ~ 2.5
- The 216 no. is non-obvious and it's derivations was in the Watch article...
rendevous said:Under UIUC Dr. Peter Hofstee gave a presentation about cell, the presentation can be found on the following URL.
http://www.acm.uiuc.edu/conference/webcast.php
(Cred goes to phed for finding it)
Between 40:18 and 40:31 in the video a block diagram is shown over the PPE. It seems like the PPE is able to issue 2 instructions to the VMX/FPU units per clock which would raise the peak GFLOPs to above 204,8 for a 1+7 configuration.
Jaws said:Thanks for the link. I had a look, but unless I'm missing something, I can't see it in the same cycle...?
rendezvous said:I figured that the numbers next do the internal buses were width of the buses in number of instructions. The bus going from VMX FPU Issue (Queue) stage has the width 2, just like the general issue stage. There may however be limitations of which instructions you could pair in the FPU/VMX units which would affect the performance numbers.