ATI - PS3 is Unrefined

london-boy said:
Not sure that made sense, but in short, everyone saw it!! Where have u been! ;)

Totally coincidental, but it turns out, I was in this really weird place, where I couldn't see it! :oops:

Not even those big eyes helped, I'm afraid. :???:

Edit: I most certainly will. :devilish:
 
Last edited by a moderator:
TurnDragoZeroV2G said:
216 doesn't seem so strange, since it can be done with a whole number. But the 90 doesn't make much sense. That one seems totally random.

I take it you certainly would mind passing that leak on...

For XeCPU, they quote 8 Flops/cycle per core which should net you 8 x 3 cores x 3.2 Ghz ~ 76.8 GF

But they state 84 GF @ 3 GHz which I extrapolated to 90 GHz @ 3.2 GHz.

Yep this bit doesn't make sense. However 8 Flops/cycle doesn't match 12 Flops/cycle (for 115 GF) and pretty much confirms that the FPU + VMX cannot concurrently sustain 2 threads (per core) to give you that peak number (115 GF)...
 
Last edited by a moderator:
Jaws said:
Yep this bit doesn't make sense. However 8 Flops/cycle doesn't match 12 Flops/cycle (for 115 GF) and pretty much confirms that the FPU + VMX cannot concurrently sustain 2 threads (per core) to give you that peak number (115 GF)...

I've heard the FPU - the other 4 flops - cannot execute unless the VMX is executing a load/store or logical operation? I don't know if the missing 4 flops can be accounted for in other ways also, though.
 
Daughter Die operations

I think where we are getting mixed up here people is the shader operations which take place on the main die vs the daughter die. Approximately 216 programmable shader ops take place on the main die...the other 26 take place on the daughter die and have soley to do with post-rendering effects (not fully programmable).
 
Titanio said:
I've heard the FPU - the other 4 flops - cannot execute unless the VMX is executing a load/store or logical operation? I don't know if the missing 4 flops can be accounted for in other ways also, though.

The difference is that IBM/MS for XeCpu counted shuffling as "flops" witch makes it around 115Gflops and as you say the number in reality is around 75Gflops peak.
 
Jaws said:
As mentioned to Dave earlier, it's derived in the text. And 216 is a non-obvious derivation. Of course 240 is reported elsewhere but the 'leak' still says 216..
The leaks are not particularly accurate in all places - for instance the overview states the shader array is 24G Instructions /s, which is wrong since its 48G Instructions/s.

They also highlight no capability differences between the Vector and Scalar portions.
 
Jaws said:
Those CPU and GPU numbers still do not agree with the technical leak doc...

E.g the CPU ~ 90 Gflops @ 3.2 Ghz. This would also likely apply to CELLs PPE...

The problem is we don't know what is actually correct, the "technical' leak doc or the various other MS/ATI documents. Nor do we know that if the technical leak doc is taking into account certain scheduling restrictions and if so, if the numbers quoted by IBM/Sony/Nvidia are taking into account similar scheduling restrictions.

For Xenon, I get 76.8 GFLOPS if the core can't dual issue VMX and FPU, 96 GFLOPS if it can (but its kinda pointless really since you'll at least need to do some loads and stores). The 115 GFLOPs number seems to be counting 12 flops per cycle which I still don't understand how they get.

Likewise, I only get 204 GFLOPs for CELL.

Aaron Spink
speaking for myself inc.
 
ROG27 said:
Do you think the president of EA would claim that PS3 has a bit more under than 360 if he didn't know...or have far more knowledge than we have?
The president of EA isn't an engineer or a programmer. He stamps checks and makes decisions. He's one of the last person I'd go to for de facto PS3 information. That's like Trip Hawkins calling the the PS2 "the next printing press". He was full of shit. Then his company went under.

You believe EA's President is right because you WANT to. Hell, Bill Gates said that Halo 3 would launch alongside the PS3. Alot of Halo fanbois ate that up like Bill Gates had a say in when Halo 3 would come out. Later, MS as a company said that it would come out when it's done.
 
Last edited by a moderator:
aaronspink said:
For Xenon, I get 76.8 GFLOPS if the core can't dual issue VMX and FPU, 96 GFLOPS if it can (but its kinda pointless really since you'll at least need to do some loads and stores).

Hmm. I'm kind of confused. I thought the FPU could only execute in parallel when the VMX unit was executing load/store or logical ops? Meaning that when the VMX unit was doing those ops, it could in fact execute, otherwise not? Maybe I'm reading your post wrong, but it seems to suggest the opposite.
 
Alpha_Spartan said:
The president of EA isn't an engineer or a programmer. He stamps checks and makes decisions. He's one of the last person I'd go to for de facto PS3 information. That's like Trip Hawkins calling the the PS2 "the next printing press". He was full of shit. Then his company went under.

You believe EA's President is right because you WANT to. Hell, Bill Gates said that Halo 3 would launch alongside the PS3. Alot of Halo fanbois ate that up like Bill Gates had a say in when Halo 3 would come out. Later, MS as a company said that it would come out when it's done.

You still point to a particular instance to validate your point, but you overlook the general concensus amongst developers that the PS3, "has a bit more under the hood" and that that bit more is just that...a slight technical advantage...nothing majorly significant.

The truth is we are grasping at straws here until the general public has more concrete info. The "privileged" folk seem to have alluded to what I have just mentioned, though.

Getting back to more relevant technical discussion about the XGPU...

"I think where we are getting mixed up here people is the shader operations which take place on the main die vs the daughter die. Approximately 216 programmable shader ops take place on the main die...the other 26 take place on the daughter die and have soley to do with post-rendering effects (not fully programmable)."
 
Last edited by a moderator:
008l.jpg



  • Branch pipeline: handles all conditional and unconditional branches, including decrementing the count register. It also handles condition code manipulation instructions such as crand and mfcr. Updates to the condition register (CR) and count register (CTR) are available after stage one and stage two, respectively.
  • Integer pipeline: handles integer add, subtract, multiply, divide, shift, xor, cntlzw, and so on. It also handles mtcrf. The integer multiply and divide instructions aren’t pipelined, which is discussed later.
  • Address generation and int load/store pipeline: all loads and stores go through this pipeline in order to generate the data address. Floating-point and vector loads and stores are sent both through the address generation and int load/store pipeline and through the vector/scalar load or store pipeline.

  • Vector/scalar load pipeline: handles all vector and scalar loads.
  • Vector/scalar store pipeline: handles all vector and scalar stores.
  • Vector permute pipeline: handles vector permutes, merges, splats, vshifts, and the vector pack and unpack instructions.

  • Vector simple pipeline: handles all vector integer operations such as vaddsbs, vsel, and also handles vector floating-point compares such as vcmpeqfp.
  • Scalar float pipeline: handles all scalar floating-point operations, single and double precision, including fmadd, fcmpo, fdiv, and so on. The fdiv and fsqrt instructions aren’t pipelined, which is discussed later.
  • Vector float pipeline: handles vector math instructions such as vmaddfp, vrefp, and so on. It also handles instructions that convert to and from vector floating point, such as vcuxwfp and vrfim. The estimate instructions, such as vrefp, use two extra stages and therefore have fourteen-cycle latency.
  • Dot product pipeline: used only for the dot product instructions, vmsum3fp128 and vmsum4fp128.
Jawed
 
Jaws said:
E.g the CPU ~ 90 Gflops @ 3.2 Ghz. This would also likely apply to CELLs PPE...

Are you saying that single thread spe can run 25.6 gflops and dual thread complete ppe core (with vmx) can run only 30 gflops?
 
Why so surprised? One is mainly there to crunch numbers and the other mainly for general purpose code, so it really shouldn't be surprising that they shine on things they are made for.

Fredi
 
Lysander said:
Are you saying that single thread spe can run 25.6 gflops and dual thread complete ppe core (with vmx) can run only 30 gflops?

A single thread on one PPE core could "run" 25.6Gflops, using VMX. As far as the paper max is concerned, the multiplicity of threads doesn't really mean anything - one thread could use that power, if you designed it to do so.
 
Titanio said:
A single thread on one PPE core could "run" 25.6Gflops, using VMX. As far as the paper max is concerned, the multiplicity of threads doesn't really mean anything - one thread could use that power, if you designed it to do so.

I think its good that they got this power without having to resort to multi-threading, one aspect of the SPEs that often goes unnoticed.
 
Titanio said:
A single thread on one PPE core could "run" 25.6Gflops, using VMX. As far as the paper max is concerned, the multiplicity of threads doesn't really mean anything - one thread could use that power, if you designed it to do so.
And the other thread will lay idle at that time?
Also what is so special in spe structure, it looks like ppe unit without complete integer and floating point unit? Spes gflops (25) are single precision floating point, while 360cpu core works only on DPFP.
 
Lysander said:
And the other thread will lay idle at that time?

Sure. The reason the threads are there is to increase utilisation of the core, but if you're maxxing it out with one thread as it is, getting good utilisation, a second thread wouldn't get a look in..

Lysander said:
Also what is so special in spe structure, it looks like ppe unit without complete integer and floating point unit? Spes gflops (25) are single precision floating point, while 360cpu core works only on DPFP.

Not sure where you get your info, but Xenon's given FP figures are for single precision. Games don't really need DP.

overclocked said:
I think its good that they got this power without having to resort to multi-threading, one aspect of the SPEs that often goes unnoticed.

I think you mean parallelism. And I think you'll find that Xenon is a parallel architecture also - without parallelism on a core level, it'd be sitting there with ~25.6-30 Gflops (assuming everything else about it stayed the same). Parallelism is being generally adopted as the route to more power now, that's just the way it is.

edit - actually, I'm not really sure what you're referring to now. SPEs run one thread at a time. In that sense, it compels you to ensure that your single thread is using as much power as possible and does not block. The SPEs, on their own, are single threaded (potential software solutions for multi-threading on one SPE aside).
 
Last edited by a moderator:
Titanio said:
Not sure where you get your info, but Xenon's given FP figures are for single precision. Games don't really need DP.

Jeffrey Brown`s IBM article on 360cpu
Each stage in the FP/VMX is also 11 FO4. As a result the pipelines are quite deep and result in significant delay for instruction completion. Scalar double-precision floating point operations have 10-cycle latency. VMX operations have four or 14-cycle latency, depending on the operation.
 
Back
Top