PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

dobwal · Jan 29, 2013

What's the chance that 14 + 4 refers to a 4 CU APU paired with a 14 CU discrete gpu. Isn't the highest performance A10 a quad CPU paired with a 7760D a non GCN part at 246 mm. How much silicon is a 8 core Jaguar apu with 18 CUs going to eat up?

Couldn't balance refer how a developer chooses to use the 4 CUs in the apu either for computing or additional horsepower for graphics?

Brad Grenz · Jan 29, 2013

Concerning the 4 CU compute modules: My guess is that the APU isn't a full HSA design. Instead they've taken an 8 Jaguar core CPU with a 4 CU GPU that does have coherent memory access and most of the HSA benefits for GPGPU and compute, and basically put another 14 CU "discreet" GPU on the same die connected via an on chip PCI-E or Hypertransport link.

Some of the early spec leaks we had talked about the system having 192GBps of total memory bandwidth, but 12GBps of that was coherent memory to the CPU. That may point to a physical partition in the way memory is accessed, even if it is still a UMA with dynamically allocated memory space.

It could also explain a lot of the rumors about it being an APU plus GPU design and explain why dev kits were rumored to A10 APUs plus 7770 level GPUs which would make a lot of sense. It would also explain why using the 4 CU compute modules would only offer a "slight" benefit to graphics if it required an inefficient cross-fire technique, or was relegated to post processing or culling or the like.

In that case it's not that the 4 CUs are specifically modified for better compute performance, but rather that the other 14 CUs would be inefficient by comparison for anything that requires interaction with the main simulation running on the CPU. For example, you can run a post processing shader on the 14 CU GPU no problem, but you'd want to keep your physics on the 4 CU compute module.

Jugix · Jan 29, 2013

gongo said:
I can't be the only one disappointed with PS4 specs? What could be the reasons that Sony is low balling their new hardware? Coming 4 years after PS3, one can forgive...but it will be 8 years after PS3....and the new PS4 specs revealed are weaksauce.

Naturally one could be disappointed if he expected PS4 to have GTX680 SLI and 16GB of fast RAM. As mentioned by scott_arm, 4CUs alone have more than double the computing power than Cell had single-precision (SP) and close to ten times in double precision (DP) (if that even matters) [1,2].

Cell BE 1xSPE
- SP: 25.6 GFLOPs
- DP: 1.83 GFLOPs

AMD Tahiti:
- DP: 1/4 SP computing power [3].

[1] http://www.ibm.com/developerworks/power/library/pa-cellperf/
[2] http://www.amd.com/us/products/desktop/graphics/7000/7850/Pages/radeon-7850.aspx#3
[3] http://www.amd.com/us/products/desktop/graphics/7000/7970ghz/Pages/radeon-7970GHz.aspx

Hornet · Jan 29, 2013

APUs do not currently support preemption and priority scheduling, causing small, latency-sensitive compute tasks being delayed by graphic tasks. Until these features are available, my understanding is that it make sense to have two separate sets of Compute Units.

tunafish · Jan 29, 2013

The extra simd speculation for CUs doesn't make sense.

The idea behind the GCN CUs is that everything is 8-cycle barrel processed, and the operations start on a staggered clock. Work comes in as 64-element jobs, executed over 4 cycles on a 16-wide SIMD pipe. This allows the scheduler be pipelined and shared between the SIMD pipes efficiently.

In layman's terms, each cycle the scheduler has one SIMD pipe that it needs to assign work to, and will make that SIMD pipe happy for 4 cycles. The next cycle it will work on the second SIMD pipe. Because of this, you need only one scheduler, and it can be made quite slow, because it has 4 cycles to prepare until it has to touch the same SIMD pipe again, and 8 cycles until it has to touch the same thread again.

Adding another SIMD pipe in a CU would be a major architectural change that would require the redesign of basically everything. It would make much more sense just to have more CUs.

Personally, I think the 14+4 split is about the frontend. I agree with 3dilettante that having 2 different kinds of CUs would be horrendous from manufacturing point of view. Instead, the chip will likely have 20 CUs, 2 disabled for yields, and 18 in a single pool. Then, instead of having a single graphics command processor drive all the CUs, there will be two of them, one specialized for low-latency communication with the CPU, capable of giving work for up to 4 CUs.

For GPGPU, at least personally I think that lower latency between job issuing and completion matters a lot more than any extra twiddly bits in the CUs themselves. Some new kind of command processor that lets me finetune what jobs I have running on CUs would be pretty much ideal.

DuckThor Evil · Jan 29, 2013

bgassassin said:
I would say we're looking at these two scenarios as most likely.

1. 4 CUs have an extra (scalar or modified) ALU and these CUs are dedicated to compute, but can also be for rendering. "Minor boost" would be incorrect IMO since they would make up 22% of the ~1.8 TFLOPs. This would mean Sweetvar's info would relate to something else.

2. 4 CUS each have an extra SIMD and these SIMDs are dedicated to compute, but these SIMDs could also be used for rendering. "Extra ALU" would be wrong in this case.

3. I think you guys are on crack

I don't think there is anything extra in these 4 CUs. The CUs themselves are an "extra" ALUs for compute, thus bringing in an extra 410GF for compute. The minor boost for rendering is because the CPU or something else in the system can't really feed more than 14CUs, that's why the system is balanced at 14CUs. Not too dissimilar to how the 192 extra cuda cores only bring relatively minor rendering boost in GTX 670 vs 680 situation, when the core frequency is normalized. At that point you get much better utilization of these CUs by using them on non rendering tasks.

Hecatoncheires · Jan 29, 2013

According to these rumors, Orbis is basically an APU + GPU design, but integrated into a single SoC instead of a MCM or SiP, which allows for minimum latency and maximum bandwith at the same time.

On the APU side we have the eight Jaguars combined with 256 GCN shaders, together burning an incredible amount of 512 GFLOPS! Just for the record: An Intel Core i7 4770k desktop CPU (only) delivers 448 GFLOPS. Imagine what this could mean for AI, animations, physics, etc. On the GPU side we have a processor that is basically a small HD7850 Pitcairn. In my eyes this GPU is strong enough to deliver enjoyable graphics for a next gen gaming console, especially when keeping the perverse TDP of modern high end cards in mind. We're most likely talking about 3rd gen HSA for Orbis, which means a unified adress space for CPU and GPU(s), pageable system memory for the GPU(s) with CPU pointers, and a fully coherent memory between CPU and GPU(s). Simply put, no copy work between the CPU and the two iGPs. This will be a hell of a speedup compared to a modularly designed PC (take a look at the superb 28nm Temash SoC rendering Dirt: Showdown at 1920x1080 with 5W!!!)

The reasons for going with APU + GPU are better programmability and lower latencies. You can look it up in this slide from the 2012 Fusion Developer Summit. You don't want your GPU to be saturated because of both GPGPU algorithms and graphics tasks. A single (AMD) GPU can't handle either task at the same time, which would end in a lot of headache for the programmer. AMD names two solutions for this problem: You can wait for the 2014 20nm feature Graphics Pre-Emption, or you just use an APU dedicated to computing together with a second GPU dedicated to graphics rendering. Sony is doing the latter, obviously.

It seems as if the Orbis rumors are getting more and more specific. The difference between this leak (8 x Jaguar/ 256 GCN SPs APU + 896 GCN SPs GPU) and the last leak (8x Jaguar CPU + 1152 GCN SPs GPU) is a much better balance between the computing power and the graphics power. Eight Jaguars for a Pitcairn seemed a bit underpowered, anyway. But one thing is missing in this leak: There is no dedicated DRM hardware at all, neither ARM nor SPE. It would really surprise me if they are launching without a proper in-hardware DRM.

Arwin · Jan 29, 2013

It makes a tonne of sense, and I'm pretty sure both Orbis and Durango will have some CU dedicated to APU functions. I also like the suggestion made earlier that the CUs are on a different bus / part of the pipeline, so they can be used for lower-latency tasks much closer to the CPU, so that they can be used much more like the SPEs were for instance in PS3 and can work in close co-operation with the CPU cores.

Still really interested to see the memory pipeline though. If the system only has GDDR5, how is everything connected to that? Are there parts of memory that the CPU and these CUs can work on together that does not affect the bandwidth of the GPU, do they each have their own bus to that memory at half the maximum speed, what?

Love_In_Rio · Jan 29, 2013

bgassassin said:
I would say we're looking at these two scenarios as most likely.

1. 4 CUs have an extra (scalar or modified) ALU and these CUs are dedicated to compute, but can also be for rendering. "Minor boost" would be incorrect IMO since they would make up 22% of the ~1.8 TFLOPs. This would mean Sweetvar's info would relate to something else.

2. 4 CUS each have an extra SIMD and these SIMDs are dedicated to compute, but these SIMDs could also be used for rendering. "Extra ALU" would be wrong in this case.

Forget the 2. Extra SIMD would increase Tflops and however the 1,843 Tflop amount is obtained with the "standard" amount of SIMDS in 18 CUs.
Or is a extra escalar -VGleaks says "extra" ALU so i vote for this- or is like Tunafish says, or both.

ultragpu · Jan 29, 2013

Oh boy if only ERP wouild share his input regarding the 14+4 CUs

, anyway looks like PS4's spec just got more flexible and slightly more powerful than before from my understanding. I have a feeling it's gonna do well with that Luminous Engine's hair and cloth physics.

Titanio · Jan 29, 2013

Seems there's two competing theories about what '14+4' means

Let's say the 4 CUs are independent of the main rendering pipeline, of the 14 other CUs. Let's say they are maybe more closely connected to the CPU in some way. What would the relationship be between those CUs and the render backends? Would they write output through those still, or would they write out results in some other way? Are we talking about physically separate groups of CUs on the APU, and the group of 4 aligned with the CPU have one or two render backends of their own, or do they output in some other way?

Love_In_Rio · Jan 29, 2013

Titanio said:
Seems there's two competing theories about what '14+4' means

Let's say the 4 CUs are independent of the main rendering pipeline, of the 14 other CUs. Let's say they are maybe more closely connected to the CPU in some way. What would the relationship be between those CUs and the render backends? Would they write output through those still, or would they write out results in some other way? Are we talking about physically separate groups of CUs on the APU, and the group of 4 aligned with the CPU have one or two render backends of their own, or do they output in some other way?

Well, they still have a texture unit attached, so they will send data to the backends...

Xenio · Jan 29, 2013

Love_In_Rio said:
Well, they still have a texture unit attached, so they will send data to the backends...

but if the 4 CU are splitted and displaced away from the 14, used for GPGPU, which data will send?

Pugger · Jan 29, 2013

ultragpu said:
Oh boy if only ERP wouild share his input regarding the 14+4 CUs , anyway looks like PS4's spec just got more flexible and slightly more powerful than before from my understanding. I have a feeling it's gonna do well with that Luminous Engine's hair and cloth physics.

I think he's more comfortable speculating about Durango's setup and performance as he has no vested interest. I have a sneaky feeling he won't be able to say much about orbis

Love_In_Rio · Jan 29, 2013

Xenio said:
but if the 4 CU are splitted and displaced away from the 14, used for GPGPU, which data will send?

I would say they are not displaced. They will put those 2 DMAs in GCN architecture in good use at last.

mboeller · Jan 29, 2013

I think the 14+4 system is strange.

Why didn't they use 16 CU's at 900MHz with a 4 x (4CU) System and therefore be able to run 4 independent programs (rendering and/or GPGPU) instead only 2. Would have been the overhead too much (4 displacement mapping "engines" with 3600 Mio Polygons/sec would have been nice too IMHO)?

V3 · Jan 29, 2013

Maybe this thing uses logic on logic, so 8 cores + 4CUs stacked with 14CUs stacked with 4GB.

This thing seems a little bit big for SoC that's going into a console. Pitcairn is like what ? 212mm^2 ? How big is 8 cores Jaguar + 4MB L2 cache ? It seems like its going to be bigger than Cell or Emotion Engine during launch.

Maybe the fact that's it's only going to be a single chip, they're willing to take in the bigger die.

What are the chances that they use two of this SoC pair with 4GB of RAM each ? Can this Liverpool SoC be XFire even ?

Sony normally uses two big dies, now they're only using one. That's like half of what they normally would do. No new optical technology this round either to waste the budget on. PS2 had DVD and PS3 had Bluray that take significant amount of the consoles budget. This seems to be just faster speed Bluray.

Maybe both Sony and MS are really aiming for shorter console cycle for next gen.

Also how is Jaguar cores ? It's pretty slow at 1.6GHz. How much slower is it compare to something like Piledriver ?

fellix · Jan 29, 2013

V3 said:
Also how is Jaguar cores ? It's pretty slow at 1.6GHz. How much slower is it compare to something like Piledriver ?

Both have a narrow 2-way integer pipeline, but Jaguar's pipeline is probably much shorter and has potentially faster AVX throughput/latency, but lacks FMA extensions.

Scott_Arm · Jan 29, 2013

Dr Evil said:
3. I think you guys are on crack I don't think there is anything extra in these 4 CUs. The CUs themselves are an "extra" ALUs for compute, thus bringing in an extra 410GF for compute. The minor boost for rendering is because the CPU or something else in the system can't really feed more than 14CUs, that's why the system is balanced at 14CUs. Not too dissimilar to how the 192 extra cuda cores only bring relatively minor rendering boost in GTX 670 vs 680 situation, when the core frequency is normalized. At that point you get much better utilization of these CUs by using them on non rendering tasks.

Could be. That leak is very poorly written.

Octane · Jan 29, 2013

Dr Evil said:
3. I think you guys are on crack I don't think there is anything extra in these 4 CUs. The CUs themselves are an "extra" ALUs for compute, thus bringing in an extra 410GF for compute. The minor boost for rendering is because the CPU or something else in the system can't really feed more than 14CUs, that's why the system is balanced at 14CUs. Not too dissimilar to how the 192 extra cuda cores only bring relatively minor rendering boost in GTX 670 vs 680 situation, when the core frequency is normalized. At that point you get much better utilization of these CUs by using them on non rendering tasks.

If the CPU can't feed more than 14CU's wouldn't it become a bottleneck for games ? (Game logic, amount of players on screen )

PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

dobwal

Brad Grenz

Philosopher & Poet

Jugix

Hornet

tunafish

DuckThor Evil

Hecatoncheires

Arwin

Now Officially a Top 10 Poster

Love_In_Rio

ultragpu

Titanio

Love_In_Rio

Xenio

Pugger

Love_In_Rio

mboeller

V3

fellix

Scott_Arm

Octane

Similar threads