PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

Cjail · Apr 4, 2013

^^^
When the GPU has exhausted/used all its resources it's impossible to get more out of it, whether its graphic or compute.
This said I think Averagejoe question is: if I can max out the graphic using X resources can I use the reaming X resources for compute?

I would say yes but the point is that to max out graphic you will unlikely have unused the GPU and no doubt compute on the GPU is something you prepare for.

P.S.
I apologize if the terminology is inappropriate but maybe my "simplified" version is more understandable.

MarkoIt · Apr 4, 2013

Averagejoe said:
Wow that chart is very hard to miss interpret they really talk about running the maximum graphics the GPU can do while running at the same time compute jobs..

Is this really really really possible.?

The chart is very hard to misunderstand, but you achieved in that

Graphics and compute are the same thing, with the first being simply a subset of the later.
If you use an ALU for lighting a pixel, you can't use the same for calculating physics or audio decompressione. It's just simple as that.

The GPU was modified to achieve a better utilization and to do so more quickly, removing idle times. You don't get higher throughtput that the theoretical limits, you get simply better utilization that the resources you have. In some abstract way, it's similar to an hyperthreading model, but done on an higher logical level, giving the intrinsic multithreading capability of a GPU. You have 64 jobs, and each of them can be made up of multiple threads.

Exophase · Apr 4, 2013

And for the billionth time, it's not like current gen highend GCN discrete GPUs can't do this, they only have 16 queues instead of 64 (or something, it's not totally clear but that's probably the case). It's not the difference between 0 and 1. 64 is a really big number. It remains to be seen if this isn't overkill.

KSterson · Apr 4, 2013

Wynix said:
Could you post a source please, i haven't seen any proof of this yet.

I ve seen the trailer :smile:

Grall · Apr 4, 2013

Cjail said:
This said I think Averagejoe question is: if I can max out the graphic using X resources can I use the reaming X resources for compute?

I would say yes

I would say no, because if you're maxed out then you're maxed out. You can't exceed the maximum capacity of the GPU no matter how you twist and turn.

If the GPU is stalling on a bunch of graphics threads, waiting for texture data from main memory perhaps, it's possible that there could be other threads, perhaps compute threads, ready and waiting that could run instead on those execution units (this is a theoretical example; real-world situations would vary depending on the software being run), thus helping to max out the hardware, but again - if you're maxed out you're maxed out and that's a hard limit. You can't go past max.

dumbo11 · Apr 4, 2013

Maybe the above diagram has nothing to do with pipelines, but is describing the effect of onion+?
e.g. "onion+ allows compute tasks to run without interfering with buffers in use by the GFX pipeline"?

patsu · Apr 4, 2013

Averagejoe said:
Wow that chart is very hard to miss interpret they really talk about running the maximum graphics the GPU can do while running at the same time compute jobs..

Is this really really really possible.?

"Balanced... with enough room for compute tasks" sounds like they have more CUs than needed. The graphics compute tasks may not need to consume all CUs.

DRS · Apr 4, 2013

Love_In_Rio said:
Yeah,by the graphics is still undecided 16 or 32 ROPs.

Well, the picture shows 4 memory controllers, so would 16 ROPS (4x4) be plausible?

Grall said:
If the GPU is stalling on a bunch of graphics threads, waiting for texture data from main memory perhaps, it's possible that there could be other threads, perhaps compute threads, ready and waiting that could run instead on those execution units

I think that would boost efficiently greatly; if one graphics thread stalls, there may be more graphics threads stalling too because they are working on the same texture and triangle sets. Having some compute threads that stall less because of being compute centric to fill those gaps is a good thing I guess.

bagofsuck · Apr 4, 2013

Why the diagram shows 1.8 clocked cores? wasnt it 1.6???

patsu · Apr 4, 2013

Can the devs tell if the instructions will stall, and schedule to do something else first ?

bagofsuck said:
Why the diagram shows 1.8 clocked cores? wasnt it 1.6???

Author's speculation. See Gradthrawn's posts in the last 2 pages.

yewyew · Apr 4, 2013

Love_In_Rio said:
Yeah,by the graphics is still undecided 16 or 32 ROPs.

Wait, now that simultaneous instruction for gaphics and compute makes sense. Hmm...

Scott_Arm · Apr 4, 2013

DRS said:
Well, the picture shows 4 memory controllers, so would 16 ROPS (4x4) be plausible?
....

Vgleaks says 8 render back-ends, which would be 32 ROPs.

Arksine · Apr 4, 2013

yewyew said:
Wait, now that simultaneous instruction for gaphics and compute makes sense. Hmm...

I'm not sure how you draw that conclusion. I think that whole situation has been hashed out enough.

I suspect the GPU will have 32 ROPs, as that is in line with what AMD offers in its 18CU parts. VGLeaks and Digital Foundry seem to be convinced that this is the case. Those charts are interesting, but it seems to include plenty of guesswork alongside the factual data they gathered.

Averagejoe · Apr 4, 2013

MarkoIt said:
The chart is very hard to misunderstand, but you achieved in that

Graphics and compute are the same thing, with the first being simply a subset of the later.
If you use an ALU for lighting a pixel, you can't use the same for calculating physics or audio decompressione. It's just simple as that.

The GPU was modified to achieve a better utilization and to do so more quickly, removing idle times. You don't get higher throughtput that the theoretical limits, you get simply better utilization that the resources you have. In some abstract way, it's similar to an hyperthreading model, but done on an higher logical level, giving the intrinsic multithreading capability of a GPU. You have 64 jobs, and each of them can be made up of multiple threads.

Oh i know that part..

The second part of your post was something i actually theorize here to,i wasn't implying that you will get more than the 1.84TF.

My theory not on that post you quote by the way,was if sony used the GPU to the max when on PC most of the time if not all that isn't achievable on PC do to the wide range of different GPU configurations.

In fact my theory was if the 78XX get to use 70% of its power do to inefficiency,and other constrains (example) and sony manage to fine tune the hardware so that the PS4 GPU is able to achieve 95% of its power couldn't that 25% be use for compute and still have the same resources for graphics as the 78XX PC counter part.?

Love_In_Rio · Apr 4, 2013

Averagejoe said:
Oh i know that part..

The second part of your post was something i actually theorize here to,i wasn't implying that you will get more than the 1.84TF.

My theory not on that post you quote by the way,was if sony used the GPU to the max when on PC most of the time if not all that isn't achievable on PC do to the wide range of different GPU configurations.

In fact my theory was if the 78XX get to use 70% of its power do to inefficiency,and other constrains (example) and sony manage to fine tune the hardware so that the PS4 GPU is able to achieve 95% of its power couldn't that 25% be use for compute and still have the same resources for graphics as the 78XX PC counter part.?

That would be possible if the gpu had context switching, but that doesnt seem the case as is something AMD is cooking for GCN 3?.As of today you must have a CU running graphics threads or compute threads but not both.So if you have a CU with a 70% of rendering threads efficiency you cant run compute threads to fill the stalls.What Sony is looking for with 64 compute queues and onion+ as far as i understand is having the CUs working purely on compute busy with many threads and reduce latency in data feeding( this with onion+ ) and get that 70% of efficiency also with computing running CUs.

Shifty Geezer · Apr 4, 2013

There's little sense to multithreading the internals of a CU from what I can see. A CU is a bunch of ALUs, that work on a batch of instructions in a wavefront. You send it a snippet of five instructions that it performs on all CUs, then select another snippet of instructions from whatever outstanding work orders there are. You can't really fit a few instructions from one program in with a handful of instructions of the wavefront being processed. 18 CUs gives fine enough granularity. If AMD really want, they could break a CU up into smaller chunks, say 16 ALUs each instead of 64, and have 4 times as many, with I presume increased cost in scheduling hardware for a gain in finer granularity. With 18 CUs as they are, it should be possible for the devs to supply enough compute work to occupy 4 CUs 100% of the time with the rest processing graphics simultaneously, which is the fancy new feature being extolled in interviews. Of course, the PS4 reveal suggested no such hard limit, so the devs could alternate 18 CUs computing and 0 doing graphics, and then switch, or run them half and half, or whatever. There'll be some jobs you'll want done up front, in the same way unified shaders could be turned to 100% vertex work then 100% pixel work. I can see a reason for 100% compute on physics to calculate game state and object positions prior to 100% graphics work for the rest of the frame.

DRS · Apr 4, 2013

Scott_Arm said:
Vgleaks says 8 render back-ends, which would be 32 ROPs.

Possible, I don't keep up with this stuff

Anyways, 800MHz * 32 ROPS * (4Bcolor + 4Bdepth) = 204.8GB/s while the reported rate is only 176GB/s right? So in that case it can have 32 ROPS that just write or read color without Z and therefore won't feature free alpha, depth check etc. That only requires half the bandwidth and leaves some room for texture reads, correct me if I'm wrong.

Love_In_Rio · Apr 4, 2013

Shifty Geezer said:
so the devs could alternate 18 CUs computing and 0 doing graphics

Thats why these APUs are the best CPUs gaming world has ever seen.Pity they are not paired with a discrete GPU...

patsu · Apr 4, 2013

Shifty Geezer said:
There's little sense to multithreading the internals of a CU from what I can see. A CU is a bunch of ALUs, that work on a batch of instructions in a wavefront. You send it a snippet of five instructions that it performs on all CUs, then select another snippet of instructions from whatever outstanding work orders there are. You can't really fit a few instructions from one program in with a handful of instructions of the wavefront being processed. 18 CUs gives fine enough granularity. If AMD really want, they could break a CU up into smaller chunks, say 16 ALUs each instead of 64, and have 4 times as many, with I presume increased cost in scheduling hardware for a gain in finer granularity. With 18 CUs as they are, it should be possible for the devs to supply enough compute work to occupy 4 CUs 100% of the time with the rest processing graphics simultaneously, which is the fancy new feature being extolled in interviews. Of course, the PS4 reveal suggested no such hard limit, so the devs could alternate 18 CUs computing and 0 doing graphics, and then switch, or run them half and half, or whatever. There'll be some jobs you'll want done up front, in the same way unified shaders could be turned to 100% vertex work then 100% pixel work. I can see a reason for 100% compute on physics to calculate game state and object positions prior to 100% graphics work for the rest of the frame.

Yes, GCN has more granularity.

Each CU has 4 SIMDs, each with its own program counter and instruction buffer for 10 wavefronts. So each CU can handle 40 wavefronts at any one time.

Each wavefront can be from a different kernel or workgroup (e.g., some random graphics or physics jobs). The CU just doesn't care. The graphics, physics and AI wavefronts all get scheduled and run together, 40 per CU when fully loaded; wavefront after wavefront.

Sony may have added facilities to prioritize/schedule these wavefronts differently. I think in SPURS, the tasks (or rather the SPURS kernels) know their deadlines. Ideally, the CUs should be made to "obey" these real time schedules.

KSterson · Apr 4, 2013

DRS said:
Possible, I don't keep up with this stuff

Anyways, 800MHz * 32 ROPS * (4Bcolor + 4Bdepth) = 204.8GB/s while the reported rate is only 176GB/s right? So in that case it can have 32 ROPS that just write or read color without Z and therefore won't feature free alpha, depth check etc. That only requires half the bandwidth and leaves some room for texture reads, correct me if I'm wrong.

It does have more bandwitdh than reference HD7870 card. <<...>> A 7790 has the same Gflops rating than a 7850, same gtexels/s (even higher triangle setup) but is way behind in 1080p bench, that's because of the ROPs (and bandwidth).

PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

Cjail

Fool

MarkoIt

Exophase

KSterson

Grall

Invisible Member

dumbo11

patsu

DRS

bagofsuck

patsu

yewyew

Scott_Arm

Arksine

Averagejoe

Love_In_Rio

Shifty Geezer

uber-Troll!

DRS

Love_In_Rio

patsu

KSterson

Similar threads