PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

onQ · Mar 29, 2013

Averagejoe said:
Now i am not a tech guy,but GDC is no E3 is not for your average joe gamer,i am sure you know that.

Sony was addressing developers there,so it would be incredible unprovable from sony that they sell developers bullsh** developers will know one way or the other when the final hardware is on their hands,so something got to give about their claims..

Hell i know no GPU out there is 100% efficient or close,what if sony actually make some customization so that they are able to get most out of the GPU out by using a mix of the 2 graphics and compute,to actually maximize the GPU..

For examples let say the 78XX is 70% efficient tops on PC,but on PS4 that same GPU is 95% efficient,what if the whole 25% efficiency comes from using CU to also do compute jobs.?

I don't know if i explain it right my English is not so great.

Hasn't it been said that the PS4 GPGPU has 8 Asynchronous Compute Engines instead of the 2 ACE's that the other AMD GCN cards have?

& with Asynchronous computing code can run on the same thread without having to wait for the other task to finish as long as the tasks are not blocking each other so a graphic code that takes 16ms & a compute code that takes 10ms can run at the same time in the same threads & only take 16ms to complete instead of 26ms.

this is what I'm getting from it could be wrong but that's the way it seem to me after reading about Asynchronous Computing.

so even though it's not going to give you 2X the power for graphics it can still run the graphics task and the compute task at the same time because they just past through each other instead of the slower car holding up traffic.

so you can use the full 1.84TFLOP for graphics & still run physics & other compute tasks on the GPGPU as long as the tasks are not blocking one another.

3dilettante · Mar 29, 2013

onQ said:
& with Asynchronous computing code can run on the same thread without having to wait for the other task to finish as long as the tasks are not blocking each other so a graphic code that takes 16ms & a compute code that takes 10ms can run at the same time in the same threads & only take 16ms to complete instead of 26ms.

Aside from running something trivial, it's more realistic to expect that the completion time is between 16 and 26, preferably noticeably lower than 26.

AlphaWolf · Mar 29, 2013

onQ said:
so you can use the full 1.84TFLOP for graphics & still run physics & other compute tasks on the GPGPU as long as the tasks are not blocking one another.

No. 1.84TF is full throughput. 100% efficiency. That's the limit.

3dilettante · Mar 29, 2013

I totally missed that last line.

Yeah, if the graphics workload is using 100% of the compute capability of the GPU, it's going to be very hard for the compute shader to get much work in edge-wise. If the graphics workload never uses less than 100% of the ALU capacity, no other computing can get done.

yewyew · Mar 29, 2013

onQ said:
Hasn't it been said that the PS4 GPGPU has 8 Asynchronous Compute Engines instead of the 2 ACE's that the other AMD GCN cards have?

& with Asynchronous computing code can run on the same thread without having to wait for the other task to finish as long as the tasks are not blocking each other so a graphic code that takes 16ms & a compute code that takes 10ms can run at the same time in the same threads & only take 16ms to complete instead of 26ms.

this is what I'm getting from it could be wrong but that's the way it seem to me after reading about Asynchronous Computing.

so even though it's not going to give you 2X the power for graphics it can still run the graphics task and the compute task at the same time because they just past through each other instead of the slower car holding up traffic.

so you can use the full 1.84TFLOP for graphics & still run physics & other compute tasks on the GPGPU as long as the tasks are not blocking one another.

Eight ACE's was only speculated when looking at the compute rings. Probably a custom alternative/substitute when the AMD stuff wasn't ready on time.

Exophase · Mar 29, 2013

HD 7970 has two ACEs, that much is confirmed. They're front ends for queueing and managing compute tasks. http://amddevcentral.com/afds/assets/presentations/2620_final.pdf

It isn't really clear to me what the limitations of a single ACE are. I don't know if the bullet points in the "multiple concurrent contexts" section is really only accomplished via multiple ACEs or if a single ACE allows out of order completion (implying that it doesn't block on a task). Anandtech agrees that AMD has not made the capabilities clear.

That makes it really hard to accurately comment on what 8 ACEs vs 2 (if PS4 really does provide it) enables. What we do know is that it's not moving from a total lack of concurrent scheduling between graphics and compute tasks.

pjbliverpool · Mar 29, 2013

Averagejoe said:
What is the flops are already accounted.?

The 7850 has 1.76TF but is at 860mhz,not at 800mhz,sure there are 2 more CU,but how much a few extra ALU will increase TF performance.?

All the flops are fully accounted for by 18 standard GCN CU's at 800Mhz. There is nothing else.

It's a valid point about throwing both graphics and compute jobs at the CU's simultaneously to attain higher utilisation but there's no reason the same couldn't be done on PC based GCN architectures as well as long as the compute jobs don't have to run synchronously with gameplay. Admittedly on first generation GCN parts the compute work may be allocated a little less efficiently due to the lower number of ACE's so efficiency may be a little less.

onQ · Mar 29, 2013

AlphaWolf said:
No. 1.84TF is full throughput. 100% efficiency. That's the limit.

3dilettante said:
I totally missed that last line.

Yeah, if the graphics workload is using 100% of the compute capability of the GPU, it's going to be very hard for the compute shader to get much work in edge-wise. If the graphics workload never uses less than 100% of the ALU capacity, no other computing can get done.

I think I found an explanation the compute code will run at times when the graphic code is waiting.

http://cs.brown.edu/courses/cs168/f12/handouts/async.pdf

2 The Motivation
We’ve seen that the asynchronous model is in some ways simpler than the threaded one because there is
a single instruction stream and tasks explicitly relinquish control instead of being suspended arbitrarily.
But the asynchronous model clearly introduces its own complexities. The programmer must organize each
task as a sequence of smaller steps that execute intermittently. If one task uses the output of another, the
dependent task must be written to accept its input as a series of bits and pieces instead of all together.
Since there is no actual parallelism, it appears from our diagrams that an asynchronous program will
take just as long to execute as a synchronous one. But there is a condition under which an asynchronous
system can outperform a synchronous one, sometimes dramatically so. This condition holds when tasks are
forced to wait, or block, as illustrated in Figure 4:

Figure 4: Blocking in a synchronous program
In the ﬁgure, the gray sections represent periods of time when a particular task is waiting (blocking) and
thus cannot make any progress. Why would a task be blocked? A frequent reason is that it is waiting to
perform I/O, to transfer data to or from an external device. A typical CPU can handle data transfer rates that
are orders of magnitude faster than a disk or a network link is capable of sustaining. Thus, a synchronous
program that is doing lots of I/O will spend much of its time blocked while a disk or network catches up.
Such a synchronous program is also called a blocking program for that reason.
Notice that Figure 4, a blocking program, looks a bit like Figure 3, an asynchronous program. This is
not a coincidence. The fundamental idea behind the asynchronous model is that an asynchronous program,
when faced with a task that would normally block in a synchronous program, will instead execute some other task that can still make progress. So an asynchronous program only blocks when no task can make
progress (which is why an asynchronous program is often called a non-blocking program). Each switch
from one task to another corresponds to the ﬁrst task either ﬁnishing, or coming to a point where it would
have to block. With a large number of potentially blocking tasks, an asynchronous program can outperform
a synchronous one by spending less overall time waiting, while devoting a roughly equal amount of time to
real work on the individual tasks.
Compared to the synchronous model, the asynchronous model performs best when:
There are a large number of tasks so there is likely always at least one task that can make progress.
The tasks perform lots of I/O, causing a synchronous program to waste lots of time blocking when
other tasks could be running.
The tasks are largely independent from one another so there is little need for inter-task communication
(and thus for one task to wait upon another).
These conditions almost perfectly characterize a typical busy network server (like a web server) in a
client-server environment. Each task represents one client request with I/O in the form of receiving the
request and sending the reply. A network server implementation is a prime candidate for the asynchronous
model, which is why Twisted and Node.js, among other asynchronous server libraries, have grown so much
in popularity in recent years.
You may be asking: Why not just use more threads? If one thread is blocking on an I/O operation,
another thread can make progress, right? However, as the number of threads increases, your server may start
to experience performance problems. With each new thread, there is some memory overhead associated
with the creation and maintenance of thread state. Another performance gain from the asynchronous model
is that it avoids context switching — every time the OS transfers control over from one thread to another it
has to save all the relevant registers, memory map, stack pointers, FPU context etc. so that the other thread
can resume execution where it left off. The overhead of doing this can be quite signiﬁcant.

Shifty Geezer · Mar 29, 2013

onQ said:
Hasn't it been said that the PS4 GPGPU has 8 Asynchronous Compute Engines instead of the 2 ACE's that the other AMD GCN cards have?

More ACEs should allow more compute 'threads', but these provide zero computation resources.

& with Asynchronous computing code can run on the same thread without having to wait for the other task to finish as long as the tasks are not blocking each other so a graphic code that takes 16ms & a compute code that takes 10ms can run at the same time in the same threads & only take 16ms to complete instead of 26ms.

So you can use the full 1.84TFLOP for graphics & still run physics & other compute tasks on the GPGPU as long as the tasks are not blocking one another.

How can they not block each other?! Those 1.8 TFlops come from a truckload of ALUs performing some 2000 operations per clock. If you are using those 2000 ops on graphics work, there is absolutely nothing else available to process compute jobs. It's certainly possible that Liverpool can process compute jobs in parallel with graphics, and it's certainly possible that it can interleave different threads more efficiently to soak up a little stall time between graphics jobs, but it's absolutely, 100% impossible to get 1.8 TFlops of graphics work and additional calculations on top. If you're burning up 1.8TF on graphics tasks, any compute jobs will have to sit and wait until they're finished.

3dilettante · Mar 29, 2013

onQ said:
I think I found an explanation the compute code will run at times when the graphic code is waiting.

This is what you wrote:

so you can use the full 1.84TFLOP for graphics & still run physics & other compute tasks on the GPGPU as long as the tasks are not blocking one another.

Graphics code using all ALU resources is by definition blocking execution of compute code that also needs to use the ALUs.
If the graphics code is waiting, then it by definition cannot be using all 1.84 TFLOPS.
"Blocking" in this case is the more general term to describe where hazards or contention prevent the resources from being used by a given program and halting progress.

Your excerpted text is dealing with a more specific example of contention for IO and refactoring code so that work that does not depend on IO can procede.

This is not the same as the scenario you outlined, where both workloads depend on the exact same resource.

patsu · Mar 29, 2013

Exophase said:
HD 7970 has two ACEs, that much is confirmed. They're front ends for queueing and managing compute tasks. http://amddevcentral.com/afds/assets/presentations/2620_final.pdf

It isn't really clear to me what the limitations of a single ACE are. I don't know if the bullet points in the "multiple concurrent contexts" section is really only accomplished via multiple ACEs or if a single ACE allows out of order completion (implying that it doesn't block on a task). Anandtech agrees that AMD has not made the capabilities clear.

That makes it really hard to accurately comment on what 8 ACEs vs 2 (if PS4 really does provide it) enables. What we do know is that it's not moving from a total lack of concurrent scheduling between graphics and compute tasks.

According to AnandTech, the GCN architecture is in-order. Instructions in the wavefront can't be reordered. But you can prioritize your wavefronts differently to adjust their execution order.

Based on the leaked diagram, the additional Orbis ACEs have more ring buffers than the existing one/two. Perhaps newer GCN GPUs will be similar (I don't know).

onQ · Mar 29, 2013

3dilettante said:
This is what you wrote:

Graphics code using all ALU resources is by definition blocking execution of compute code that also needs to use the ALUs.
If the graphics code is waiting, then it by definition cannot be using all 1.84 TFLOPS.
"Blocking" in this case is the more general term to describe where hazards or contention prevent the resources from being used by a given program and halting progress.

Your excerpted text is dealing with a more specific example of contention for IO and refactoring code so that work that does not depend on IO can procede.

This is not the same as the scenario you outlined, where both workloads depend on the exact same resource.

I'm just trying to figure out how it's getting maximum graphics of 1.843TFLOPS & computing at the same time.

The system is also set up to run graphics and computational code synchronously, without suspending one to run the other. Norden says that Sony has worked to carefully balance the two processors to provide maximum graphics power of 1.843 teraFLOPS at an 800Mhz clock speed while still leaving enough room for computational tasks. The GPU will also be able to run arbitrary code, allowing developers to run hundreds or thousands of parallelized tasks with full access to the system's 8GB of unified memory.

"The cool thing about Compute on PlayStation 4 is that it runs completely simultaneous with graphics," Norden enthused. "So traditionally with OpenCL or other languages you have to suspend graphics to get good Compute performance. On PS4 you don't, it runs simultaneous with graphics. We've architected the system to take full advantage of Compute at the same time as graphics because we know that everyone wants maximum graphics performance."

patsu · Mar 29, 2013

pjbliverpool said:
All the flops are fully accounted for by 18 standard GCN CU's at 800Mhz. There is nothing else.

It's a valid point about throwing both graphics and compute jobs at the CU's simultaneously to attain higher utilisation but there's no reason the same couldn't be done on PC based GCN architectures as well as long as the compute jobs don't have to run synchronously with gameplay. Admittedly on first generation GCN parts the compute work may be allocated a little less efficiently due to the lower number of ACE's so efficiency may be a little less.

It's more because of console's "single spec" nature. Developers optimize to this one spec to reap the most benefits.

From designing data locality, static allocation (since you know the exact cores, speed, bandwidth, and timing), to low level optimization. Everything, including contents, can be optimized for this one run-time.

They all contribute to the final savings/speedup.

3dilettante · Mar 29, 2013

patsu said:
According to AnandTech, the GCN architecture is in-order. Instructions in the wavefront can't be reordered. But you can prioritize your wavefronts differently to adjust their execution order.

Based on the leaked diagram, the additional Orbis ACEs have more ring buffers than the existing one/two. Perhaps newer GCN GPU will be similar (I don't know).

That's in-order with respect to instruction issue and execution within a wavefront.
The out of order completion deals with how the ACEs can free up wavefront slots and storage when a given wavefront has finished, even if older wavefronts from older queue commands are still in progress.

This means work items that aren't dependent on the results of the completed wavefront can start earlier, instead of stalling on reservations that are doing nothing.

patsu · Mar 29, 2013

Yes, the AnandTech article mentioned this. You can adjust the priority of your wavefronts to tweak the order of execution for essentially groups of operations. I think the system won't be able to do the reordering for you. The app has to adjust the priority. The system simply moves things around based on its prioritized scheduling scheme.

Scott_Arm · Mar 29, 2013

onQ said:
I'm just trying to figure out how it's getting maximum graphics of 1.843TFLOPS & computing at the same time.

It can't unless you count the cpu.

patsu · Mar 29, 2013

If you include the CPU, it would be higher than 1.84TFLOPS, almost 2 TFLOPS peak.

Scott_Arm · Mar 29, 2013

patsu said:
If you include the CPU, it would be higher than 1.84TFLOPS, almost 2 TFLOPS peak.

Exactly. The gpu cannot do more than 1.84 tflops regardless of the workload. Any additional computing would have to happen on the cpu.

Shifty Geezer · Mar 29, 2013

onQ said:
I'm just trying to figure out how it's getting maximum graphics of 1.843TFLOPS & computing at the same time.

It's not!

That's what everyone's repeating. You cannot get that, ergo your interpretation of the comments is wrong. Your quote says the system, which is the APU (specifically states two processors), not the graphics cores in isolation, can run compute simultaneously. The second quote means you can send compute tasks to the GPU and it'll run them along with graphics workloads, but it doesn't mean the GPU is processing 1.8 TF of graphics work and then doing additional compute. That's impossible. Taking the technical knowledge of the GPU and its limitations allows us to correctly interpret the quotes in context. Of course there are those who'll adamantly stick with their interpretation of an ambiguous language and contrive technical solutions (there must be special magic extra processing sauce that adds several hundred GFlops of compute power but which no-one's cared to mention?!), but hopefully sensible folk will see the different interpretations and come to see the logical one fits all the pieces without having to entertain highly unrealistic scenarios.

Shifty Geezer · Mar 29, 2013

Scott_Arm said:
Exactly. The gpu cannot do more than 1.84 tflops regardless of the workload. Any additional computing would have to happen on the cpu.

And an important plot point here. If Liverpool can run 1.8 TF of graphics and a few significant GF of compute, don't people think they'd be extolling the larger number? Like "an APU capable of 2.5 teraflops , with 1.8 teraflops graphics power, over 100 gigaflops versatile CPU power, and more than 500 gigaflops of specialist compute power."

PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

onQ

3dilettante

AlphaWolf

Specious Misanthrope

3dilettante

yewyew

Exophase

pjbliverpool

B3D Scallywag

onQ

Shifty Geezer

uber-Troll!

3dilettante

patsu

onQ

patsu

3dilettante

patsu

Scott_Arm

patsu

Scott_Arm

Shifty Geezer

uber-Troll!

Shifty Geezer

uber-Troll!

Similar threads