PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

Status
Not open for further replies.
I don't know why some of you think there's this big pool of compute power that graphics tasks can't make use of but compute tasks can. They look the same to the shader cores.

No, it's not impossible to use all clock cycles for graphics, nor does it mean wasted cycles could have been used on other things. No, there isn't some natural restriction to how graphics commands are submitted that prevent the shader cores from reaching full occupancy. Yes, having a larger number of tasks of a different variety of types may help improve utilization, but that's not an immediate given and it's not right to assume that you're guaranteed to be able to throw compute tasks on top of the graphics workload at zero expense.
 
which sayings would still be accurate?
- GPU and GPGPU tasks can be handled, simultaneously, without effecting graphics.
No. Jesus. How would you run compute and graphics tasks simultaneously on the same hardware? If you run compute tasks, you CANNOT run graphics stuff on those same execution units. They only do one thing at a time.
- GPU compute and graphics tasks can co-exist with no penalty to graphics.
No! If some execution units run compute tasks then there will be fewer units available to run graphics tasks. So there will be a penalty to performance.

Where is all this fairytale nonsense about unrestricted simultaneous execution coming from? Jesus! Since when has that ever been possible on any piece of hardware, much less an el cheapo APU?!!!

There's no magic involved here. Pretty much no complex processor ever made can come close to 100% utilization all the time. That's just the way it is.

It's all still means the same thing onQ was saying.
...And he's been talking out of his ass for weeks now, imagining this kind of nonsense which you are now repeating and perpetuating.

Question: If GPU commands are issued in waves
Batches, really. "Waves", or wavefronts, or whatever (this is an nvidia term I believe, not AMD) is just a gimmicky label someone invented because it sounded cool to him I suppose.

wouldn't that mean one cycles are naturally not being used?
No. Quite the opposite really. Commands are batched up to utilize as much as possible of the hardware. Before the hardware is finished with one set of batches you would normally send it new batches to work on, if there's still anything left to process.

When FLOPs tests are performed, are they still using this "wave" technique?
Not sure what you're trying to say.
 
I remember a B3D discussion about how GPUs don't/can't use all of it's clock cycles. It had something to do with all the extra power eSRAM could possibly through "efficiency". :)

Knowing a GPU can't make full use of ALL of it's clock cycles, why couldn't that be an opportunity to stick some compute tasks within those windows? Wouldn't that still leave it's normal graphics tasks uneffected?

Didn't Cerny say they could bypass GPU L2 and L1 cache?

There's a difference between going from say, 70% efficiency to 90% efficiency.

If you are using all 1.84 TFLOPs of the GPU for graphics rendering there is nothing left over for anything else because you're already at 100%. BTW - it's also highly unlikely you'll generally be using 100% of the GPU. It'll peak at that at times, yes, but you'll likely average far lower than that for graphics tasks.

Now when you consider that, then there is an opportunity to use idle resources for compute. So rather than the GPU running at an overall general utilization of 70%, perhaps now with some compute jobs thrown in you can approach 90% average general utilization.

There's where things you put in to remove performance pitfalls and increase utilization (efficiency) make sense. Sony, just like Microsoft, have likely made some changes to boost utilization of resources.

But at the end of the day they still can't get more than 1.84 TFLOPs out of the GPU. Just like Microsoft with the rumored specs isn't going to get more than 1.2 TFLOPs out of the GPU no matter how much people may wish it so.

Regards,
SB
 
The problem with all these theories is you're only considering flops. Sure graphics may grossly underutilize the ALU's, but you have to look at why.
If the reason you're not utilizing flops is because you're saturating the bus, any additional operation that impacts that bus usage also impacts the running operation.
There is value in lower priority long running compute jobs running in parallel with rendering, but it's not magic.
 
I don't know why some of you think there's this big pool of compute power that graphics tasks can't make use of but compute tasks can. They look the same to the shader cores.
I would say because the CPU is an extra method of input and can bypass all cache. To me, it sounds like UNused clock cycles can be used for compute without GPU management systems. Why would that not be a good reason?

No, it's not impossible to use all clock cycles for graphics, nor does it mean wasted cycles could have been used on other things. No, there isn't some natural restriction to how graphics commands are submitted that prevent the shader cores from reaching full occupancy. Yes, having a larger number of tasks of a different variety of types may help improve utilization, but that's not an immediate given and it's not right to assume that you're guaranteed to be able to throw compute tasks on top of the graphics workload at zero expense.

My apologies, then. Can you provide an example or two of this?
 
The problem with all these theories is you're only considering flops. Sure graphics may grossly underutilize the ALU's, but you have to look at why.
If the reason you're not utilizing flops is because you're saturating the bus, any additional operation that impacts that bus usage also impacts the running operation.
There is value in lower priority long running compute jobs running in parallel with rendering, but it's not magic.

Exactly! No magic required, just good engineering and know how.
 
I have a question & I don't want to be attacked for my question:

in the VGLeaks paper it says that the PS4 GPU has "4 additional CUs (410 Gflops) “extra” ALU as resource for compute

& also


& the gameindustry article says that the PS4 has "a new asynchronous compute architecture"

& here is a PDF about Asynchronous ALUs http://www.academia.edu/3117590/Efficient_Asynchronous_ALU_Model_Design_and_Simulation

So I'm wondering if the extra ALU's added to the CU's are Asynchronous ALU with no clock & if that's the reason why it's not showing up as more FLOPS on the GPU.

could that be a possibility?
 
The vgleaks article said 14+4 CU's. The total there is 18. 18 CU's at 800Mhz = 1.84 TF.

They wouldn't be releasing pressers with 'almost 2TF' if they could have said more than 2TF.

There is no spoon. This line of discussion is well past absurd.
 
There's a difference between going from say, 70% efficiency to 90% efficiency.

If you are using all 1.84 TFLOPs of the GPU for graphics rendering there is nothing left over for anything else because you're already at 100%. BTW - it's also highly unlikely you'll generally be using 100% of the GPU. It'll peak at that at times, yes, but you'll likely average far lower than that for graphics tasks.

Now when you consider that, then there is an opportunity to use idle resources for compute. So rather than the GPU running at an overall general utilization of 70%, perhaps now with some compute jobs thrown in you can approach 90% average general utilization.

There's where things you put in to remove performance pitfalls and increase utilization (efficiency) make sense. Sony, just like Microsoft, have likely made some changes to boost utilization of resources.

But at the end of the day they still can't get more than 1.84 TFLOPs out of the GPU. Just like Microsoft with the rumored specs isn't going to get more than 1.2 TFLOPs out of the GPU no matter how much people may wish it so.

Regards,
SB

Are ALL GPU resources counted in a FLOPs tally? I thought only graphics ability was counted in a GPU FLOPs tally.
 
That actually clears up quite a few things, and the upcoming Gamasutra article should provide further detail. However, that interview, and others, makes it clear that there's a lot of discussion of "this is what we'd like to do" and not "this is day one." Vita Remote Play seems to be a lock thanks to the PS4 hardware video encoder and Vita hardware video decoder. However, on the feature, multitasking, and social integration side of things its far more ambiguous. Some of the features they're talking about now were discussed 8 years ago during the PS3 reveal. The F1 reveal with the in game video chat immediately comes to mind. What I would love to see is a publicly available road-map of planned features and target dates released before launch. Normally, I would seriously doubt that would every happen with SCE, but with the direction Cerny is taking things with the PS4, who knows. There's a community features suggestion section on the PS Blog for PS3 so maybe this time around they'll be even more open/transparent.

I'm also curious to know how robust the video encoder is. There are 4 key features that seem to rely on it.
  1. Remote Play
  2. Live Stream
  3. Last few minutes of video (for upload)
  4. PS4 Eye Video feed
Can each of those features be used simultaneously?

I highly doubt so. One example at least is that Remote Play will not be compatible with streaming or AR Move games, which they already said. They seem to all need the same processor. I doubt much simultaneous use will exist except possibly AR Singstar and Streaming for example. Streaming with Remote Play already seems out of the question though.
 
The vgleaks article said 14+4 CU's. The total there is 18. 18 CU's at 800Mhz = 1.84 TF.

They wouldn't be releasing pressers with 'almost 2TF' if they could have said more than 2TF.

There is no spoon. This line of discussion is well past absurd.

I think you missed what I was asking.


I was asking if the “extra” ALU as resource for compute was Asynchronous ALU with no clock could that be the reason why no extra flop numbers was giving?
 
Are ALL GPU resources counted in a FLOPs tally? I thought only graphics ability was counted in a GPU FLOPs tally.

FLOPS are just FLoating point Operations Per Second. It's not a test or benchmark. It's the total theoretical floating point operations a given piece of hardware can perform per second. As it's a theoretical maximum, it is vary rarely reached outside of very synthetic tests. Hence, why I never assume either the PS4 or Durango will come close to using all of its respective FLOPS.

Also, FLOPS isn't always equal to FLOPS depending on the how a given piece of hardware is architected. In this case that's moot, however, when comparing Durango to Orbis as they both basically use the same architecture for the GPU.

And as ERP pointed out, FLOPS isn't the only thing that is important with regards to graphics rendering. There's a lot of other stuff going on as well in order to make the pretty pixels appear on screen. So it makes sense to do things to make use of any potentially idle compute resources.

Hence, at no time will you ever exceed 1.84 TFLOPS on the PS4, but you may get relatively close. 80-90% utilization in general. If they can do better than that over a long gameplay session then I'll frankly be quite amazed.

Regards,
SB
 
I have a question & I don't want to be attacked for my question:

in the VGLeaks paper it says that the PS4 GPU has "4 additional CUs (410 Gflops) “extra” ALU as resource for compute

& also

& the gameindustry article says that the PS4 has "a new asynchronous compute architecture"

& here is a PDF about Asynchronous ALUs http://www.academia.edu/3117590/Efficient_Asynchronous_ALU_Model_Design_and_Simulation

So I'm wondering if the extra ALU's added to the CU's are Asynchronous ALU with no clock & if that's the reason why it's not showing up as more FLOPS on the GPU.

could that be a possibility?

Please please stop googling terms and stringing them together to try to support an argument. You have to actually understand the topics first. I know you asked not to be attacked but I don't know how to better put this.

The paper on asynchronous ALUs couldn't have anything less to do with the ACEs in AMD's GCN design. Asynchronous there refers to clock-less logic. I assure you nothing about PS4's GPU is clock-less. It's called asynchronous because it has queues for compute tasks that aren't connected to the command processors. If you had the required background necessary to understand the things you linked you'd know that they had no connection at all...
 
Please please stop googling terms and stringing them together to try to support an argument. You have to actually understand the topics first. I know you asked not to be attacked but I don't know how to better put this.

The paper on asynchronous ALUs couldn't have anything less to do with the ACEs in AMD's GCN design. Asynchronous there refers to clock-less logic. I assure you nothing about PS4's GPU is clock-less. It's called asynchronous because it has queues for compute tasks that aren't connected to the command processors. If you had the required background necessary to understand the things you linked you'd know that they had no connection at all...

That was my point of asking.
 
I think you missed what I was asking.


I was asking if the “extra” ALU as resource for compute was Asynchronous ALU with no clock could that be the reason why no extra flop numbers was giving?

We see no evidence of an extra ALU so far. There are more to it than raw specs.

For contention between graphics and other compute jobs, or bottlenecks in other areas, it will boil down to the actual design, implementation and contents used. No magic is needed to overcome them. The devs have finite time to tackle them through optimization or rewrite or contents tweak, or scale the requirements down if all else failed. There is also learning curve and work-in-progress tools to deal with in the mean time.

It's also futile to chase 100% utilization. The objective is to make a great game within budget. Your real gods are the programmers. In the software world, people find the cheapest (and wise) solutions to do the job adequately. It is quite meaningless to just maximize utilization unnecessarily. The best solution may use the least resources (so you can run graphics and compute jobs together where it makes sense).

It sounds like those Sony folks have balanced the h/w against some target specs. I would be surprised if they planned the h/w specs assuming 100% utilization. There should be some slacks/buffer planned when designing the system.

From the interview, what they aim to provide is a flexible set of development tool that marry the CPU and GPU together "nicely". It's an extension from the familiar PS3 development concept, but this time they try to get out of the way as much as possible. There are also custom features for developers to make better use of the h/w compared to PC tools.

That's it ! The great developers will figure out the best ways to fool your eyes.
 
FLOPS are just FLoating point Operations Per Second. It's not a test or benchmark. It's the total theoretical floating point operations a given piece of hardware can perform per second. As it's a theoretical maximum, it is vary rarely reached outside of very synthetic tests. Hence, why I never assume either the PS4 or Durango will come close to using all of its respective FLOPS.

Also, FLOPS isn't always equal to FLOPS depending on the how a given piece of hardware is architected. In this case that's moot, however, when comparing Durango to Orbis as they both basically use the same architecture for the GPU.

And as ERP pointed out, FLOPS isn't the only thing that is important with regards to graphics rendering. There's a lot of other stuff going on as well in order to make the pretty pixels appear on screen. So it makes sense to do things to make use of any potentially idle compute resources.

Hence, at no time will you ever exceed 1.84 TFLOPS on the PS4, but you may get relatively close. 80-90% utilization in general. If they can do better than that over a long gameplay session then I'll frankly be quite amazed.

Regards,
SB

In general, FLOP usually applies to programmable computation power only.

e.g., I read somewhere that people have problem stating the FLOP count for FPGA solution because they are not programmable in the software sense.
 
I have a question & I don't want to be attacked for my question:

in the VGLeaks paper it says that the PS4 GPU has "4 additional CUs (410 Gflops) “extra” ALU as resource for compute

& also



& the gameindustry article says that the PS4 has "a new asynchronous compute architecture"

& here is a PDF about Asynchronous ALUs http://www.academia.edu/3117590/Efficient_Asynchronous_ALU_Model_Design_and_Simulation

So I'm wondering if the extra ALU's added to the CU's are Asynchronous ALU with no clock & if that's the reason why it's not showing up as more FLOPS on the GPU.

could that be a possibility?

The leaks were from early dev kits (about the 14+4 CU's), now we know it's 18 unified CU's capable of doing both GP and graphics, eliminating the need to restrict 4 solely to GP compute. Devs are now free to do one or the or the other or half-&-half of both.

The asynchronous architecture is most likely the compute rings and queues in the other leak from a later kit.

http://www.vgleaks.com/orbis-gpu-compute-queues-and-pipelines/

Mark Cerny said:
For the PS4 hardware, the GPU can also be used in an analogous manner as x86-64 to use resources at various levels. This idea has 8 pipes and each pipe(?) has 8 computation queues. Each queue can execute things such as physics computation middle ware, and other prioprietarily designed workflows. This, while simultaneously handling graphics processing.

This also links with assisting the CPU for better functions by sacrificing some GPU resource through the supposed Onion Bus.

http://www.vgleaks.com/playstation-4-architecture-evolution-over-time/

Mark Cerny said:
The GPGPU for us is a feature that is of utmost importance. For that purpose, we’ve customized the existing technologies in many ways.

Just as an example…when the CPU and GPU exchange information in a generic PC, the CPU inputs information, and the GPU needs to read the information and clear the cache, initially. When returning the results, the GPU needs to clear the cache, then return the result to the CPU. We’ve created a cache bypass. The GPU can return the result using this bypass directly. By using this design, we can send data directly from the main memory to the GPU shader core. Essentially, we can bypass the GPU L1 and L2 cache. Of course, this isn’t just for data read, but also for write. Because of this, we have an extremely high bandwidth of 10GB/sec.

All this is just an attempt to make optimizations and tweaks easier for devs, that's all it is.

Note: Quotes are from translated Watch article:
http://forum.beyond3d.com/showpost.php?p=1723410&postcount=1105
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top