The capabilities of the 4 special CUs in Orbis

No, I'm refering to 400Gflops on a dedicated PhysX card, with another GPU that is much more powerful as a the graphics unit. Thats literally a whole card with 400Gflops doing nothing but physics, but its not enough, even for 3-4 year old PhysX games.


Well you did not say that.

But i have see videos on Youtube about slow cards use dedicated for Physics while a powerful card was use for the rest,and the slower card use for physic actually slowed down the faster GPU,in fact the faster GPU was 66% faster on its own than with another card doing the phyics.

That clearly shows that something is creating a bottleneck,and slow down the GPU,that did not happen with Cell + RSX configuration,all the contrary anything Cell offloaded from the RSX actually gave it a boost in performance.

https://www.youtube.com/watch?v=cbww3dhzK0M
 
No, I'm refering to 400Gflops on a dedicated PhysX card, with another GPU that is much more powerful as a the graphics unit. Thats literally a whole card with 400Gflops doing nothing but physics, but its not enough, even for 3-4 year old PhysX games.

And that's totally irrelevant as it has no relation to how physics would work on Orbis.
 
No, I'm refering to 400Gflops on a dedicated PhysX card, with another GPU that is much more powerful as a the graphics unit. Thats literally a whole card with 400Gflops doing nothing but physics, but its not enough, even for 3-4 year old PhysX games.

Need to see how they did it.

On a PC, there may be more overhead and memory copy.
 
Since you seem to know, please edify us with how physics will work on Orbis.

Same as it'll work on Durango.

To claim physics on a full or partial HSA implementation will suck based on the performance of an addon PCIE card is disengenuous. It may also just mean that the addon card sucks.
 
To claim physics on a full or partial HSA implementation will suck based on the performance of an addon PCIE card is disengenuous. It may also just mean that the addon card sucks.

ERP said:
GPGPU is anything but a magic bullet.
There are just certain classes of problem that are not well suited to it.
For it to save WiiU the the class of problems it is good for would have to dominate the CPU workload of modern games and I don't think that's true.
I spoke to the Havok guys recently despite the demo they did a few years ago, they don't think rigid body dynamics is a good fit on the GPU and they are looking for other "physics" effects to use the GPU for instead.
http://forum.beyond3d.com/showpost.php?p=1681101&postcount=72

Game level rigid body physics isn't a good fit for GPU (CU). The demos and examples I've seen have focussed on lots of the same object colliding, getting of the order of a 2x speed up over a CPU of the same silicon budget. GPUs are far better suited to fluid and particle physics with connected bodies, or other tasks entirely such as some fancy light processing or a particle engine or procedural texture creation or something.
 
What I never understood is why separate these 4 CUs for anything specific if they are no different than normal CUs? If they segmented these 4 CUs purely for compute tasks, wouldn't it be more beneficial to just leave them free to be used for either compute or rendering? Usually when it comes to hardware dedicated for a specific use, you trade efficiency in this use for flexibility. That doesn't look to be the case here if they are indeed normal CUs. Am I missing something here?
 
And that's totally irrelevant as it has no relation to how physics would work on Orbis.

Yeah you're right, GPU physics on a dedicated platform and card with dedicated code and software just for it has nothing to do with GPU physics on a chip with a few shaders reserved that are supposedly shoehorned into doing physics work... because one of those two actually would work, where as the other one is just a pipe dream. ;)
 
What I never understood is why separate these 4 CUs for anything specific if they are no different than normal CUs? If they segmented these 4 CUs purely for compute tasks, wouldn't it be more beneficial to just leave them free to be used for either compute or rendering? Usually when it comes to hardware dedicated for a specific use, you trade efficiency in this use for flexibility. That doesn't look to be the case here if they are indeed normal CUs. Am I missing something here?

The suggestion from leaks seemed to be that they were still free to work as regular CUs, but that they were optimised for doing compute tasks.

But I'll gladly wait a little while to find out for sure. ;)
 
Rumors claimed minor mod (e.g, additional scalar ALU). But really details on both are a little sketchy. DF's earlier Orbis article even identifies this extra compute resource as a standalone unit, which I thought is odd.

Within AMD's existing GCN framework, there are also some ways to setup these CUs for more flexible use (e.g., extra ACE). It depends on how Sony "glue" the potentially modified parts in.


For CPU vs GPU physics, I think we will end up with both. The GPU physics on Orbis (and Durango) should be more efficient and hence more flexible than straight PC GPU implementations.

What I'd like to find out is where GCN's DMA units sit, and if they have been removed in Orbis since there is one unified and fast mem pool now.
 
The suggestion from leaks seemed to be that they were still free to work as regular CUs, but that they were optimised for doing compute tasks.

But I'll gladly wait a little while to find out for sure. ;)

Oh ok, I thought they were reserved specifically for compute tasks, but what you say makes more sense assuming the optimized bit is true. Thanks :D
 
The VGleaks article mentioned that the extra compute units can be used for rendering jobs to see minor boost.
 
Particles and so on will greatly benefit from those CUs -Crysis 2 at top settings on a PC is a great example of this-.

I think the first image in the thread is not quite right though. I remember @Laa Yosh saying that the trailer that picture comes from used 128bit HDR rendering, which I don't think next gen consoles are going to pull off.
I think we can assume that nextgen consoles can at least support 64bit HDR since it's a new feature in the RedEngine for Witcher 3 which is also coming to nextgen consoles. I doubt the difference between 64bit and 128bit is that big though and even then that's not the main factor to influence how the particle is rendered in the pic. I was thinking more of a advanced fluid dynamic particle system for those smokes like this.
Rendered realtime on Nvidia CUDA in 2009.
http://www.youtube.com/watch?v=RuZQpWo9Qhs

I'm thinking a 2012 based Pitcairn card with 4 dedicated compute cores should at least render something similar for the rocket and burning wreckage trail no?
 
The 4 CUs could be very interesting if they could operate tightly with CPU cores, IE read/write to each others caches and in general have more fine grained communication.
 
I believe the 4 CU's are to support the CPU in balance with the GPU, like a poor man's Cell BE, mainly for familiarity with devs migrating from PS3 development. That's ONE way of making PS4 development easier...
 
Hmm... Do developers really need to familiarize themselves with GPGPU programming ? The basic GPU helper concepts are similar. SPU coding seems different enough.

I do think if the "helper" CUs are scheduled independently, have separate cache, and can DMA + share data with the GPU proper, they should simplify and accelerate GPU tasks.

In Orbis, the developers should have low level access to make a flexible, "hybrid" pipeline.

EDIT: For relationship between the CPU and the "helper" CUs, it may be similar to Cell. The PPU loads the right SPU-lets into the SPUs and send them addresses to the data blocks, the SPUs schedule the work among themselves to process the data. The SPUs also manage their own cache.

Is there a need to tie the CPU cores with these "helper" CUs beyond DMA ? Sharing cache between 'em would slow the regular accesses down wouldn't it ?


One difference is: Cell clocks at 3.2GHz, 4 times as fast as RSX. So the L1-level LocalStore latency helps the SPUs make up for the smaller number of cores (compared to the GPU). In Orbis, the CUs are clocked the same ? So the "helper" CUs will likely work on larger blocks (with more compute power), and keep in lock step with the other CUs. Perhaps it would be easier to manage.

Cell's DMA list feature also helps to organize data for easy GPU consumption. Would be interesting to see if AMD's GPU DMA is equally flexible, and whether Sony kept them.
 
Hmm... Do developers really need to familiarize themselves with GPGPU programming ? The basic GPU helper concepts are similar. SPU coding seems different enough.

I do think if the "helper" CUs are scheduled independently, have separate cache, and can DMA + share data with the GPU proper, they should simplify and accelerate GPU tasks.

In Orbis, the developers should have low level access to make a flexible, "hybrid" pipeline.

EDIT: For relationship between the CPU and the "helper" CUs, it may be similar to Cell. The PPU loads the right SPU-lets into the SPUs and send them addresses to the data blocks, the SPUs schedule the work among themselves to process the data. The SPUs also manage their own cache.

Is there a need to tie the CPU cores with these "helper" CUs beyond DMA ? Sharing cache between 'em would slow the regular accesses down wouldn't it ?


One difference is: Cell clocks at 3.2GHz, 4 times as fast as RSX. So the L1-level LocalStore latency helps the SPUs make up for the smaller number of cores (compared to the GPU). In Orbis, the CUs are clocked the same ? So the "helper" CUs will likely work on larger blocks (with more compute power), and keep in lock step with the other CUs. Perhaps it would be easier to manage.

Cell's DMA list feature also helps to organize data for easy GPU consumption. Would be interesting to see if AMD's GPU DMA is equally flexible, and whether Sony kept them.
I get more and more iffy about all this talk about GPGPU / really wonder if it is worse the investment.
INtel IGP can already share the same memory space as the CPU, though it is not exposed by the drivers.
The same is true for the last Mali processors.

AMD talks big about HSA but what I see is that nobody is rushing and pressure either ARM or Intel so they expose properly those features and the wins supposed to come along with them.

Then you Nvidia ( I mean the guy that have been pushing CUDA) that stated what they stated wrt to their last mobile part/GPU and gave up on quiet some compute performances in their last line of desktop GPU products.

When I look at AMD VLIW4 design and GCN, I see a 50% increase in transistor count that doesn't translate in that much win as far graphic are concerned (it is better that's it).
From juniper to cap verde you have 500 millions transistors, at some points I think that manufacturers, AMD for example, (Nvidia seems to have taken notice) may wonder if it would not be better to invest those transistor in making their CPU better, dsp or what not, especially in AMD case, after having try to save transistors with their CMT experiment (imo they re beating a dead horse, I hope they will survive that misstep :cry: ).

Looking at the next generation systems, the comparison is valid too, say if one system had embark only a CAP verde GPU (1.5billions) for the sake of (quiet massively) improved compute performance, I really wonder if they weere that concerned with compute performance / GP perfs, if hype aside, going with a 10 SIMDs part (vliw4 based) saving in the process 500 millions transistors and investing the saved transistors in more /better CPU core would have been more rational.
I know that transistors are far from equal in density (logic/control flow vs ALUS, vs storage) but still, it is quiet a massive amount of transistors and I speak of "only" a cap verde class of GPU, imagine with an hypothetical 18 SIMD vliw4 part 8O
 
Last edited by a moderator:
I get more and more iffy about all this talk about GPGPU / really wonder if it is worse the investment.
INtel IGP can already share the same memory space as the CPU, though it is not exposed by the drivers.
The same is true for the last Mali processors.

AMD talks big about HSA but what I see is that nobody is rushing and pressure either ARM or Intel so they expose properly those features and the wins supposed to come along with them.

Then you Nvidia ( I mean the guy that have been pushing CUDA) that stated what they stated wrt to their last mobile part/GPU and gave up on quiet some compute performances in their last line of desktop GPU products.

When I look at AMD VLIW4 design and GCN, I see a 50% increase in transistor count that doesn't translate in that much win as far graphic are concerned (it is better that's it).
From juniper to cap verde you have 500 millions transistors, at some points I think that manufacturers, AMD for example, (Nvidia seems to have taken notice) may wonder if it would not be better to invest those transistor in making their CPU better, dsp or what not, especially in AMD case, after having try to save transistors with their CMT experiment (imo they re beating a dead horse, I hope they will survive that misstep :cry: ).

Looking at the next generation systems, the comparison is valid too, say if one system had embark only a CAP verde GPU (1.5billions) for the sake of (quiet massively) improved compute performance, I really wonder if they weere that concerned with compute performance / GP perfs, if hype aside, going with a 10 SIMDs part (vliw4 based) saving in the process 500 millions transistors and investing the saved transistors in more /better CPU core would have been more rational.
I know that transistors are far from equal in density (logic/control flow vs ALUS, vs storage) but still, it is quiet a massive amount of transistors and I speak of "only" a cap verde class of GPU, imagine with an hypothetical 18 SIMD vliw4 part 8O

Haswell agrees with you ;)
 
Back
Top