I remember a while back someone mentioned 32 alus was perhaps overkill and would never be used to their fullest with this information does it now make sense?
A bit of extra resource so compute can always have some alu resources to play with?
There are several possibilities. The amount of work could be too low to really fill all ALUs (that means getting full occupancy of 40 wavefronts [or whatever register usage allows] per CU). The throughput could be bottlenecked by another pipeline stage, leaving only a relatively low amount of wavefronts in flight at a given time.Why doesn't that consumer all the ALUs for a very short time then? I'd have thought, in my naivety, that vertex work saturates the ALUs for a brief moment to churn through those jobs, and then the ALUs sit on the pixel work for much of the scene churning through that work, both pixels and then post processing. In my understanding of US GPUs, there's not going to be a moment when a few ALU pipes are active and the rest are idle, or when the whole set of ALUs are idle waiting for something to do.
What L2 cache tweak? The volatile tag exists in all GCN based GPUs sold since 16 months. It's a GCN feature that finally gets used .Their modifications are mostly to prevent parallel jobs from stepping on each other's toes. The graphics compute jobs will probably be "consistent" w.r.t. other graphics jobs. OTOH, some non-graphics compute jobs may interfere with whatever is going on if not handled correctly (Article mentioned the L2 cache tweak).
First look at AMD G-series APUs [Jaguar]
According to the chart, quadcore part running on a 1.6GHz with disabled onboard GPU is using 15W. The most power-hungry version spends 25W [quadcore 2ghz, gpu activated].
What L2 cache tweak? The volatile tag exists in all GCN based GPUs sold since 16 months. It's a GCN feature that finally gets used .
And the GCN hardware is capable of handling large amounts of different shaders/kernels with or without direct dependencies within the rendering pipeline (the latter would be asynchronous compute stuff). Wavefronts can be assigned different priorities for instance (for example, all wavefronts of a certain shader/kernel could be assigned a higher priority or some asynchronous background tasks can be assigned a lower priority). What Sony will probably add is a possibility for the devs to exert some influence on the work distribution and prioritization. Currently, there is no such possibility on PC GPUs (one can only change the priority in shader code once it got scheduled to a CU [and only when writing the shader in the native ISA, there is no possibility to do it through some higher level API], one can't set the base priority assigned to a wavefront upon creation), it is handled by game profiles in the driver. But that doesn't necessitate hardware changes, it's an API and firmware issue.
What L2 cache tweak? The volatile tag exists in all GCN based GPUs sold since 16 months. It's a GCN feature that finally gets used .
First look at AMD G-series APUs [Jaguar]
According to the chart, quadcore part running on a 1.6GHz with disabled onboard GPU is using 15W. The most power-hungry version spends 25W [quadcore 2ghz, gpu activated].
Consider a part like the PS4 with 18 CUs, 72 SIMDs and 2 triangles/clock throughput. In this example there's one unique vertex per triangle. Each SIMD works on 16 vertices/clock so to get work on every SIMD it will take 16*72/2=576 clocks.Why doesn't that consumer all the ALUs for a very short time then? I'd have thought, in my naivety, that vertex work saturates the ALUs for a brief moment to churn through those jobs, and then the ALUs sit on the pixel work for much of the scene churning through that work, both pixels and then post processing. In my understanding of US GPUs, there's not going to be a moment when a few ALU pipes are active and the rest are idle, or when the whole set of ALUs are idle waiting for something to do.
First look at AMD G-series APUs [Jaguar]
According to the chart, quadcore part running on a 1.6GHz with disabled onboard GPU is using 15W. The most power-hungry version spends 25W [quadcore 2ghz, gpu activated].
The highest performing SKU is a 25 watt TDP part running four cores at 2 GHz.
Jaguar Vanilla
* 1.8GHz LC Clocks (can be under-clocked for specific low-powered battery device needs - tablets, etc...).
* 2MB shared L2 cache per CUs
* 1-4 CUs can be outfitted per chip. (i.e. 4-16 logical cores)
* 5-25 watts depending on the device/product. (45 watts is achievable under proper conditions)
PS4 Jaguar with chocolate syrup.
* 2GHz is correct as of now.
* 4MB of total L2 cache (2MB L2 x 2 CUs)
* 2 CUs (8 Logical cores).
* idles around 7 watts during non-gaming operations and around 12 watts during Blu-ray movie operations. Gaming is a mixed bag...
What would be nice is a fully loaded Jaguar chip.
Quit bringing up platforms beyond the scope of the thread.
Then why the f does Cerny sell it as one of the three compute customizations...in the end it all seems standard AMD GCN 1.1
Okay. With this and 3dcgi's post, I think I understand. I was thinking of a wavefront occupying the ALU's for 100% of the time during its resolution, and then another wavefront following behind, so there was no idle time. I hadn't made the connection with delays in the other pipes. Although I'd want to hear how much of a frame can really be lost on a GPU by such stalls. Is there really a lot of spare GPU power going to waste that can be repurposed to compute?There are several possibilities. The amount of work could be too low to really fill all ALUs (that means getting full occupancy of 40 wavefronts [or whatever register usage allows] per CU). The throughput could be bottlenecked by another pipeline stage, leaving only a relatively low amount of wavefronts in flight at a given time.
Thanks for the explanation.Consider a part like the PS4 with 18 CUs, 72 SIMDs and 2 triangles/clock throughput. In this example there's one unique vertex per triangle. Each SIMD works on 16 vertices/clock so to get work on every SIMD it will take 16*72/2=576 clocks.
If the oldest VS finishes in < 576 clocks at least one SIMD is idle. A significant part of the VS's clocks are spent fetching data from the vertex buffer, constant memory, etc. so the ALUs aren't busy the entire time leaving some idle time for compute.
Okay. With this and 3dcgi's post, I think I understand. I was thinking of a wavefront occupying the ALU's for 100% of the time during its resolution, and then another wavefront following behind, so there was no idle time. I hadn't made the connection with delays in the other pipes. Although I'd want to hear how much of a frame can really be lost on a GPU by such stalls. Is there really a lot of spare GPU power going to waste that can be repurposed to compute?
Thanks for the explanation.
I really like this guy.
The main reason I feel like he really knows what he is talking about is that not only is he designing the hardware, but he's also making the games to run on that hardware. Talk about a really close relationship with the hardware, literally and figuratively!
If anyone should know what direction PS4 is trying to go for anything, this is your guy!
That would really depend on the game and how well compute work complements graphics work. If compute needs whatever is bottlenecking the graphics shader it will slow down graphics work. The more ALU heavy the compute work is the more effective this will be as it's easier to become fetch bound than ALU bound.Although I'd want to hear how much of a frame can really be lost on a GPU by such stalls. Is there really a lot of spare GPU power going to waste that can be repurposed to compute?
Good that he picked UE4 as the platform for his game. It will translate better to third party work.