AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
Compilation problems are just a small corner of the problem for game performance. Ultimately ALU throughput isn't the bottleneck, it's pretty much everything else in the GPU that's slowing the ALUs down.
 
Ok wait from what I understand of what your saying is that due to compilation other parts of the GPU aren't used optimally, and that leads to the ALU's not being utilized, well they might be waiting on other parts of the GPU?
 
Ok wait from what I understand of what your saying is that due to compilation other parts of the GPU aren't used optimally, and that leads to the ALU's not being utilized, well they might be waiting on other parts of the GPU?
If I understood what he said correctly, he's saying the compiler leads to less than optimal ALU utilization. But he's also saying that other parts of the GPU slowing the ALU's down in regards to game performance as well... that has nothing to do with the compilation.
 
due to compilation other parts of the GPU aren't used optimally, and that leads to the ALU's not being utilized, well they might be waiting on other parts of the GPU?
As any other architecture GCN has had tons of issues since the very release, in order to get the best out of GCN, those issues must be work-arounded by the programmers, there are all kinds of pipeline bubbles and inefficiencies due to low wavefronts occupancy during simple passes like z-prepass and shadow pass, low command processor throughput in some cases, low primitive rate, low tesselation rate, low utilization with low triangle counts, etc., etc., very broad(broad wavefronts, + several issue ports for each SIMD, LD/ST, BR, Scalar instructions) architectures like GCN suffer the most from such inefficiencies, so probably he meant that there are other things in the graphics pipeline, which could limit the performance to a greater extent than compiler's ineficiences of loop unrolling, so these things could be the primary bottleneck for GCN performance.

Fortunately, there are good compute shaders with async support which could fix some of GCN's graphics pipeline inefficiencies to some extend, unfortunately, additional programming efforts are required to hand tune async for tons of chips with different capabilities, FixedPipe/ALU/BW ratios and for tons of graphics options with tons of resolutions, which are available on PC(since bottlenecks could shift with different options/resolutions/shader modifications/ram bandwidth/etc.), while the 5% perf. outcome in some cases is still highly questionable if you think of more general and usefull algorithmic level optimizations like Intel does in their research papers.

I'd prefer to see more optimisations on algorithmic level, like Sebbbi does, rather than all this Async stuff to fix pipeline inefficiencies of GCN, there are 30% gains from async in some corner cases, but they are really corner cases, which have nothing to do with real life, I don't buy something like The Children of Tomorrow as a viable example of 25% gains, which is applicable for normal games, because it's the game with very simple geometry and slow geometry shaders, which are used for CR => there must be very low utilization with such simple geometry on GCN + geometry shaders, a lot of resources should stay available for something like asynchronous compute. Hopefully, developers will switch their attention back to all users of PC market and do better algoritmic level optimisations, rather then continue to waste time on GCN specific optimisations only
 
None that I know of, Crytek still pushes and makes their own things though but its nothing ground shattering like in the past.
 
Epic could have done something big using Unreal Engine 4, but they've sold their soul these days doing MOBA/tower defense shit on mobile platforms...


Not to go OT but yeah, that is right, Epic has the resources to do much more than what they have done so far to really think outside of the box when it comes to the graphics they could produce, I think their FEAR of Unity made them really push the engine into a similar marketing model as Unity.... Personally and I think many other dev's here would probably agree, there was a reason why Unreal engine was and still is one of the best AAA game engines, and thats because of its tools. In the past they were always about to show off cool graphics with great artwork, but with new modelling or sculpting software, it makes even novice artists much better lol.
 
What does wavefront occupancy have to do with it being a z-prepass or shadowpass?
It has nothing(literally) to do with shadowpass since vertex/pixel shaders are either too simple or nonexistent, CUs are mostly unused, and with z-prepass all geometry inefficiencies are doubled, which increases chances of pipeline stalls due to tesselation, triangle setup or other bottlenecks in the pipeline(= decreased CU occupancy), objects could be batched in z-prepass though, so small objects are less likely to be an issue here
 
Many years writing shaders. Modern compilers are CPU centric. They have no concept of the threads in flight versus register allocation trade off. AMD's compiler will allocate registers until only 1 hardware thread can be in flight.
I don't know all the tradeoffs it makes, but AMD's compiler is aware of the need for latency hiding and tries to reduce register usage. It won't always use more registers to save ALUs.
 
3dilettante, the compute block you're thinking of might be MEC - MicroEngine Compute ? Each MEC block manages 4 "pipes" each supporting up to 8 "queues" (rings).
In this case, I had forgotten what fell under the DCE heading. The shared base number with Carrizo made me wonder whether some of the Carrizo-specific items in the GPU's memory handling could have carried over. Additional searching found more clear references to the display controller, and onQ's post noted that as well.

I think somewhere along the line AMD talked about instruction caching having an effect on IPC.

I don't buy it. There would need to be multiple massive shaders/kernels trying to run on a set of CUs simultaneously to get even close to exhausting instruction cache. I believe 4 CUs share an instruction cache of 32KB.
The slide listing the defining features for 4th Gen GCN has an entry for instruction prefetch that would seem to agree with you.
Conflict or capacity misses would if anything be worsened by prefetch, although what prefetch is going on and what form is actually new with the 4th gen is unclear.
I wouldn't think that the wavefront instruction buffers would be what AMD considered a 4th gen feature, although fetching more instructions beyond the one being decoded is a very basic form of prefetch.

It might be something like expanding the command processor's prefetch capability for reading in buffers and also running ahead and initiating transactions for instruction pages, which wouldn't directly affect the CU instruction cache. It might work for initiating PCIe transactions versus counting on the limited lifetime of data on-hcip.

Actually prefetching into the CU Icache would point to some other motivation, like a higher cost to compulsory misses from that cache, or a less flexible pipeline that doesn't support many outstanding misses before inter-wavefront stalls start happening.
 
It has nothing(literally) to do with shadowpass since vertex/pixel shaders are either too simple or nonexistent, CUs are mostly unused, and with z-prepass all geometry inefficiencies are doubled, which increases chances of pipeline stalls due to tesselation, triangle setup or other bottlenecks in the pipeline(= decreased CU occupancy), objects could be batched in z-prepass though, so small objects are less likely to be an issue here
Are you sure you don't mean utilization instead of occupany?

Another wish-improvement : root buffer size increase. 512-bytes (16 Dwords) is so little.
Yeah I was wondering if they would implement the 64Dword limit as well.
 
occupancy is the ability to have wavefront after wavefront without / with out latency. Utilization is all about the ALU's doing work or being idle.
 
Occupancy in this context has been defined by Nvidia as being the number of active warps divided by the maximum number of active warps the hardware can have active. Warp, wavefront, or hardware thread could be used interchangeably.

On the topic of GFX IP level, I am curious whether increased FP16 vs FP32 throughput can sneak under the same version number, or if that is different enough to be visible to the driver.
 
Status
Not open for further replies.
Back
Top