AMD: Speculation, Rumors, and Discussion (Archive)

Jawed · May 1, 2016

Do you think leaving ALUs idle is a good thing? Do you think the console devs using this in their games is an illusion?

Razor1 · May 1, 2016

well looking at it from if there is an increase in throughput async needs get less and less, if they are really going to tout async, I don't see much change in through put, it might actually come in handy...... If they do increase throughput, well what is the need to extract the small left over percentage of performance *time vs cost*

Also all the async figures we have seen look great for certain GPU's within the GCN family while others just don't get any where near the same benefits.

Now added to this every single console having two brands, with two different class GPU's, but all consoles will co exists, that just creates major headaches for console dev's and porting *we don't know about xbox one yet, but I wouldn't be surprised if they take the same route as PS. I can foresee a BM in the future.

renderstate · May 1, 2016

Jawed said:
Do you think leaving ALUs idle is a good thing? Do you think the console devs using this in their games is an illusion?

Async compute is a very good idea, but ironically has more potential on GPU architectures that aren't very good at keeping all their cores busy with something to do (workload permitting..).

It's not a silver bullet though as it might increase cache trashing and actually decrease performance.

trinibwoy · May 1, 2016

Jawed said:
Do you think leaving ALUs idle is a good thing? Do you think the console devs using this in their games is an illusion?

Leaving them idle is definitely not a good ting. Even better would be minimizing the amount of idle time to begin with. Async is a useful scalpel to shape workloads to match a given architecture's capabilities.

The numbers coming out now aren't really demonstrating any tangible benefit hence my question. Is there meat to this or just hype.

Alexko · May 1, 2016

Hmm, ~234mm² for Polaris 10, according to VideoCardz. That's somewhere between Pitcairn and RV770.

silent_guy · May 1, 2016

Alexko said:
Hmm, ~234mm² for Polaris 10, according to VideoCardz. That's somewhere between Pitcairn and RV770.

I believe the LinkedIn leak said 232mm2. The guy knew what he was working on.

CarstenS · May 1, 2016

Jawed said:
Do you think leaving ALUs idle is a good thing? Do you think the console devs using this in their games is an illusion?

Ideally, your ALUs are kept busy without explicit hints. If you cannot do that, you would need Async. The less you can do that, the greater the profit from AC.

Anarchist4000 · May 1, 2016

trinibwoy said:
Leaving them idle is definitely not a good ting. Even better would be minimizing the amount of idle time to begin with. Async is a useful scalpel to shape workloads to match a given architecture's capabilities.

The numbers coming out now aren't really demonstrating any tangible benefit hence my question. Is there meat to this or just hype.

Isn't that what they just did? Leave compute tasks capable of running asynchronously in a separate queue so they can be scheduled optimally by the driver. A strategy that seems to require very little time by developers. There isn't even a requirement that it needs to execute asynchronously.

CarstenS said:
Ideally, your ALUs are kept busy without explicit hints. If you cannot do that, you would need Async. The less you can do that, the greater the profit from AC.

Ideally, but it's a lot easier to scale ALUs than TMUs and ROPs and crafting shaders to use the extra ALUs all the time is another problem. Being able to balance bottlenecks across kernels seems a beneficial solution. Why disable/idle ALUs at the beginning of the frame just to be restricted by ALU throughput later?

trinibwoy · May 1, 2016

Anarchist4000 said:
Isn't that what they just did? Leave compute tasks capable of running asynchronously in a separate queue so they can be scheduled optimally by the driver.

I'm not quite sure how much say the driver has over scheduling of the various queues.

A strategy that seems to require very little time by developers. There isn't even a requirement that it needs to execute asynchronously.

Everything I've read so far says quite the opposite. Devs need to be careful and deliberate when using async. Improper usage can hose performance (fighting for the same resources, cache thrashing etc)

Frenetic Pony · May 1, 2016

232mm works out with 3072 of AMD's Compute Units (stream processors still?) and a 256bit bus. Though it might be 2560 CUs... there's room for error there. Whichever, if there's a 40% overclock and a 15% efficiency boost, would work out to the rumored 980ti like performance, perhaps just above a Titan X by 1% or so to a bit below a Fury X. Which either way is devastatingly small and low power if even the 2x performance per watt claim shows up. It's no wonder Nvidia seems eager to paper launch their products, with GP100 "launched" without even mention of availability and GP104 paper launching sometime this month, even though availability isn't supposed to happen until computex. Well, if it all adds up then good for AMD.

trinibwoy said:
AMD is pushing async real hard. Do they genuinely believe it's such a big deal or is it just marketing?

http://www.amd.com/en-us/innovations/software-technologies/radeon-polaris#

Async compute is quite nice from an efficiency standpoint. If you're clever and running the right sort of program (EG a game) you can get 10-20% more performance out a GPU. Though half the reason they're PRing it so much is because NVIDIA doesn't have proper support.

Picao84 · May 1, 2016

Frenetic Pony said:
Async compute is quite nice from an efficiency standpoint. If you're clever and running the right sort of program (EG a game) you can get 10-20% more performance out a GPU. Though half the reason they're PRing it so much is because NVIDIA doesn't have proper support.

nVIDIA supposedly does not have proper support. Shall we remind ourselves of what happened with Tessellation? Tessellation was touted as a clear ATI/AMD thing since they invented it years before and 5800 series was their 5th or so generation. I still remember the incredulity and Charlie's attempts to explain how nVIDIA nailed it with Fermi and sped far ahead of AMD on it, going as far as speculating nVIDIA was cheating on UniEngine benchmark. We all know how that turned out.

CarstenS · May 1, 2016

Anarchist4000 said:
Isn't that what they just did? Leave compute tasks capable of running asynchronously in a separate queue so they can be scheduled optimally by the driver. A strategy that seems to require very little time by developers. There isn't even a requirement that it needs to execute asynchronously.

Sort of, yeah. But what I meant was a hardware solution that works automatically. So every application profits from it that uses much GPU power.

Anarchist4000 said:
Ideally, but it's a lot easier to scale ALUs than TMUs and ROPs and crafting shaders to use the extra ALUs all the time is another problem. Being able to balance bottlenecks across kernels seems a beneficial solution. Why disable/idle ALUs at the beginning of the frame just to be restricted by ALU throughput later?

Yeah, why would you disable ALUs? No point in it, but you have to work extra hard to keep them busy at all times. All times, as in not only when special programs do trigger certain hints.

Frenetic Pony · May 2, 2016

As far as I remember the async compute stuff has gotten more programmable over time with GCN, with 1.2 you can now schedule compute/normal tasks as you wish, they've called it a "quick response que". Which is to say, if you need an async compute task done at a certain time you can pre-empt other resources to finish it even if it was running in "spare" resources befoe. But either way you need to be rather explicit in calling it, it's not a thing you'd generally just want to let a driver to handle by itself.

Regardless, you're never going to get 100% utilization on a GPU today. Doing things like re-writing in assembly to optimize stuff and make sure the GPU utilization is high and the actual shader running is maximally efficient has gone by the wayside, as it's simply becoming too complex to do so in any practical sense. And aside there's far too many higher level optimizations that can be done today to even think about how to schedule everything for maximum utilization. By the time you've implemented sample distribution shadow maps with screenspace based shadow occlusion culling you look up and see something called sparse shadow maps that's faster and now there's implicit clustered light culling which is faster than tiled light culling but you've just spent a long time building your pipeline around explicit tiled light culling so...

Async compute, while requiring effort, can definitely be worth the cost as you are going to have bubbles of underutilization one way or another. A more abstract way to just fill that time with something useful is highly appreciated. EG during a shadow map pass where your rasterizer is doing a lot work but ALUs aren't you can potentially get a lot done.

silent_guy · May 2, 2016

Frenetic Pony said:
Which either way is devastatingly small and low power if even the 2x performance per watt claim shows up.

So AMD has claimed that the division in perf/W increase is 30% for design and 70% for process.

Based on the Fury X vs Titan X numbers on hardware.fr, the perf/W ratio between the 2 is almost exactly 1.30 (37.3 / 28.7). And that's despite Fury X benefitting from HBM power efficiency, so the numbers are a bit flattering for AMD in terms of architecture efficiency.

Nvidia is largely keeping the architecture of Maxwell, so let's apply 70% process benefit, for a perf/W ratio of 63.4. For AMD to get even with Nvidia, it needs to improve perf/W by 63.4/28.7 = 2.21. But taking into account the lack of HBM, it's probably a bit more.

I think that's very reasonable and achievable. I also think that's about the best we can expect from AMD. I don't see at all how that would be devastating. It's just AMD catching up to where Nvidia was 2 years ago in terms of architecture alone.

It's no wonder Nvidia seems eager to paper launch their products, with GP100 "launched" without even mention of availability and GP104 paper launching sometime this month, even though availability isn't supposed to happen until computex. Well, if it all adds up then good for AMD.

AMD kicked off their paper launch of Polaris in December for reasons unknown, so I don't see the big deal.

Anarchist4000 · May 2, 2016

trinibwoy said:
I'm not quite sure how much say the driver has over scheduling of the various queues.

The developer wouldn't have control beyond the driver level beyond a convoluted use of barriers. Which part actually does the scheduling likely depends on architecture, but beyond the driver it's out of the developers hands anyways.

trinibwoy said:
Everything I've read so far says quite the opposite. Devs need to be careful and deliberate when using async. Improper usage can hose performance (fighting for the same resources, cache thrashing etc)

Careful, but that's always the case as you could stall the pipeline. Oxide for example said they implemented it in a weekend just to see what it could do. That doesn't sound difficult to me. I'm not sure we've heard from console devs beyond significant performance gains. I think the issue is some devs likely overthinking the implementation. You just want to ensure the compute queues are filled while executing graphics and not selecting 2 compute tasks and adding barriers to try and force them to execute concurrently. Even with the compute queue filled there is no reason to expect tasks not to execute serially. It shouldn't be that different from hyperthreading where you have a 2nd thread to schedule on available hardware.

CarstenS said:
Sort of, yeah. But what I meant was a hardware solution that works automatically. So every application profits from it that uses much GPU power.

At least on newer GCN versions I think it does work automatically. It has to do some sort of tuning, unless it's using a round robin dispatch of all the queues, or you'd just flood the hardware with a single available kernel. I really haven't seen any clarification from AMD on just how they select wavefronts for scheduling. If you follow the thought process that they wanted concurrent execution, having the scheduler target ratios of graphics:compute or fetch:alu doesn't seem unreasonable. I'd imagine it's not available because Nvidia is still working out the details for their implementation.

CarstenS said:
Yeah, why would you disable ALUs? No point in it, but you have to work extra hard to keep them busy at all times. All times, as in not only when special programs do trigger certain hints.

Power efficiency going off that recent patent. Throughput would reduce to whatever was required by disabling or downclocking ALUs. You would basically guarantee the hardware was always close to full utilization.

Frenetic Pony · May 2, 2016

silent_guy said:
So AMD has claimed that the division in perf/W increase is 30% for design and 70% for process.

Based on the Fury X vs Titan X numbers on hardware.fr, the perf/W ratio between the 2 is almost exactly 1.30 (37.3 / 28.7). And that's despite Fury X benefitting from HBM power efficiency, so the numbers are a bit flattering for AMD in terms of architecture efficiency.

Nvidia is largely keeping the architecture of Maxwell, so let's apply 70% process benefit, for a perf/W ratio of 63.4. For AMD to get even with Nvidia, it needs to improve perf/W by 63.4/28.7 = 2.21. But taking into account the lack of HBM, it's probably a bit more.

I think that's very reasonable and achievable. I also think that's about the best we can expect from AMD. I don't see at all how that would be devastating. It's just AMD catching up to where Nvidia was 2 years ago in terms of architecture alone.

AMD kicked off their paper launch of Polaris in December for reasons unknown, so I don't see the big deal.

Eh, highly depends on what you're benchmarking. If you're looking at newer, DX12 stuff like Hitman in DX12 mode/Ashes of the Singularity AMD already has a performance, and even performance per watt advantage over Nvidia, but who knows if that's just those games and will continue to translate to other newer titles? Either way, a 150 watt 980ti like performance is around 20% more efficient than Nvidia's GP100 numbers, and that's solely based on Nvidia as a benchmark so there's no "if you do this title/that title" like comparison.

Which is obviously great for AMD, but that's assuming there's actually a 980ti like performance at 150 watts. If it's 175 watts then it's much closer to parity for the two. And the January thing was just a tease, so it definitely seems to me that Nvidia is eager for whatever reason to put out all the information they have on their new cards, while AMD is, for whatever reason, more content to wait/tease little by little.

Razor1 · May 2, 2016

Frenetic Pony said:
Eh, highly depends on what you're benchmarking. If you're looking at newer, DX12 stuff like Hitman in DX12 mode/Ashes of the Singularity AMD already has a performance, and even performance per watt advantage over Nvidia, but who knows if that's just those games and will continue to translate to other newer titles? Either way, a 150 watt 980ti like performance is around 20% more efficient than Nvidia's GP100 numbers, and that's solely based on Nvidia as a benchmark so there's no "if you do this title/that title" like comparison.

Which is obviously great for AMD, but that's assuming there's actually a 980ti like performance at 150 watts. If it's 175 watts then it's much closer to parity for the two. And the January thing was just a tease, so it definitely seems to me that Nvidia is eager for whatever reason to put out all the information they have on their new cards, while AMD is, for whatever reason, more content to wait/tease little by little.

Well look at it this way, AMD would have used base case to "market" their hardware. if best case is 2.5 times perf per watt. They may have gotten close to Maxwell 2 in this department (they are definitly using Hawaii as their base line not Fury since its got HBM if they were it won't be 2.5x thats for sure if we normalize it by removing the additional % gain from HBM), lets not even go into what ever Pascal's architecture brings to the table. This is why I said the only way for AMD to catch up that quickly on this front is if nV somehow screws up.

Why would they be telling OEM's this number if it wasn't best case with games like Hitman, like SW Battlefront which are advantages to AMD then putting on frame rate lock which they definitely have an advantage for?

Also the P100 we don't know what the performance of it is all we know it should be faster than a 980 ti and uses more power since its got DP.........

nV is eager to stop any marketshare gains by AMD, as is AMD eager to gain OEM support and marketshare gains in mobile and midrange. This is typical marketing and getting the word out there.

"hey we have something to show you and they aren't the only ones"

silent_guy · May 2, 2016

Frenetic Pony said:
Which is obviously great for AMD, but that's assuming there's actually a 980ti like performance at 150 watts. If it's 175 watts then it's much closer to parity for the two.

The hardware.fr numbers don't use canned 150W or 175W numbers. They use measured power. So let's ignore marketing slogans for a change?

Furthermore, if you compare the GTX 980 Ti standard against a 980 Ti super clocked in the same article (Fury X review), the perf/W numbers go up for the latter, not down. Now I don't want to pin anything down on the specifics here, there's a lot of noise between samples, but flat out claiming that higher clocked (and higher power) models skew the data in a bad way for a 980 Ti is not warranted.

The GP100 numbers are useless for comparison. All we have is a single number that's probably used as guidance for data center architects, so they're very likely to be worst case as opposed to real life gaming cases. We don't know yet what Pascal consumer chips will lack compared to GP100, so let's give that a rest. Similarly, I didn't add any architectural improvements for Pascal either.

Note also that TSMC lists a 70% power reduction for 16FF+. All things equal that'd be a 230% increase in perf/W, not the 70% that I used. It remains to be seen what the real number is, but I don't think I've been overly aggressive there.

Finally, wrt to which game to benchmark: the DX12 games perform better on AMD. That means they do more work. Which means they likely consume more power as well. Without redoing the accurate measurements that were done by hardware.fr, you can't just change the perf nominator while keeping the power denominator constant. It's just data we don't have. (Mr. Triolet, that's a suggestion for an article right there!)

And the January thing was just a tease, so it definitely seems to me that Nvidia is eager for whatever reason to put out all the information they have on their new cards, while AMD is, for whatever reason, more content to wait/tease little by little.

It was a tease with power numbers, performance suggestions and comparisons with Nvidia GPUs. I call that the start of a drawn out paper launch. YMMV.

RedVi · May 2, 2016

Razor1 said:
Well look at it this way, AMD would have used base case to "market" their hardware. if best case is 2.5 times perf per watt. They may have gotten close to Maxwell 2 in this department (they are definitly using Hawaii as their base line not Fury since its got HBM if they were it won't be 2.5x thats for sure if we normalize it by removing the additional % gain from HBM)

Why wouldn't it be Tonga?

Razor1 · May 2, 2016

hmm Tonga doesn't have as good as perf/watt as Hawaii. IF it was then it will be just lower lol.

AMD: Speculation, Rumors, and Discussion (Archive)

Jawed

Razor1

renderstate

trinibwoy

Meh

Alexko

silent_guy

CarstenS

Moderator

Anarchist4000

trinibwoy

Meh

Frenetic Pony

Picao84

CarstenS

Moderator

Frenetic Pony

silent_guy

Anarchist4000

Frenetic Pony

Razor1

silent_guy

RedVi

Razor1

Similar threads