AMD Vega Hardware Reviews

Its odd how IPC has seeped into GPU terminology, it makes little sense in this context. The IPC of a chip is just the number of SPs it has enabled, what were actually interested in is how well the shader array can be kept saturated.

If the Vega FE benchmarks are taken to be final then it is really shader efficiency that has decreased vs Fiji. AotS is not a geometry heavy title, and this was made apparent by Fiji's performance in it. Why does Vega not perform in line with its 13tflop throughput?
 
I don't follow?
The 2x fp16 or 4x int8 throuput is unlikely to come from the ability to issue 2 or 4 times as many independent instructions as in the fp32 case. It's much more likely to come from issuing the same number of instructions, with each instruction specifying operation on 2x or 4x the number of elements, so that IPC remains constant.

I was assuming IPC increase meant less stalling (from e.g. better latency hiding) or some sort of instruction coissue, etc... Then again, in the absence of any other improvements, I would expect IPC to decrease as core clock speed increases. If IPC is really the same as Fury X, except now at 1600MHz, then it does seem that some improvements were made.
 
I don't follow?

If AMD's marketing for NCU's improved IPC is using packed math as the source of the increase, it is being misleading. The term IPC was not used that way prior, and shouldn't be used that way to inflate what work AMD did to improve its architecture's ability to issue more instructions or avoid stalls. Packed math changes what operations an instruction does, but it requires specifically targeting the feature and the individual operations are more constrained due to the overall architecture not reflecting the operations. Branching, registers, and general memory accesses are still handled at the natural width of the architecture, which limits what packed math applies to.

Its odd how IPC has seeped into GPU terminology, it makes little sense in this context. The IPC of a chip is just the number of SPs it has enabled, what were actually interested in is how well the shader array can be kept saturated.

IPC started out in the context of determining single-threaded performance in terms of instructions a pipeline could execute per clock, which served as a general means of comparing chips within an architectural family that operated at different clocks. Multi-threading and multiple cores left IPC as a metric generally applied to straightline execution within a thread.
Using multiple threads and cores' instruction throughput per clock usually happens when a marketing department needs help obfuscating a weak architecture, like what some tried with Bulldozer.

Comparing between ISAs or ISA generations can be difficult, because the semantic density of the instruction stream changes. Throughput or performance can change, which marketing can fairly depict, but not by conflating it with an architecture's general ability to issue instructions in parallel or in the face of hazards and latency. The kind of work that goes into that is more difficult and usually more generally applicable, and it's not so easily inflated to make an architecture look better than it is.
If that is what Vega's marketing relied upon, then NCU may not be that different, especially relative to a CU capable of stronger general issue capability.
 
I think a lot of developers, @sebbbi included, would argue that FP16 does have a good deal of value for shaders—I don't really know about INT8.

But I was thinking about "standard" FP32. Unless AMD was being misleading when they claimed IPC improvements, that was for FP32 code. Yet I don't see that improvement at this point.

I am all for using FP16 math but my point was about the current generation games on PC and not sometime in the future. The desktop GPUs have only recently started supporting HW accelerated FP16 so there's been no incentive yet. Games don't use integer math much so INT8 is completely for compute. I take claims of "IPC improvements" from IHVs with a grain of salt as it's always some corner case which gets improved like transcendental math. It's hard to do FADD, FMAD operations faster than a single cycle already today.
 
Maybe tiling is broken in the FE silicon and thats why RX Vega is late, maybe there has been a chip fix/respin?
I would buy a respin story if the FE silicon date stamp was pretty old. But I believe I read somewhere that it had a May 2017 production date. While not impossible, it'd be strange to productize the same silicon twice in a month.

Starting from the assumptions that AMD engineers are not idiots and that they did a lot of performance simulations before taping out, my current theory is that some corner case results in a badly messed up outcome, and that they're using a work-around with a terrible performance impact.

Random imaginary example: a ROP write operation in combination with an L2 cache flush has a 0.001% of a full chip hang, and the only work-around is to use the L2 cache in write-through instead of write-back mode.

If that kind of bug is only found late in the process, and it can't be fixed with a metal ECO, what do you do as a company? Scrap the product? What if they're already sitting on a $10M inventory?

Some people have pointed out that some January timeframe Vega demos showed much higher performance that what we see today. The scenario above would fit that kind of outcome very well: initial demos running great, but shit hitting the fan later on.
 
Its odd how IPC has seeped into GPU terminology, it makes little sense in this context. The IPC of a chip is just the number of SPs it has enabled, what were actually interested in is how well the shader array can be kept saturated.
That wouldn't be an accurate definition of IPC for a GPU. In the case of a GPU, more vector instructions per clock would need to be issued. Without changing unit counts. Provided sufficient operands, two independent vector instructions could be issued with the scheduler performing the packing. Packed math could legitimately be higher IPC in that case as it's more along the lines of two FP16 instructions.

If AMD's marketing for NCU's improved IPC is using packed math as the source of the increase, it is being misleading. The term IPC was not used that way prior, and shouldn't be used that way to inflate what work AMD did to improve its architecture's ability to issue more instructions or avoid stalls. Packed math changes what operations an instruction does, but it requires specifically targeting the feature and the individual operations are more constrained due to the overall architecture not reflecting the operations. Branching, registers, and general memory accesses are still handled at the natural width of the architecture, which limits what packed math applies to.
As I mentioned above, packed could be justified depending on the arrangement. Independent mul and add, even with FP32, could be a possibility. That would be the SMT approach and two instructions per clock without packing. A RF cache providing operands would allow higher IPC operation as independent hardware could be used without building out the actual register file. I'm still liking that 40-70% fewer RF accesses with 36% less energy concept. Without that feature Vega would perform like Fiji or worse. I'd think it would be hardware driven, but maybe the compiler needs to hint dependencies and only one wave can be active?

Flexible scalar cleaning up diverged paths to free the SIMD might classify as IPC as well. SIMD+scalar for dot products that don't block vector issue.

Also the possibility some long instruction sequences could execute more quickly. A series of FP16 operations could occur as they stabilize more quickly. Matrix multiplications could be accelerated to shave a few clocks for larger operations as they may not be bound by register file latency provided temporary registers. Allowing a SIMD to be out of phase with registers as local latches should clock higher while using less energy.

Games don't use integer math much so INT8 is completely for compute.
A8R8G8B8.
 
A RF cache providing operands would allow higher IPC operation as independent hardware could be used without building out the actual register file. I'm still liking that 40-70% fewer RF accesses with 36% less energy concept. Without that feature Vega would perform like Fiji or worse.
When I googled "40-70% fewer RF accesses with 36% less energy concept", the first link that came up was the Nvidia paper. But you're mentioning Vega. Typo or are there reports that Vega has this kind of cache?
 
What really gets me about that paper is that the 36% energy savings seems to be of RF energy only, which is claimed to be something like 15-20% of dynamic energy for the GPU. So after all those machinations, you've saved say 5% vs a GPU without them? GPU design sure seems brutally hard.
 
If AMD's marketing for NCU's improved IPC is using packed math as the source of the increase, it is being misleading. The term IPC was not used that way prior, and shouldn't be used that way to inflate what work AMD did to improve its architecture's ability to issue more instructions or avoid stalls. Packed math changes what operations an instruction does, but it requires specifically targeting the feature and the individual operations are more constrained due to the overall architecture not reflecting the operations. Branching, registers, and general memory accesses are still handled at the natural width of the architecture, which limits what packed math applies to.
Thank you for explaining, it's pretty clear to me now - maybe I was a bit tired yesterday.

Anyway: „We“ are including SIMD-Units into IPC already, (whether or not that being correct in the strict sense of the word) right? If we accept that, we are already at non-independent Instructions. It's only, that the M in SIMD is now M×2 or M×4, depending on where you come from. I doubt that anyone ever counted only the schedulers and dispatcher per CU/SM as indicative of IPC.
 
Are you sure that they need 4 warps to keep the CU busy? I thought, with a 4 deep pipeline, a 64 wide warp, and a 16 wide SIMD, they could do it with just 2 warps?
Yes 2 wavefronts on each GCN SIMD can run at >95% of theoretical peak. I have (or had - can't remember how it currently works) code that does that, It's really down to high ALU:MEM for a kernel with a runtime of thousands of cycles. Coherent branching in this situation barely makes a difference.

EDIT: not per CU, per SIMD
 
When I googled "40-70% fewer RF accesses with 36% less energy concept", the first link that came up was the Nvidia paper. But you're mentioning Vega. Typo or are there reports that Vega has this kind of cache?
Speculation, but Zen has something similar and rebuilding the memory hierarchy appears to have been a focus with Vega. So it makes sense. The 40-70% should translate directly as it's a probability. The energy savings more of an unknown, but GCN tends to have larger, slower caches so has the potential to benefit more.

What really gets me about that paper is that the 36% energy savings seems to be of RF energy only, which is claimed to be something like 15-20% of dynamic energy for the GPU. So after all those machinations, you've saved say 5% vs a GPU without them? GPU design sure seems brutally hard.
Not that hard, just need to keep the designs in perspective. That 15-20% would be starting with less physical cache and ratios have been trending up since it was written. Would have included GDDR with higher consumption pushing down the percentage. And didn't take into account the possibility of simplifying the existing cache structure. They may have shaved off LDS and SGPRs in the process. They also didn't have GCNs cadence that would constantly be writing out as opposed to forwarding for some savings.
 
New Anyone know the scores of the Vega FE in 3DMark 11 to compare?
Judging by it's FireStrike and TimeSpy numbers, It's probably lower, Vega FE is usually throttling and can't keep up it's rated 1600MHz. This RX Vega 3DMark 11 score is achieved with a 1630MHz. So this could very well be the water cooled version.
 
FE is not RX, ... ?
What do you think the difference is? And what are you needing 16Gb for?

AMD themselves are trying to tout the design of Vega as requiring less "VRAM" for gaming and that even 8Gb is overkill for 4k.
 
AMD RX Vega Leaked Benchmark Shows It Ahead Of GTX 1080 – Specs Confirmed, 1630MHz Clock Speed, 8GB HBM2 & 484GB/s Of Bandwidth

http://wccftech.com/amd-rx-vega-ben...nfirmed-1630mhz-8gb-hbm2-484gbs-of-bandwidth/

https://videocardz.com/70777/amd-radeon-rx-vega-3dmark11-performance

That´s the result of one of the multiple overclocked results over 1630 Mhz of the same card. The non overclocked one is still slower than a stock 1080. So, still similar results to Vega FE.
 
What do you think the difference is? And what are you needing 16Gb for?

AMD themselves are trying to tout the design of Vega as requiring less "VRAM" for gaming and that even 8Gb is overkill for 4k.

Some rare games are already pushing above 8gb... But anyway, I was more curious than anything else.
 
Back
Top