AMD: Speculation, Rumors, and Discussion (Archive)

CarstenS · Apr 22, 2016

SimBy said:
Wasn't it always up to 2.5x? I mean perf/W depends on how hard you push it above the sweet spot anyway. So it's not a fixed thing.

Also according to earnings call, Polaris is focusing on mainstream. So below $300 confirmed?

Even though it's not explicitly stated there, it need to be read as an "up to" still, because, you know, corner cases and financial statements.

silent_guy said:
They did say back in December that there was still a bunch of extra optimization needed. Maybe the driver guys misread the memo.

They did. But then they also emphasized on how (IIRC!!) half a dozen power saving features where not even enable in the prototype they are basing this figure on.

SimBy · Apr 22, 2016

CarstenS said:
Even though it's not explicitly stated there, it need to be read as an "up to" still, because, you know, corner cases and financial statements.

Agreed. That's how I always look at perf/W. But again, perf/W is probably one of the most fancy metrics...that tells you almost nothing.

Compared to what exactly, which SKU in which benchmark etc. I would imagine its far lower than 2x compared to Nano.

Grall · Apr 22, 2016

SimBy said:
So below $300 confirmed?

If AMD could do another Radeon HD 4890, that would be ridiculously welcome, IMO. That card was fuken awesome from a gaming perspective. Kickass performance, ridiculously price competitive!

CarstenS · Apr 22, 2016

Grall said:
If AMD could do another Radeon HD 4890, that would be ridiculously welcome, IMO. That card was fuken awesome from a gaming perspective. Kickass performance, ridiculously price competitive!

You really want AMD to die, don't you? Look at what that ridiculous price war did to them in the aftermath.

Razor1 · Apr 22, 2016

Well I think its always a comparative amount, not that they will want to cut down margins to gain an advantage, if they have the ability to do it and keep their margins healthy it will be good.

Grall · Apr 22, 2016

CarstenS said:
You really want AMD to die, don't you? Look at what that ridiculous price war did to them in the aftermath.

It's not my job to set AMD's prices, that's AMD's job. GPU prices have climbed steadily, and is now at completely ridiculous prices. When the GTX 8800 Ultra hit $600 we thought we were at Peak Ridiculous, but we hadn't seen nothing yet. Nvidia hasn't launched a high-end gaming card at $600 for how long now?

A high-performant GPU for a good price, who wouldn't want that? Even if a $300 Polaris can't beat a $600+ NV GPU, for that price you could buy two AMD cards and run DX12 multiadapter mode in future games and get superior FPSes too as a result...!

3dilettante · Apr 22, 2016

Anarchist4000 said:
What if a per SIMD scalar doesn't have it's own register file and just shares with the VALUs? If the compiler is switching between scalar and vector code paths as needed they shouldn't interfere with each other. Rearranging threads would be a per CU scalar unit which likely would be high performance. That unit could then have a vector sized RF and bulk dump data or update a bunch of indices very quickly. Regrouping an entire thread block would stall a lot of waves without the high performance part.

I was meaning to return to this since putting a scalar unit downstream from whatever buffer or front end stage that splits the CU scalar pipe from the SIMD pipes provides an opportunity for reducing delays, or places the logic more closely to where it can track dependences.

If the delays are reduced or eliminated due to the hardware being more integrated, or implementing a wavefront stall, then the software-exposed model for the architecture doesn't need to change. It might be more optimal if the padding were eliminated, but the compiler could lag behind hardware evolution without compromising correctness.
Short of a local scoreboard, possibly a more general set of flag registers for wavefronts that can stall if a specific instruction hazard might happen would be incremental in impact.

I'm not sure if it would need to share the vector file, since it might require more complex handling when instructions can source both a vector and scalar register. CPUs can handle the basic interlocks needed between an FP and integer-linked set of pipelines within far fewer cycles with incremental complexity increases, and the SIMD pipeline could handle things even more conservatively.

That would be extra hardware, but after two process nodes the area taken by scalar resources dedicated to a SIMD would be where it is now.

trinibwoy · Apr 22, 2016

CarstenS said:
You really want AMD to die, don't you? Look at what that ridiculous price war did to them in the aftermath.

Was it the price war or the previous thrashing of the 29xx/38xx line?

Granted, asking $300 for the 4870 was a bit too aggressive and unnecessary in hindsight.

Frenetic Pony · Apr 24, 2016

Samsung (who's process is licensed by GloFlo) claims their defect density rate is < .2 (per cm^2 as the usual) for 14nm fifnet. http://www.anandtech.com/show/10272/samsung-foundry-updates-7nm-euv-10lpp-and-14lpc

For reference, 28nm was at 0.05 defect rate by the beginning of last year. I don't know what TSMC's current 16nm defect rate is, but last year it was at 0.18. So by my guess I'd say TSMC has a lower defect rate at than Samsung currently, though since Samsung has a denser process I'm not sure how much a transistor comparison one could make even if there were less murky numbers to go on.

Still, it does show that their are yield problems on bigger chips, at least for AMD. Of course with HBM gen 2 seemingly in such low supply it's questionable how much that matters at the moment.

sebbbi · Apr 24, 2016

Anarchist4000 said:
This I didn't realize was getting added. That was one of the huge sticking points with Tonga/Fiji as I recall.

Some info about this:
http://gpuopen.com/dcc-overview/

mczak said:
Yes indeed, but are you sure about depth compression? I see no evidence the gpu can now do that (I'm just glancing at the open-source driver, which still seems to do a in-place decompress if a depth buffer gets read in the shader).

I don't see anything indicating you could skip decompression for either msaa-compressed surfaces or fast cleared ones. (That said, the allocation of textures, which will also determine the fmask, fastclear, dcc bits disappears somewhere into some address library which I didn't really look at, I don't know if those "old" bits actually still get used for color surfaces...)

This article didn't specify MSAA and depth decompression and direct read in detail, so I asked some extra questions in their Twitter post.

Me:
@TimothyLottes did I understand correctly: MSAA + custom resolve is a bad case for DCC? Can GCN 1.2 read MSAA color + Z without decompress?

Timothly Lottes:
"@SebAaltonen GCN 1.2 has DCC texture-read path without separate decompress pass. But that DCC mode doesn't compress as well as no-read case."

So it seems that it can directly read both depth and MSAA without decompression. However the readable format compresses slightly worse. A huge improvement over GCN 1.0/1.1.

Grall · Apr 24, 2016

Frenetic Pony said:
So by my guess I'd say TSMC has a lower defect rate at than Samsung currently, though since Samsung has a denser process I'm not sure how much a transistor comparison one could make even if there were less murky numbers to go on.

Is it really denser though? .16/14 are marketing numbers that don't reflect reality fully.

Kaotik · Apr 24, 2016

Grall said:
Is it really denser though? .16/14 are marketing numbers that don't reflect reality fully.

It's slightly denser, Apple did chips on both, and ones made by Samsung were a bit smaller

itsmydamnation · Apr 24, 2016

I figure this mans posts are worth sharing:

http://forums.anandtech.com/search.php?searchid=2739829

There are some situation where Polaris is incredibly fast. Faster than anything in the market. The secret is probably that special culling mechanism in the hardware, which helps the GPU to effectively cull those false positive primitives that aren't visible in the screen. Today's hardwares can't do this.
Single wavefront perfomance is also incredibly good. 10-100 times faster than anything in the market. This is good for VR.

CarstenS · Apr 24, 2016

Seems to be locked if you're not a member of the AT forums. Who is this poster?

itsmydamnation · Apr 24, 2016

CarstenS said:
Seems to be locked if you're not a member of the AT forums. Who is this poster?

zlatan in my opinion one of the few people who is worth listening to on that forum. Claims to be a game dev, quality of post seems to back it up .

Grall · Apr 24, 2016

CarstenS said:
Seems to be locked if you're not a member of the AT forums.

I'm a member, and there just doesn't seem to be a post there anymore. Deleted, or typo in the link perhaps?

itsmydamnation · Apr 24, 2016

Grall said:
I'm a member, and there just doesn't seem to be a post there anymore. Deleted, or typo in the link perhaps?

odd it loads for me just fine, its the post history of zlatan

Anarchist4000 · Apr 24, 2016

3dilettante said:
The paper posited a more explicit tracking within the program itself, in order to evaluate when utilization was below par.

There was a French paper linked to me I read on this as well on APUs. The CPU was polling L3 cache hit rate to determine how far ahead to run with prefetching. The L3 on APUs part really makes me wonder if GPU memory will react like a L3 cache out of system memory going forward. AMD seems to have been making driver changes suggesting this with the entire system memory pool showing up as available VRAM.

3dilettante said:
I was meaning to return to this since putting a scalar unit downstream from whatever buffer or front end stage that splits the CU scalar pipe from the SIMD pipes provides an opportunity for reducing delays, or places the logic more closely to where it can track dependences.

If the delays are reduced or eliminated due to the hardware being more integrated, or implementing a wavefront stall, then the software-exposed model for the architecture doesn't need to change. It might be more optimal if the padding were eliminated, but the compiler could lag behind hardware evolution without compromising correctness.
Short of a local scoreboard, possibly a more general set of flag registers for wavefronts that can stall if a specific instruction hazard might happen would be incremental in impact.

I'm not sure if it would need to share the vector file, since it might require more complex handling when instructions can source both a vector and scalar register. CPUs can handle the basic interlocks needed between an FP and integer-linked set of pipelines within far fewer cycles with incremental complexity increases, and the SIMD pipeline could handle things even more conservatively.

That would be extra hardware, but after two process nodes the area taken by scalar resources dedicated to a SIMD would be where it is now.

I have still been pondering over this as well. Still reading over that hybrid architectures to depth imaging paper. That paper seems to suggest a high performance CPU is required to do prefetching, but I'm not sure how it could keep up with a discrete GPU. Especially if constrained by PCIE bandwidth, even with Onion.

sebbbi said:
So it seems that it can directly read both depth and MSAA without decompression. However the readable format compresses slightly worse. A huge improvement over GCN 1.0/1.1.

I recall reading that. Still thinking there was another limitation or that "slightly worse" was a bit more than slight. It just seemed like some devs were tripping over it more than would be expected if the fix was a simple creation option.

CarstenS said:
Seems to be locked if you're not a member of the AT forums. Who is this poster?

It was a search for member posts so required a login.

This should be it.
http://forums.anandtech.com/showpost.php?p=38180442&postcount=46

CSI PC · Apr 24, 2016

As AMD has implemented this Primitive Discard Accelerator, would it also had made sense for them to support Conservative Raster with its Occlusion Culling in DX12?
Wonder how well the Primitive Discard Accelerator will work with engines such as UE4/Unity/CryEngine with their own internal occlusion culling designs and what level of integration is needed.
Cheers

sebbbi · Apr 24, 2016

CSI PC said:
As AMD has implemented this Primitive Discard Accelerator, would it also had made sense for them to support Conservative Raster with its Occlusion Culling in DX12?
Wonder how well the Primitive Discard Accelerator will work with engines such as UE4/Unity/CryEngine with their own internal occlusion culling designs and what level of integration is needed.
Cheers

I would assume that the "Primitive Discard Accelerator" is just a marketing term for some additional early out tests for backfacing & smaller than pixel & out of the screen triangles (and maybe an early out test for small triangles vs HTILE). Currently Nvidia beats AMD badly in triangle rate benchmarks, especially in cases where the triangles result in zero visible pixels. Nvidia certainly has more advanced triangle processing hardware, but the interesting question is whether they just win by brute force (Nvidia has distributed geometry processing to parallelize the work and have better load balancing), or whether they also have better (early out) culling for triangles that are not visible.

Programmable vertex shaders make it almost impossible to do robust automatic coarse grained culling (by driver & hardware). Not all engines use vertex buffers anymore, and even if vertex buffers are used, the vertex position data might be bit packed in a custom format. The transform matrix might be 4x4, 4x3, 3x4, there might be two or three of them (separate world * view * projection), there might be separate position transform and 3x3 rotation, or quaternion (or dual quaternion) rotation instead of a matrix. So I would assume that the GPU has to run the vertex shader. Of course the driver could split the SV_Position related vertex shader code to reduce some math and data loads. Still this would require loading position data for each vertex and transforming each vertex before the culling decision could be made. This greatly reduces the potential savings.

Coarse occlusion culling (object and/or sub-object granularity) is still highly beneficial, even if the GPU had hardware to occlusion cull at triangle granularity (and/or really high triangle rate). Coarse culling doesn't need to fetch any per vertex data (just bounding boxes/spheres), meaning that is accesses much less memory, and does much less calculations per culled triangle. Coarse (software based) occlusion culling will be still highly relevant in the future.

AMD: Speculation, Rumors, and Discussion (Archive)

CarstenS

Moderator

SimBy

Grall

Invisible Member

CarstenS

Moderator

Razor1

Grall

Invisible Member

3dilettante

trinibwoy

Meh

Frenetic Pony

sebbbi

Grall

Invisible Member

Kaotik

Drunk Member

itsmydamnation

CarstenS

Moderator

itsmydamnation

Grall

Invisible Member

itsmydamnation

Anarchist4000

CSI PC

sebbbi

Similar threads