AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
Looks like the L2 partition size has been doubled over GCN 1.2.
The architecture per se doesn't have fixed cache size per partition (that's true for nvidia as well, not just GCN). That said, GCN 1.0-1.2 chips had L2 sizes of 128kB or 256kB per 64bit partition (albeit the number for Tonga is still unconfirmed), so 512kB per 64bit partition is definitely higher (Hawaii had just 1MB in total). Fiji also has 2MB, but it has also 8 memory partitions and not just 4. I'd say though this was fully expected...
And what's with the dedicated L/S units -- 16 per CU?
Same old afaict...
 
So, I was right: 32 ROPs.

The primitive discard accelerator is merely chucking out small triangles. The reference to an improvement with MSAA level might be a hint that historically queries against the hierarchical-Z buffer are sufficiently heavy-weight that a large number of them causes a stall. So, if they are culled earlier, then that increases the proportion of time given to triangles that will be rendered and eases the workload on the hierarchical-Z buffer. The hierarchical-Z buffer becomes larger with MSAA level (and resolution), which implies more data is required to satisfy each triangle-query.

Well, that's my guess.

I don't know much about the throughput of the hierarchical-Z buffer versus its size and the triangle rate. I might be imagining a problem that isn't there.
 
Is hierarchical Z physically tiled in the same way the render back ends are? Could the inter-engine distribution links between the geometry processors and rasterizers have been broadcasting a zero-coverage triangle to multiple engines?

With 4 shader engines and geometry processors, what is left that isn't at least as capable as the 390?
Rasterizer and matching ROP throughput, L2 crossbar bandwidth due to smaller number of channels and matching slices?

Oddly enough, the number of memory controllers seems to be higher for this 256-bit bus than it was for Tonga, which AMD diagrammed as having 4 64-bit controllers.
There are 8 MC blocks in the Polaris diagram. Could that mean even the L2 is Hawaii level (edit: in slice count)?
And what does that mean since Fiji had HBM controllers, but was diagrammed with 8 MC blocks--and bandwidth tests showed very little scaling until the data stride got off-chip?
 
My theory is that when you enable MSAA, the coverage analysis will test each primitive against more samples. So zero-area primitives will take longer in coverage analysis only to be discarded at the end as they won't hit a single sample. Now if you culled these primitives even before rasterization then it'll save all that useless work. Honestly I thought culling zero area primitives would already be implemented as it only takes 6 MULs and 3 ADDs to determine if the cross product of a triangle is 0.
 
32b/channel makes sense for the MC as the architecture will have to work alongside a CPU and could be addressing system memory.

No details on CU configurations. Diagrams looks like the standard 4x16+1 though. Reserving CU should be interesting. I'd like to see the sound engine/effects that needs over 1TF(22%) of processing power if that diagram is realistic.

That ACE/HWS scheduling/self-tuning I theorized a while back appears to be in there. Curious if it's applicable to previous generations. Doesn't look like they've changed the schedulers much, although they may be a bit more robust. Keep in mind 2ACE=1HWS=1 programmable logic device. They are all interchangeable so could be 2ACE+3HWS or 8ACE if that ever made sense.

Native FP16 and 32b controllers could be useful for deep learning?

The increase with MSAA is relative gains, so it may just be marketing spin with increased MSAA being an inherently heavier operation. As MSAA levels increased you'd be culling more subsamples.
 
Let this be the start of the AMD RX480 Review discussions. The official reviews should be hitting in a little under 12 hours, but I'm sure there may be some that are accidentally released early.
 
Are we really make a:

Thread for Polaris arch.
Thread to talk about 480
Thread to talk about the reviews of the 480
and another hundreds of thread for every architecture, every product of that architecture and every review of every product of that architecture?
 
Are we really make a:

Thread for Polaris arch.
Thread to talk about 480
Thread to talk about the reviews of the 480
and another hundreds of thread for every architecture, every product of that architecture and every review of every product of that architecture?
Yes or Would you rather just pile everything into one thread and not be organized?
 
Yes or Would you rather just pile everything into one thread and not be organized?

Yes. In a perfect world we could get 200 threads about every aspect of a product, but we don't live in a perfect world. We will be discussing after of the architecture, the product, the market, etc in the same thread regardless of the tittle of that thread. So in this thread for "the reviews" we will be talking about the TAM or about the architecture etc. Having everything into one single thread would be better and more organize since we could have a better chronological organization of the posts and would make searching through the thread easier also.

A single thread for "polaris" and a single thread for "vega" would be much better for everyone.
 
Yes. In a perfect world we could get 200 threads about every aspect of a product, but we don't live in a perfect world. We will be discussing after of the architecture, the product, the market, etc in the same thread regardless of the tittle of that thread. So in this thread for "the reviews" we will be talking about the TAM or about the architecture etc. Having everything into one single thread would be better and more organize since we could have a better chronological organization of the posts and would make searching through the thread easier also.

A single thread for "polaris" and a single thread for "vega" would be much better for everyone.

I'll do you one better and just make 1 thread for AMD and 1 thread for Nvidia.

The small number of threads that deal with the architecture, the product announcement, and the product reviews are to help the forums readers to quickly determine what information is where. When you only have 1 mega thread with over 20000 posts it makes it impossible to find the information you're looking for.

Take this thread which was spawned from another one. It has over 1000 posts in a short amount of time. Now imagine the thread has been alive for years. They easily grow to beyond 10K posts. Those are impossible to find information in them. They do not work as a means for organized discussion.
 
Yes or Would you rather just pile everything into one thread and not be organized?
Yes.

I'll do you one better and just make 1 thread for AMD and 1 thread for Nvidia.
There's no need to go to extremes.

If you really want to have multiple threads, at least lock some of the staler ones. There's no good reason to have 4 in parallel where the same thing could be argued.
 
If you really want to have multiple threads, at least lock some of the staler ones. There's no good reason to have 4 in parallel where the same thing could be argued.

No reason why the users couldn't moderate things themselves and choose to not post in stale threads.

Anyways, here it is, here is the one single thread for all things AMD. Enjoy.
 
No reason why the users couldn't moderate things themselves and choose to not post in stale threads.

Anyways, here it is, here is the one single thread for all things AMD. Enjoy.

that was a quick turn around. I do like the mega threads but the multi thread format was easier to find things.
 
Increased per wave instruction buffer size
  • Improve single threaded performance
Just realized this is probably the flexible scalar from the papers. Skipped over it initially just thinking they made a cache larger like the L2. A while back wasn't this likely mistaken for higher single threaded CPU performance for DX11?
 
One thread for each gen of cards like before was fine. Not sure why it needed more or less threads.
Exactly.
The old (until a few weeks ago?) system of 1 architecture thread and 1 review thread has worked very well for as long as I can remember.
Suddenly our dear moderator decides to split them into pointless subthreads and when people voice their opinion about that, he throws a tantrum and goes the other extreme.

Just leave things they were they worked.

When interesting (off-topic) subthreads really develop, people usually ask for it to be split.
 
Just realized this is probably the flexible scalar from the papers. Skipped over it initially just thinking they made a cache larger like the L2. A while back wasn't this likely mistaken for higher single threaded CPU performance for DX11?
The improved single thread performance is a sub-item of the larger instruction buffer item. My interpretation is that the larger bullet point gives a feature, and then the indented items give the effect or a detail about it.
 
Status
Not open for further replies.
Back
Top