7850 with what CPU...
It's not just Tessmark. You see it with AMD's DirectX SDK example as well.Tessmark is really special.. i dont know why, for a long time they was not even test their soft on AMD GPU's.
That's what I meant by patch data.The hardware tessellator only needs tessellation factors.
I was giving a suggestion for even avoiding the UV calculation in the tesselator, but if AMD does do it, then that's great. The problem is that their data flow management for tessellation sucks if it needs off chip storage.It outputs UV's and the DS does the rest.
I don't see how this is so different from triangles from vertex shaders generating pixels that run on multiple CUs/SMXs.The DS needs access to the HS output and if you only have a few wavefronts in flight performance will suck. Hence you want DS waves from the same patch to execute on multiple CUs/SMXs in parallel.
HS's can be very expensive so you need a lot of them in flight to perform well at low tessellation cases. The FIFO you suggest is the limiting factor for how many HS waves can execute in parallel.That's what I meant by patch data.
I was giving a suggestion for even avoiding the UV calculation in the tesselator, but if AMD does do it, then that's great. The problem is that their data flow management for tessellation sucks if it needs off chip storage.
I don't see how this is so different from triangles from vertex shaders generating pixels that run on multiple CUs/SMXs.
AMD chips don't push polygons into an off-chip buffer there. They run the VS on however many CUs are deemed appropriate, fill a FIFO on chip to temporarily store the output, and when it's full the CUs stop doing VS work while the rasterizers start emptying the FIFO by creating pixel wavefronts for the CUs to process.
This is how tesselation should work. Run the HS, fill a FIFO, when full stop HS work and let tessellators empty the FIFO by creating DS wavefronts.
First of all, that's a failure in optimization if AMD is sacrificing high-factor performance to improve low-factor performance. Their own devrel material suggests not using tessellation at low factors, and poor HS performance won't matter then anyway.HS's can be very expensive so you need a lot of them in flight to perform well at low tessellation cases. The FIFO you suggest is the limiting factor for how many HS waves can execute in parallel.
The DS is where displacement mapping happens, so I don't see why the HS needs 1000 control points to hide latency. The vast majority of HSes will be pure arithmetic.Say you have 4 control points per patch, 64 bytes of output per control point, plus 6 32-bit tessellation factors per patch and you need 1000 control points in flight to hide the latency of the HS.
Correct. This is why it doesn't make sense to have a large on chip FIFO.Anyway, HS output going to memory doesn't explain AMD's performance in the least. The higher the tessellation factor, the more DS verts per patch, and the less that the BW/latency of off-chip HS data will affect the triangle throughput of the tessellator.
If the DS output went off chip you'd see performance scale with memory bandwidth.In reality, AMD's tris per clock goes drastically down with tessellation factor. That must mean they are storing tessellator output off chip (maybe 5-6 bytes per DS vert), not HS output. There's no need to do this with properly designed hardware.
I still don't understand how "driver issues" can affect one SKU when others based on the same chip aren't similarly affected.
It's not just Tessmark. You see it with AMD's DirectX SDK example as well.
It doesn't need to be large. Your calculations were dependent on having 1000 HS wavefronts, which is at least 50x too many. It's extremely rare for a HS to need texture access.Correct. This is why it doesn't make sense to have a large on chip FIFO.
BW wouldn't be an issue (4 verts/clk is < 24 GB/s), unless there is an internal restriction. Latency and/or extra clocks for export/import probably is.If the DS output went off chip you'd see performance scale with memory bandwidth.
I said 1000 control points not 1000 HS wavefronts. If you don't think bandwidth is an issue then I'm not sure why you think that would be a stupid design.It doesn't need to be large. Your calculations were dependent on having 1000 HS wavefronts, which is at least 50x too many. It's extremely rare for a HS to need texture access.
BW wouldn't be an issue (4 verts/clk is < 24 GB/s), unless there is an internal restriction. Latency and/or extra clocks for export/import probably is.
My mistake. Ignore what I wrote in that paragraph.I said 1000 control points not 1000 HS wavefronts. If you don't think bandwidth is an issue then I'm not sure why you think that would be a stupid design.
Don't underestimate how much latency there is in a graphics chip nor how complicated some HS's are. Especially once sub-division surfaces start being used more and there are a lot of control points per patch.My mistake. Ignore what I wrote in that paragraph.
But sustaining 1000 control points is silly for a shader without texture loads in a load-balanced system (even if it does initially issue that many), and even if you're right that the tens of kB needed to buffer enough control points is too much of a cost, problems with HS streamout would result in a t-factor effect on performance that trends opposite of what we're seeing.
BW for tessellator streamout is not an issue in theoretical tests, but it'll still sap significant BW (BTW, I forgot to multiply those numbers by 2 for write and read, but it's still a lot less than total BW. I assumed two 16-bit UVs plus a couple other bytes for indexing/overhead). So it still is stupid to stream UVs out, especially when they're so tiny and can be created on demand by the tessellator and packed directly into DS wavefronts.
But why are you denying that there's a problem? I'm still waiting for your explanation, but AFAICS the only reason to have lower triangle throughput with high t-factor is UV streamout (high t-factor, in fact, reduces data flow in the pipeline), with stalls appearing either when writing out or reading back in.
It's a flawed design.
OK I must have missed that.290 non-X was pushed back to Nov 5th last week. It's not yet completely AWOL.
The blurred out card is the new card coming Tuesday 5am from AMD.
A card which I've now managed scores even beating the R290X OC with, due to newer drivers.
Again, I'm accepting your claim that HS output may warrant going off chip. And latency is exactly why I think UV streamout can cause stalls.Don't underestimate how much latency there is in a graphics chip nor how complicated some HS's are. Especially once sub-division surfaces start being used more and there are a lot of control points per patch.
Well, improved off-chip buffering is really the only thing AMD has mentioned between generations, so that's what I assumed. I guess it could be related to DS wavefront generation and distribution, or bank conflicts, or whatever, but it's hard to think of anything that makes throughput go down with higher factors.I'm not denying there is a problem and that the design is flawed. It's just not flawed due to off-chip buffering.