AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

3dilettante · Nov 13, 2017

Anarchist4000 said:
It wouldn't retroactively decide, but poll input from prior geometry as one possible test.

What is it polling? Something that resides in 300ns memory, something in the FIFO or in the post-transform path that feeds into the primitive assembly pipeline?
The specific reference in the presentation linked earlier was the performance penalty of an indirect draw that was empty. Just the initial read was a memory operation, and the single command processor's limited concurrency means the system at large can feel the effects of more than a few such requests.
There would be no primitive to evaluate in order to poll anything, and even in the conventional pipeline the system wouldn't get far enough to consider polling anything.
That scenario is not served well by a primitive shader, which runs in the middle of the vertex processing stages. The empty draw's primitive shader cannot poll anything since that initial step on the part of the command processor occurs before the primitive shader's instance comes into existence--and it won't because there's nothing to invoke it for. The ballot process that removes empty draws happens in a prior separate shader call, and while I've seen AMD discussing culling possibilities for triangles, I have not seen it promise overriding or replacing developer-coded draw call submission.

For an engine that already does much the same thing as promised by primitive shaders, this may be a case where primitive shaders do not help as much or could be detrimental if the duplication of work or contention with the ongoing compute shaders is significant.

Is the inspiration for the your statement concerning 300 instructions per triangle from the Frostbite presentation, or was it pulled from something else?

Move Z culling into the primitive testing at a per bin resolution.

What exactly would that entail, for a general programmable shader in the GCN ISA? If this starts invoking reads to caches or buffers, the prospect of misses (latency), conservative hierarchies (inaccuracy), or timing issues comes into play.

There would also be some mechanism to analyze or reduce the bins.

My understanding of the claim is that this is all a big programmable shader--the mechanism is presumably "code something".

That's roughly what I'm suggesting. All of that running on a CU until satisfied with whatever bins we're created. A single draw call possibly instantiating only one wavefront for establishing bins.

Is this what you're suggesting would be a good idea, or what you are suggesting AMD is doing?
I think a quick skim of the primitive shader section of the Vega whitepaper would serve to counter the latter.

That this big shader is tracking over at least three points where the architecture hands off between concurrent hardware units with amplification/decimation possibilities and a FIFO is at least one point where I would have questions about the former.

Digidi said:
Intresting Patents found:
https://www.google.de/patents/US20140362102?dq=ininventor:"Michael+Mantor"&hl=en&sa=X&ved=0ahUKEwiRw9np97jXAhUM9YMKHYs2CjkQ6AEITDAF
https://www.google.de/patents/US20160371873?dq=ininventor:"Michael+Mantor"&hl=en&sa=X&ved=0ahUKEwiRw9np97jXAhUM9YMKHYs2CjkQ6AEILjAB

Intresting:
https://www.google.de/patents/EP3008701A1?cl=en&dq=ininventor:"David+Simpson"&hl=en&sa=X&ved=0ahUKEwim0rv_-bjXAhVMkJAKHXnPCScQ6AEIvQIwJg

Maybe this is primitive shader?

The patents not covering the binning rasterizer cover using compute as way to submit geometry to the front end, which in recent times has also a shown up as a graphics API extension.
One discussed variation that includes culling and compute+vertex pairs linked by ring buffers is quite close to what Mark Cerny publicly described for the PS4's triangle sieve customization (note: that culling was not a guaranteed benefit).
What AMD is promising for primitive shaders is not something new or necessarily limited to AMD's architectures. That other patents from other companies have been cited makes it probable that a few other parties likely have similar patents or concepts based on them.
Primitive shaders currently are more of a tweak in where AMD intends to place the culling in abstract graphics pipeline and whether formerly separate shaders are combined into a one shader.

Perhaps some of the difficulty getting primitive shaders rolled out is baked into the age of some of this discussion. Some of Vega's features have a documented lineage that goes back to the early days before some of the more advanced techniques had the chance to be developed and deployed in various engines, and the apparent lag in Vega's IP features getting to market may have meant the workloads it was meant for have moved to a point that its gains are less impressive.

Anarchist4000 · Nov 13, 2017

3dilettante said:
What is it polling? Something that resides in 300ns memory, something in the FIFO or in the post-transform path that feeds into the primitive assembly pipeline?

In the case I suggested hierarchical Z values from fragment processing ideally aligned to bin dimensions. Although a separate mechanism would work. Occlusion culling geometry prior to hitting the FIFO across draw calls.

3dilettante said:
For an engine that already does much the same thing as promised by primitive shaders, this may be a case where primitive shaders do not help as much or could be detrimental if the duplication of work or contention with the ongoing compute shaders is significant.

That was my thinking as well. Primitive shaders should be a bit more efficient, but realistic gains minimal save a few extreme cases.

3dilettante said:
Is the inspiration for the your statement concerning 300 instructions per triangle from the Frostbite presentation, or was it pulled from something else?

Yes, from the Frostbite presentation. Obviously subject to situation, but a solid starting point as it should take far fewer instructions. Even exceeding that number some techniques may still benefit if integrating pipeline stages.

3dilettante said:
What exactly would that entail, for a general programmable shader in the GCN ISA? If this starts invoking reads to caches or buffers, the prospect of misses (latency), conservative hierarchies (inaccuracy), or timing issues comes into play.

Specifically that 4SE instruction for binning and probably indexing HiZ results. The bins are likely sufficiently large that caching results for both primitive and fragment culling would be reasonably compact and ideally always cached. Accuracy less of a concern as the outcome would only result in not culling a primitive. Even a percentage point of accurately culled primitives should provide a net gain.

3dilettante said:
My understanding of the claim is that this is all a big programmable shader--the mechanism is presumably "code something".

Correct, with some default mechanisms filling in when not provided an alternative.

3dilettante said:
Is this what you're suggesting would be a good idea, or what you are suggesting AMD is doing?
I think a quick skim of the primitive shader section of the Vega whitepaper would serve to counter the latter.

Both. The Vega Whitepaper provides some basics of primitive shaders, but nothing on DSBR or interactions with other stages. I'm looking at them as an inline function of a larger program with the distinction yet to be determined by AMD.

3dilettante said:
That this big shader is tracking over at least three points where the architecture hands off between concurrent hardware units with amplification/decimation possibilities and a FIFO is at least one point where I would have questions about the former.

One engineer did describe PS as being like assembly. So it may not be unreasonable to leverage FIFOs like streaming devices if batch processing geometry. Especially for binning or minimal sorting into SEs. FIFOs could be pulled forward if you could guarantee all geometry in a bin resided on the same SE.

3dilettante said:
One discussed variation that includes culling and compute+vertex pairs linked by ring buffers is quite close to what Mark Cerny publicly described for the PS4's triangle sieve customization (note: that culling was not a guaranteed benefit).

Cerny and Sony were listed on one of the patents linked above so it seems likely there was some input. That "assembly like" coding for primitive shaders is why I think there is more going on than just culling with a compute shader. My thinking is the entire geometry pipeline is inherently programmable and a form of primitive shader is how it is implemented internally. Same for Nvidia. Some specialized hardware in the front end would look rather strange to the ISA, however that 4SE instruction would suggest some interaction. The ISA wouldn't necessarily list internal instructions or more specifically registers that may relate to primitive shaders.

Digidi · Nov 13, 2017

Are there any Patents outside which come Cloth to primitve shader to understand the behavior?

Infinisearch · Nov 13, 2017

Digidi said:
Are there any Patents outside which come Cloth to primitve shader to understand the behavior?

Look over the filters section on this page: (except for cluster culling) https://github.com/GPUOpen-Effects/GeometryFX/

3dilettante · Nov 14, 2017

Anarchist4000 said:
In the case I suggested hierarchical Z values from fragment processing ideally aligned to bin dimensions.

The presentation included binding the Hi-Z data to a surface so that it could be read and decoded manually. That's one or more memory accesses, and does rely on a pre-pass or other invocation that built that data prior to the shader evaluating the culling.
Bins are screen-space partitions owned by rasterizers and their linked render back-ends, and depth is updated and maintained by the render back-ends, so some level of alignment seems likely.
The presentation notes that the more expensive culling tests, such as Hi-Z, justify branch divergence. I'm still not seeing how the 300 figure from the empty indirect draw scenario applies, except as a cautionary example of what happens if complex culling is applied too well without paying attention to the less-concurrent portions of the system.

Specifically that 4SE instruction for binning and probably indexing HiZ results.

That gives you a 4-bit mask about how many back-ends might be applicable to a conservative bounding box in a 4-shader engine GPU.
Once that value is derived, either by the instruction in Vega 10's case or in other implementations by standard math, what then?
Since it takes a bounding box, it is conservative and has a number of early checks that take things to full coverage. At the same time, how does 4 bits of data allow sufficient granularity for how many or which bins or HiZ tiles overall might be touched?

One possible use case for that information I'm wondering about is the Early Prim Deallocate message type, which among other things references the ability to deallocate primitives from up to 4 buffers, which seems like something a programmable shader might be able to use to remove from primitives from consideration at a per-SE basis. There's a lack of specificity in the encoding as to where to start a stretch of deallocations, or how to make the 11 bits of encoding apply in the likely case that there are far fewer than that many primitives in a row to remove.
A possible implicit method is that the there's state built up in terms of positions exported for primitives in a FIFO, which helps with knowing where to start and which primitives the deallocation can skip over.
Figuring out the optimal balance between that many concurrent processes and the timing of the exports/message is something that might be a little more difficult to get right.

The bins are likely sufficiently large that caching results for both primitive and fragment culling would be reasonably compact and ideally always cached. Accuracy less of a concern as the outcome would only result in not culling a primitive. Even a percentage point of accurately culled primitives should provide a net gain.

I cannot reconcile the "it's all one big programmable shader" with the "but it doesn't need to be accurate since something else will take care of what it misses".

As far as bin sizes go, and possibly batch sizes, the following may be related:
https://cgit.freedesktop.org/mesa/mesa/commit/?id=c3ebac68900de5ad461a7b5a279621a435f5bcec

It's described as being tuned for the more modest Raven Ridge GPU, however.

Correct, with some default mechanisms filling in when not provided an alternative.

Primitive shaders are a conservative culling mechanism. Outside of some contrived scenario, the shader feeds into that same default mechanism--since the primitive shader's explicit target was saving cycles in the fixed-function path.
Is this all one big shader, or is there something else the shader cannot see directly and thus negating the assertion?

Both. The Vega Whitepaper provides some basics of primitive shaders, but nothing on DSBR or interactions with other stages. I'm looking at them as an inline function of a larger program with the distinction yet to be determined by AMD.

The whitepaper diagram has the primitive shader drawn in a box that excludes tessellation and primitive assembly, and it states that the primitive shader is explicitly not the same thing as the surface shader that is emitted when tessellation is on.

One engineer did describe PS as being like assembly. So it may not be unreasonable to leverage FIFOs like streaming devices if batch processing geometry.

If it's all one shader and its using the generally programmable arrays, is there a reason why the same shader instruction stream is running both sides of a FIFO to itself (discounting the earlier Vega whitepaper doc saying that at times it is definitely not one shader)? Handing oneself one's own data in the exact order generated seems a little pedestrian.

Also, that could wind up constraining concurrency, since latency events on one endpoint effect the whole path in this scenario.

Anarchist4000 · Nov 14, 2017

3dilettante said:
That gives you a 4-bit mask about how many back-ends might be applicable to a conservative bounding box in a 4-shader engine GPU.
Once that value is derived, either by the instruction in Vega 10's case or in other implementations by standard math, what then?
Since it takes a bounding box, it is conservative and has a number of early checks that take things to full coverage. At the same time, how does 4 bits of data allow sufficient granularity for how many or which bins or HiZ tiles overall might be touched?

That was the original intent, but the function could be used recursively to further partition the screen space. Previously there were essentially 4 bins, one per SE. Each iteration providing a pow2 number of variably sized bins and a triangle could hit every sub-bin within a bin.

3dilettante said:
I cannot reconcile the "it's all one big programmable shader" with the "but it doesn't need to be accurate since something else will take care of what it misses".

Early Linux documentation mentioned merging all initial pipeline stages into a single monolithic shader with Vega. My understanding is one shader up until rasterization, which is what I'm basing my argument on.

The accuracy part is conservative, so a miss just results in rasterizing geometry that would be culled in subsequent depth tests within

3dilettante said:
As far as bin sizes go, and possibly batch sizes, the following may be related:
https://cgit.freedesktop.org/mesa/mesa/commit/?id=c3ebac68900de5ad461a7b5a279621a435f5bcec

It's described as being tuned for the more modest Raven Ridge GPU, however.

Code:

+    /* This is ported from Vulkan, but it doesn't make much sense to me.
+    * Maybe it's for RE-Z? But Vulkan doesn't use RE-Z. TODO: Clarify this.
+    */
+    bool ps_can_reject_z_trivially =
+        !G_02880C_Z_EXPORT_ENABLE(db_shader_control) ||
+        G_02880C_CONSERVATIVE_Z_EXPORT(db_shader_control);
+
+    /* Disable binning if PS can kill trivially with DB writes.
+    * Ported from Vulkan. (heuristic?)
+    */
+    if (ps_can_kill &&
+        ps_can_reject_z_trivially &&
+        sctx->framebuffer.state.zsbuf &&
+        dsa->db_can_write) {
+        si_emit_dpbb_disable(sctx);
+        return;
+    }

This code block above the bin sizes I believe is referencing what I described. Where they are likely depth culling, but as the comment mentions is unclear why it's there.

I've seen that code before, however my thinking was the bins vary in screen space dimensions recursively to balance the primitive count. Encoding that may be tricky though and I haven't seen any direct evidence of that occurring.

3dilettante said:
Primitive shaders are a conservative culling mechanism. Outside of some contrived scenario, the shader feeds into that same default mechanism--since the primitive shader's explicit target was saving cycles in the fixed-function path.
Is this all one big shader, or is there something else the shader cannot see directly and thus negating the assertion?

Saving cycles was only one aspect, reducing bandwidth and other resource demands I'd consider a close second when tied in with binning. If a triangle at the end of a list occluded everything, a huge reduction in primitives could exist.

3dilettante said:
The whitepaper diagram has the primitive shader drawn in a box that excludes tessellation and primitive assembly, and it states that the primitive shader is explicitly not the same thing as the surface shader that is emitted when tessellation is on.

That is a matter of defining the scope of the shader. All geometry stages in the case of Vega should, not required, run as one monolithic shader, including primitive shaders. Tying in variable configurations (different products and generations) of fixed function hardware I believe is the reason for limiting the scope. The tightly defined primitive shader you describe could be broadened in scope. The question is how much of the capability get exposed or is compatible across IHVs. In theory a dev could write a shader for culling and binning that would work across vendors. The practicality of doing so is another matter with closely guarded secrets residing within. That area would be the basis of most of Nvidia's bandwidth and performance advantage.

3dilettante · Nov 15, 2017

Anarchist4000 said:
That was the original intent, but the function could be used recursively to further partition the screen space. Previously there were essentially 4 bins, one per SE.

What do you mean by previously?

As described in patents or other disclosures, a bin is an area of screen space that an SE is responsible for. As far as Vega and its DSBR are indicated by various descriptions and patents, there are far more than 4 of them. The implications of the code are that it is indicating possible intersections of a triangle with how the screen is already divided, so the checks and math are in terms of the bin size.

Each iteration providing a pow2 number of variably sized bins and a triangle could hit every sub-bin within a bin.

The hardware and storage would be sized and aligned to a few sizes. I don't quite see the point in drilling any further down than the hardware wants to go, or why it's beneficial in this instance to iteratively query how much more a triangle that already hits a bin recursively hits it further.

Early Linux documentation mentioned merging all initial pipeline stages into a single monolithic shader with Vega. My understanding is one shader up until rasterization, which is what I'm basing my argument on.

If this is what you are talking about, it looks to me like this is a misreading of how the shader compiler is structured to balance compilation time penalties, handling of API state, cross-section optimization, and packaging of binary output.
https://github.com/anholt/mesa/blob/master/src/gallium/drivers/radeonsi/si_shader.h

Per its text, the monolithic case is the less common and is generally built asynchronously from the base prolog, main, epilog subsections.
It goes on to provide a diagram of the more than one stage Vega has in its geometry pipeline, which does align with the Vega white paper's description of a separate surface shader.

~~I also get the impression from the following that the monolithic mode is not exclusive to gfx9:~~
(edit: misread a section, may not be the case.
Retaining reference, which might make this more broadly applicable, but I need to look at this further:
https://phabricator.pmoreau.org/file/data/adlsdrp6rwtuu6cogvft/PHID-FILE-xpfx6voxzexygkjyrrvp/file )

This code block above the bin sizes I believe is referencing what I described. Where they are likely depth culling, but as the comment mentions is unclear why it's there.

Could that be a case where binning is being disabled because the pixel shader is able to override the coverage information available earlier in the pipeline, potentially in a way that makes the early work redundant or incorrect?

Saving cycles was only one aspect, reducing bandwidth and other resource demands I'd consider a close second when tied in with binning. If a triangle at the end of a list occluded everything, a huge reduction in primitives could exist.

I'm not going that far into the details. I am saying that there is a claim "it is all one big programmable shader" and one of AMD's lead designers saying "there's fixed function hardware not in that shader".

Anarchist4000 · Nov 21, 2017

Sorry, was busy so just getting back to this.

3dilettante said:
What do you mean by previously?

As described in patents or other disclosures, a bin is an area of screen space that an SE is responsible for. As far as Vega and its DSBR are indicated by various descriptions and patents, there are far more than 4 of them. The implications of the code are that it is indicating possible intersections of a triangle with how the screen is already divided, so the checks and math are in terms of the bin size.

Prior to DSBR there would be one bin per SE. Low end hardware one bin, higher tier products with four aligning to number of SEs. Primary difference being lack of any deferred binning.

3dilettante said:
The hardware and storage would be sized and aligned to a few sizes. I don't quite see the point in drilling any further down than the hardware wants to go, or why it's beneficial in this instance to iteratively query how much more a triangle that already hits a bin recursively hits it further.

It would be app dependent, but fully TBDR would likely nesessitate finer bins to accommodate the hardware. If just filling a bin and handing it off then I'd agree with your assessment.

3dilettante said:
If this is what you are talking about, it looks to me like this is a misreading of how the shader compiler is structured to balance compilation time penalties, handling of API state, cross-section optimization, and packaging of binary output.

I recall a commit with some additional comments and explanation, but yes that's roughly the code.

Code:

 * Monolithic shaders are shaders where the parts are combined before LLVM
 * compilation, and the whole thing is compiled and optimized as one unit with
 * one binary on the output. The result is the same as the non-monolithic
 * shader, but the final code can be better, because LLVM can optimize across
 * all shader parts. Monolithic shaders aren't usually used except for these
 * special cases:

The "optimize across parts" I've been intrepreting as deferred attribute interpolation for example. Just described with generic wording.

3dilettante said:
Could that be a case where binning is being disabled because the pixel shader is able to override the coverage information available earlier in the pipeline, potentially in a way that makes the early work redundant or incorrect?

I've intrepreted it as extremely simple shader passes and/or deferred rendering. So primarily redundant for an app with good culling and optimization. Possibly incorrect, but I'd think that requires a scenario conservative culling somehow already breaks the rendering path.

3dilettante · Nov 21, 2017

Anarchist4000 said:
Prior to DSBR there would be one bin per SE. Low end hardware one bin, higher tier products with four aligning to number of SEs. Primary difference being lack of any deferred binning.

Prior to Vega, I have seen the statically assigned rectangles in screen space called tiles, and it did not seem like the whole set of tiles in the framebuffer linked to an RBE was treated as a single entity.

It would be app dependent, but fully TBDR would likely nesessitate finer bins to accommodate the hardware. If just filling a bin and handing it off then I'd agree with your assessment.

Full TBDR goes against AMD's own description of the tech, and subdividing a bin into a smaller set of tiles does not fully control the amount of context the rasterizer needs to juggle. A tile can only be subdivided so finely, and even then it wouldn't stop a flood of primitives with the same bounding box coverage to exceed on-chip storage. Iteratively working down to whatever the proposed sub-tile size is going to make things workable leaves an uncertain number of clocks before a primitive can be submitted.

I recall a commit with some additional comments and explanation, but yes that's roughly the code.

The rationale is fully in the linked code:

/* The compiler middle-end architecture: Explaining (non-)monolithic shaders
* -------------------------------------------------------------------------
*
* Typically, there is one-to-one correspondence between API and HW shaders,
* that is, for every API shader, there is exactly one shader binary in
* the driver.
*
* The problem with that is that we also have to emulate some API states
* (e.g. alpha-test, and many others) in shaders too. The two obvious ways
* to deal with it are:
* - each shader has multiple variants for each combination of emulated states,
* and the variants are compiled on demand, possibly relying on a shader
* cache for good performance
* - patch shaders at the binary level
*
* This driver uses something completely different. The emulated states are
* usually implemented at the beginning or end of shaders. Therefore, we can
* split the shader into 3 parts:
* - prolog part (shader code dependent on states)
* - main part (the API shader)
* - epilog part (shader code dependent on states)
*
* Each part is compiled as a separate shader and the final binaries are
* concatenated. This type of shader is called non-monolithic, because it
* consists of multiple independent binaries. Creating a new shader variant
* is therefore only a concatenation of shader parts (binaries) and doesn't
* involve any compilation. The main shader parts are the only parts that are
* compiled when applications create shader objects. The prolog and epilog
* parts are compiled on the first use and saved, so that their binaries can
* be reused by many other shaders.

Primitive shaders, or the idea that the whole geometry pipeline of Vega is one large shader invocation, do not figure into this.
The optimization of a monolithic shader (usually undesirable) versus the common split shader is a reflection of the information available to the optimization pass. Code can be optimized across the full shader output in a monolithic shader, while the three-part method limits optimization to what is knowable within the prolog, main, or epilogue sections.
Despite the potential limits of the optimization pass, the desired case to be able to mix and match sections if possible rather than take compilation delays or caching a large number of shader variations.

This is a trade-off being made within the driver's shader compiler code, not the hardware pipeline. The linked code goes further to outline the implications of Vega's merged stages, which reduce shader invocations without going to a single shader and create a more complex concatenation scenario for concatenating shader sections.

Anarchist4000 · Nov 23, 2017

3dilettante said:
I saw references to larger instruction buffers, which are not the same as the caches. I think some patches may have referenced an increase along the lines of moving from 12 entries to 16. Overall the only clear change I saw in this regard was that the front end would be shared with at most 3 CUs, rather than up to 4 in prior chips.

3dilettante said:
Full TBDR goes against AMD's own description of the tech, and subdividing a bin into a smaller set of tiles does not fully control the amount of context the rasterizer needs to juggle. A tile can only be subdivided so finely, and even then it wouldn't stop a flood of primitives with the same bounding box coverage to exceed on-chip storage. Iteratively working down to whatever the proposed sub-tile size is going to make things workable leaves an uncertain number of clocks before a primitive can be submitted.

Approaching TBDR would be more accurate. Subdividing until cache usage is maximized or some ideal bin size reached. Smaller bins would facilitate more accurate occlusion culling of geometry. Possibly with an uneven distribution. Balancing becomes tricky in opposing quadrants. It may appear remeniscent of MPEG block encoding.

I've been working off the hypothesis aggregate bin size will exceed hardware capacity somewhat readily in an ideal case. Not simply filling a batch/tile before passing it on. More TBDR than streaming in application. A similar mechanism for tiling where much of Nvidia's gains originated with their driver. Possibly a next step from what was likely conservatively presented in current papers. Approaching TBDR efficiency overcoming the cost of binning into memory. Replacing ROP overdraw costs with a binning process.

3dilettante said:
The rationale is fully in the linked code:

https://lists.freedesktop.org/archives/mesa-dev/2017-April/152733.html

That's what I recalled specifically. Not quite fully merged, but reduced stages with data in LDS. At least in Vega's case.

3dilettante said:
This is a trade-off being made within the driver's shader compiler code, not the hardware pipeline. The linked code goes further to outline the implications of Vega's merged stages, which reduce shader invocations without going to a single shader and create a more complex concatenation scenario for concatenating shader sections.

Part of that hardware pipeline no longer exists though. So it is more than just a compiler vs hardware trade-off for optimization as referenced in the link above.

Rootax · Nov 23, 2017

It's maybe off topic but, about TBDR, is it possible for AMD / nVidia to make a "full/true" TBDR chip, or Imagination Tech has too many patents about this ?

Bondrewd · Nov 23, 2017

Rootax said:
It's maybe off topic but, about TBDR, is it possible for AMD / nVidia to make a "full/true" TBDR chip, or Imagination Tech has too many patents about this ?

Probably possible, but they will never talk about it having TBDR (because Imagination will scream murder then).
Qualcomm does exactly that.

manux · Nov 23, 2017

Rootax said:
It's maybe off topic but, about TBDR, is it possible for AMD / nVidia to make a "full/true" TBDR chip, or Imagination Tech has too many patents about this ?

Probably, imgtech is not only one with patents. In distant past there was company called gigapixel that was making TBDR based chip. Gigapixel was acquired by 3dfx which was later on bought by nvidia. Nvidia very likely has healthy amount of patents in TBDR domain.

Rootax · Nov 23, 2017

Yeah I remember about Gigapixel, but i thought it was "just" a TBR chip. My bad.

CSI PC · Nov 23, 2017

manux said:
Probably, imgtech is not only one with patents. In distant past there was company called gigapixel that was making TBDR based chip. Gigapixel was acquired by 3dfx which was later on bought by nvidia. Nvidia very likely has healthy amount of patents in TBDR domain.

That is a good point, considering the reported similarity between VideoLogic and Gigapixel going back.
Also there is ARM with its Mali GPU from IP it acquired with Falanx, that also goes back to a similar period of time; maybe the closest to the TBDR comparison which could be developed if some of the reports were correct *shrug*.
Still debatable if they could achieve the same level of tile-based deferred rendering as PowerVR (per pixel level) evolving those original IP technologies.

Esrever · Nov 23, 2017

isn't Imagination up for sale? If someone wanted their IPs they could just buy them. The company is tiny compared to Nvidia/AMD.

manux · Nov 23, 2017

How long are patents valid? Will the late 90's patents become free for everyone soon?

Entropy · Nov 23, 2017

Esrever said:
isn't Imagination up for sale? If someone wanted their IPs they could just buy them. The company is tiny compared to Nvidia/AMD.

It’s sold.

BRiT · Nov 23, 2017

manux said:
How long are patents valid? Will the late 90's patents become free for everyone soon?

However, there are tricks that can be played to extend the life of a patent, such as doing tweaks to it and filing it anew.

In the United States, for utility patents filed on or after June 8, 1995, the term of the patent is 20 years from the earliest filing date of the application on which the patent was granted and any prior U.S. or Patent Cooperation Treaty (PCT) applications from which the patent claims priority. For patents filed prior to June 8, 1995, the term of patent is either 20 years from the earliest filing date as above (excluding provisional applications) or 17 years from the issue date, whichever is longer. Extensions may be had for certain administrative delays.

The patent term will additionally be adjusted to compensate for delays in the issuance of a patent. The reasons for extensions include:

Delayed response to an application request for patent.
Exceeding 3 years to consider a patent application.
Delays due to a secrecy order or appeal.

Because of significant backlog of pending applications at the USPTO, the majority of newly issued patents receive some adjustment that extends the term for a period longer than 20 years.

Mat3 · Nov 24, 2017

One of the most hotly contested subjects on these boards for TBDRs in the past is the storage and bandwidth cost of binning all the post transformed triangles for a modern desktop game.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

(>• •)>⌐■-■ (⌐■-■)