AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Status
Not open for further replies.
fermipipeline_memoryflow.png




http://pixeljetstream.blogspot.de/2015/02/life-of-triangle-nvidias-logical.html

Edit: @pixeljetstream, welcome and thanks for posting interesting info on your website. It also seems you have contributed quite a bit to GTC as well.
http://on-demand-gtc.gputechconf.co...&sessionYear=&sessionFormat=&submit=&select=+

So which cache is the crossbar that distributes to the raster units from the other linked slides?


I think it's saying those are just placeholders for tests, not that the placeholders are wrong.
 
So which cache is the crossbar that distributes to the raster units from the other linked slides?
My working theory has been for a while: Syncing via the L2 cache is the reason why in certain synthetical tests like the B3D suite geometry rates start to be limited by the individual L2's R/W rate, hence no scaling beyond 4 GPCs in those synthies (will be a while before we reach that limit in real world scenarios). My best guess is, after syncing, it is determined if the geometry processed can stay inside the GPC or has to move out.
 
So which cache is the crossbar that distributes to the raster units from the other linked slides?

.
Was the Nvidia Distributed Tiled Cache patent ever posted?
http://www.freepatentsonline.com/y2014/0118361.html
Figure 2,3,3A are of interest to the below points and worth checking out while reading.
In graphics and compute applications, GPC 208 may be configured such that each SM 310 is coupled to a texture unit 315 for performing texture mapping operations, such as determining texture sample positions, reading texture data, and filtering texture data.

In operation, each SM 310 transmits a processed task to work distribution crossbar 330 in order to provide the processed task to another GPC 208 for further processing or to store the processed task in an L2 cache (not shown), parallel processing memory 204, or system memory 104 via crossbar unit 210. In addition, a pre-raster operations (preROP) unit 325 is configured to receive data from SM 310, direct data to one or more raster operations (ROP) units within partition units 215, perform optimizations for color blending, organize pixel color data, and perform address translations.

It will be appreciated that the core architecture described herein is illustrative and that variations and modifications are possible. Among other things, any number of processing units, such as SMs 310, texture units 315, or preROP units 325, may be included within GPC 208. Further, as described above in conjunction with FIG. 2, PPU 202 may include any number of GPCs 208 that are configured to be functionally similar to one another so that execution behavior does not depend on which GPC 208 receives a particular processing task. Further, each GPC 208 operates independently of the other GPCs 208 in PPU 202 to execute tasks for one or more application programs. In view of the foregoing, persons of ordinary skill in the art will appreciate that the architecture described in FIGS. 1-3A in no way limits the scope of the present invention.

Graphics Pipeline Architecture
FIG. 3B is a conceptual diagram of a graphics processing pipeline 350 that may be implemented within PPU 202 of FIG. 2, according to one embodiment of the present invention. As shown, the graphics processing pipeline 350 includes, without limitation, a primitive distributor (PD) 355; a vertex attribute fetch unit (VAF) 360; a vertex, tessellation, geometry processing unit (VTG) 365; a viewport scale, cull, and clip unit (VPC) 370; a tiling unit 375, a setup unit (setup) 380, a rasterizer (raster) 385; a fragment processing unit, also identified as a pixel shading unit (PS) 390, and a raster operations unit (ROP) 395.
.....
The geometry processing unit is a programmable execution unit that is configured to execute geometry shader programs, thereby transforming graphics primitives. Vertices are grouped to construct graphics primitives for processing, where graphics primitives include triangles, line segments, points, and the like. For example, the geometry processing unit may be programmed to subdivide the graphics primitives into one or more new graphics primitives and calculate parameters, such as plane equation coefficients, that are used to rasterize the new graphics primitives.

The geometry processing unit transmits the parameters and vertices specifying new graphics primitives to the VPC 370. The geometry processing unit may read data that is stored in shared memory for use in processing the geometry data. The VPC 370 performs clipping, culling, perspective correction, and viewport transform to determine which graphics primitives are potentially viewable in the final rendered image and which graphics primitives are not potentially viewable. The VPC 370 then transmits processed graphics primitives to the tiling unit 375.

The tiling unit 375 is a graphics primitive sorting engine that resides between a world space pipeline 352 and a screen space pipeline 354, as further described herein. Graphics primitives are processed in the world space pipeline 352 and then transmitted to the tiling unit 375. The screen space is divided into cache tiles, where each cache tile is associated with a portion of the screen space. For each graphics primitive, the tiling unit 375 identifies the set of cache tiles that intersect with the graphics primitive, a process referred to herein as “tiling.” After tiling a certain number of graphics primitives, the tiling unit 375 processes the graphics primitives on a cache tile basis, where graphics primitives associated with a particular cache tile are transmitted to the setup unit 380. The tiling unit 375 transmits graphics primitives to the setup unit 380 one cache tile at a time. Graphics primitives that intersect with multiple cache tiles are typically processed once in the world space pipeline 352, but are then transmitted multiple times to the screen space pipeline 354.

Such a technique improves cache memory locality during processing in the screen space pipeline 354, where multiple memory operations associated with a first cache tile access a region of the L2 caches, or any other technically feasible cache memory, that may stay resident during screen space processing of the first cache tile. Once the graphics primitives associated with the first cache tile are processed by the screen space pipeline 354, the portion of the L2 caches associated with the first cache tile may be flushed and the tiling unit may transmit graphics primitives associated with a second cache tile. Multiple memory operations associated with a second cache tile may then access the region of the L2 caches that may stay resident during screen space processing of the second cache tile. Accordingly, the overall memory traffic to the L2 caches and to the render targets may be reduced. In some embodiments, the world space computation is performed once for a given graphics primitive irrespective of the number of cache tiles in screen space that intersects with the graphics primitive.
 
Those who've seen what the industry does if there's even a remote amount of fragility or have software written prior to the introduction of said product.
Game production managers would prefer the former. Tim Sweeney has on a number of occasions cited productivity as the limiter, going so far as to state a willingness to trade performance for productivity in development.
Getting even to a decent level of multi-core coding has taken decades at this point, with all those elements that make it as transparent as possible in architectures vastly more robust than GPUs have.

This is really no problem as the hardware access is behind so many layers that you can make it transparent on the API level. DX11 co-exists with DX12 for that reason. DX12 already has all the API needed to cover chiplets, but well, if you don't want to even see that there are chiplets, then drop in an alternative kernel implementation which has exactly one graphics queue, and one compute queue, for compatibility sake. Or lift it up to DX13.

From experience I can tell you, bringing a engine to support multi-core CPUs was a way way more prolonged and painful path then bringing one to support multi-GPU. The number of bugs occuring for the former is still very high, by the nature of it, the number of bugs occuring for the latter is ... fairly low, because it's all neatly protectable [in DX12] and bakable into management code.

I wish I could address individual CUs already, so I could have my proto-chiplet algorithms stabalize before hitting the main target.

In any case, for GPUs, I don't believe you have to drop all this behaviour into the circuits, you have the luxury of thick software layers which you can bend. A super-cluster has no hardware to make the cluster itself appear coherent, but it has software to do so to some degree. The scheduling is below a thick layer of software, it has to be, it's way too much sub-systems involved to do this yourself.

I wonder about the impact of the x86 memory model on people's mind-set. I had to target ARM multi-core lately, and uggg, no coherence sucks if all your sync and atomic code depends on it. But then the C+11 threaded API was a great relief, because I only need to state what I need and then it will be done one way or another. I then started believing that incorporating these specifics into your code-base without a wrapper in the first place is counter-productive. Lesson: don't hack away, anticipate variance, do good software design, be verbose, really really verbose, very semantic (will vanish in compiler) - then you won't have that much problems when your target changes "radically".
The language/API/ISA is only half the solution regardless, because you have batch cache-invalidations for example, in general you have to design the data-sync points more carefully, more semantic. :)

But hey, all this is comfort-land in comparison with PS3-to-PS4 transitions (for all the mentioned reasons).
 
This is really no problem as the hardware access is behind so many layers that you can make it transparent on the API level. DX11 co-exists with DX12 for that reason. DX12 already has all the API needed to cover chiplets, but well, if you don't want to even see that there are chiplets, then drop in an alternative kernel implementation which has exactly one graphics queue, and one compute queue, for compatibility sake. Or lift it up to DX13.
Is the alternative kernel with one queue for each type expected to scale with an MCM GPU, or is it expected to perform at 1/N the headline performance on the box? Is it expected to run with 1/N the amount of memory?

One interpretation of the statements of the then-head of RTG was that AMD's plans were have developers manage the N-chip case explicitly, not extend the abstraction.
I think this would reduce the desirability of such a product for a majority of the gaming market, absent some changes and a period of time where AMD proves that this time it's different. There's been evidence in the past that SLI on a stick cards lost ground against the largest single-GPU solutions, and multi-GPU support from the IHVs has in recent products regressed.

If there's certainty that competing products will make the same jump to multi-GPU and with the AMD's claimed level of low-level exposure, then it might be able to make the case that there is no alternative. Even then, legacy code (e.g. everything at hardware launch time) might hinder adoption versus the known-working prior generations.
If the possibility exists that a larger and more consistent single-GPU competitor might launch, or gains might not be consistent for existing code, that's a riskier bet for AMD to take.

Further, if the handling of the multi-adapter case is equally applicable to an MCM GPU as it is to two separate cards, how does MCM product distinguish itself?

From experience I can tell you, bringing a engine to support multi-core CPUs was a way way more prolonged and painful path then bringing one to support multi-GPU. The number of bugs occuring for the former is still very high, by the nature of it, the number of bugs occuring for the latter is ... fairly low, because it's all neatly protectable [in DX12] and bakable into management code.
At this point, however, I would cite that nobody can buy a one-core CPU and software can be expected to use at least 2-4 cores at this point. There's been limited adoption and potentially a negative trend for SLI and Crossfire. The negative effect that those implementations have had on user experience for years isn't easily forgotten, and we have many games that are not implemented to support multi-GPU while the vendors are lowering the max device count they'll support.

Fair or not, the less-forgiving nature of CPU concurrency has been priced in as an unavoidable reality, and the vendors have made the quality of that infrastructure paramount despite how difficult it is to use.

I wish I could address individual CUs already, so I could have my proto-chiplet algorithms stabalize before hitting the main target.
This gives me the impression of wanting things both ways. The CUs are currently held beneath one or two layers of hardware abstraction and management within the ASIC itself, and those would be below the layers of software abstraction touted earlier. There are specific knobs that console and driver devs might have and high-level constructs that give some hints for the low level systems, but there are architectural elements that would run counter to exposing the internals further.

In any case, for GPUs, I don't believe you have to drop all this behaviour into the circuits, you have the luxury of thick software layers which you can bend.
Not knowing the specifics of the implementation, there's potentially the game, engine, API, driver(userspace/kernel), front-end hardware, and back-end hardware levels of abstraction, with some probable omissions/variations.
The lack of confidence in many of those levels is where the desire for a transparent MCM solution comes from.

A super-cluster has no hardware to make the cluster itself appear coherent, but it has software to do so to some degree. The scheduling is below a thick layer of software, it has to be, it's way too much sub-systems involved to do this yourself.
I think there are a number of implementation details that can change the math on this, and if a cluster uses message passing it could skip the illusion of coherence in general. The established protocols are heavily optimized for throughout the stack, however. I'm not sure how comparable the numbers are for some of the cluster latency figures versus some of those given for GPU synchronization.

I wonder about the impact of the x86 memory model on people's mind-set. I had to target ARM multi-core lately, and uggg, no coherence sucks if all your sync and atomic code depends on it. But then the C+11 threaded API was a great relief, because I only need to state what I need and then it will be done one way or another.
Relying on the C++ 11 standard doesn't remove the dependence on the hardware's memory model. It maps the higher-level behaviors desired to the fences or barriers provided for a given architecture. For the more weakly-ordered accesses that aren't considered synchronized, the regular accesses for x86 are considered too strong, but for synchronization points x86 is considered only somewhat more strong than is strictly necessary. Its non-temporal instructions are considered too weak.

More weakly-ordered architectures have more explicitly synchronized accesses or heavyweight barriers, and the architectural trend for CPUs from ARM to Power has been to implement a more strongly-ordered subset closer to x86 for the load-acquire/store release semantics.
The standard's unresolved issues frequently cover when parts of its model are violated, often when it comes to weaker hardware models turning out not being able to meet the standard's assumptions or potentially not being able to without impractically heavy barriers.

That aside, the question isn't so much whether an architecture is as strongly-ordered as x86, but whether the architecture has defined and validated its semantics and implementations, or has the necessary elements to do so. The software standard that sits at the higher level of abstraction assumes the foundation they supply is sound, and some of its thornier problems arise when it is not. The shader-level hierarchy would be nominally compliant with the standard, but the GPU elements not exposed to the shader ISA are not held to it and have paths that go around it.
 
Further, if the handling of the multi-adapter case is equally applicable to an MCM GPU as it is to two separate cards, how does MCM product distinguish itself?

Wouldn't there be a much faster interconnect between the individual chips and their directly attached resources and wouldn't that yield a substantial performance benefit over a standard multi-adapter setup?
 
Wouldn't there be a much faster interconnect between the individual chips and their directly attached resources and wouldn't that yield a substantial performance benefit over a standard multi-adapter setup?
It's a good item to leverage if the bandwidth is there.
In a more explicitly developer-managed system, it seems like this would be a different class of multi-adapter where certain combinations of independent devices have non-uniform performance effects.
Copy operations would be faster if they were finding themselves limited by bandwidth before.

If the game's multi-adapter path was coded with traditional multi-slot bandwidth in mind, the MCM GPU could find itself in a similar situation to how Vega's HBCC shows more arguable benefits when most games remain coded for the constraints of regular cards.
HBCC's handling is already more hardware-managed than the scenario I was addressing.

Application-level queues or device commands used to initiate transfers or sync the devices wouldn't necessarily see this interconnect, since that leaves each chip engaged in its own back and forth with the host side. That direction is less about bandwidth than it is the scheduling and device latencies, although creating fast paths in the architecture that let the chips collaborate on their own could leverage the interconnect.
If this does not replace traditional multi-card setups, it's an additional niche to target.
 
Well I was thinking about this some more and have some thoughts. I think the key to AMD's strategy with a multi-chip Navi center around DSBR and its deferred work feature. If we consider opaque triangles with no writing to UAV's in the pixel shader (and no memory accesses other than initial geometry in the geometry stages) you can guess at a interchip bandwidth/performance opimization strategy. If the chips are setup similarly to a NUMA setup with each chip having one or more memory channels and assume they are logically stripped on a per chiplet granularity (in case I'm not being clear I mean all memory channels hooked up to chiplet 1 form the first stripe, chiplet 2 second stripe... and so on), my guess is that they localize all work up to and including rasterization to that chiplet. They perform binning and defer the work until visibility is worked out. Then the temporary tiles created in each chiplet are then compared against each other in the chiplet whose local memory contains the backing store of the frame buffer tile in question and final visibility is determined and pixel shading proceeds in that chiplet. This doesn't solve the interchip bandwidth problem for texture access's, but it does solve it for geometry and framebuffer. Since the the DSBR with deferred work reduces the number of pixels to be shaded it does reduce texture access bandwidth somewhat as well. This may require a different memory layout for the tiles of the frame buffer, but that shouldn't be a problem...

I think some more about other types of work later, thanks for listening.
 
What kind of additional latency do you guys expect from inter-chip(let) connections anyway? It's not like with 3D Rendering in Cinema 4D or Blender, where you have a nice sorting up front and then much much rendering happening in tiny tiles.
 
What kind of additional latency do you guys expect from inter-chip(let) connections anyway? It's not like with 3D Rendering in Cinema 4D or Blender, where you have a nice sorting up front and then much much rendering happening in tiny tiles.
Isn't that made even more complex due to such design looking to hide/reduce latency but then impacted and not in a linear fashion by certain workloads while others are insensitive (low parallelism), and other variables such as cache and also writebacks.
Does not seem there is an easy answer to your question unless limiting scope of real world use.
 
In the AMD roadmap, the keyword describing Navi is "Scalability". I believe that is where the concept was born?

I agree this is probably where it started but it makes no sense to me. We know that for whatever reason AMD has not "scaled" past 64 compute units and we also know they can't sit at 64 compute units forever. In my mind the most likely interpretation of "Scalability" for Navi is just that it will be the first AMD GPU core to go past this number.
 
I think one other argument is "it's their only chance to compete", like, they don't have the ressources to make a big fat monolithic die that can compete with nvidia (Fiji, Vega,...) anymore. I hope I'm wrong, but the difference in R&D budget is so huge... I'm not sur they can pull an R300 or, in another scenario, an RV770 again...
 
Status
Not open for further replies.
Back
Top