Next Generation Hardware Speculation with a Technical Spin [2018]

Status
Not open for further replies.
It largely depends on the quality of HRT titles. If it’s massive and it generates demand, then everyone should be offering it all price points. But then again, how RT acceleration is handled is the big ?

With only seeing nvidia’s Implementation we are anchoring our opinion of what RT hardware looks like
 
SMT is the term for the general concept of running multiple threads through the same core by allowing different threads to take up different resources in parallel, vendors may or may not give their marketing names to their individual implementations.

As far as special hardware, I am unsure what counts as special. At its most basic, the core needs to keep track of the context that belongs solely to each thread like the next instruction pointer and control settings, and it has to track instructions well enough that their data and results never mix with those of different threads. That can be a few context registers and a register tracking table that keeps two independent lists. Out of order hardware can often be adjusted to handle SMT with little increase in hardware, since out of order execution already keeps track of specific instructions so that they do not accidentally interact with the wrong data or results.

In the most straightforward cases, not much is going to really stand out. Most units do not need to know what thread is using them, so long as their sources of data and outputs are directed appropriately. A small number of generic-looking registers here and there aren't going to stand out.
Performance enhancements like duplicating hardware per thread may be noticed by someone knowledgeable enough about the hardware to know where to look, although there can be other reasons why a given block appears to have more copies than expected.

That's a bummer. I hope there's one that could investigate if for example Ryzen 7 is the same exact ship as 5 except 5 has the SMT hardware disabled inside them through other means like firmware/software/motherboards.

Is it true that SMT affect clockspeeds? Does this apply to Ryzen?
 
That's a bummer. I hope there's one that could investigate if for example Ryzen 7 is the same exact ship as 5 except 5 has the SMT hardware disabled inside them through other means like firmware/software/motherboards.

Is it true that SMT affect clockspeeds? Does this apply to Ryzen?

It's not affecting clockspeeds so much as it's affecting thermals, do more work and produce more heat or consume more power, this indirectly impacts clockspeed because todays CPUs are throttled based on thermal targets or power consumption.
 
That's a bummer. I hope there's one that could investigate if for example Ryzen 7 is the same exact ship as 5 except 5 has the SMT hardware disabled inside them through other means like firmware/software/motherboards.

Is it true that SMT affect clockspeeds? Does this apply to Ryzen?
AMD's stated there's just one Zen chip for the 1x00 Ryzen products and a different revision for the 2x00 products, though if you want to see Zen without SMT that's Ryzen 3. There's no motivation to make a different chip as removing the elements specific to SMT provide negligible savings (and cost millions of dollars in engineering and compromised binning to create two almost identical products). Outside of a small set of context elements, the rest of the hardware is no different and the impact of the SMT-specific elements is unlikely to show up over a wide range of more pressing bottlenecks. SMT is readily disabled by software for the chips that come with SMT available, and is likely disabled by some kind of blown fuse for the products sold without it. The chip just leaves the SMT-specific elements unused or ignored.

The extra tracking and logic in SMT by itself is dwarfed by the rest of a massive OoO engine, many deep pipelines, and wide execution and memory resources. It's not likely that those small elements in the simplest 2-thread case would become the critical path when there are so many more large elements with more pressing scaling challenges. As noted by BRiT, the higher utilization that SMT can allow by filling in stall cycles in parts of the chip can increase power consumption, which can make a core hit limits more readily. If there are enough cores with sufficient resource utilization gaps, this may allow a chip to meet the TDP specifications of a slightly higher-clocked bin when it is tested during manufacturing.
 
AMD's stated there's just one Zen chip for the 1x00 Ryzen products and a different revision for the 2x00 products, though if you want to see Zen without SMT that's Ryzen 3. There's no motivation to make a different chip as removing the elements specific to SMT provide negligible savings (and cost millions of dollars in engineering and compromised binning to create two almost identical products). Outside of a small set of context elements, the rest of the hardware is no different and the impact of the SMT-specific elements is unlikely to show up over a wide range of more pressing bottlenecks. SMT is readily disabled by software for the chips that come with SMT available, and is likely disabled by some kind of blown fuse for the products sold without it. The chip just leaves the SMT-specific elements unused or ignored.

The extra tracking and logic in SMT by itself is dwarfed by the rest of a massive OoO engine, many deep pipelines, and wide execution and memory resources. It's not likely that those small elements in the simplest 2-thread case would become the critical path when there are so many more large elements with more pressing scaling challenges. As noted by BRiT, the higher utilization that SMT can allow by filling in stall cycles in parts of the chip can increase power consumption, which can make a core hit limits more readily. If there are enough cores with sufficient resource utilization gaps, this may allow a chip to meet the TDP specifications of a slightly higher-clocked bin when it is tested during manufacturing.

Would you happen to have a reference with that statement of having one chip? It makes sense to me though.

Btw, is it also true that Intel also uses the same chips for desktop and laptops?

Basically just one or few designs then they just "artificially nip and tuck" for their desired price/performance segmentation?

It's not affecting clockspeeds so much as it's affecting thermals, do more work and produce more heat or consume more power, this indirectly impacts clockspeed because todays CPUs are throttled based on thermal targets or power consumption.

Very interesting thanks. So things can cost the same just need a better cooler. Then again, there's binning right? Even if the chips costs near the same to build from high to low, say Ryzen 3 and 7, AMD will have to charge for sony to use Ryzen 7 features.

Btw sorry guys, I'm fairly new to these kinds of details and I still appreciate it staying here as I'm more interested in the context of consoles.
 
Would you happen to have a reference with that statement of having one chip? It makes sense to me though.
The specifics were announced as far back as the initial launch of the various products.
Epyc's launch discussed how it used the same chip as Ryzen, and various product reviews of the Ryzen families show how the lower products have progressively more cores and L3 cache capacity disabled.
https://www.pcper.com/reviews/Proce...cessor-Launch-Gunning-Xeon/Architectural-Outl

https://www.anandtech.com/show/1124...0x-vs-core-i5-review-twelve-threads-vs-four/2


The APU is a different Raven Ridge die, and the Ryzen 2 non-APU chips are port of the original Ryzen's Summit Ridge chip to a slightly more refined process.

Btw, is it also true that Intel also uses the same chips for desktop and laptops?
The same chip can go into multiple product lines, though the highest end desktop/enthusiast ones are derived from higher-end server chips that generally aren't portable outside of a niche like workstation laptops that are mobile only in the sense that they can be carried from wall plug to wall plug with little battery life.

Basically just one or few designs then they just "artificially nip and tuck" for their desired price/performance segmentation?
Generally, this is the case. Silicon chips are individually cheap due to mass-production, but the up-front costs of engineering a new implementation are massive. Tweaking the same chip saves those costs and allows less than perfect chips to be salvaged. AMD's current line is somewhat unusual in the number of markets covered by virtually the same chip, but even though Intel has a fair number of different dies they go into a vast and confusing number of different product lines.
 
It may be in AMD's interest to pretty much just bin different chips, that doesn't mean MS or Sony buy a single design, and can bin in the same way.

So from a manufacturing stand point may be little reason to produce chips that don't have SMT, just disable it at hardware level for bining purposes or to meet a quota.
But AMD would be able to sell SMT as a feature of the design, so that would cost MS and Sony more to use it.

So for MS and Sony there would be a cost to SMT, above and beyond thermals etc.
Thats how I would expect AMD to sell design features anyway.
Not replying to anyone in particular, just adding my 2 cents.
 
For example: The engine can use much simpler geometry for reflection Rays of rough surfaces (blurry, reflection) and that performance gain may be used for more rays on that surface (which is more necessary in rough surfaces to reduce noise). Is a generic BVH optimal for that? Are there other clever ways to organize your data that could favor such tricks, that would ultimately improve quality and be more performant?
DXR 1.0 does not include geometry LOD or geometry shaders (though these features are considered for the future), probably because raytracing was intended as an addition to the rasterization and computing pipelines.

You can build/modify the command list exclusively for the separate reflection pass (but AFAIK that would require multi-GPU support to run in parallel to AO and GI).

there is the question of maybe reutilizing the same BVH structure you use for Ray tracing to accelerate other things, like collision detection (not Ray based), which I'm not sure can be done under Nvidia's scheme
You can't. You have access to command lists and bundles, but you cannot access acceleration structures after the video driver builds them from your geometry (AFAIK, you can physically access the memory region, but the data format is proprietary and not disclosed).

I can also forsee a world in which devs are preemptively adjusting their design decisions to fit the kind of workload the specific way consoles end up constructing their BVH in their black-box likes
It is not certain that next-gen consoles will get RTRT in the first place - this could only happen with mid-range $200 GPUs typical for consoles. For all we know, NVidia's is not coming to consoles (not at GeForce RTX 20xx price point either) and AMD is seemingly taking a different, software-based approach with 'Radeon Rays'.

By the time hardware RTRT trickles down to mid-range, there will be DXR tier 2.0 with more shader types and processing functions.

Why would a dev use a physics system that relies on a BVH but not ray casting when they have RT acceleration hardware at hand? Are there even any physics systems that work like that? That's like asking: "what if a game makes use of a rendering system based on quads or some other primivitve other than triangles? Rasterization hardware is obviously a waste of sillicon!" Game tech is based mostly on the hardware it's intended to run on.
Exactly. DXR 1.0 lets hardware developers choose a subdivision algorithm which is best for available memory/cache bandwidth and their specific software or fixed-function implementation of ray-intersection search.

For your game engine needs, your can ship pre-computed BSP/BVH structures with your game assets and use whatever algorithm would be most efficient or full-featured for your particular task.
 
Last edited:
Does SMT require a special hardware/part of the chip?
It requires extensions to the instruction decoder and micro-op scheduler.

Superscalar ALUs are built to have several blocks operating in parallel - such as memory access, register file access, integer operations, floating-point operations, whatever. Each complex 'macro' command - i.e. x86/x64 instruction - is implemented with a sequence of proprietary VLIW microcode which runs very simple operations for each block, the micro-ops.

Thus it is possible to implement several front-ends in each ALU (i.e. several instruction decoders and schedulers) which share execution blocks (the back-end), and present these front-ends as separate cores. But this will only work when the threads have orthogonal workloads which consume different blocks, so that re-ordering and parallelization of micro-ops is possible. Thermal restrictions would also apply, as modern CPUs are very actively restricting the workloads to remain within their specified TDP level.

Would it be correct to assume 4 FP32 operations per shader core? For example, a 3640 SP device @1.172 GHz would give you 17TF?
I don't think so. The "simple" ALU only supports a subset of the "full" instruction set - but the register width is fixed and FP32 most likely works by chaining two FP16 blocks in the ALU.
 
Last edited:
Navi 12 will have 40 CUs, no GCN.Super Simd then?
These could have 65/70 or 130/140 shader blocks per each CU, for a total of 2600/2800 or 5200/5600 shader processors - if these are indeed 'simple' cores, you can put more of them on the same area...

As mentioned per the article AMD would have "freely" developed its next gen microarch in exchange of not allowing MS to use it.
I'm not sure how would you derive the above from the actual citation:

"The Navi 12 is not going to be the GPU that gets featured in PS5, its a derivative of the actual Navi die and has been created specifically so AMD can get it to market for the PC audience primarily."

Given the generation, tdp, size, clock of the arm cpu in the switch and the fact that its not just running 40 year old platformers is enough to show that a console part wouldn't be that far away
Yes, in comparison to NES Classic/SNES Classic, the Switch is a full featured console... a lot of innovation here, almost like the Sony PlayStation® of 1995 :devilish:
 
Last edited:
It largely depends on the quality of HRT titles. If it’s massive and it generates demand, then everyone should be offering it all price points. But then again, how RT acceleration is handled is the big ?

With only seeing nvidia’s Implementation we are anchoring our opinion of what RT hardware looks like

Demand is only one part of the equation.

The other, and arguably far more important part of the equation is how much does it cost in terms of silicon real estate to enable performant RT that will blow people away?

Right now, with NV's RT hardware the lower bound is a 2070 that offers performance impacted RT that some people find impressive and others less so. It's certainly doing RT faster than non-RT hardware, but is it doing it fast enough for games at a quality level that is an overall improvement in the game, versus just an improvement in specific areas of a game at the expense of other areas of a game?

In 2 years will acceptable quality and performance for games, be found in a hardware component not even a console, but just a component of a console be available for under 500 USD? Again, just for the hardware component, not even talking about a full blown console at this point.

And additionally will it be flexible enough that developers can adapt and use it in their games, regardless of the types of games they are making?

I'd argue that it isn't very likely. At some point in the future we all hope RT will be viable in a console, but next generation is not likely to be it for reasons of cost versus performance versus quality versus flexibility.

More than happy to be proven wrong, of course. :) But I just don't see it, at this point in time.

Regards,
SB
 
It does all sound like PS5's custom Navi is based on KUMA of the new uArch, so no more limitation to the 64cu count of GCN which is good. Does this not change the expectation of the reasonable teraflops we might end up getting? 14-15tf used to be the higher end but with KUMA this could be well within reach I hope.
 
It does all sound like PS5's custom Navi is based on KUMA of the new uArch, so no more limitation to the 64cu count of GCN which is good. Does this not change the expectation of the reasonable teraflops we might end up getting? 14-15tf used to be the higher end but with KUMA this could be well within reach I hope.
I don’t think it would change things. If the goal is to increase compute power by going wide, everything else in the pipeline needs to be equally increased or you’ll hit bottlenecks elsewhere.

If the goal is to increase compute power using clockspeed, the entire pipeline moves together but the amount of power the heat goes up.

There will be an optimum ratio of CU to clockspeed to generate maximum power for the least amount of silicon and power but a 64CU limitation was not likely to be a factor there.
 
I don’t think it would change things. If the goal is to increase compute power by going wide, everything else in the pipeline needs to be equally increased or you’ll hit bottlenecks elsewhere.

If the goal is to increase compute power using clockspeed, the entire pipeline moves together but the amount of power the heat goes up.

There will be an optimum ratio of CU to clockspeed to generate maximum power for the least amount of silicon and power but a 64CU limitation was not likely to be a factor there.
But isn't this where 7nm comes in presumably? I'm hoping the efficiency of the new process node would negate most of that, of course to a reasonable degree. Also we really don't know what's in the pipeline exactly do we? Would 24gig GDDR6 be ample already?
 
It does all sound like PS5's custom Navi is based on KUMA of the new uArch, so no more limitation to the 64cu count of GCN which is good. Does this not change the expectation of the reasonable teraflops we might end up getting? 14-15tf used to be the higher end but with KUMA this could be well within reach I hope.

It doesnt sound anything like that... We have no idea (and especially wcctech) what Navi will be nor if the 64cu limitation will be still there.
 
It doesnt sound anything like that... We have no idea (and especially wcctech) what Navi will be nor if the 64cu limitation will be still there.
Well it sure as hell ain't gonna be Navi 12 which is only 40CU, so what else is out there? Navi 20 custom it is.
 
It doesnt sound anything like that... We have no idea (and especially wcctech) what Navi will be nor if the 64cu limitation will be still there.

For a dreamer it may sound like that. In truth we dont know anything, if history repeats itself there will be some midrange hardware for the time in there.
 
Status
Not open for further replies.
Back
Top