AMD: Zen 2 (Ryzen/Threadripper 3000?, Epyc 8000?) Speculation, Rumours and Discussion

TF numbers alone are not enough to extrapolate performance between NVIDIA and AMD. NVIDIA has higher performance than their TF number would suggest. Because they compensate by having better polygon throughput, pixel filtrate, texturing and higher effective memory bandwidth. Also they have more advanced Tile Rendering than AMD.
Agreed.

Though for consoles, this is an interesting take. AMD; Instead of having specialized hardware like tensor cores to support tensor flow, they're beefing up their compute units to support a variety of tasks. There could be a variety of reasons not to support tensor flow, unfortunately I wouldn't know. But if this is the direction that AMD wants to take to tackle deep learning, we could be getting some insight into what to expect for our next generation consoles.
 
1.25x performance for the same power. This is on TSMC, not AMD. Interconnect resistivity likely rearing its ugly head. Rumors were that Apple weren’t pleased either.



Wouldn’t this give atrocious latency?

Does this likely mean that the next GPU will be about 20-25% better for the same watts?
 
ToTTenTranz said:
Am I right in thinking that some BVH accelerating hardware would take up little of the M160's 331mm2 die, but would put it in the same realm as the RTX2070 and 2080?

Yes,though RTRT performance would still be terrible.

Do you mean that in the sense that RTRT performance, in general, even on the RTX2080, is terrible?
 
Does this likely mean that the next GPU will be about 20-25% better for the same watts?
Not necessarily because Vega 20 has other power-consuming features compared to Vega 10.

Though I suggest you follow those questions in this thread.


Do you mean that in the sense that RTRT performance, in general, even on the RTX2080, is terrible?
Yes.
All examples we've seen from nvidia themselves put the hybrid raster+RT demos running at 1080p 60FPS on the RTX 2080 Ti.
RTX 2080 and RTX 2070 will perform below that.
 
Agreed.

Though for consoles, this is an interesting take. AMD; Instead of having specialized hardware like tensor cores to support tensor flow, they're beefing up their compute units to support a variety of tasks. There could be a variety of reasons not to support tensor flow, unfortunately I wouldn't know. But if this is the direction that AMD wants to take to tackle deep learning, we could be getting some insight into what to expect for our next generation consoles.

Oh is that what's going on? If so I guess they're doing what I mentioned a few pages back

https://forum.beyond3d.com/threads/...chnical-spin-2018.60604/page-169#post-2048236

Would it be possible for say, AMD to have a new type of shader cores that have improved ML design in each of them (but individually weak) instead of Nvidia's 3 innard design of having Rasters, Tensors and RT Cores?

I hope that's what it means because I'd be kinda happy a bit since I don't know much about these things.

Not necessarily because Vega 20 has other power-consuming features compared to Vega 10.

Though I suggest you follow those questions in this thread.



Yes.
All examples we've seen from nvidia themselves put the hybrid raster+RT demos running at 1080p 60FPS on the RTX 2080 Ti.
RTX 2080 and RTX 2070 will perform below that.

Thanks.
 
Yes.
All examples we've seen from nvidia themselves put the hybrid raster+RT demos running at 1080p 60FPS on the RTX 2080 Ti.
RTX 2080 and RTX 2070 will perform below that.

Fair enough, but if the hardware's capable of RTRT in even the relatively limited capacity of the RTX2070, then at least it's in developers' hands for a generation.

If the M160's any indication of AMD's answer to RTX, we could be looking at a fairly versatile approach, letting developers slide quite easily around on the hybrid rendering gradient.

Want to use all 14.8TF for rasterisation? Go for it. Want to use 7 for rasterisation and the rest for RT? Go for it.

Hopefully that's the case, anyway.
 
Fair enough, but if the hardware's capable of RTRT in even the relatively limited capacity of the RTX2070, then at least it's in developers' hands for a generation.

If the M160's any indication of AMD's answer to RTX, we could be looking at a fairly versatile approach, letting developers slide quite easily around on the hybrid rendering gradient.

Want to use all 14.8TF for rasterisation? Go for it. Want to use 7 for rasterisation and the rest for RT? Go for it.

Hopefully that's the case, anyway.

Wanna use some of that for more ML/AI tasks? Go for it.

Pricing is still probably the major factor.

I guess it's still back to the question where a general compute bunch matches the three class system of Nvidia.
 
I'm not sure if pricing is inevitably an issue. Probably to begin with, but once 7nm matures, a ~330mm2 die shouldn't be too expensive.

As for who's approach is better, time will tell, but I'd put money on the traditional split of performance for Nvidia and price for AMD.
 
If the M160's any indication of AMD's answer to RTX, we could be looking at a fairly versatile approach, letting developers slide quite easily around on the hybrid rendering gradient.
It's their answer to a deep learning yet flexible performance GPU. Not to be confused with Ray Tracing.
 
Found a pic of the 2* chiplet Consumer version
Slot-A_Athlon.jpg

:LOL:

Semi-seriously though, if packages are gonna be that huge, at some point doesn't it make sense to bring back CPU slots instead of giant sockets?
 
Another hypothesis is power. You want the main power and ground lines to go straight through to the chip, and the socket is designed for Zeppelins to be where those chiplets were placed.

That could be a limit on placement, as some elements like IO and to a lesser extent where the DDR lines go to leave the package seem to have some similarity within some margin of error. That margin gets larger where the IO die is, which concentrates PHY for DRAM and IO nearer the middle.
If the concern is how direct this is from the pins for power, there's a fair amount of distribution and metal used for Zen's voltage regulation that muddles things. Slide 19 from https://www.slideshare.net/AMD/isscc-2018-zeppelin-an-soc-for-multichip-architectures seems to show some power pins align with the CPUs, but a very large chunk of it would now be under the IO die, and another large chunk is concentrated on one side of the MCM.

Slides 9 and 10 show a possibly undesirable amount of latency could be involved, if AMD hasn't adjusted more of its fabric's features. Possibly, the data fabric on the CPU chiplets will be simplified by having a number of its clients moved elsewhere. I'm curious if there's potentially some special case handling possible since the coherent CPUs are known to be on one side of an IF package link, and the home agents and routing hardware are on the other.

Could someone give me more insight?
I've seen many negative comments in Anand saying the new product is 50% more dense but only 20% better in performance?
I'm not sure if it's necessarily negative, but that seems to be accurate. It's been a very long time since density scaling was linked strongly to performance or power scaling, with 90nm or 65nm being the threshold where a number of vendors got hit by significant problems.
Density governs how many transistors there can be in a chip, whereas performance in this context is more about the straight-line speed of individual circuits or pipelines. Only a subset of the overall set of transistors can take part in any local block or circuit, and in general more is not better due to each contributing some amount of delay or each increasing the length of wires required to connect them.

There are other physical conditions and architectural choices that can influence how much power, transistors, or wires come into play within a given time window. The faster a desired clock period, the more costly tradeoffs will become.
Trying to focus on parallelism can utilize the 50% increase in density without travelling as far up the steep upward curve of pushing clock circuit performance, if that option is available.

1 - Where's the L3 cache? One would think it would belong in the chiplets to reduce latency and the new IF links don't seem to be wide enough to keep up with the bandwidth. OTOH the chiplets are tiny (way less than half of a Zeppelin) and that I/O die is huge. There's no way the I/O die has only DDR4 PHYs and IF glue, it has to have some huge cache for coherency (L4?).
It seems probable there's cache or SRAM arrays on the IO die, at least because unless AMD has changed its system topology the home agents that maintain memory ordering and handle coherence would be there (unless AMD changed this for Infinity Fabric 2.0). How fast some of these processes can be, if they are always a link or more in distance and are on a die not optimized for speed may be a question mark (perhaps more for client-oriented products?).
A local L3 absorbing traffic before it traverses the IF links and contends for the centralized resources also seems worthwhile.
There's also PCIe, the PSP(s?), encryption hardware, USB and disk controllers, potentially. AMD promised expanded enterprise features, which may belong closer to the memory controllers and IO complexes. One question I have is whether this changes what happens for some of the error handling, where it used to be that errors pertaining to off-chip links would have a nearby CPU and its local memory to handle them. If there's a link problem with a CPU chiplet, is there a resource on the IO die that can step in?
 
Since amd will need a new i/o die for ryzen 3000 with 2 memory channels, wouldn't it be economically wiser to include a gpu in it (making all ryzen 3000, apus) using the same 7nm chiplets, instead of designing both a new i/o die and and a new apu chiplet (either on 7nm or 12/14nm) ?
 
The full quote is:
Estimated increase in instructions per cycle (IPC) is based on AMD internal testing for “Zen 2” across microbenchmarks, measured at 4.53 IPC for DKERN +RSA compared to prior “Zen 1” generation CPU (measured at 3.5 IPC for DKERN + RSA) using combined floating point and integer benchmarks.
That's quite specific as far as benchmarking goes :) Have any other CPU benchmarks been revealed?
 
So what does this imply for consumer level chips?
1 chiplet & a smaller IO chip? 2 chiplets & same IO chip? APUs only?
 
Wouldn’t this give atrocious latency?

DRAM latency would become worse than current, because of both an extra hop to the IO die and having to traverse L4 tags before sending request. However, increase of L3 size, and having L4 replying from SRAM would reduce latencies so long as you hit the caches. I think it would be a win for overall latency, but not nearly as big one as the size of the caches make it seem to be.

Semi-seriously though, if packages are gonna be that huge, at some point doesn't it make sense to bring back CPU slots instead of giant sockets?

There is no reasonable way to put enough pins on a slot/card system to connect to DRAM. Should memory get integrated on package, as some sort of future HBM derivative, a slotted CPU would become feasible.
 
So what does this imply for consumer level chips?
1 chiplet & a smaller IO chip? 2 chiplets & same IO chip? APUs only?

Maybe they would just leave the Ryzen topology unchanged, just pluging in the newer cores, newer IO, newer fabric?
Likely takes more effort, but is it substantially more ?
 
Maybe they would just leave the Ryzen topology unchanged, just pluging in the newer cores, newer IO, newer fabric?
Likely takes more effort, but is it substantially more ?

That would imply 3 dies on 7nm.

I personally strongly feel that a large part driving the shift to chiplets is that AMD wants to minimize the amount of different dies manufactured at the top-end fabs. I think that consumer Ryzens will be built using those same 7nm chiplets, just with different companion dies.

Threadripper will use the same IO die as EPYCs, allowing harvesting for faults that would not make the usable in any EPYC product (mostly, memory channel faults).

Ryzen APUs will use a separate GPU/memory controller/IO die, with a single infinity fabric link connecting to the cpu, 2 DRAM channels. This same IO die can do double duty as a low-end discrete gpu -- infinity fabric links can be reconfigured as PCI-E.

Ryzen desktop will either use the APU die (if it has a spare infinity fabric link), or it's own with just the DRAM controllers and other io, with 2 chiplets.
 
Are there any specifics on infinity fabric link bandwidth ?

Ryzen dies had 3 IF links each 2x32 bits wide (running at 4x DRAM command speed). Given the topology of EPYC 2 each chiplet only needs one IF link, so I'd expect it to be at least twice as wide as a single link. I'd also expect the operating frequency to be decoupled from DRAM command rate (because that was never really a good idea). Ie. a 2x64 lane link running at 2GHZ (8GT/s rate) it would have 64GB/s bandwidth in each direction.

Cheers
 
Back
Top