AMD RyZen CPU Architecture for 2017

Would your assessment of potential usage in games be different if AVX-512 were universally supported and unified into a single version? Or do you think it's just not useful in games?
AVX-512 is awesome. 16 wide (32b) float/int vectors. Scatter + gather. Execution masks. AVX-512 BW adds 8bit / 16bit operations (32 wide, 64 wide). Very useful for example for string processing. AVX-512 CDI has lots of nice instructions that help in many algorithms and make autovectorization easier.

AVX-512 is perfect fit for SPMD-style languages (ISPC). AVX had only full width float ops, AVX2 added full width integer ops (needed for address calculation, etc) and gather, but still lacked scatter and execution masks. AVX-512 adds both and widens everything by 2x.

AVX-512 is also much better for autovectorization. Also full width integers, gather, scatter allow each loop iteration to read/write data from independent locations (not limited linear loops with no pointers / indirections). AVX-512 CDI and execution masks further help to solve autovectorization problems. AVX-512 is the first time it would be actually possible to autovectorize generic loops. AVX/AVX2 was only good for trivial cases.
 
I want to bitch that AM4 socket with Excavator APU/CPU still is not available. This is while Pentium G4560 is readily available and everyone recommends it even for small time gaming and grandma use. A8-7650K or 7670K is no less expensive and has you buy into a dead socket and DDR3.
I have lost track of when Zen/Ryzen is supposed to launch but it seems they're delaying the launch of AM4 till the launch of Ryzen and I'm questioning that.
 
AVX-512 is awesome. 16 wide (32b) float/int vectors. Scatter + gather. Execution masks. AVX-512 BW adds 8bit / 16bit operations (32 wide, 64 wide). Very useful for example for string processing. AVX-512 CDI has lots of nice instructions that help in many algorithms and make autovectorization easier.

AVX-512 is perfect fit for SPMD-style languages (ISPC). AVX had only full width float ops, AVX2 added full width integer ops (needed for address calculation, etc) and gather, but still lacked scatter and execution masks. AVX-512 adds both and widens everything by 2x.

AVX-512 is also much better for autovectorization. Also full width integers, gather, scatter allow each loop iteration to read/write data from independent locations (not limited linear loops with no pointers / indirections). AVX-512 CDI and execution masks further help to solve autovectorization problems. AVX-512 is the first time it would be actually possible to autovectorize generic loops. AVX/AVX2 was only good for trivial cases.

Is the vector width usefull or just the features?

so which would you prefer of these 3 hypothetical "targets" where the core and cache is designed to handle 2 loads and 1 store at the nominal vector width?

128bit vectors with AVX-512 "features" @ 3.0ghz
256bit vectors with AVX-512 "features" @ 2.0ghz
512bit vectors with AVX-512 "features" @ 1.5ghz

cheers
 
The features, because you get a GPU "light" with less latency (to hide). If you need wide width you pick a full fledged GPU.
 
AVX-512 is awesome. 16 wide (32b) float/int vectors. Scatter + gather. Execution masks. AVX-512 BW adds 8bit / 16bit operations (32 wide, 64 wide). Very useful for example for string processing. AVX-512 CDI has lots of nice instructions that help in many algorithms and make autovectorization easier.

AVX-512 is perfect fit for SPMD-style languages (ISPC). AVX had only full width float ops, AVX2 added full width integer ops (needed for address calculation, etc) and gather, but still lacked scatter and execution masks. AVX-512 adds both and widens everything by 2x.

AVX-512 is also much better for autovectorization. Also full width integers, gather, scatter allow each loop iteration to read/write data from independent locations (not limited linear loops with no pointers / indirections). AVX-512 CDI and execution masks further help to solve autovectorization problems. AVX-512 is the first time it would be actually possible to autovectorize generic loops. AVX/AVX2 was only good for trivial cases.


Indeed AVX-512 is great, not so good as latest ARM SIMD supercomputer extension (which makes abstraction of vector length), but being Intel it's one of their better efforts.
 
From Bits 'n Chips: "Some thought on official Global Foundries PDF on the 14nm FinFET process."

Low power APUs: 60% power reduction means that we can have a 15W APU with double the cores and the SP than on 28nm. We can have 4m/8c Excavator and 1024SPs at the same clock. But newer architectures are more power efficient (Ryzen and Vega), so a 15W APU could have 4 Ryzen cores in the 3GHz+ range and 1024 SPs in the 1GHz range, according to this graph.

Talking of high core count servers, we can expect that 8 Zen core at 3GHz can draw just 30-40W. This means that 32c 180W Naples could have more than 3GHz base clock. Finally, this graph let us estimate also the final Ryzen clocks: 4x2.7GHz Ryzen cores should draw at most 15W. So 8x2.7GHz Ryzen core should draw at most 30W. Assuming cubic scaling with frequency for power drawn, we have that at 95W we can have (95/30)^(1/3)=1.47 times 2.7GHz, that are about 4GHz.

These estimations are only ballpark, but we are confident that are quite realistic.
Do these numbers seem reasonable? > 3 GHz base for 32 cores is much higher than both the 1.4 GHz of the "AMD Corporation Diesel" Geekbench result (although I expect this to be lower than the final product's clock anyway) and the clock speeds of Broadwell-EP chips (which have fewer cores).
 
From Bits 'n Chips: "Some thought on official Global Foundries PDF on the 14nm FinFET process."

Do these numbers seem reasonable? > 3 GHz base for 32 cores is much higher than both the 1.4 GHz of the "AMD Corporation Diesel" Geekbench result (although I expect this to be lower than the final product's clock anyway) and the clock speeds of Broadwell-EP chips (which have fewer cores).
Depends on TDP @250watts sure, maybe not so much @ 180.

edit you could get base 2.8ghz 16 core 140watt piledriver opterons so maybe 32core 3ghz base in 180 is possible?
 
Indeed AVX-512 is great, not so good as latest ARM SIMD supercomputer extension (which makes abstraction of vector length), but being Intel it's one of their better efforts.
Hard to say yet, because full SVE specs are not available. AVX-512 has some really nice instructions.

This is the most low level presentation I found:
https://community.arm.com/cfs-file/...ARMv8_2D00_A-SVE-technology-Hot-Chips-v12.pdf

Really good stuff. They have focused on similar things as Intel has with AVX-512. Both autovectorization and SPMD style execution have clearly been top priority. Full width float + int, gather, scatter, predicates (execution masks) and loop/branch based on masks. Variable width (128 to 2048 bits) instruction set will make it forward compatible. Both hardware and software are fully scalable. Intel on the other hand can't make AVX-512 universal instruction set. AVX-512 units take so much die space. Xeon D (dense server) will likely not get it. Same for 4/6 core consumer CPUs (die space needed for GPU). Would be better to have scalable instruction set, and simply equip some CPU models with narrow (128/256 bit hardware units). All software would work on all CPU models.
 
Maybe an opportunity for AMD to pull another X64 moment, make a variable execution length AVX-512 call it AVX3. Get it in the consoles ( PS5, Xbox2). I wonder if they would be legally be able to just copy SVE and put it on X86?
 
AVX-512 units take so much die space. Xeon D (dense server) will likely not get it. Same for 4/6 core consumer CPUs (die space needed for GPU). Would be better to have scalable instruction set, and simply equip some CPU models with narrow (128/256 bit hardware units). All software would work on all CPU models.

In various scenarios, Intel's AVX2 chips already operate at half-width in the common case. It takes a rather coarse mode switch to ungate the upper half of the units upon detecting their being used, and there's a warm-up period where the chip operates on the 256-bit instructions without the units being fully active.
That's not the desired steady-state, but that's Intel's implementation choice rather than a requirement that the units match the instructions being fetched.

Maybe an opportunity for AMD to pull another X64 moment, make a variable execution length AVX-512 call it AVX3. Get it in the consoles ( PS5, Xbox2). I wonder if they would be legally be able to just copy SVE and put it on X86?
The first question I have is whether AMD's agreement needs to be refreshed with Intel. The most recent version's coverage sunset a few years ago, prior to AVX-512.

Another is that copying SVE into AVX, past the question of whether AMD can safely work with the most recent extensions by Intel, may run into whether ARM would appreciate a licensee copying its architecture into a competing ISA. It's possible that ARM may not like that. There may be various patents or IP inherent to being a licensee that would not extend to a non-ARM implementation.

SVE was given a large dedicated carve-out of the ISA encoding space by ARM, which is more straightforward in a new architecture that hasn't accumulated so many old extensions or hardware optimized for non-SVE modes. Something SVE-like could be added, though what it might cost in code-density (yet another prefix?) or odd implementation choices isn't clear. Worse would be if AMD does something like this and Intel's rumored spring-cleaning of its vector extensions portends another incompatible extension set in the freed space.
 
If you remember when i3-6100 was "accidentally" overclockable : it was stuck in AVX2 low performance mode, even losing out to SSE code if I remember correctly. (Or SSE winning over AVX1, which was borked anyway)
It was less than half the full performance. But certainly decent if it's for a very short time lost in the noise of your clock speed transitions.
 
In various scenarios, Intel's AVX2 chips already operate at half-width in the common case. It takes a rather coarse mode switch to ungate the upper half of the units upon detecting their being used, and there's a warm-up period where the chip operates on the 256-bit instructions without the units being fully active.
That's not the desired steady-state, but that's Intel's implementation choice rather than a requirement that the units match the instructions being fetched.
Yes, the upper half of the SIMD hardware is powered down when processing SSE code (or floats). However the full width SIMD still takes transistor space, making the cores larger and increasing the production cost. The wider SIMD hardware has also implications to L1 cache bandwidth and register files among other things. Hardware designed to execute 256 bit vectors isn't perfectly optimal for 128 bit vectors.
 
Hardware designed to execute 256 bit vectors isn't perfectly optimal for 128 bit vectors.

That is true, but there isn't a reason other than an implementation choice that the back-end is as wide as the width of the instruction. AMD's cores have a history of cracking ops, and Intel at various times has used half-width hardware for operations. The early Atom cores had half-throughput for non-integer SIMD. The ISA doesn't mandate the physical width of the hardware.
 
Recent rumors of Ryzen being tied to 4 or 8 core options only are false. We at io-tech can now confirm that cores can be individually disabled, L3 of each CCX can be full (8MB), half (4MB) or disabled. Only limitation is that each CCX should have identical configuration, meaning if you have 2 CCXs, and you disable 1 core from one of them, you have to disable one from the other CCX too.

So in theory it could do 2, 4, 6 and 8 core models with 2 CCX, and 1, 2, 3 and 4 core models with 1 CCX


https://www.io-tech.fi/uutinen/6-ytiminen-ryzen-prosessori-teknisesti-mahdollista-toteuttaa/
 
yes people ether:
1. think AMD are stupid
2. Dont understand the market Zen sells into.

AMD cant go to Server Market with only 32 and 16 core options, but 32,28,24,20,16,14,12,10,8 (aggregate of both SP3&4) is perfectly fine ( intel goes down by 2's in V4)
 
AMD Ryzen CPUs to support Windows 7 with drivers
AMD will make Ryzen fully compatible with Windows 7, which is still supported by Microsoft for another three years. The info was shared by AMD at an partner event. Basically this opens up all processor instruction sets for Window 7, as long as they are compatible with the OS of course similar to what Kaby Lake had with HEVC / 10-bit encoding driver support. Microsoft is not supporting Intel’s latest Kaby Lake range of CPUs on any OS below Windows 10, kind of weird right? We're not sure how that will detail itself in relation to Ryzen as Microsoft clearly stated not to support new processors on older operating systems.
http://www.guru3d.com/news-story/amd-ryzen-cpus-to-support-windows-7-with-drivers.html

Source article:
https://www.computerbase.de/2017-02/amd-ryzen-am4-treiber-windows-7/
 
Last edited by a moderator:
Back
Top