Next Generation Hardware Speculation with a Technical Spin [2018]

Status
Not open for further replies.
What is CIVI?

Someone's joke of CM? Once upon a time at my work, someone responsible for setting up new accounts misread "ADAM" as "A D A I V I". Since then we've always forced Adam to use ADAIVI.
 
A webpage that has the flops/clock of various CPUs - worth a look if you are interested:
https://stackoverflow.com/questions...le-for-sandy-bridge-and-haswell-sse2-avx-avx2

AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):

  • 8 DP FLOPs/cycle: 4-wide FMA
  • 16 SP FLOPs/cycle: 8-wide FMA
AMD Ryzen

  • 8 DP FLOPs/cycle: 4-wide FMA
  • 16 SP FLOPs/cycle: 8-wide FMA
AMD Jaguar:

  • 3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
  • 8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle

So where does that put a 8-core jaguar? Jaguar 8 core must be very efficient being able to outperform a FX8350 @ 4ghz with its 2ghz?
 
A webpage that has the flops/clock of various CPUs - worth a look if you are interested:
https://stackoverflow.com/questions...le-for-sandy-bridge-and-haswell-sse2-avx-avx2

AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):

  • 8 DP FLOPs/cycle: 4-wide FMA
  • 16 SP FLOPs/cycle: 8-wide FMA
AMD Ryzen

  • 8 DP FLOPs/cycle: 4-wide FMA
  • 16 SP FLOPs/cycle: 8-wide FMA
AMD Jaguar:

  • 3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
  • 8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle
So those numbers for Ryzen and Jaguar are per core, the Bulldozer are per module. Does Jaguar really have a performance penalty for DP? Don't Jaguar and Bulldozer both have the same FMAC only Bulldozer has one per module and Jaguar has 1 per core (or 4 per 4 core module)?

So where does that put a 8-core jaguar? Jaguar 8 core must be very efficient being able to outperform a FX8350 @ 4ghz with its 2ghz?
According to those numbers an 8 core Jaguar would do 24 DP FLOPS per cycle or 64 SP Flops per cycle while a 8 core Bulldozer would be at 32 DP FLOPS or 64 SP FLOPS per cycle. I was under the understanding that each Jaguar core used the same FPU as each Bulldozer module, but apparently not. Still, Bulldozer has a pretty long pipeline which can hold back IPC.
 
What is CIVI?

It's likely a combination of CI and VI. Sea Islands and Volcanic Islands. Since there was already a Southern Islands, SI was taken.

So those numbers for Ryzen and Jaguar are per core, the Bulldozer are per module. Does Jaguar really have a performance penalty for DP?
For multiplication, double precision takes an extra iteration for Jaguar that blocks additional multiplications. The extra gap in cycles reduces further than the half-rate expected for going to double-precision.

Don't Jaguar and Bulldozer both have the same FMAC only Bulldozer has one per module and Jaguar has 1 per core (or 4 per 4 core module)?
The architectures are different. For one, Jaguar doesn't have an FMAC and it lacks a fair number of extensions supported by the Bulldozer line. Bulldozer also has a higher priority for double-precision, while Jaguar saved hardware by reducing throughput for that data type.

According to those numbers an 8 core Jaguar would do 24 DP FLOPS per cycle or 64 SP Flops per cycle while a 8 core Bulldozer would be at 32 DP FLOPS or 64 SP FLOPS per cycle. I was under the understanding that each Jaguar core used the same FPU as each Bulldozer module, but apparently not. Still, Bulldozer has a pretty long pipeline which can hold back IPC.
The Bulldozer module would presumably not be running at the same clock as the Jaguar one, and likely would target something close to twice the clock speed while only having half as many cores.
Per leaks about the early PS4 architecture, Sony almost decided on a 2-module Steamroller APU. Throughput would have been generally equivalent, but single-threaded performance would have favored the Steamroller one. By that revision of Bulldozer, some notable shortcomings relative to a Jaguar implementation would have been improved (wider decoders, better branch prediction, etc).
Whether power consumption, Steamroller's dependence on Globalfoundries, or some other factor put Sony off is unclear.
 
The architectures are different. For one, Jaguar doesn't have an FMAC and it lacks a fair number of extensions supported by the Bulldozer line. Bulldozer also has a higher priority for double-precision, while Jaguar saved hardware by reducing throughput for that data type.
Ahhh... I'd read that Jaguar's FPUs were double or two way 128bit FPUs which i equated as being the same as the FMAC in Bulldozer. So is Jaguar 128bit per pipe with a performance penalty to combine 2 pipes into DP, and Bulldozer is 2x128bit per module (2 cores) with no performance penalty for DP?
 
@3dilettante
So a FX8350 on its stock speed is faster then 8 core jaguar found in consoles?
How is that clock for clock?

The FX8350 is a 220W (edit: correction--that's the 9590, 125W for Vishera) processor running from 4 to 4.2 GHz versus the 1.6 GHz Jaguar cores.
For non FPU workloads, Vishera had as many cores as the PS4 running over twice as fast. That's vastly better multithreaded throughput and single-threaded performance.
The shared FPUs made it so that there were 4 FPUs, but they were individually much more heavyweight than Jaguar. The exact mixes each design favored do not align all the time (FMA can be better generally, but in a few spots worse versus separate MUL and ADD), but Vishera's FPU could hit the same peak numbers as two Jaguar FPUs and supported a more robust set of shuffle and vector integer operations, even without considering it was running over twice as fast.

Then there's the data paths and cache subsystem, which were wider and faster versus the power-conscious Jaguar.

However, Vishera was a vastly bigger investment in terms of power and silicon, so being much faster doesn't mean it was necessarily an order of magnitude faster than the more modest Jaguar.

Overall, the per-clock and other efficiency measures would have left the first two Bulldozer core variants out. However, by the time Steamroller came about, process changes and architectural changes fixed bugs and improved per-clock performance measurably. Steamroller at half the cores but twice the clock speed of Jaguar was what Sony may have nearly gone with for Orbis.

Ahhh... I'd read that Jaguar's FPUs were double or two way 128bit FPUs which i equated as being the same as the FMAC in Bulldozer. So is Jaguar 128bit per pipe with a performance penalty to combine 2 pipes into DP, and Bulldozer is 2x128bit per module (2 cores) with no performance penalty for DP?

Jaguar has one 128-bit pipe for addition, and one 128-bit pipe for multiplication. Miscellaneous operations and vector integer ops are distributed among those two ports.
Bulldozer has 4 FPU pipes, two for FMA. Permutes, moves, and integer ops were spread among all four.
Steamroller and later had 3 FPU pipes, where some of the integer operations and miscellaneous instructions were moved onto the remaining three.
There's a strong mixture of extensions and operations supported by Bulldozer versus Jaguar, so a lot of the other less glamorous elements of floating point workloads could be better handled by Bulldozer without fighting for cycles on Jaguar's smaller number of ports.

For DP, the Bulldozer line could get half-rate. The FMA pipe is counted as performing 2 operations and provides additional flexibility in the pure addition or pure multiplication case. It's not as strong if there's an equal mix of additions and multiplications that cannot chained together into an FMA.
For Jaguar, the addition pipe could perform a DP addition at half rate. For DP multiplication, the unit could produce the expected half-rate per instruction, but it then had to loop back through the smaller multiplier for an extra cycle--cutting performance further.

For the consoles, the DP case isn't all that important, however.
 
Last edited:
It's a shame that Microsoft at least didn't put a dozer core in the Xbox one X.

The size and power consumption of the last of the bulldozer line, Excavator, had been reduced significantly, so much so that on 28nm 2 Excavator modules would have been only slightly bigger than a jaguar 4 core complex. They could have theoretically fit 8 excavator cores in a similar silicon budget as what they have now. The FX 9800p also manages to run 2 modules/4 cores at 2.7ghz base in just 15 w power budget.

It would have madw the possibility of repurposing the One X as a low end next gen machine a bit more palatable.
 
The transistor budget would likely have been larger for Excavator. It was a later microarchitecture that took advantage of a tweaked process and higher-density libraries to cram more transistors into a similar area to Jaguar. Whether Jaguar had some level of similar compaction, or could have been redesigned in the same way is unclear, although at that point in time AMD was in no position to continue revamping the Jaguar line as it did Bulldozer.

Whether Excavator's density gains versus a more stagnant Jaguar line would have carried over to 16nm is not clear, the node jump would have reset the architectures to a similar starting point.
 
I thought the high density libraries were more a trade off of clock speed for a smaller size as they switched to a different metal stack to enable greater density. You see this on the power curve - at the top end of the clock, Steamroller is actually more efficient than excavator, at least according to AMDs slides.

As to why it was not chosen for the consoles, most likely it was the fact that excavator was built on GF process and not TMSC's, as you've mentioned.
 
For the consoles, the DP case isn't all that important, however.
Is this the case for games in general? I had an old A10 laptop and a Phenom 2 940 and for most games the A10 was GPU bound, but older games like Quake 1 at lower resolutions ran better on it IIRC.
 
Status
Not open for further replies.
Back
Top