Qualcomm SoC & ARMv8 custom core discussions

It fares better than the A9 on the memory score.

Looking at those numbers and assuming a frequency of 2.2ghz, it has aproximatley the same integer IPC as A57 but 50% more floating point IPC. Memory performance is about 2x better.
The multi-core performance is strangely low though. You'd think it would manage ~7000 if the smaller cores had just as good IPC and scaling was linear.
 
The multi-core performance is strangely low though. You'd think it would manage ~7000 if the smaller cores had just as good IPC and scaling was linear.
How did you reach 7000? Wouldn't theoretical peak be ~6500. (1872*2) 3744+ ((1873 * (16.5 / 22)) * 2) = 6553.5
84% of purely theoretical 4-core scaling is reasonable, considering there will always be bottlenecks coming into play.
 
Assuming 2.2 ghz on the big cores and 1.6 ghz on the little:
(2162*2) 4324+ ((2162 * (16 / 22)) * 2)3144 = 7468

I was working with the overall score whereas you seem to be working with just the integer score. Makes sense because you can't expect the memory score to scale linearly with core count.
 
What I really was impressed with is
Looks like usual ES2.0 forward shading path with planar reflections and simplified lighting/materials with a little to no # of draw calls in this demo.
I compiled several UE4 examples with ES3.1 AEP deferred shading render to Shield Tablet in March, which are much more complicated from shading and CPU POVs -
Average framerate was about 20 FPS and almost all examples were CPU bound, 3000 draw calls per frame in the video above are still quite heavy for a mobile CPU with ES3.1
 
It's pretty amazing what they've achieved with such a narrow architecture. It has very good IPEU (instructions per execution unit :p)
 
Sounds like the CPU execution block might be too thin given the massive memory bandwidth available to it, but perhaps it makes sense on the whole SoC level with the much hungrier GPU and new DSP features.
 
There is an L3 physically (at least 4MB, not sure of the exact size), but it's disabled just now. Hardware bug if I had to bet, and I believe it's going back for a fourth revision (not sure if it's to fix that issue though). There's nothing different about the four Kryo cores either, physically. I think they overvolt for the performance ("gold") pair and nominal/undervolt for the power ("silver") pair. Not sure if it's the same pair every time that's setup that way, in every chip.
 
There is an L3 physically (at least 4MB, not sure of the exact size), but it's disabled just now. Hardware bug if I had to bet, and I believe it's going back for a fourth revision (not sure if it's to fix that issue though). There's nothing different about the four Kryo cores either, physically. I think they overvolt for the performance ("gold") pair and nominal/undervolt for the power ("silver") pair. Not sure if it's the same pair every time that's setup that way, in every chip.

The disclosure indicates that the slow pair has a different amount of cache available. If the cores are physically identical, then it sounds like Qualcomm is trying to spin a defect-recovery mechanism as a product feature. It would make sense in that scenario to decide which pair is gold on a case-by-case basis, based on which ones may have the worse variation or defects.
 
  • Like
Reactions: Rys
There is an L3 physically (at least 4MB, not sure of the exact size), but it's disabled just now. Hardware bug if I had to bet, and I believe it's going back for a fourth revision (not sure if it's to fix that issue though). There's nothing different about the four Kryo cores either, physically. I think they overvolt for the performance ("gold") pair and nominal/undervolt for the power ("silver") pair. Not sure if it's the same pair every time that's setup that way, in every chip.

According to Tom's Hardware:

"Unlike Apple’s A9, Snapdragon 820 does not use an L3 cache. Qualcomm says it considered using an L3 cache, but ultimately decided the benefits did not outweigh the additional cost in power and die space for its design."

If it's in the actual SoC but disabled due to a bug then Qualcomm isn't being very honest (or Tom's isn't properly quoting them)

Very surprised to hear that the Kryo cores don't have any implementation differences either, since it's generally being reported that they two clusters have different design optimizations. If they're the same implementation the volt/frequency curves and leakage should be roughly the same and there wouldn't be a good reason why one can't clock as high as the other, unless they're using different style VRMs for the two or something.

The disclosure indicates that the slow pair has a different amount of cache available. If the cores are physically identical, then it sounds like Qualcomm is trying to spin a defect-recovery mechanism as a product feature. It would make sense in that scenario to decide which pair is gold on a case-by-case basis, based on which ones may have the worse variation or defects.

That would explain the difference in cache, but what about the difference in clock speed? I know there's going to be some natural variation in the freq/voltage curves but this much across the same chip?
 
The cores are big, so I think they're run differently for power reasons most of all, at least in the current spin. I'm half expecting a different final v4 (or an 825 or something), just not 100% sure what that might mean yet other than probably all 4 cores running equally, not heterogeneously.

There's no if with the L3 or the Kryo complex being homogenous. I can see it on the chip floorplan and the L3 is configured in the MSM8996 kernel source for all 3 versions so far.
 
If it's in the actual SoC but disabled due to a bug then Qualcomm isn't being very honest (or Tom's isn't properly quoting them)
There is overwhelming evidence of the L3, I put it into this graphic a month ago for good reason (Known about for more than half a year) http://images.anandtech.com/doci/9778/S820.png

Of course Qualcomm tells us there's no L3.

The current situations might be:
a) The L3 is buggy and things are being covered up.
b) The L3 works fine but the performance/power benefits are not there so they just decided to disable it.

In case of:
a) We'll see in the future a new SKU and they'll miraculously have a "new" working L3.
b) The new revision or production silicon won't have it on the die, thus saving space & cost.
 
If it's in the actual SoC but disabled due to a bug then Qualcomm isn't being very honest (or Tom's isn't properly quoting them)
The level of transparency for design for cores like x86 is much worse than it once was. The mobile providers have been even more opaque. The fragments of reverse-engineered details for Cyclone and Kryo and the limited marketing slides wouldn't add up to one Realworldtech page, a sliver from Agner Fog's optimization guide, or the reports done for the console CPUs in the previous gen.

That would explain the difference in cache, but what about the difference in clock speed? I know there's going to be some natural variation in the freq/voltage curves but this much across the same chip?
Intra-die variability has been getting worse, although another possibility is that a four cores at max performance have serious issues with parametric yields, but hobbling half of them brings the whole chip in line.
It should be possible to run a power virus on a per-cluster basis to figure out which one edges out the other. The lack of binning opportunities would lead to lumping the defect and variation compensation into one less than perfect standard.

Maybe a "new" Kryo2 with all 4 cores unleashed and a functioning L3 could be released later.

What Qualcomm has done does seem rather hackish if there's a dead L3 and cores that should be equals.
 
The cores are big, so I think they're run differently for power reasons most of all, at least in the current spin.

If true that's not a great way to approach it, it'd be better to simply cap the maximum sum of clock speeds of the two clusters at the kernel level or something (nVidia for example has done things like this) Fixing different maximum clock speeds to the two clusters is going to force more migrations than necessary. And removes the opportunity of running the cores at full clock speed in the most permissive thermal environments.

Maybe this is really all just a marketing trick to make people think they're doing more of a big.LITTLE-esque design than they are. That'd be pretty funny (and sad)

The level of transparency for design for cores like x86 is much worse than it once was. The mobile providers have been even more opaque. The fragments of reverse-engineered details for Cyclone and Kryo and the limited marketing slides wouldn't add up to one Realworldtech page, a sliver from Agner Fog's optimization guide, or the reports done for the console CPUs in the previous gen.

Very poor availability of CPU design information is one thing, and unfortunately it's been terrible for a while. But saying that they decided against L3 cache to save die area would be less withholding information and more just straight up lying. Not that I'm very surprised given their creative die shots. They could have just said nothing on the topic; of course I don't expect them to admit that it's broken, but this just makes me think more that it is.

Intra-die variability has been getting worse, although another possibility is that a four cores at max performance have serious issues with parametric yields, but hobbling half of them brings the whole chip in line.
It should be possible to run a power virus on a per-cluster basis to figure out which one edges out the other. The lack of binning opportunities would lead to lumping the defect and variation compensation into one less than perfect standard.

Indeed, they'd better hope that the cluster they're binning for smaller L2 cache because of defects also happens to be the cluster that they're binning for lower clock speed.
 
If true that's not a great way to approach it, it'd be better to simply cap the maximum sum of clock speeds of the two clusters at the kernel level or something (nVidia for example has done things like this) Fixing different maximum clock speeds to the two clusters is going to force more migrations than necessary. And removes the opportunity of running the cores at full clock speed in the most permissive thermal environments.
I suppose the possibility exists that they simply can't, especially if this decision wasn't settled until late in development. Actually having the L3 physically there would make the scenario of a late kludge more likely.
Maybe their DVFS isn't smart enough, or could not be validated/bug-fixed well enough for a fully dynamic setup.
Having an L3 and not being able to use it has certain other possibilities, such as some kind of synchronization issue or consistency problem akin to AMD's TLB issue.

At any rate, I am curious about how the Anandtech latency numbers are derived. Perhaps it is an artifact of mobile memory standards, but the wall-clock times are bad. Qualcomm without an (active) L3 is significantly worse than Apple's A9 with an L3, which is worse than the A9X by an amount that seems consistent with the latency of an exclusive L3 access.
The derived values for the AMD-powered consoles are very poor, but still better than the mobile ARM chips. AMD's higher-power APUs are bad but faster than the consoles, and then there's AMD's non-APU CPUs that were decent, then pretty much any modern Intel design at the forefront.

Indeed, they'd better hope that the cluster they're binning for smaller L2 cache because of defects also happens to be the cluster that they're binning for lower clock speed.
I've given this some additional thought, and I think that slashing half the L2 and dropping the max clock somewhere below the inflection point for the power curve provides a lot of wiggle room, perhaps more than needing to worry about intra-cluster variation except maybe as a secondary consideration. If for some reason the chip's power is that borderline where one cluster has defects and the other runs hot, it might be in the corner case that Qualcomm would decide to discard.
 
Maybe their DVFS isn't smart enough, or could not be validated/bug-fixed well enough for a fully dynamic setup.
I doubt it. I could write a driver for such a logic in like a day.
At any rate, I am curious about how the Anandtech latency numbers are derived.
It's fairly accurate.
I've given this some additional thought, and I think that slashing half the L2 and dropping the max clock somewhere below the inflection point for the power curve provides a lot of wiggle room
It would be interesting to see if somehow the clusters are binned for one or the other to be the "fast" cluster and the other to be the slow cluster. What I'm wondering is why bother to slash the L2 again, it's not like it'll give some amazing leakage advantage when you compare it to the cores themselves. We still don't really know if the slow cluster L2 is really 512KB other than Tom's saying it is so and we won't be able to verify it until I get a device and shut down the fast cores to be able to get a latency run on the slow cluster.
 
Back
Top