Qualcomm SoC & ARMv8 custom core discussions

Turbotab · Dec 12, 2015

Rys said:
There is an L3 physically (at least 4MB, not sure of the exact size), but it's disabled just now. Hardware bug if I had to bet, and I believe it's going back for a fourth revision (not sure if it's to fix that issue though). There's nothing different about the four Kryo cores either, physically. I think they overvolt for the performance ("gold") pair and nominal/undervolt for the power ("silver") pair. Not sure if it's the same pair every time that's setup that way, in every chip.

Do you you know about the L3 cache because you've seen die shots or through other insider knowledge?

3dilettante · Dec 12, 2015

Nebuchadnezzar said:
I doubt it. I could write a driver for such a logic in like a day.

I was operating on a subconscious assumption that the baseline for DVFS was something like the closed loop Samsung recently implemented, rather than the common case of a software solution.
A more involved scheme would rely on better physical characterization and more complex clock/power management logic. That can take more time to get right.
Does Kryo not have anything at the hardware level?
It would make me lean towards the possibility that software can't deliver the upper clock ranges dynamically across four cores.

It's fairly accurate.

The latencies are fairly very bad then, at least from the perspective of non-mobile architectures. The latency numbers I've seen for those would have come from different tests, so I am unsure they can be compared. Qualcomm's are higher than an architecture that officially has an L3, which can point to some amount of undisclosed complexity. There are some things like the number of CPU domains and independent graphics and DSP resources hooked into the uncore that look like contributors.

It would be interesting to see if somehow the clusters are binned for one or the other to be the "fast" cluster and the other to be the slow cluster. What I'm wondering is why bother to slash the L2 again, it's not like it'll give some amazing leakage advantage when you compare it to the cores themselves.

Defects come to mind, but it could be motivated by a desire to keep the performance cluster's L2 as much of a superset of the slow one as possible in the case of thread migration, especially if there was supposed to be an L3 that could have backed this up.
Last minute bugs or show-stoppers can sometimes lead to weird workarounds when dealing with a deadline.

Nebuchadnezzar · Dec 12, 2015

3dilettante said:
I was operating on a subconscious assumption that the baseline for DVFS was something like the closed loop Samsung recently implemented, rather than the common case of a software solution.
A more involved scheme would rely on better physical characterization and more complex clock/power management logic. That can take more time to get right.
Does Kryo not have anything at the hardware level?
It would make me lean towards the possibility that software can't deliver the upper clock ranges dynamically across four cores.

Closed loop voltage systems don't include frequency, they just dynamically control voltage depending on chraracertization and temperature/load/currents. Nvidia since TK1 and Qualcomm since the 810 have quite advanced closed loop voltage systems, the 820 has it as well. Samsung's is a special case as they use both binning and the micro-controller, so it's not directly comparable.

As much as Intel likes to talk about their hardware frequency control, I see it as no advantage to what the mobile SoC space does in software.

3dilettante said:
The latencies are fairly very bad then, at least from the perspective of non-mobile architectures. The latency numbers I've seen for those would have come from different tests, so I am unsure they can be compared. Qualcomm's are higher than an architecture that officially has an L3, which can point to some amount of undisclosed complexity. There are some things like the number of CPU domains and independent graphics and DSP resources hooked into the uncore that look like contributors.

It depends what you're measuring. The numbers we publish in mobile are worse-case full random latency which goes to about 256ns. Random within an access window flatlines at about 125ns clearly after the L2 finishes. And again of course we're not seeing the L3 because it's supposedly not active / not there.

Rys · Dec 12, 2015

Turbotab said:
Do you you know about the L3 cache because you've seen die shots or through other insider knowledge?

Die shot.

3dilettante · Dec 14, 2015

Nebuchadnezzar said:
As much as Intel likes to talk about their hardware frequency control, I see it as no advantage to what the mobile SoC space does in software.

I approach it from where x86 cores would be if they didn't have it. The example I was thinking of was actually AMD's Jaguar, since its architecture was launched with an in-design fully featured on-chip DVFS that failed to be fully realized until generally equivalent silicon was launched as Puma.
There was poorer perf/W, and certain hacky clock management choices like Jaguar not reaching its upper clock band in certain products except when the device was plugged into a dock.

Qualcomm's aspirations for taking its custom cores beyond mobile also threaten to take it into a place where it cannot be so readily discounted.
The server chip would have a definite use for an working L3, and more responsive clock and power management given the competition and higher core counts.
I suppose Qualcomm could have two different architectures, with the server version getting an L3 and more integrated power and clock management.
On the other hand, the phone chip wouldn't need a dead L3 and a weird clock settings for the same cores, and the server core would lose the ability to leverage the R&D from a high-volume product.

It depends what you're measuring. The numbers we publish in mobile are worse-case full random latency which goes to about 256ns. Random within an access window flatlines at about 125ns clearly after the L2 finishes. And again of course we're not seeing the L3 because it's supposedly not active / not there.

The post L2 number does bring the latencies more in line with some of the other architectures' fuzzy numbers. I find that an interesting number for comparative purposes, and it might show some of the differences between mobile and non-mobile memory standards and page policies.

Exophase · Dec 14, 2015

3dilettante said:
I approach it from where x86 cores would be if they didn't have it. The example I was thinking of was actually AMD's Jaguar, since its architecture was launched with an in-design fully featured on-chip DVFS that failed to be fully realized until generally equivalent silicon was launched as Puma.
There was poorer perf/W, and certain hacky clock management choices like Jaguar not reaching its upper clock band in certain products except when the device was plugged into a dock.

With Puma:

AMD claims a 19% reduction in core leakage/static current for Puma+ compared to Jaguar at 1.2V, and a 38% reduction for the GPU.

These platforms now ship with more strict guidelines as to what sort of memory can be used on board and how traces must be routed. The result is a memory interface that shaves off more than 500mW when in this more strict, low power mode.

Beema/Mullins also show up to a 200mW reduction in power consumed by the display interface compared to Kabini/Temash.

http://anandtech.com/show/7974/amd-beema-mullins-architecture-a10-micro-6700t-performance-preview

These are not improvements enabled by changing the power management control loop on generally equivalent silicon, this is a physical design overhaul. And it's not that surprising that these gains could be realized, Kabini used 0.77W when idle (http://www.hotchips.org/wp-content/.../HC25.26.111-Kabini-APU-Bouvier-AMD-Final.pdf) which is a lot for an SoC in this segment.

One of the big power management features of Puma is chassis temperature based power budgeting:

http://anandtech.com/show/7974/amd-beema-mullins-architecture-a10-micro-6700t-performance-preview/2

This too is not something that you need closed loop hardware for: chassis temperature changes relatively slowly and you can see that their control loop worked over periods of hundreds of seconds.

I think we have to be clear exactly what we're talking about when we talk about hardware frequency control. These two features have been implemented in processors I know of:

- The core dynamically temporarily drops the frequency to compensate for sudden current transient events. This is less of a problem on lower power devices because the transients aren't as large. And there are other solutions that involve dynamically changing the voltage instead.
- The core estimates/measures power consumption in order to determine what frequencies it can currently support, and will expose that to the OS eg as p-states.

But in these cases the OS is still setting a nominal frequency that the core will generally run at. The only exception I know of here is Skylake, where the CPU can take over scheduling the frequency (presumably under the same guidelines an OS tends to use, which is roughly speaking about minimizing idle time). This allows a faster response time, but really only a small number of benchmarks show a meaningful performance improvement and power consumption doesn't really change. In general I would say that most programs don't need rapid frequency changing to work efficiently.

One thing with SoCs like Puma is that they're usually ran on OSes like Windows and it's difficult to get MS to update the kernel to include power management code that's heavily tuned for any specific CPU. Especially in time for release day of the product. So they may have no option but to put power modeling in the CPU even if it doesn't strictly require fast response times. The situation is very different for SoCs that are used almost exclusively in devices like Android phones where the hardware vendor has control over what goes into the OS and where there's a lot of device-specific code.

wishiknew · Dec 14, 2015

Did the 810 hit a new low at Anandtech's OnePlus 2 review?

3dilettante · Dec 14, 2015

Exophase said:
These are not improvements enabled by changing the power management control loop on generally equivalent silicon, this is a physical design overhaul.

I did not posit a change in the DVFS design, since it was announced as being there when the Jaguar architecture was launched.
What looks not have been planned at design time was the foundry jump to TSMC, which AMD needed to pay for. (edit: Much of the team left as well, but how much that matters versus that it killed AMD's ability to design more than one new core is unclear to me.)
I look askance at that move as a potential reason why the full range of Jaguar's announced DVFS capability was not realized, outside of an oddly inflexible exception. It took an additional cycle, something that to a lesser extent has happened with the desktop APU refreshes as processes improved and more time has elapsed for characterizing the hardware.
The guard-banding for AMD's initial offerings has a history of being conservative, per the history of AMD's 7970 to 7970 GHz, its 2xx to 3xx series, Trinity/Richland, Kaveri/Godavari, Carrizo/Bristol Ridge, etc.

I think we have to be clear exactly what we're talking about when we talk about hardware frequency control.

I am discussing in terms of the closed-loop voltage control being something that implemented first, and that it allows for more advanced control for clock speed to be implemented. Guard banding is reduced by voltage control, and it can be reduced further with frequency control.
I wasn't sure what Qualcomm had already implemented, and so I was not sure where it would have been in the process. Given where it hopes to go, I think Qualcomm will have a use for the additional functionality.

These two features have been implemented in processors I know of:
- The core dynamically temporarily drops the frequency to compensate for sudden current transient events. This is less of a problem on lower power devices because the transients aren't as large. And there are other solutions that involve dynamically changing the voltage instead.

This goes to my discussion about Qualcomm's higher-end goals for a server chip, and if that is being leveraged in the mobile silicon. The learning curve from that might be indicated in something odd like a physically present L3, and some other design quirks.

- The core estimates/measures power consumption in order to determine what frequencies it can currently support, and will expose that to the OS eg as p-states.

But in these cases the OS is still setting a nominal frequency that the core will generally run at. The only exception I know of here is Skylake, where the CPU can take over scheduling the frequency (presumably under the same guidelines an OS tends to use, which is roughly speaking about minimizing idle time).

My interpretation prior to Skylake's Speed Shift, the states below the turbo range were handled with the OS power states, with the hardware sneaking in turbo bins opportunistically.
Intel added the SOix active idle functionality prior to Skylake, which is where the hardware opportunistically takes the core down to lower power states than the OS is aware of. The latter seems like it would be a larger gain due to Intel's cores being so over-engineered for the space. This is more pressing than it would be for an architecture content with the purely mobile space, but I am questioning if that is the case for Qualcomm this time around.

One thing with SoCs like Puma is that they're usually ran on OSes like Windows and it's difficult to get MS to update the kernel to include power management code that's heavily tuned for any specific CPU. Especially in time for release day of the product. So they may have no option but to put power modeling in the CPU even if it doesn't strictly require fast response times.

For Windows at least, its response time is measured in tens of milliseconds, whereas at the high power ranges critical events operate at order of magnitude less time.
That leads to pessimization on the part of the software on what it thinks it can risk, and it leads to thicker guard-banding on the part of the hardware. If Android can poll that much faster, and Kryo doesn't need to target a higher-power device class, then I agree the need isn't the same.

Nebuchadnezzar · Dec 15, 2015

Exophase is correct about the Windows vs Android thing. Mobile devices have absolute control over DVFS and scheduler stuff, and can pretty much implement whatever behaviour you want just by updating some drivers that the vendor can push through. Hardware control makes sense for Windows due to the sheer effort of trying to change anything in the OS' drivers and how comparatively more limited the logic in those drivers are.

3dilettante · Dec 15, 2015

Does that indicate that Android can respond in the millisecond to sub-millisecond range, compared to tens of milliseconds for Windows?
Is this representative of the Linux-based systems x86 server chips run in, and which Qualcomm is hoping to get into?

Deleted member 13524 · Dec 15, 2015

wishiknew said:
Did the 810 hit a new low at Anandtech's OnePlus 2 review?

No, it hit a new low on OnePlus' specific implementation.
To disable the big cores for Javascript-intensive scenarios (which is mostly why the big cores exist in the first place) is just terrible engineering and shows that the people responsible for OxygenOS are completely clueless about anything other than making cute-looking skins for Android (i.e. they're modders, not developers). They definitely didn't do any performance profiling whatsoever (I wonder if they even know how to do it).
My guess is the OnePlus 2 is probably losing battery life on real-world web browsing scenarios due to the Cortex A57 cores being cutoff.

It seems to me that OnePlus was doomed the moment they lost their partnership with CyanogenMod.

Nebuchadnezzar · Dec 15, 2015

3dilettante said:
Does that indicate that Android can respond in the millisecond to sub-millisecond range, compared to tens of milliseconds for Windows?
Is this representative of the Linux-based systems x86 server chips run in, and which Qualcomm is hoping to get into?

Yes, you can set it up to respond in sub-millisecond time and there's many QoS mechanisms that instantaneously trigger up-scaling in DVFS based on certain events, all which are extremely customized to the platform and SoC.

Server systems usually run more vanilla software which doesn't include most of these non-upstream drivers/patches, so I wouldn't know how it behaves.

Nebuchadnezzar · Dec 16, 2015

Who's gonna be the first to do 1+1 and combine this graph with Qualcomm's efficiency claims for the 820? http://images.anandtech.com/doci/9820/eff.png

Laurent06 · Dec 16, 2015

Nebuchadnezzar said:
Who's gonna be the first to do 1+1 and combine this graph with Qualcomm's efficiency claims for the 820? http://images.anandtech.com/doci/9820/eff.png

810 does incredibly bad compared to 7420. Is it both a TSMC process issue and a Qualcomm bad implementation?

willardjuice · Dec 16, 2015

Laurent06 said:
810 does incredibly bad compared to 7420. Is it both a TSMC process issue and a Qualcomm bad implementation?

They aren't comparable. 7420 was finfet ("14nm") and 810 was not (20nm). I suspect all A57 implementations are poor on 20nm.

Nebuchadnezzar · Dec 16, 2015

willardjuice said:
They aren't comparable. 7420 was finfet ("14nm") and 810 was not (20nm). I suspect all A57 implementations are poor on 20nm.

Nvidia seems to do just fine.

willardjuice · Dec 16, 2015

Nebuchadnezzar said:
Nvidia seems to do just fine.

They are near Samsung's 14nm? IIRC you even noted the note 4 international version (20nm) was also poor on power.

Nebuchadnezzar · Dec 16, 2015

willardjuice said:
They are near Samsung's 14nm? IIRC you even noted the note 4 international version (20nm) was also poor on power.

No 7420 is still ahead.
I didn't want to mention 5433 because it's on a different process but both X1 and 5433 seem about at about the same level. The 810/808 use about 60-70% more power which kinda redefined what's considered "poor". Implementation is wildly different.

willardjuice · Dec 18, 2015

Nebuchadnezzar said:
No 7420 is still ahead.
I didn't want to mention 5433 because it's on a different process but both X1 and 5433 seem about at about the same level. The 810/808 use about 60-70% more power which kinda redefined what's considered "poor". Implementation is wildly different.

Ah okay, I didn't mean to imply all 20nm implementations were as poor as the 810, but in general they are all poor (for phones at least). While the 810 was ultimately the worst offender, I don't think you can blame qualcomm or tsmc completely for its failure. Even "good" implementations of the A57 on 20nm can't compare to the 7420. Qualcomm can be blamed for making things worse though.

Exophase · Dec 19, 2015

willardjuice said:
Ah okay, I didn't mean to imply all 20nm implementations were as poor as the 810, but in general they are all poor (for phones at least). While the 810 was ultimately the worst offender, I don't think you can blame qualcomm or tsmc completely for its failure. Even "good" implementations of the A57 on 20nm can't compare to the 7420. Qualcomm can be blamed for making things worse though.

Poor in what way? By not being able to match the efficiency of an SoC on a finfet process? That's hardly behind expectations. The only expectations are that the efficiency is somewhat superior to what a similar design could have gotten on a previous process. If X1 is about 5433 level in efficiency that should at least be roughly the case.

In their 808 and 810 SoCs Qualcomm has worse efficiency in the much weaker Cortex-A53 core than their old Krait 400, pretty much across the board (this is assuming there's nothing wrong with the analysis performed; if anything Krait is not a strong performer in SPEC so it might even be disadvantaged here). That is a very, very poor showing.

I'm really struggling to even process these results. Why is the A53 perf/W on 810 even worse than on 808? Why is it only marginally better than the A57 perf/W? How did Qualcomm screw this up so badly? Does the A53 perf/W on their 28nm SoCs look significantly better than this? Or does that also use a terrible implementation?

Qualcomm SoC & ARMv8 custom core discussions

Turbotab

3dilettante

Nebuchadnezzar

Rys

Graphics @ AMD

3dilettante

Exophase

wishiknew

3dilettante

Nebuchadnezzar

3dilettante

Deleted member 13524

Guest

Nebuchadnezzar

Nebuchadnezzar

Laurent06

willardjuice

super willyjuice

Nebuchadnezzar

willardjuice

super willyjuice

Nebuchadnezzar

willardjuice

super willyjuice

Exophase

Similar threads