Qualcomm SoC & ARMv8 custom core discussions

Discussion in 'Mobile Devices and SoCs' started by Nebuchadnezzar, Jan 20, 2015.

  1. Turbotab

    Newcomer

    Joined:
    Feb 19, 2013
    Messages:
    214
    Likes Received:
    3
    Do you you know about the L3 cache because you've seen die shots or through other insider knowledge?
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    I was operating on a subconscious assumption that the baseline for DVFS was something like the closed loop Samsung recently implemented, rather than the common case of a software solution.
    A more involved scheme would rely on better physical characterization and more complex clock/power management logic. That can take more time to get right.
    Does Kryo not have anything at the hardware level?
    It would make me lean towards the possibility that software can't deliver the upper clock ranges dynamically across four cores.

    The latencies are fairly very bad then, at least from the perspective of non-mobile architectures. The latency numbers I've seen for those would have come from different tests, so I am unsure they can be compared. Qualcomm's are higher than an architecture that officially has an L3, which can point to some amount of undisclosed complexity. There are some things like the number of CPU domains and independent graphics and DSP resources hooked into the uncore that look like contributors.

    Defects come to mind, but it could be motivated by a desire to keep the performance cluster's L2 as much of a superset of the slow one as possible in the case of thread migration, especially if there was supposed to be an L3 that could have backed this up.
    Last minute bugs or show-stoppers can sometimes lead to weird workarounds when dealing with a deadline.
     
  3. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    974
    Likes Received:
    141
    Location:
    Luxembourg
    Closed loop voltage systems don't include frequency, they just dynamically control voltage depending on chraracertization and temperature/load/currents. Nvidia since TK1 and Qualcomm since the 810 have quite advanced closed loop voltage systems, the 820 has it as well. Samsung's is a special case as they use both binning and the micro-controller, so it's not directly comparable.

    As much as Intel likes to talk about their hardware frequency control, I see it as no advantage to what the mobile SoC space does in software.
    It depends what you're measuring. The numbers we publish in mobile are worse-case full random latency which goes to about 256ns. Random within an access window flatlines at about 125ns clearly after the L2 finishes. And again of course we're not seeing the L3 because it's supposedly not active / not there.
     
  4. Rys

    Rys PowerVR
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,156
    Likes Received:
    1,433
    Location:
    Beyond3D HQ
    Die shot.
     
    Turbotab likes this.
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    I approach it from where x86 cores would be if they didn't have it. The example I was thinking of was actually AMD's Jaguar, since its architecture was launched with an in-design fully featured on-chip DVFS that failed to be fully realized until generally equivalent silicon was launched as Puma.
    There was poorer perf/W, and certain hacky clock management choices like Jaguar not reaching its upper clock band in certain products except when the device was plugged into a dock.

    Qualcomm's aspirations for taking its custom cores beyond mobile also threaten to take it into a place where it cannot be so readily discounted.
    The server chip would have a definite use for an working L3, and more responsive clock and power management given the competition and higher core counts.
    I suppose Qualcomm could have two different architectures, with the server version getting an L3 and more integrated power and clock management.
    On the other hand, the phone chip wouldn't need a dead L3 and a weird clock settings for the same cores, and the server core would lose the ability to leverage the R&D from a high-volume product.

    The post L2 number does bring the latencies more in line with some of the other architectures' fuzzy numbers. I find that an interesting number for comparative purposes, and it might show some of the differences between mobile and non-mobile memory standards and page policies.
     
  6. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    With Puma:
    http://anandtech.com/show/7974/amd-beema-mullins-architecture-a10-micro-6700t-performance-preview

    These are not improvements enabled by changing the power management control loop on generally equivalent silicon, this is a physical design overhaul. And it's not that surprising that these gains could be realized, Kabini used 0.77W when idle (http://www.hotchips.org/wp-content/.../HC25.26.111-Kabini-APU-Bouvier-AMD-Final.pdf) which is a lot for an SoC in this segment.

    One of the big power management features of Puma is chassis temperature based power budgeting:

    http://anandtech.com/show/7974/amd-beema-mullins-architecture-a10-micro-6700t-performance-preview/2

    This too is not something that you need closed loop hardware for: chassis temperature changes relatively slowly and you can see that their control loop worked over periods of hundreds of seconds.

    I think we have to be clear exactly what we're talking about when we talk about hardware frequency control. These two features have been implemented in processors I know of:

    - The core dynamically temporarily drops the frequency to compensate for sudden current transient events. This is less of a problem on lower power devices because the transients aren't as large. And there are other solutions that involve dynamically changing the voltage instead.
    - The core estimates/measures power consumption in order to determine what frequencies it can currently support, and will expose that to the OS eg as p-states.

    But in these cases the OS is still setting a nominal frequency that the core will generally run at. The only exception I know of here is Skylake, where the CPU can take over scheduling the frequency (presumably under the same guidelines an OS tends to use, which is roughly speaking about minimizing idle time). This allows a faster response time, but really only a small number of benchmarks show a meaningful performance improvement and power consumption doesn't really change. In general I would say that most programs don't need rapid frequency changing to work efficiently.

    One thing with SoCs like Puma is that they're usually ran on OSes like Windows and it's difficult to get MS to update the kernel to include power management code that's heavily tuned for any specific CPU. Especially in time for release day of the product. So they may have no option but to put power modeling in the CPU even if it doesn't strictly require fast response times. The situation is very different for SoCs that are used almost exclusively in devices like Android phones where the hardware vendor has control over what goes into the OS and where there's a lot of device-specific code.
     
    #186 Exophase, Dec 14, 2015
    Last edited: Dec 14, 2015
  7. wishiknew

    Regular

    Joined:
    May 19, 2004
    Messages:
    332
    Likes Received:
    6
    Did the 810 hit a new low at Anandtech's OnePlus 2 review?
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    I did not posit a change in the DVFS design, since it was announced as being there when the Jaguar architecture was launched.
    What looks not have been planned at design time was the foundry jump to TSMC, which AMD needed to pay for. (edit: Much of the team left as well, but how much that matters versus that it killed AMD's ability to design more than one new core is unclear to me.)
    I look askance at that move as a potential reason why the full range of Jaguar's announced DVFS capability was not realized, outside of an oddly inflexible exception. It took an additional cycle, something that to a lesser extent has happened with the desktop APU refreshes as processes improved and more time has elapsed for characterizing the hardware.
    The guard-banding for AMD's initial offerings has a history of being conservative, per the history of AMD's 7970 to 7970 GHz, its 2xx to 3xx series, Trinity/Richland, Kaveri/Godavari, Carrizo/Bristol Ridge, etc.

    I am discussing in terms of the closed-loop voltage control being something that implemented first, and that it allows for more advanced control for clock speed to be implemented. Guard banding is reduced by voltage control, and it can be reduced further with frequency control.
    I wasn't sure what Qualcomm had already implemented, and so I was not sure where it would have been in the process. Given where it hopes to go, I think Qualcomm will have a use for the additional functionality.

    This goes to my discussion about Qualcomm's higher-end goals for a server chip, and if that is being leveraged in the mobile silicon. The learning curve from that might be indicated in something odd like a physically present L3, and some other design quirks.

    My interpretation prior to Skylake's Speed Shift, the states below the turbo range were handled with the OS power states, with the hardware sneaking in turbo bins opportunistically.
    Intel added the SOix active idle functionality prior to Skylake, which is where the hardware opportunistically takes the core down to lower power states than the OS is aware of. The latter seems like it would be a larger gain due to Intel's cores being so over-engineered for the space. This is more pressing than it would be for an architecture content with the purely mobile space, but I am questioning if that is the case for Qualcomm this time around.

    For Windows at least, its response time is measured in tens of milliseconds, whereas at the high power ranges critical events operate at order of magnitude less time.
    That leads to pessimization on the part of the software on what it thinks it can risk, and it leads to thicker guard-banding on the part of the hardware. If Android can poll that much faster, and Kryo doesn't need to target a higher-power device class, then I agree the need isn't the same.
     
    #188 3dilettante, Dec 14, 2015
    Last edited: Dec 14, 2015
  9. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    974
    Likes Received:
    141
    Location:
    Luxembourg
    Exophase is correct about the Windows vs Android thing. Mobile devices have absolute control over DVFS and scheduler stuff, and can pretty much implement whatever behaviour you want just by updating some drivers that the vendor can push through. Hardware control makes sense for Windows due to the sheer effort of trying to change anything in the OS' drivers and how comparatively more limited the logic in those drivers are.
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Does that indicate that Android can respond in the millisecond to sub-millisecond range, compared to tens of milliseconds for Windows?
    Is this representative of the Linux-based systems x86 server chips run in, and which Qualcomm is hoping to get into?
     
  11. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,982
    Likes Received:
    4,569
    No, it hit a new low on OnePlus' specific implementation.
    To disable the big cores for Javascript-intensive scenarios (which is mostly why the big cores exist in the first place) is just terrible engineering and shows that the people responsible for OxygenOS are completely clueless about anything other than making cute-looking skins for Android (i.e. they're modders, not developers). They definitely didn't do any performance profiling whatsoever (I wonder if they even know how to do it).
    My guess is the OnePlus 2 is probably losing battery life on real-world web browsing scenarios due to the Cortex A57 cores being cutoff.

    It seems to me that OnePlus was doomed the moment they lost their partnership with CyanogenMod.
     
  12. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    974
    Likes Received:
    141
    Location:
    Luxembourg
    Yes, you can set it up to respond in sub-millisecond time and there's many QoS mechanisms that instantaneously trigger up-scaling in DVFS based on certain events, all which are extremely customized to the platform and SoC.

    Server systems usually run more vanilla software which doesn't include most of these non-upstream drivers/patches, so I wouldn't know how it behaves.
     
  13. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    974
    Likes Received:
    141
    Location:
    Luxembourg
  14. Laurent06

    Regular

    Joined:
    Dec 14, 2007
    Messages:
    715
    Likes Received:
    33
    810 does incredibly bad compared to 7420. Is it both a TSMC process issue and a Qualcomm bad implementation?
     
  15. willardjuice

    willardjuice super willyjuice
    Moderator Veteran Alpha Subscriber

    Joined:
    May 14, 2005
    Messages:
    1,373
    Likes Received:
    242
    Location:
    NY
    They aren't comparable. 7420 was finfet ("14nm") and 810 was not (20nm). I suspect all A57 implementations are poor on 20nm.
     
  16. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    974
    Likes Received:
    141
    Location:
    Luxembourg
    Nvidia seems to do just fine.
     
  17. willardjuice

    willardjuice super willyjuice
    Moderator Veteran Alpha Subscriber

    Joined:
    May 14, 2005
    Messages:
    1,373
    Likes Received:
    242
    Location:
    NY
    They are near Samsung's 14nm? IIRC you even noted the note 4 international version (20nm) was also poor on power.
     
  18. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    974
    Likes Received:
    141
    Location:
    Luxembourg
    No 7420 is still ahead.
    I didn't want to mention 5433 because it's on a different process but both X1 and 5433 seem about at about the same level. The 810/808 use about 60-70% more power which kinda redefined what's considered "poor". Implementation is wildly different.
     
    #198 Nebuchadnezzar, Dec 16, 2015
    Last edited: Dec 16, 2015
  19. willardjuice

    willardjuice super willyjuice
    Moderator Veteran Alpha Subscriber

    Joined:
    May 14, 2005
    Messages:
    1,373
    Likes Received:
    242
    Location:
    NY
    Ah okay, I didn't mean to imply all 20nm implementations were as poor as the 810, but in general they are all poor (for phones at least). While the 810 was ultimately the worst offender, I don't think you can blame qualcomm or tsmc completely for its failure. Even "good" implementations of the A57 on 20nm can't compare to the 7420. Qualcomm can be blamed for making things worse though. :razz:
     
  20. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    Poor in what way? By not being able to match the efficiency of an SoC on a finfet process? That's hardly behind expectations. The only expectations are that the efficiency is somewhat superior to what a similar design could have gotten on a previous process. If X1 is about 5433 level in efficiency that should at least be roughly the case.

    In their 808 and 810 SoCs Qualcomm has worse efficiency in the much weaker Cortex-A53 core than their old Krait 400, pretty much across the board (this is assuming there's nothing wrong with the analysis performed; if anything Krait is not a strong performer in SPEC so it might even be disadvantaged here). That is a very, very poor showing.

    I'm really struggling to even process these results. Why is the A53 perf/W on 810 even worse than on 808? Why is it only marginally better than the A57 perf/W? How did Qualcomm screw this up so badly? Does the A53 perf/W on their 28nm SoCs look significantly better than this? Or does that also use a terrible implementation?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...