Qualcomm SoC & ARMv8 custom core discussions

Discussion in 'Mobile Devices and SoCs' started by Nebuchadnezzar, Jan 20, 2015.

  1. juicytuna

    Newcomer

    Joined:
    Jul 27, 2005
    Messages:
    71
    Likes Received:
    0
    It fares better than the A9 on the memory score.

    Looking at those numbers and assuming a frequency of 2.2ghz, it has aproximatley the same integer IPC as A57 but 50% more floating point IPC. Memory performance is about 2x better.
    The multi-core performance is strangely low though. You'd think it would manage ~7000 if the smaller cores had just as good IPC and scaling was linear.
     
  2. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    555
    Likes Received:
    93
    How did you reach 7000? Wouldn't theoretical peak be ~6500. (1872*2) 3744+ ((1873 * (16.5 / 22)) * 2) = 6553.5
    84% of purely theoretical 4-core scaling is reasonable, considering there will always be bottlenecks coming into play.
     
  3. juicytuna

    Newcomer

    Joined:
    Jul 27, 2005
    Messages:
    71
    Likes Received:
    0
    Assuming 2.2 ghz on the big cores and 1.6 ghz on the little:
    (2162*2) 4324+ ((2162 * (16 / 22)) * 2)3144 = 7468

    I was working with the overall score whereas you seem to be working with just the integer score. Makes sense because you can't expect the memory score to scale linearly with core count.
     
  4. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,420
    Likes Received:
    179
    Location:
    Chania
  5. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    975
    Likes Received:
    145
    Location:
    Luxembourg
  6. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    365
    Likes Received:
    257
    Looks like usual ES2.0 forward shading path with planar reflections and simplified lighting/materials with a little to no # of draw calls in this demo.
    I compiled several UE4 examples with ES3.1 AEP deferred shading render to Shield Tablet in March, which are much more complicated from shading and CPU POVs -
    Average framerate was about 20 FPS and almost all examples were CPU bound, 3000 draw calls per frame in the video above are still quite heavy for a mobile CPU with ES3.1
     
  7. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,420
    Likes Received:
    179
    Location:
    Chania
    I said I'm impressed by the artistic work.
     
  8. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    975
    Likes Received:
    145
    Location:
    Luxembourg
  9. juicytuna

    Newcomer

    Joined:
    Jul 27, 2005
    Messages:
    71
    Likes Received:
    0
    It's pretty amazing what they've achieved with such a narrow architecture. It has very good IPEU (instructions per execution unit :razz:)
     
  10. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    Sounds like the CPU execution block might be too thin given the massive memory bandwidth available to it, but perhaps it makes sense on the whole SoC level with the much hungrier GPU and new DSP features.
     
  11. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
  12. Rys

    Rys PowerVR
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,158
    Likes Received:
    1,439
    Location:
    Beyond3D HQ
    There is an L3 physically (at least 4MB, not sure of the exact size), but it's disabled just now. Hardware bug if I had to bet, and I believe it's going back for a fourth revision (not sure if it's to fix that issue though). There's nothing different about the four Kryo cores either, physically. I think they overvolt for the performance ("gold") pair and nominal/undervolt for the power ("silver") pair. Not sure if it's the same pair every time that's setup that way, in every chip.
     
    Kaarlisk likes this.
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,135
    Likes Received:
    2,935
    Location:
    Well within 3d
    The disclosure indicates that the slow pair has a different amount of cache available. If the cores are physically identical, then it sounds like Qualcomm is trying to spin a defect-recovery mechanism as a product feature. It would make sense in that scenario to decide which pair is gold on a case-by-case basis, based on which ones may have the worse variation or defects.
     
    Rys likes this.
  14. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    According to Tom's Hardware:

    "Unlike Appleā€™s A9, Snapdragon 820 does not use an L3 cache. Qualcomm says it considered using an L3 cache, but ultimately decided the benefits did not outweigh the additional cost in power and die space for its design."

    If it's in the actual SoC but disabled due to a bug then Qualcomm isn't being very honest (or Tom's isn't properly quoting them)

    Very surprised to hear that the Kryo cores don't have any implementation differences either, since it's generally being reported that they two clusters have different design optimizations. If they're the same implementation the volt/frequency curves and leakage should be roughly the same and there wouldn't be a good reason why one can't clock as high as the other, unless they're using different style VRMs for the two or something.

    That would explain the difference in cache, but what about the difference in clock speed? I know there's going to be some natural variation in the freq/voltage curves but this much across the same chip?
     
  15. Rys

    Rys PowerVR
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,158
    Likes Received:
    1,439
    Location:
    Beyond3D HQ
    The cores are big, so I think they're run differently for power reasons most of all, at least in the current spin. I'm half expecting a different final v4 (or an 825 or something), just not 100% sure what that might mean yet other than probably all 4 cores running equally, not heterogeneously.

    There's no if with the L3 or the Kryo complex being homogenous. I can see it on the chip floorplan and the L3 is configured in the MSM8996 kernel source for all 3 versions so far.
     
  16. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    975
    Likes Received:
    145
    Location:
    Luxembourg
    There is overwhelming evidence of the L3, I put it into this graphic a month ago for good reason (Known about for more than half a year) http://images.anandtech.com/doci/9778/S820.png

    Of course Qualcomm tells us there's no L3.

    The current situations might be:
    a) The L3 is buggy and things are being covered up.
    b) The L3 works fine but the performance/power benefits are not there so they just decided to disable it.

    In case of:
    a) We'll see in the future a new SKU and they'll miraculously have a "new" working L3.
    b) The new revision or production silicon won't have it on the die, thus saving space & cost.
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,135
    Likes Received:
    2,935
    Location:
    Well within 3d
    The level of transparency for design for cores like x86 is much worse than it once was. The mobile providers have been even more opaque. The fragments of reverse-engineered details for Cyclone and Kryo and the limited marketing slides wouldn't add up to one Realworldtech page, a sliver from Agner Fog's optimization guide, or the reports done for the console CPUs in the previous gen.

    Intra-die variability has been getting worse, although another possibility is that a four cores at max performance have serious issues with parametric yields, but hobbling half of them brings the whole chip in line.
    It should be possible to run a power virus on a per-cluster basis to figure out which one edges out the other. The lack of binning opportunities would lead to lumping the defect and variation compensation into one less than perfect standard.

    Maybe a "new" Kryo2 with all 4 cores unleashed and a functioning L3 could be released later.

    What Qualcomm has done does seem rather hackish if there's a dead L3 and cores that should be equals.
     
  18. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    If true that's not a great way to approach it, it'd be better to simply cap the maximum sum of clock speeds of the two clusters at the kernel level or something (nVidia for example has done things like this) Fixing different maximum clock speeds to the two clusters is going to force more migrations than necessary. And removes the opportunity of running the cores at full clock speed in the most permissive thermal environments.

    Maybe this is really all just a marketing trick to make people think they're doing more of a big.LITTLE-esque design than they are. That'd be pretty funny (and sad)

    Very poor availability of CPU design information is one thing, and unfortunately it's been terrible for a while. But saying that they decided against L3 cache to save die area would be less withholding information and more just straight up lying. Not that I'm very surprised given their creative die shots. They could have just said nothing on the topic; of course I don't expect them to admit that it's broken, but this just makes me think more that it is.

    Indeed, they'd better hope that the cluster they're binning for smaller L2 cache because of defects also happens to be the cluster that they're binning for lower clock speed.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,135
    Likes Received:
    2,935
    Location:
    Well within 3d
    I suppose the possibility exists that they simply can't, especially if this decision wasn't settled until late in development. Actually having the L3 physically there would make the scenario of a late kludge more likely.
    Maybe their DVFS isn't smart enough, or could not be validated/bug-fixed well enough for a fully dynamic setup.
    Having an L3 and not being able to use it has certain other possibilities, such as some kind of synchronization issue or consistency problem akin to AMD's TLB issue.

    At any rate, I am curious about how the Anandtech latency numbers are derived. Perhaps it is an artifact of mobile memory standards, but the wall-clock times are bad. Qualcomm without an (active) L3 is significantly worse than Apple's A9 with an L3, which is worse than the A9X by an amount that seems consistent with the latency of an exclusive L3 access.
    The derived values for the AMD-powered consoles are very poor, but still better than the mobile ARM chips. AMD's higher-power APUs are bad but faster than the consoles, and then there's AMD's non-APU CPUs that were decent, then pretty much any modern Intel design at the forefront.

    I've given this some additional thought, and I think that slashing half the L2 and dropping the max clock somewhere below the inflection point for the power curve provides a lot of wiggle room, perhaps more than needing to worry about intra-cluster variation except maybe as a secondary consideration. If for some reason the chip's power is that borderline where one cluster has defects and the other runs hot, it might be in the corner case that Qualcomm would decide to discard.
     
  20. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    975
    Likes Received:
    145
    Location:
    Luxembourg
    I doubt it. I could write a driver for such a logic in like a day.
    It's fairly accurate.
    It would be interesting to see if somehow the clusters are binned for one or the other to be the "fast" cluster and the other to be the slow cluster. What I'm wondering is why bother to slash the L2 again, it's not like it'll give some amazing leakage advantage when you compare it to the cores themselves. We still don't really know if the slow cluster L2 is really 512KB other than Tom's saying it is so and we won't be able to verify it until I get a device and shut down the fast cores to be able to get a latency run on the slow cluster.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...