NVIDIA Tegra Architecture

Discussion in 'Mobile Graphics Architectures and IP' started by french toast, Jan 17, 2012.

Tags:
  1. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    It has equal fill rate. But only 25.6 GB/s of BW (will scale up when LPDDR4 matures to ~34 GB/s). That's not even half of the BW of the main memory and there's no alternative for the 32 MB ESRAM. The raw FP32 rate is also under under half of the console (FP16 rate is closer to the FP32 rate of Xbox One, but not there yet). A two SMX version could be competetive against the consoles if (and only if) they solved the bandwidth issue.

    They need to roughly 4x the memory controller width and wait for 2133 MHz LPDDR4. That would get them to 136.5 GB/s. Maxwell is more BW friendly than GCN (as it has delta color compression and bigger caches) so this should be enough. I don't actually even know the cache size of the Tegra Maxwell designs (2 MB L2 sounds a little bit high for such a small GPU, but not if you compare it to Intel designs).
     
    Lazy8s likes this.
  2. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,446
    Likes Received:
    181
    Location:
    Chania
    You are both missing two vital things: its not only that the double fp16 rate is under conditionals, most rates including fillrate are subject to the frequency in final devices. The X1 marketing material was again using 950MHz for GK20A/K1 albeit peak frequency being at 850MHz in the end in real devices.
    Yes at 1 GHz, with identical ops for fp16 & much higher bandwidth they'd be there, but that's 3 conditionals already.
     
  3. liquidboy

    Regular Newcomer

    Joined:
    Jan 16, 2013
    Messages:
    416
    Likes Received:
    77
    + GPU cores
    + x86 cores
    + x64 cores

    :) when I see an architecture like that I immediately think of multi-kernel research OS's :) ... Maybe it's the perfect time for these research OS's to surface

    [​IMG]
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,367
    Likes Received:
    3,959
    Location:
    Well within 3d
    But what does hosting disparate host CPU ISAs actually give you in this instance?
    It's combining functionally equivalent cores with deeply similar architectural goals whose primary source of synergy is implementing them in dangerously incompatible ways.
    If you want backwards compatibility, you don't want your code wandering onto a CPU that it wasn't developed for.
    The apparent cross-node messaging system sounds like it negates a fair portion of the benefits of physically integrating them, on top of the massive validation space for that many architectures.
     
  5. liquidboy

    Regular Newcomer

    Joined:
    Jan 16, 2013
    Messages:
    416
    Likes Received:
    77
    One argument is certain apps can run in lower power mode and instantly switch to another more power hungry mode eg. Powerpoint in slideshow mode vs Edit mode. Or word in tablet mode versus keyboard connected edit mode..

    Another argument is a way to fill the "app gap", some believe that MS will soon launch a solution where android apps can run on windows devices (either streamed from a server or virtualized who knows)..

    Just as that saying goes, the right tools for the job, I guess you can extend that thinking to "the right architecture for the job"

    p.s. the whole container based apps, eg. Docker tech or as MS will soon launch its own Windows Containers in Windows 10.. We already compile down Apps all the way down to the OS and the services that app needs. The next logical step is to also include the architecture that container based app works best on.. it's a virtualized container based future
     
  6. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    I wonder how difficult it would be to switch the 4xA53 cluster to another 4xA57 cluster. According to the marketing material, the two modules are already fully cache coherent and able to run simultaneously. It wouldn't be a big stretch to update the other cluster to beefier cores. With bigger TDP ceiling and active cooling (a home console) this setup would be able to compete with 8 Jaguar cores. Shouldn't take that much extra die space either. Obviously this setup would be completely stupid for mobile devices, but it could be useful for laptops (chrome books, etc) and consoles.
     
  7. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    647
    Likes Received:
    94
    Why not 4 x Denver and 4 x A53? (But using HMP). Aside from the fact that a 3 cluster design may be overly complicated..this seems like a far better solution to both increase peak performance and reduce idle/non-peak power consumption.
    Ahh..makes a bit more sense now. If its just one loaded core then yes..that would basically be down to the process and libraries as you said. Still quite a large difference though (Assuming we take Nvidia's claim of "40% performance increase at the same power" at face value).

    Voltages dont tell us the full story but yes from all accounts TSMC does seem to have the better process. We also have to take into account process maturity though..Samsung's SoC was in a shipping product in September/October 2014 and TX1 will not ship until Q2 so thats a gap of at least six months. At this early stage of the process..it could mean a big difference. The difference between their FINFET processes should also be interesting to see. Anyway I await your article as that should give us a good comparison between the SoCs.
    I'd argue that Qualcomm's success is also down to the quality of the integrated modems, aside from the fact that they do make quality SoCs overall. Qualcomm usually has the best modems in the industry, especially when it comes to LTE and you can see that even Apple switched to Qualcomm modems from AFAIK the iPhone 5 onwards. Another factor is that Qualcomm also offers the complete RF/analog solution and bundles with their SoCs/modems. I've also read..though I cannot verify this..that certification and validation tends be to easier with Qualcomm. Anyway..I have no doubt that the Snapdragon 810 will be as successful as 800 was.
    Maybe they do..but they haven't announced anything apart from Octa core A53 so far..one with Mali 760 MP2 and one with PowerVR G6200 so very mediocre graphics.
    This I agree with..similar to what we saw in the PC/laptop space..we've reached an era where even low cost devices (with Cortex A7 onwards at least) can do pretty much all that a high end device can..and they do it well enough. With Cortex A53..the low end performance becomes even better.
    I doubt the CPU is even close to as good. I think 8 Jaguar cores would outpace 4 A57 + 4 A53 quite handily. And of course graphics wise..the X1 achieves 512 FP32 GFlops only at 1 Ghz whereas the Xbox GPU has 1.35 TFlops. Pixel fill rate is similar but the Xbox has much higher Texture fill and Memory bandwidth, not even counting the ESRAM. Still..for an SoC designed for the mobile segment..and which would be out ~1.5 years after the Xbox..the performance is remarkable.
    I dont think it would be too difficult..but what would be the point? You barely have any use cases where more than 2 cores are used..so what's the point of having 8? And the whole selling point of big.LITTLE is lower power consumption and a second A57 cluster..even if optimised for lower clocks..wouldn't be as efficient as an A53 cluster. Regarding die space..AFAIK one A53 takes about 0.8 mm2 and one A57 a little less than 3 mm2 so switching 4 A53's for A57's would mean an increase of about 8 mm2, which is significant.
     
    #3387 Erinyes, Jan 8, 2015
    Last edited: Jan 8, 2015
  8. Nebuchadnezzar

    Legend Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,008
    Likes Received:
    174
    Location:
    Luxembourg
    Yea I know. I actually devised a method to reverse engineer capacitance coefficient values for the chips that I hope I will be able to apply to in more upcoming SoCs. That should bring a new perspective onto things.

    There's been a lot of fanfare going on about that in the last few weeks and it doesn't seem to be unsubstantiated from what I've heard either... The Exynos 7420 is really the chip to watch out for in 1H2015 in my opinion.
    http://blogs.barrons.com/asiastocks...lcomms-snapdragon-delay/?mod=google_news_blog
     
  9. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    There's no point at all on mobile devices. However a 8 core A57 clocked high enough with a bigger (2-4 SMX) Maxwell GPU would be a good starting point for a ARM based gaming console. It would offer comparable performance to current generation consoles (with a slightly lower power draw). Obviously this would not be enough for true next generation (gen 9) consoles.
     
  10. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,088
    Likes Received:
    5,634
    4 SMM would be 512 cores doing 1 TFLOP FP32 at 1GHz. It's basically the same as a GTX 750 desktop card.
    Since it lags against a Bonaire in the desktop world, I don't think it'd compete with the xbone, much less the PS4.
    Unless that 2x FP16 performance trick did wonders and the GPU was clocked closer to 1.5GHz.

    Then again, who would sell such a console with that custom chip and more importantly who would write games for it?
     
  11. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,446
    Likes Received:
    181
    Location:
    Chania
    How can the 2xFP16 do any wonders?

    Let's do a theoretical exersize:

    Apple A8X GPU@=/>500MHz
    ~256 GFLOPs FP32 or ~512 GFLOPs FP16
    Manhattan offscreen 33.0 fps

    NVIDIA GK20A GPU K1@~800MHz
    ~307 GFLOPs FP32
    Manhattan offscreen 32.0 fps

    1. Manhattan is quite ALU bound.
    2. A8X GPU has other higher resources too apart from just FP16 ALUs.
     
  12. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,088
    Likes Received:
    5,634
    That's not a theoretical exercise, it's a practical one ;)

    Regardless, the Manhattan test may be ALU bound in most cases, but you don't know what kind of variables they're using in how many operations.
    The Tegra X1 @1GHz gets 65FPS (twice the performance) in Manhattan while having over 3X the theoretical FP16 performance of Tegra K1 (1TFLOPs).

    If you remember nVidia's first statements about the first Maxwell GM107, they said that 1 SMM would have around the same practical ALU performance as 90% of 1 SMX at the same clockspeeds.
    Therefore, the performance jump between TK1 and TX1 in an ALU-bound test should go like TK1x(2 SMM)x0.9 = 57.6. And then the TX1 is clocked 25% higher, so 57.6x1.25= 72FPS.
    So even if we were looking at pure FP32 performance, the TX1 should have a higher performance jump compared to TK1 - if offline Manhattan was ALU bound.
    Conclusion: for this newest batch of high-end mobile GPUs, it doesn't seem that Manhattan is ALU bound anymore, and the added FP16 performance isn't doing much for GPUs doing over 256GFLOPs.
    Perhaps the test is now limited by memory bandwidth and fillrate. The TX1 did in fact double the memory bandwidth compared to TK1.
     
  13. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    Assuming you had 50/50 mix of FP16/FP32 code, 4 SMM part would be quite comparable in raw flops (equivalent to 1.5 TFLOP at 1 GHZ). 50/50 split would be a realistic assumption if we are talking about console games tailor made for the hardware (obviously not for generic PC software). Fill rate is also already compatible in the 2 SMM model.

    But the biggest problem would be the bandwidth. 25.6 GB/s is less than both previous generation consoles had (PS3 had GDDR + RDRAM and Xbox 360 had EDRAM + GDDR). Obviously new improved tech such as depth compression, delta color compression, early depth rejection, improved hiZ, big L2 cache, improved cache logic, and loading compressed data to caches (instead of uncompressing it, wasting 4x cache space) bring all relatively big effective BW gains for Maxwell. I would expect the current (2 SMM) version to actually slightly beat both last gen consoles in BW heavy scenarios. In all the other cases the X1 would be quite a bit faster than the last generation consoles, but the limited BW would likely make it perform generally much closed to last gen than current gen. And I am mainly talking about tablets and laptops (chromebooks, etc) here. Phones wouldn't have enough TDP to sustain maximum GPU and CPU clocks for long time.

    But this is still good news. Last gen console performance is finally available in your pocket. Too bad we don't have The Last of Us and GTA 5 available on Android :(
     
  14. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,446
    Likes Received:
    181
    Location:
    Chania
    Nope look above; it's 1 TFLOP FP16 for the X1 GPU vs. 512 GFLOPs FP16 for the A8X GPU; twice the rate and almost twice the performance.

    FP16 isn't doing much in terms of performance on Rogues either. It's a power saving initiative mostly.

    * This was an early presentation for the paperlaunch.
    * As time goes by they have time to further fine tune drivers and squeeze out even more performance; they're now at 65, why on God's green earth should be 72 or even more fps with better optimized drivers be a problem? It's just 10% difference.
    * With the desired performance level reached they don't need to clock at 1GHz if it should pose problems with power consumption or else history repeats itself as with GK20A.
     
    #3394 Ailuros, Jan 8, 2015
    Last edited: Jan 8, 2015
  15. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    405
    Likes Received:
    431
    it was fast, Audi to use Tegra X1

    Two days after NVIDIA Tegra X1’s official launch, Audi confirmed today that it will use the new mobile superchip in developing its future automotive self-piloting capabilities.
    source: http://blogs.nvidia.com/blog/2015/01/06/audi-tegra-x1/
     
  16. Brodda Thep

    Newcomer

    Joined:
    Jul 29, 2005
    Messages:
    39
    Likes Received:
    24
    I am not sure how much everyone knows about what is happening in AI world, but I strongly feel that the predominate reason for FP16 is to run convolutional neural networks very very fast. Once you train a convNet you don't need FP32. In fact you could probably train a convNet with FP16.

    ConvNets will be everywhere ( and in fact are), but you will see them being used more and more and FP16 is a huge leg up in performance.
     
  17. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    647
    Likes Received:
    94
    Great! Looking forward to seeing that.
    You are right..digging deeper it seems like I will have to retract my statement that the S810 will do as well. I've also heard from a friend at Samsung that even the S810 dev platform is overheating. And apparently the Galaxy S6 will be Exynos only..no Snapdragon version at all.

    PS: After reading that link of yours..I had a question. The analysts say that TSMC will have a revenue shortfall in Q1 due to lower 20nm utilization as a result of Qualcomm's issues. But presumably these fab contracts are locked up months if not years in advance. So if Qualcomm does not utilize the booked capacity, wouldn't they be liable to pay a sizeable penalty?
    True..it would have been a decent choice for a console..and the lower cost would have certainly been welcome. Maybe if Microsoft and Sony had waited a year or two we would have seen an ARM based console. But that ship has sailed and for someone else to make one now and especially for devs to develop for it would be a challenge.
    I think Audi had confirmed it at the launch of X1 itself. Also..Audi has been collaborating with NV since the Tegra 2 days so its hardly a surprise.
     
    #3397 Erinyes, Jan 11, 2015
    Last edited: Jan 11, 2015
  18. mboeller

    Regular

    Joined:
    Feb 7, 2002
    Messages:
    922
    Likes Received:
    1
    Location:
    Germany
  19. Nebuchadnezzar

    Legend Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,008
    Likes Received:
    174
    Location:
    Luxembourg
  20. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,515
    Likes Received:
    934
    What limitations?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...