Trinity vs Ivy Bridge

Discussion in 'Architecture and Products' started by rpg.314, Jun 29, 2011.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    There is a thread going on, but since SB is already out, and IB will be competing with Trinity in the market, I thought it makes more sense to compare those.

    Known stuff

    IB has 33% more EU's.
    Trinity has BD cores
    IB will have finfets
    Trinity will have NI cores
    IB will be DX11, ~3 years late.

    Expected stuff
    Trinity will have 10 vliw4 simd's.
    Trinity will have better, more integrated turbo.

    Wanted stuff
    Trinity should have better cache level integration.
    IB should integrate the gpu deeply into it's coherency protocol.
    Trinity should have quick sync hw.

    DK's speculations are here.
     
  2. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,909
    Can't see why it would be more than 8.
    Seems like a safe bet indeed.
    If it's going to get L3 cache, totally agreed. But I wouldn't be surprised if it skipped L3 again neither, making this impossible.
     
  3. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    6,115
    How much would the NI architecture benefit from using L3 cache anyways?
    It's not like the GPU was designed to take advantage of it, afaik.

    IMO, what AMD needs to worry in Trinity is increasing the APU's memory bandwidth, either through special sideport channels for the GPU, more memory channels or faster memory. It's Llano's main bottleneck, for the moment.
     
  4. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    IIRC, they said trinity would be >50% bump.
    My guess is that there would be ~4M L3 cache. But even if they skip L3, I hope they improve upon the coherency protocol between cpu and gpu.
     
  5. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Llano's bigger problem is it's memory architecture, not bw per se.

    As per TR's benches, Llano with dual mem channels performs just like a single channel i5.
     
  6. GZ007

    Regular

    Joined:
    Jan 22, 2010
    Messages:
    416
    Those are CPU benches :!: The GPU still could use out all of the bandwith. They could try some SiSoftware Sandra benches. It has video memory bandwith test.
     
  7. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,909
    Using L3 is one way to reduce memory bandwidth requirements. Sure it might need some changes, but you could use it to store hierarchical z for instance.
    I think it's a much cheaper way to increase "bandwidth" than your other suggestions (well if you factor in that it's useful for the cpu too).

    50%. 8 simds with a slightly higher clock is enough, if that figure was even peak flops (for all we know they could have been talking texture filtering rate...).

    If the gpu can't use any L3 cache, I don't think there would be much to gain there, looks "good enough" to me.
     
  8. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Come on, as we all know, flops are everything :)

    For a quad core BD, it will have 4 MB L2 as it is. I am expecting them to use 1MB L2 for fusion and 4MB L3.
     
  9. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    6,115

    Getting a big chunk of cache inside the APU (that usually takes a sizeable amount of die area) is "much cheaper" than creating i.e. a 64-bit sideport GDDR5 channel for the IGP?!
     
  10. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,909
    Yes, considering all the problems sideport has. First, you better find a way to switch that gddr5 sideport off completely, otherwise power draw is probably not acceptable for the mobile parts. Also, there's already 2 internal memory buses to worry about, you really want a third (which adds a significant amount of i/o too), you'd also need to find some way to partition the memory, and the plan is probably to unify the address spaces not further segregate them. It also obviously adds cost for the memory chips (for 64bit sideport you need 2) which might already be as high as the cost of the L3 cache (which really isn't all that big, 25mm² or so for 4MB in Zambezi and intel fits 8MB in ~40mm²) and it needs PCB real estate.
    Granted it's a bit a theoretical view without knowing how much performance you could get from using L3 cache.
    Faster memory OTOH is a good option it's just not really available (apart from some minimal incremental increase). More memory channels aren't viable neither I think.
    Now if the L3 would only help the GPU it would probably be too expensive but considering it helps the cpu too it looks quite cheap to me.
     
  11. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    Doesn't have the GPU a specialized buffer/cache for that already integrated?
    But a large L3 could serve as some kind of increased ROP cache holding far more framebuffer tiles than the color/Z caches within the ROPs itself, a bit like the eDRAM in some consoles. Or it could serve as a 3rd level texture cache (Sandybridge bypasses the L3 for texture reads, if DK's article is correct; obviously intel decided it's not worth it).
    I would also say 7 to 8 SIMDs are enough. AMD/GF appear to be still in the learning curve for the 32nm process and the GPU implementation targeting it. I would think that a high performance 32nm SOI process with HKMG should enable at least the same frequencies as TSMCs 40nm bulk process at a lower power (needed for the integration in an APU). But you can get a 40nm HD5650M with the same 400 VLIW5 units as Llano but running at 650MHz (thus faster than on Llano), which consumes only 19W including 1GB DDR3 @800MHz. And Llano on desktops with 100W TDP can't get it faster than 600 MHz? I would expect that clock on mobile parts :roll:
     
  12. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    6,115
    Already done. 780G and later motherboards with sideport memory give you the option in the bios to switch between Sideport only, UMA only and Sideport+UMA.
    I don't think it would be much harder to implement a driver-enabled "high performance mode" with the Sideport enabled and the rest of the time just use the UMA.


    Is it that much more?
    Bloomfield (3-channel) has 200 more "pins" than Lynnfield (2-channel), and Lynnfield actually has 40M transistors more because of integrated PCI-Express and DMA.


    I don't really understand what you mean by that. AMD has been using motherboards using Sideport+UMA combinations for several years, increasing the IGP's performance. What's so different here?


    That's the thing. How much performance would the GPU get for using L3 cache, if at all? Isn't there a good reason why there haven't been any mid-to-high end GPUs using eDRAM, for example?

    Increased memory bandwidth has shown to drastically change Llano's results (25% more gaming performance with 33% higher bandwidth).




    Of course, UMA is the future.. Given Llano's results, I think a high-performance Sideport could be a good temporary option, untill DDR4 is ready for market.
     
  13. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    587
    Yea ive mentioned this before on the Llano thread as well, the clocks were quite disappointing for what was supposed to be a leading edge process. I was expecting HKMG to bring significant gains. But even in the case of 45nm, it took them a while to sort the process out. Afaik the The Phenom II X4 launched at 3.2 ghz or 3.4 ghz back in Nov 2008 and the TDP was 125W. By the time they launched the hex core chips (afaik March 2010), they were offering six cores at the same clocks while maintaining the same TDP.

    In the case of the 780G, the sideport was 32 bit. And the reason was more to do with power than with performance. The use of the sideport meant that the IGP(which was on the northbridge) did not have to make a trip to the CPU(where the mem controller was) and back when it needed to access some video memory (or something to that effect, maybe i havent got it totally right).

    And essentially you're proposing that all motherboards should come with GDDR5 built in (say 512 MB if you're proposing a 64 bit channel for the GPU). Thats not cheap and i would imagine it isnt power efficient either
     
  14. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,909
    You need seemless switching. Off at idle or light load, on otherwise (also I'm not sure those boards actually powered the memory down, given it was ddr2/3 it probably didn't draw much power and on the desktop noone cared anyway, while it saved power for the notebooks as otherwise you had constant HT/MC fetches even for display scanout).

    It might not be that much more but it's still a budget cpu, after all. There is significantly more room for such things on the high end.

    That was rather primitive and it didn't really help performance all that much (cause sideport was very low bandwidth). But if both main memory and side port have similar memory bandwidth (as it would be with 64bit gddr5) I'm not sure that scheme would be sufficient. You could think about framebuffers in gddr5 sideport, textures in main memory or something, but the needs might also be dictated for what parts of the memory you still want to be able to access it with the cpu (with reasonable performance). Not saying it's impossible just that it probably gets a bit complex.


    No doubt. I think if you're only looking at discrete gpus, it's probably just not worth it because increasing overall bandwidth doesn't really add much complexity - it's still one interface, just faster (of course this still increases i/o and stuff). I just think the balance shifts quite a bit when you have a APU.
    I don't know how much performance you can really gain with L3, but I find the sandy bridge results with 1 memory channel (also in that techreport article) quite amazing on that front, it only loses about 20% of the performance for half the memory bandwidth. Sure part of that is because the GPU isn't all that fast compared to Llano (hence it needs less memory bandwidth), but still I think part of that is the usage of L3 cache for the GPU. I don't have any proof for that though (some comparisons with Arrendale could be interesting maybe, unfortunately you can't switch off the L3 cache AFAIK...).

    That would be a quite a long standing temporary solution, since ddr4 isn't predicted before 2014 (and really 2015 for volume) according to latest report. I don't think it would help all that much anyway since by then surely the gpus will be a lot faster too (assuming ddr4 is twice as fast, certainly gpus will be faster by more than that in 2015).
     
  15. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    eDRAM in sufficient quantities will be too expensive and low end GPU's won't be able to afford it, needing a different architecture.
     
  16. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,909
    Actually I'm quite sure it was only 16bit, supporting ddr2/3 on all the 7xx chipsets.
    Not sure about rs690 might have been 16 or 32bit (but didn't support ddr3 for sure).
     
  17. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    6,115
    Desktop versions actually have decently-clocked DDR3 chips.

    I've also heard that in some cases it's only a 16-bit bus, but I'm pretty sure the 780G in my Ferrari One is using a 32bit Sideport with 384MB. The access to UMA is blocked through the bios, though :(
     
    #17 ToTTenTranz, Jun 29, 2011
    Last edited by a moderator: Jun 29, 2011
  18. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    6,529
    How is 8 VLIW4 SIMDs 50% increase over 5 VLIW5 SIMDs, even if you bump the clocks slightly?
     
  19. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,909
    You "only" need a 17% clock increase to achieve that 50% increase for 8 vliw4 simds over 5 vliew5 ones. I don't know if that's realistic or not (though compared to discrete parts the clocks certainly wouldn't be extraordinary high, and overclocking attempts also suggest it's doable). But in any case it would be a very substantial increase in graphic power (more than those 50%!).
     
  20. LordEC911

    Regular

    Joined:
    Nov 25, 2007
    Messages:
    772
    Location:
    'Zona
    What happened to Trinity having a vliw4 GPU based on 6850?
     

Share This Page

Loading...