Trinity vs Ivy Bridge

rpg.314 · Jun 29, 2011

There is a thread going on, but since SB is already out, and IB will be competing with Trinity in the market, I thought it makes more sense to compare those.

Known stuff

IB has 33% more EU's.
Trinity has BD cores
IB will have finfets
Trinity will have NI cores
IB will be DX11, ~3 years late.

Expected stuff
Trinity will have 10 vliw4 simd's.
Trinity will have better, more integrated turbo.

Wanted stuff
Trinity should have better cache level integration.
IB should integrate the gpu deeply into it's coherency protocol.
Trinity should have quick sync hw.

DK's speculations are here.

mczak · Jun 29, 2011

rpg.314 said:
Trinity will have 10 vliw4 simd's.

Can't see why it would be more than 8.

Trinity will have better, more integrated turbo.

Seems like a safe bet indeed.

Trinity should have better cache level integration.

If it's going to get L3 cache, totally agreed. But I wouldn't be surprised if it skipped L3 again neither, making this impossible.

Deleted member 13524 · Jun 29, 2011

How much would the NI architecture benefit from using L3 cache anyways?
It's not like the GPU was designed to take advantage of it, afaik.

IMO, what AMD needs to worry in Trinity is increasing the APU's memory bandwidth, either through special sideport channels for the GPU, more memory channels or faster memory. It's Llano's main bottleneck, for the moment.

rpg.314 · Jun 29, 2011

mczak said:
Can't see why it would be more than 8.

IIRC, they said trinity would be >50% bump.

If it's going to get L3 cache, totally agreed. But I wouldn't be surprised if it skipped L3 again neither, making this impossible.

My guess is that there would be ~4M L3 cache. But even if they skip L3, I hope they improve upon the coherency protocol between cpu and gpu.

rpg.314 · Jun 29, 2011

IMO, what AMD needs to worry in Trinity is increasing the APU's memory bandwidth, either through special sideport channels for the GPU, more memory channels or faster memory. It's Llano's main bottleneck, for the moment.

Llano's bigger problem is it's memory architecture, not bw per se.

As per TR's benches, Llano with dual mem channels performs just like a single channel i5.

GZ007 · Jun 29, 2011

rpg.314 said:
Llano's bigger problem is it's memory architecture, not bw per se.

As per TR's benches, Llano with dual mem channels performs just like a single channel i5.

Those are CPU benches :!:

The GPU still could use out all of the bandwith. They could try some SiSoftware Sandra benches. It has video memory bandwith test.

mczak · Jun 29, 2011

ToTTenTranz said:
How much would the NI architecture benefit from using L3 cache anyways?
It's not like the GPU was designed to take advantage of it, afaik.

IMO, what AMD needs to worry in Trinity is increasing the APU's memory bandwidth, either through special sideport channels for the GPU, more memory channels or faster memory. It's Llano's main bottleneck, for the moment.

Using L3 is one way to reduce memory bandwidth requirements. Sure it might need some changes, but you could use it to store hierarchical z for instance.
I think it's a much cheaper way to increase "bandwidth" than your other suggestions (well if you factor in that it's useful for the cpu too).

rpg.314 said:
IIRC, they said trinity would be >50% bump.

50%. 8 simds with a slightly higher clock is enough, if that figure was even peak flops (for all we know they could have been talking texture filtering rate...).

My guess is that there would be ~4M L3 cache. But even if they skip L3, I hope they improve upon the coherency protocol between cpu and gpu.

If the gpu can't use any L3 cache, I don't think there would be much to gain there, looks "good enough" to me.

rpg.314 · Jun 29, 2011

mczak said:
50%. 8 simds with a slightly higher clock is enough, if that figure was even peak flops (for all we know they could have been talking texture filtering rate...).

Come on, as we all know, flops are everything

If the gpu can't use any L3 cache, I don't think there would be much to gain there, looks "good enough" to me.

For a quad core BD, it will have 4 MB L2 as it is. I am expecting them to use 1MB L2 for fusion and 4MB L3.

Deleted member 13524 · Jun 29, 2011

mczak said:
Using L3 is one way to reduce memory bandwidth requirements. Sure it might need some changes, but you could use it to store hierarchical z for instance.
I think it's a much cheaper way to increase "bandwidth" than your other suggestions (well if you factor in that it's useful for the cpu too).

Getting a big chunk of cache inside the APU (that usually takes a sizeable amount of die area) is "much cheaper" than creating i.e. a 64-bit sideport GDDR5 channel for the IGP?!

mczak · Jun 29, 2011

ToTTenTranz said:
Getting a big chunk of cache inside the APU (that usually takes a sizeable amount of die area) is "much cheaper" than creating i.e. a 64-bit sideport GDDR5 channel for the IGP?!

Yes, considering all the problems sideport has. First, you better find a way to switch that gddr5 sideport off completely, otherwise power draw is probably not acceptable for the mobile parts. Also, there's already 2 internal memory buses to worry about, you really want a third (which adds a significant amount of i/o too), you'd also need to find some way to partition the memory, and the plan is probably to unify the address spaces not further segregate them. It also obviously adds cost for the memory chips (for 64bit sideport you need 2) which might already be as high as the cost of the L3 cache (which really isn't all that big, 25mm² or so for 4MB in Zambezi and intel fits 8MB in ~40mm²) and it needs PCB real estate.
Granted it's a bit a theoretical view without knowing how much performance you could get from using L3 cache.
Faster memory OTOH is a good option it's just not really available (apart from some minimal incremental increase). More memory channels aren't viable neither I think.
Now if the L3 would only help the GPU it would probably be too expensive but considering it helps the cpu too it looks quite cheap to me.

Gipsel · Jun 29, 2011

mczak said:
Using L3 is one way to reduce memory bandwidth requirements. Sure it might need some changes, but you could use it to store hierarchical z for instance.

Doesn't have the GPU a specialized buffer/cache for that already integrated?
But a large L3 could serve as some kind of increased ROP cache holding far more framebuffer tiles than the color/Z caches within the ROPs itself, a bit like the eDRAM in some consoles. Or it could serve as a 3rd level texture cache (Sandybridge bypasses the L3 for texture reads, if DK's article is correct; obviously intel decided it's not worth it).

mczak said:
50%. 8 simds with a slightly higher clock is enough, if that figure was even peak flops (for all we know they could have been talking texture filtering rate...).

I would also say 7 to 8 SIMDs are enough. AMD/GF appear to be still in the learning curve for the 32nm process and the GPU implementation targeting it. I would think that a high performance 32nm SOI process with HKMG should enable at least the same frequencies as TSMCs 40nm bulk process at a lower power (needed for the integration in an APU). But you can get a 40nm HD5650M with the same 400 VLIW5 units as Llano but running at 650MHz (thus faster than on Llano), which consumes only 19W including 1GB DDR3 @800MHz. And Llano on desktops with 100W TDP can't get it faster than 600 MHz? I would expect that clock on mobile parts

Deleted member 13524 · Jun 29, 2011

mczak said:
Yes, considering all the problems sideport has. First, you better find a way to switch that gddr5 sideport off completely, otherwise power draw is probably not acceptable for the mobile parts.

Already done. 780G and later motherboards with sideport memory give you the option in the bios to switch between Sideport only, UMA only and Sideport+UMA.
I don't think it would be much harder to implement a driver-enabled "high performance mode" with the Sideport enabled and the rest of the time just use the UMA.

mczak said:
Also, there's already 2 internal memory buses to worry about, you really want a third (which adds a significant amount of i/o too),

Is it that much more?
Bloomfield (3-channel) has 200 more "pins" than Lynnfield (2-channel), and Lynnfield actually has 40M transistors more because of integrated PCI-Express and DMA.

mczak said:
you'd also need to find some way to partition the memory

I don't really understand what you mean by that. AMD has been using motherboards using Sideport+UMA combinations for several years, increasing the IGP's performance. What's so different here?

mczak said:
It also obviously adds cost for the memory chips (for 64bit sideport you need 2) which might already be as high as the cost of the L3 cache (which really isn't all that big, 25mm² or so for 4MB in Zambezi and intel fits 8MB in ~40mm²) and it needs PCB real estate.
Granted it's a bit a theoretical view without knowing how much performance you could get from using L3 cache.
(...)

Now if the L3 would only help the GPU it would probably be too expensive but considering it helps the cpu too it looks quite cheap to me

That's the thing. How much performance would the GPU get for using L3 cache, if at all? Isn't there a good reason why there haven't been any mid-to-high end GPUs using eDRAM, for example?

Increased memory bandwidth has shown to drastically change Llano's results (25% more gaming performance with 33% higher bandwidth).

mczak said:
and the plan is probably to unify the address spaces not further segregate them.
(...)
Faster memory OTOH is a good option it's just not really available (apart from some minimal incremental increase).

Of course, UMA is the future.. Given Llano's results, I think a high-performance Sideport could be a good temporary option, untill DDR4 is ready for market.

Erinyes · Jun 29, 2011

Gipsel said:
I would also say 7 to 8 SIMDs are enough. AMD/GF appear to be still in the learning curve for the 32nm process and the GPU implementation targeting it. I would think that a high performance 32nm SOI process with HKMG should enable at least the same frequencies as TSMCs 40nm bulk process at a lower power (needed for the integration in an APU). But you can get a 40nm HD5650M with the same 400 VLIW5 units as Llano but running at 650MHz (thus faster than on Llano), which consumes only 19W including 1GB DDR3 @800MHz. And Llano on desktops with 100W TDP can't get it faster than 600 MHz? I would expect that clock on mobile parts

Yea ive mentioned this before on the Llano thread as well, the clocks were quite disappointing for what was supposed to be a leading edge process. I was expecting HKMG to bring significant gains. But even in the case of 45nm, it took them a while to sort the process out. Afaik the The Phenom II X4 launched at 3.2 ghz or 3.4 ghz back in Nov 2008 and the TDP was 125W. By the time they launched the hex core chips (afaik March 2010), they were offering six cores at the same clocks while maintaining the same TDP.

ToTTenTranz said:
Already done. 780G and later motherboards with sideport memory give you the option in the bios to switch between Sideport only, UMA only and Sideport+UMA.
I don't think it would be much harder to implement a driver-enabled "high performance mode" with the Sideport enabled and the rest of the time just use the UMA

In the case of the 780G, the sideport was 32 bit. And the reason was more to do with power than with performance. The use of the sideport meant that the IGP(which was on the northbridge) did not have to make a trip to the CPU(where the mem controller was) and back when it needed to access some video memory (or something to that effect, maybe i havent got it totally right).

And essentially you're proposing that all motherboards should come with GDDR5 built in (say 512 MB if you're proposing a 64 bit channel for the GPU). Thats not cheap and i would imagine it isnt power efficient either

mczak · Jun 29, 2011

ToTTenTranz said:
Already done. 780G and later motherboards with sideport memory give you the option in the bios to switch between Sideport only, UMA only and Sideport+UMA.
I don't think it would be much harder to implement a driver-enabled "high performance mode" with the Sideport enabled and the rest of the time just use the UMA.

You need seemless switching. Off at idle or light load, on otherwise (also I'm not sure those boards actually powered the memory down, given it was ddr2/3 it probably didn't draw much power and on the desktop noone cared anyway, while it saved power for the notebooks as otherwise you had constant HT/MC fetches even for display scanout).

Is it that much more?
Bloomfield (3-channel) has 200 more "pins" than Lynnfield (2-channel), and Lynnfield actually has 40M transistors more because of integrated PCI-Express and DMA.

It might not be that much more but it's still a budget cpu, after all. There is significantly more room for such things on the high end.

I don't really understand what you mean by that. AMD has been using motherboards using Sideport+UMA combinations for several years, increasing the IGP's performance. What's so different here?

That was rather primitive and it didn't really help performance all that much (cause sideport was very low bandwidth). But if both main memory and side port have similar memory bandwidth (as it would be with 64bit gddr5) I'm not sure that scheme would be sufficient. You could think about framebuffers in gddr5 sideport, textures in main memory or something, but the needs might also be dictated for what parts of the memory you still want to be able to access it with the cpu (with reasonable performance). Not saying it's impossible just that it probably gets a bit complex.

That's the thing. How much performance would the GPU get for using L3 cache, if at all? Isn't there a good reason why there haven't been any mid-to-high end GPUs using eDRAM, for example?

Increased memory bandwidth has shown to drastically change Llano's results (25% more gaming performance with 33% higher bandwidth).

No doubt. I think if you're only looking at discrete gpus, it's probably just not worth it because increasing overall bandwidth doesn't really add much complexity - it's still one interface, just faster (of course this still increases i/o and stuff). I just think the balance shifts quite a bit when you have a APU.
I don't know how much performance you can really gain with L3, but I find the sandy bridge results with 1 memory channel (also in that techreport article) quite amazing on that front, it only loses about 20% of the performance for half the memory bandwidth. Sure part of that is because the GPU isn't all that fast compared to Llano (hence it needs less memory bandwidth), but still I think part of that is the usage of L3 cache for the GPU. I don't have any proof for that though (some comparisons with Arrendale could be interesting maybe, unfortunately you can't switch off the L3 cache AFAIK...).

Of course, UMA is the future.. Given Llano's results, I think a high-performance Sideport could be a good temporary option, untill DDR4 is ready for market.

That would be a quite a long standing temporary solution, since ddr4 isn't predicted before 2014 (and really 2015 for volume) according to latest report. I don't think it would help all that much anyway since by then surely the gpus will be a lot faster too (assuming ddr4 is twice as fast, certainly gpus will be faster by more than that in 2015).

rpg.314 · Jun 29, 2011

Isn't there a good reason why there haven't been any mid-to-high end GPUs using eDRAM, for example?

eDRAM in sufficient quantities will be too expensive and low end GPU's won't be able to afford it, needing a different architecture.

mczak · Jun 29, 2011

Erinyes said:
In the case of the 780G, the sideport was 32 bit

Actually I'm quite sure it was only 16bit, supporting ddr2/3 on all the 7xx chipsets.
Not sure about rs690 might have been 16 or 32bit (but didn't support ddr3 for sure).

Deleted member 13524 · Jun 29, 2011

mczak said:
Actually I'm quite sure it was only 16bit, supporting ddr2/3 on all the 7xx chipsets.
Not sure about rs690 might have been 16 or 32bit (but didn't support ddr3 for sure).

Desktop versions actually have decently-clocked DDR3 chips.

I've also heard that in some cases it's only a 16-bit bus, but I'm pretty sure the 780G in my Ferrari One is using a 32bit Sideport with 384MB. The access to UMA is blocked through the bios, though

Kaotik · Jul 2, 2011

mczak said:
50%. 8 simds with a slightly higher clock is enough, if that figure was even peak flops (for all we know they could have been talking texture filtering rate...).

How is 8 VLIW4 SIMDs 50% increase over 5 VLIW5 SIMDs, even if you bump the clocks slightly?

mczak · Jul 2, 2011

Kaotik said:
How is 8 VLIW4 SIMDs 50% increase over 5 VLIW5 SIMDs, even if you bump the clocks slightly?

You "only" need a 17% clock increase to achieve that 50% increase for 8 vliw4 simds over 5 vliew5 ones. I don't know if that's realistic or not (though compared to discrete parts the clocks certainly wouldn't be extraordinary high, and overclocking attempts also suggest it's doable). But in any case it would be a very substantial increase in graphic power (more than those 50%!).

LordEC911 · Jul 2, 2011

What happened to Trinity having a vliw4 GPU based on 6850?

Trinity vs Ivy Bridge

rpg.314

mczak

Deleted member 13524

Guest

rpg.314

rpg.314

GZ007

mczak

rpg.314

Deleted member 13524

Guest

mczak

Gipsel

Deleted member 13524

Guest

Erinyes

mczak

rpg.314

mczak

Deleted member 13524

Guest

Kaotik

Drunk Member

mczak

LordEC911

Similar threads