Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 29-Jun-2011, 09:51   #1
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default Trinity vs Ivy Bridge

There is a thread going on, but since SB is already out, and IB will be competing with Trinity in the market, I thought it makes more sense to compare those.

Known stuff

IB has 33% more EU's.
Trinity has BD cores
IB will have finfets
Trinity will have NI cores
IB will be DX11, ~3 years late.

Expected stuff
Trinity will have 10 vliw4 simd's.
Trinity will have better, more integrated turbo.

Wanted stuff
Trinity should have better cache level integration.
IB should integrate the gpu deeply into it's coherency protocol.
Trinity should have quick sync hw.

DK's speculations are here.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 11:49   #2
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,433
Default

Quote:
Originally Posted by rpg.314 View Post
Trinity will have 10 vliw4 simd's.
Can't see why it would be more than 8.
Quote:
Trinity will have better, more integrated turbo.
Seems like a safe bet indeed.
Quote:
Trinity should have better cache level integration.
If it's going to get L3 cache, totally agreed. But I wouldn't be surprised if it skipped L3 again neither, making this impossible.
mczak is offline   Reply With Quote
Old 29-Jun-2011, 12:02   #3
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 2,146
Default

How much would the NI architecture benefit from using L3 cache anyways?
It's not like the GPU was designed to take advantage of it, afaik.

IMO, what AMD needs to worry in Trinity is increasing the APU's memory bandwidth, either through special sideport channels for the GPU, more memory channels or faster memory. It's Llano's main bottleneck, for the moment.
ToTTenTranz is offline   Reply With Quote
Old 29-Jun-2011, 12:02   #4
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by mczak View Post
Can't see why it would be more than 8.
IIRC, they said trinity would be >50% bump.
Quote:
If it's going to get L3 cache, totally agreed. But I wouldn't be surprised if it skipped L3 again neither, making this impossible.
My guess is that there would be ~4M L3 cache. But even if they skip L3, I hope they improve upon the coherency protocol between cpu and gpu.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 12:13   #5
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
IMO, what AMD needs to worry in Trinity is increasing the APU's memory bandwidth, either through special sideport channels for the GPU, more memory channels or faster memory. It's Llano's main bottleneck, for the moment.
Llano's bigger problem is it's memory architecture, not bw per se.

As per TR's benches, Llano with dual mem channels performs just like a single channel i5.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 12:51   #6
GZ007
Member
 
Join Date: Jan 2010
Posts: 416
Default

Quote:
Originally Posted by rpg.314 View Post
Llano's bigger problem is it's memory architecture, not bw per se.

As per TR's benches, Llano with dual mem channels performs just like a single channel i5.
Those are CPU benches The GPU still could use out all of the bandwith. They could try some SiSoftware Sandra benches. It has video memory bandwith test.
GZ007 is offline   Reply With Quote
Old 29-Jun-2011, 14:49   #7
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,433
Default

Quote:
Originally Posted by ToTTenTranz View Post
How much would the NI architecture benefit from using L3 cache anyways?
It's not like the GPU was designed to take advantage of it, afaik.

IMO, what AMD needs to worry in Trinity is increasing the APU's memory bandwidth, either through special sideport channels for the GPU, more memory channels or faster memory. It's Llano's main bottleneck, for the moment.
Using L3 is one way to reduce memory bandwidth requirements. Sure it might need some changes, but you could use it to store hierarchical z for instance.
I think it's a much cheaper way to increase "bandwidth" than your other suggestions (well if you factor in that it's useful for the cpu too).

Quote:
Originally Posted by rpg.314 View Post
IIRC, they said trinity would be >50% bump.
50%. 8 simds with a slightly higher clock is enough, if that figure was even peak flops (for all we know they could have been talking texture filtering rate...).

Quote:
My guess is that there would be ~4M L3 cache. But even if they skip L3, I hope they improve upon the coherency protocol between cpu and gpu.
If the gpu can't use any L3 cache, I don't think there would be much to gain there, looks "good enough" to me.
mczak is offline   Reply With Quote
Old 29-Jun-2011, 14:58   #8
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by mczak View Post
50%. 8 simds with a slightly higher clock is enough, if that figure was even peak flops (for all we know they could have been talking texture filtering rate...).
Come on, as we all know, flops are everything

Quote:
If the gpu can't use any L3 cache, I don't think there would be much to gain there, looks "good enough" to me.
For a quad core BD, it will have 4 MB L2 as it is. I am expecting them to use 1MB L2 for fusion and 4MB L3.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 15:24   #9
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 2,146
Default

Quote:
Originally Posted by mczak View Post
Using L3 is one way to reduce memory bandwidth requirements. Sure it might need some changes, but you could use it to store hierarchical z for instance.
I think it's a much cheaper way to increase "bandwidth" than your other suggestions (well if you factor in that it's useful for the cpu too).

Getting a big chunk of cache inside the APU (that usually takes a sizeable amount of die area) is "much cheaper" than creating i.e. a 64-bit sideport GDDR5 channel for the IGP?!
ToTTenTranz is offline   Reply With Quote
Old 29-Jun-2011, 15:50   #10
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,433
Default

Quote:
Originally Posted by ToTTenTranz View Post
Getting a big chunk of cache inside the APU (that usually takes a sizeable amount of die area) is "much cheaper" than creating i.e. a 64-bit sideport GDDR5 channel for the IGP?!
Yes, considering all the problems sideport has. First, you better find a way to switch that gddr5 sideport off completely, otherwise power draw is probably not acceptable for the mobile parts. Also, there's already 2 internal memory buses to worry about, you really want a third (which adds a significant amount of i/o too), you'd also need to find some way to partition the memory, and the plan is probably to unify the address spaces not further segregate them. It also obviously adds cost for the memory chips (for 64bit sideport you need 2) which might already be as high as the cost of the L3 cache (which really isn't all that big, 25mm² or so for 4MB in Zambezi and intel fits 8MB in ~40mm²) and it needs PCB real estate.
Granted it's a bit a theoretical view without knowing how much performance you could get from using L3 cache.
Faster memory OTOH is a good option it's just not really available (apart from some minimal incremental increase). More memory channels aren't viable neither I think.
Now if the L3 would only help the GPU it would probably be too expensive but considering it helps the cpu too it looks quite cheap to me.
mczak is offline   Reply With Quote
Old 29-Jun-2011, 16:59   #11
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 985
Default

Quote:
Originally Posted by mczak View Post
Using L3 is one way to reduce memory bandwidth requirements. Sure it might need some changes, but you could use it to store hierarchical z for instance.
Doesn't have the GPU a specialized buffer/cache for that already integrated?
But a large L3 could serve as some kind of increased ROP cache holding far more framebuffer tiles than the color/Z caches within the ROPs itself, a bit like the eDRAM in some consoles. Or it could serve as a 3rd level texture cache (Sandybridge bypasses the L3 for texture reads, if DK's article is correct; obviously intel decided it's not worth it).
Quote:
Originally Posted by mczak View Post
50%. 8 simds with a slightly higher clock is enough, if that figure was even peak flops (for all we know they could have been talking texture filtering rate...).
I would also say 7 to 8 SIMDs are enough. AMD/GF appear to be still in the learning curve for the 32nm process and the GPU implementation targeting it. I would think that a high performance 32nm SOI process with HKMG should enable at least the same frequencies as TSMCs 40nm bulk process at a lower power (needed for the integration in an APU). But you can get a 40nm HD5650M with the same 400 VLIW5 units as Llano but running at 650MHz (thus faster than on Llano), which consumes only 19W including 1GB DDR3 @800MHz. And Llano on desktops with 100W TDP can't get it faster than 600 MHz? I would expect that clock on mobile parts
Gipsel is offline   Reply With Quote
Old 29-Jun-2011, 17:25   #12
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 2,146
Default

Quote:
Originally Posted by mczak View Post
Yes, considering all the problems sideport has. First, you better find a way to switch that gddr5 sideport off completely, otherwise power draw is probably not acceptable for the mobile parts.
Already done. 780G and later motherboards with sideport memory give you the option in the bios to switch between Sideport only, UMA only and Sideport+UMA.
I don't think it would be much harder to implement a driver-enabled "high performance mode" with the Sideport enabled and the rest of the time just use the UMA.


Quote:
Originally Posted by mczak View Post
Also, there's already 2 internal memory buses to worry about, you really want a third (which adds a significant amount of i/o too),
Is it that much more?
Bloomfield (3-channel) has 200 more "pins" than Lynnfield (2-channel), and Lynnfield actually has 40M transistors more because of integrated PCI-Express and DMA.


Quote:
Originally Posted by mczak View Post
you'd also need to find some way to partition the memory
I don't really understand what you mean by that. AMD has been using motherboards using Sideport+UMA combinations for several years, increasing the IGP's performance. What's so different here?


Quote:
Originally Posted by mczak View Post
It also obviously adds cost for the memory chips (for 64bit sideport you need 2) which might already be as high as the cost of the L3 cache (which really isn't all that big, 25mm² or so for 4MB in Zambezi and intel fits 8MB in ~40mm²) and it needs PCB real estate.
Granted it's a bit a theoretical view without knowing how much performance you could get from using L3 cache.
(...)

Now if the L3 would only help the GPU it would probably be too expensive but considering it helps the cpu too it looks quite cheap to me
That's the thing. How much performance would the GPU get for using L3 cache, if at all? Isn't there a good reason why there haven't been any mid-to-high end GPUs using eDRAM, for example?

Increased memory bandwidth has shown to drastically change Llano's results (25% more gaming performance with 33% higher bandwidth).




Quote:
Originally Posted by mczak View Post
and the plan is probably to unify the address spaces not further segregate them.
(...)
Faster memory OTOH is a good option it's just not really available (apart from some minimal incremental increase).
Of course, UMA is the future.. Given Llano's results, I think a high-performance Sideport could be a good temporary option, untill DDR4 is ready for market.
ToTTenTranz is offline   Reply With Quote
Old 29-Jun-2011, 17:50   #13
Erinyes
Member
 
Join Date: Mar 2010
Posts: 331
Default

Quote:
Originally Posted by Gipsel View Post
I would also say 7 to 8 SIMDs are enough. AMD/GF appear to be still in the learning curve for the 32nm process and the GPU implementation targeting it. I would think that a high performance 32nm SOI process with HKMG should enable at least the same frequencies as TSMCs 40nm bulk process at a lower power (needed for the integration in an APU). But you can get a 40nm HD5650M with the same 400 VLIW5 units as Llano but running at 650MHz (thus faster than on Llano), which consumes only 19W including 1GB DDR3 @800MHz. And Llano on desktops with 100W TDP can't get it faster than 600 MHz? I would expect that clock on mobile parts
Yea ive mentioned this before on the Llano thread as well, the clocks were quite disappointing for what was supposed to be a leading edge process. I was expecting HKMG to bring significant gains. But even in the case of 45nm, it took them a while to sort the process out. Afaik the The Phenom II X4 launched at 3.2 ghz or 3.4 ghz back in Nov 2008 and the TDP was 125W. By the time they launched the hex core chips (afaik March 2010), they were offering six cores at the same clocks while maintaining the same TDP.

Quote:
Originally Posted by ToTTenTranz View Post
Already done. 780G and later motherboards with sideport memory give you the option in the bios to switch between Sideport only, UMA only and Sideport+UMA.
I don't think it would be much harder to implement a driver-enabled "high performance mode" with the Sideport enabled and the rest of the time just use the UMA
In the case of the 780G, the sideport was 32 bit. And the reason was more to do with power than with performance. The use of the sideport meant that the IGP(which was on the northbridge) did not have to make a trip to the CPU(where the mem controller was) and back when it needed to access some video memory (or something to that effect, maybe i havent got it totally right).

And essentially you're proposing that all motherboards should come with GDDR5 built in (say 512 MB if you're proposing a 64 bit channel for the GPU). Thats not cheap and i would imagine it isnt power efficient either
Erinyes is offline   Reply With Quote
Old 29-Jun-2011, 17:53   #14
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,433
Default

Quote:
Originally Posted by ToTTenTranz View Post
Already done. 780G and later motherboards with sideport memory give you the option in the bios to switch between Sideport only, UMA only and Sideport+UMA.
I don't think it would be much harder to implement a driver-enabled "high performance mode" with the Sideport enabled and the rest of the time just use the UMA.
You need seemless switching. Off at idle or light load, on otherwise (also I'm not sure those boards actually powered the memory down, given it was ddr2/3 it probably didn't draw much power and on the desktop noone cared anyway, while it saved power for the notebooks as otherwise you had constant HT/MC fetches even for display scanout).

Quote:
Is it that much more?
Bloomfield (3-channel) has 200 more "pins" than Lynnfield (2-channel), and Lynnfield actually has 40M transistors more because of integrated PCI-Express and DMA.
It might not be that much more but it's still a budget cpu, after all. There is significantly more room for such things on the high end.

Quote:
I don't really understand what you mean by that. AMD has been using motherboards using Sideport+UMA combinations for several years, increasing the IGP's performance. What's so different here?
That was rather primitive and it didn't really help performance all that much (cause sideport was very low bandwidth). But if both main memory and side port have similar memory bandwidth (as it would be with 64bit gddr5) I'm not sure that scheme would be sufficient. You could think about framebuffers in gddr5 sideport, textures in main memory or something, but the needs might also be dictated for what parts of the memory you still want to be able to access it with the cpu (with reasonable performance). Not saying it's impossible just that it probably gets a bit complex.


Quote:
That's the thing. How much performance would the GPU get for using L3 cache, if at all? Isn't there a good reason why there haven't been any mid-to-high end GPUs using eDRAM, for example?

Increased memory bandwidth has shown to drastically change Llano's results (25% more gaming performance with 33% higher bandwidth).
No doubt. I think if you're only looking at discrete gpus, it's probably just not worth it because increasing overall bandwidth doesn't really add much complexity - it's still one interface, just faster (of course this still increases i/o and stuff). I just think the balance shifts quite a bit when you have a APU.
I don't know how much performance you can really gain with L3, but I find the sandy bridge results with 1 memory channel (also in that techreport article) quite amazing on that front, it only loses about 20% of the performance for half the memory bandwidth. Sure part of that is because the GPU isn't all that fast compared to Llano (hence it needs less memory bandwidth), but still I think part of that is the usage of L3 cache for the GPU. I don't have any proof for that though (some comparisons with Arrendale could be interesting maybe, unfortunately you can't switch off the L3 cache AFAIK...).

Quote:
Of course, UMA is the future.. Given Llano's results, I think a high-performance Sideport could be a good temporary option, untill DDR4 is ready for market.
That would be a quite a long standing temporary solution, since ddr4 isn't predicted before 2014 (and really 2015 for volume) according to latest report. I don't think it would help all that much anyway since by then surely the gpus will be a lot faster too (assuming ddr4 is twice as fast, certainly gpus will be faster by more than that in 2015).
mczak is offline   Reply With Quote
Old 29-Jun-2011, 17:59   #15
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Isn't there a good reason why there haven't been any mid-to-high end GPUs using eDRAM, for example?
eDRAM in sufficient quantities will be too expensive and low end GPU's won't be able to afford it, needing a different architecture.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 29-Jun-2011, 17:59   #16
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,433
Default

Quote:
Originally Posted by Erinyes View Post
In the case of the 780G, the sideport was 32 bit
Actually I'm quite sure it was only 16bit, supporting ddr2/3 on all the 7xx chipsets.
Not sure about rs690 might have been 16 or 32bit (but didn't support ddr3 for sure).
mczak is offline   Reply With Quote
Old 29-Jun-2011, 18:23   #17
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 2,146
Default

Quote:
Originally Posted by mczak View Post
Actually I'm quite sure it was only 16bit, supporting ddr2/3 on all the 7xx chipsets.
Not sure about rs690 might have been 16 or 32bit (but didn't support ddr3 for sure).
Desktop versions actually have decently-clocked DDR3 chips.

I've also heard that in some cases it's only a 16-bit bus, but I'm pretty sure the 780G in my Ferrari One is using a 32bit Sideport with 384MB. The access to UMA is blocked through the bios, though

Last edited by ToTTenTranz; 29-Jun-2011 at 18:31.
ToTTenTranz is offline   Reply With Quote
Old 02-Jul-2011, 01:26   #18
Kaotik
yes, i'm drunk
 
Join Date: Apr 2003
Posts: 4,801
Send a message via ICQ to Kaotik
Default

Quote:
Originally Posted by mczak View Post
50%. 8 simds with a slightly higher clock is enough, if that figure was even peak flops (for all we know they could have been talking texture filtering rate...).
How is 8 VLIW4 SIMDs 50% increase over 5 VLIW5 SIMDs, even if you bump the clocks slightly?
__________________
I'm nothing but a shattered soul...
Been ravaged by the chaotic beauty...
Ruined by the unreal temptations...
I was betrayed by my own beliefs...
Kaotik is online now   Reply With Quote
Old 02-Jul-2011, 02:25   #19
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,433
Default

Quote:
Originally Posted by Kaotik View Post
How is 8 VLIW4 SIMDs 50% increase over 5 VLIW5 SIMDs, even if you bump the clocks slightly?
You "only" need a 17% clock increase to achieve that 50% increase for 8 vliw4 simds over 5 vliew5 ones. I don't know if that's realistic or not (though compared to discrete parts the clocks certainly wouldn't be extraordinary high, and overclocking attempts also suggest it's doable). But in any case it would be a very substantial increase in graphic power (more than those 50%!).
mczak is offline   Reply With Quote
Old 02-Jul-2011, 03:02   #20
LordEC911
Member
 
Join Date: Nov 2007
Location: 'Zona
Posts: 514
Default

What happened to Trinity having a vliw4 GPU based on 6850?
LordEC911 is offline   Reply With Quote
Old 02-Jul-2011, 05:20   #21
swaaye
Entirely Suboptimal
 
Join Date: Mar 2003
Location: WI, USA
Posts: 6,845
Default

Only Cayman (6950) is VLIW4. Trinity is indeed VLIW4 as well according to reports/rumors. Maybe one day we'll actually have some use for VLIW4 (GPGPU). But then AMD did just show us how they want to leave it behind too.
swaaye is offline   Reply With Quote
Old 02-Jul-2011, 09:13   #22
LordEC911
Member
 
Join Date: Nov 2007
Location: 'Zona
Posts: 514
Default

Quote:
Originally Posted by swaaye View Post
Only Cayman (6950) is VLIW4. Trinity is indeed VLIW4 as well according to reports/rumors. Maybe one day we'll actually have some use for VLIW4 (GPGPU). But then AMD did just show us how they want to leave it behind too.
I was talking about the supposed Freudian slip from Mr. Houston...

Quote:
Originally Posted by PCPer
7:55 Trinity has a "6850" kind of thing....interesting....
7:55 I think that slipped!
7:56 But then he stated Trinity would be "VLIW4" so Cayman-based... interesting.
http://www.pcper.com/news/Graphics-C...2011-Live-Blog
LordEC911 is offline   Reply With Quote
Old 02-Jul-2011, 21:03   #23
Dave Baumann
Gamerscore Wh...
 
Join Date: Jan 2002
Posts: 12,946
Default

That was a confusion. He thought that the 6800 was based on VLIW4. He meant to say that the architecture is based on the 6900 series.
__________________
Expand. Accelerate. Dominate.
Tweet Tweet!
Dave Baumann is offline   Reply With Quote
Old 02-Jul-2011, 22:37   #24
LordEC911
Member
 
Join Date: Nov 2007
Location: 'Zona
Posts: 514
Default

Quote:
Originally Posted by Dave Baumann View Post
That was a confusion. He thought that the 6800 was based on VLIW4. He meant to say that the architecture is based on the 6900 series.
Awww... well that's no fun for silly season.
So back to 8-10SIMDs.
LordEC911 is offline   Reply With Quote
Old 03-Jul-2011, 01:14   #25
Kaotik
yes, i'm drunk
 
Join Date: Apr 2003
Posts: 4,801
Send a message via ICQ to Kaotik
Default

Quote:
Originally Posted by Dave Baumann View Post
That was a confusion. He thought that the 6800 was based on VLIW4. He meant to say that the architecture is based on the 6900 series.
Since it's now "the past", was 6800-series meant to be VLIW4, but 32-40nm case forced it to be VLIW5?
__________________
I'm nothing but a shattered soul...
Been ravaged by the chaotic beauty...
Ruined by the unreal temptations...
I was betrayed by my own beliefs...
Kaotik is online now   Reply With Quote

Reply

Tags
amd, fusion, intel, ivy bridge, trinity

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 21:52.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.