Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 14-Jan-2014, 20:49   #1126
Alexko
Senior Member
 
Join Date: Aug 2009
Posts: 2,854
Send a message via MSN to Alexko
Default

A few OpenCL/HSA benchmarks: http://www.extremetech.com/computing...geneous-chip/5
__________________
"Well, you mentioned Disneyland, I thought of this porn site, and then bam! A blue Hulk." —The Creature
My (currently dormant) blog: Teχlog
Alexko is online now   Reply With Quote
Old 14-Jan-2014, 21:03   #1127
yuri
Junior Member
 
Join Date: Jun 2010
Posts: 22
Default

The BKDG doc for Kaveri has been released recently... Apparently Kaveri was really supposed to be equipped with GDDR5 memory as a complement to the standard DDR3 one.

That would solve the mem b/w problems for 'hiend' SKUs at expense of a few watts more.

http://support.amd.com/TechDocs/4912...h-3Fh_BKDG.pdf Search "GDDR5" string.

Regarding to the exotic memory configurations. The HBM/HMC solutions are surely pretty expensive ATM and it will need a few years/generations to get cheap enough for AMD's still-budget APUs.
yuri is offline   Reply With Quote
Old 14-Jan-2014, 21:10   #1128
Alexko
Senior Member
 
Join Date: Aug 2009
Posts: 2,854
Send a message via MSN to Alexko
Default

If Kaveri was originally supposed to have 4 memory channels, with GDDR5 compatibility on at least some of those channels, it might explain why latency has increased: it's just due to higher complexity (which doesn't actually achieve anything since it's all disabled).

That's rather unfortunate.
__________________
"Well, you mentioned Disneyland, I thought of this porn site, and then bam! A blue Hulk." —The Creature
My (currently dormant) blog: Teχlog
Alexko is online now   Reply With Quote
Old 14-Jan-2014, 21:54   #1129
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,691
Default

Quote:
Originally Posted by Turbotab View Post
"And speaking of memory bandwidth, Kaveri has two 64-bit, fully independent memory channels. "We do stripe across them," Macri told us, "especially for the memory that's allocated for high-bandwidth needs like graphics."
He is referring to the die shot (and the kernel patches disabling 2 of 4 logical memory channels) which seem to imply there's more there than just 2 64bit ddr3 controllers. If you look at anandtech's pictures, http://www.anandtech.com/show/7677/a...00-a10-7850k/4 the memory controllers grew a lot in size (on the bottom edge for kaveri, upper edge on llano/trinity).
Someone should ask amd why the memory controllers are so big .
They could definitely need more bandwidth, gpu clock scaling is hilariously bad (e.g. here, http://www.computerbase.de/artikel/p...eri-im-test/6/ - for 40% higher gpu clock you don't even get a 10% increase in performance...).
If you thought Kabini with its single 64bit memory channel was memory bandwidth limited, think again, Kaveri has just half the bandwidth/flop. On the upside though you get nearly the same performance with the much cheaper, 6 GCN cores 65W a8-7600 than with the 8 GCN cores a10-7850k (in games)...
mczak is offline   Reply With Quote
Old 14-Jan-2014, 22:47   #1130
DSC
Naughty Boy!
 
Join Date: Jul 2003
Posts: 689
Default

http://vr-zone.com/articles/amd-push...ses/17088.html

Quote:
Originally Posted by Some AMD employee
"Steamroller is not Bulldozer Enhanced. F*** no. The layout might look the same but our LEGO blocks are completely different. When all is said and done we should get 45% improvement and this goes to show how the Bulldozer was f***** design. This is all what Bulldozer was supposed to be."
DSC is offline   Reply With Quote
Old 14-Jan-2014, 22:47   #1131
Turbotab
Member
 
Join Date: Feb 2013
Posts: 147
Default

Quote:
Originally Posted by mczak View Post
He is referring to the die shot (and the kernel patches disabling 2 of 4 logical memory channels) which seem to imply there's more there than just 2 64bit ddr3 controllers. If you look at anandtech's pictures, http://www.anandtech.com/show/7677/a...00-a10-7850k/4 the memory controllers grew a lot in size (on the bottom edge for kaveri, upper edge on llano/trinity).
Someone should ask amd why the memory controllers are so big .
They could definitely need more bandwidth, gpu clock scaling is hilariously bad (e.g. here, http://www.computerbase.de/artikel/p...eri-im-test/6/ - for 40% higher gpu clock you don't even get a 10% increase in performance...).
If you thought Kabini with its single 64bit memory channel was memory bandwidth limited, think again, Kaveri has just half the bandwidth/flop. On the upside though you get nearly the same performance with the much cheaper, 6 GCN cores 65W a8-7600 than with the 8 GCN cores a10-7850k (in games)...
If AMD wanted to enable quad-channel memory, wouldn't they need quad-channel compatible motherboards as well? Anyway if AMD wants to waste die space, then why not, they've got money to burn
Turbotab is offline   Reply With Quote
Old 14-Jan-2014, 22:54   #1132
Nemo
Junior Member
 
Join Date: Sep 2012
Posts: 71
Default

facepalm

Nemo is offline   Reply With Quote
Old 14-Jan-2014, 23:18   #1133
Turbotab
Member
 
Join Date: Feb 2013
Posts: 147
Default

Quote:
Originally Posted by Nemo View Post
facepalm
Just going through Techreport's review, and on BF4 & Tomb Raider at least frame pacing is pretty good.

Now for some power consumption figures, during x264 encoding, the 65w Kaveri uses more power than an 84w 4770k, which is only a few watts higher than the 45w Kaveri.

http://techreport.com/review/25908/a...or-reviewed/12
Turbotab is offline   Reply With Quote
Old 14-Jan-2014, 23:35   #1134
Psycho
Member
 
Join Date: Jun 2008
Location: Copenhagen
Posts: 673
Default

That probably says more about Theo's sources... you forgot the bit before that "source" quote:
Quote:
According to Mark Papermaster, the improvements should yield up to 30% performance increase, but our sources inside the company beg to differ.
Up to 30% ipc from bd sounds about right..
Psycho is offline   Reply With Quote
Old 14-Jan-2014, 23:46   #1135
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,691
Default

Quote:
Originally Posted by Nemo View Post
facepalm

If you see graphs like this when the rendering time differs from frame to frame that much (that is, a very high frame time followed by a very low one) without using some AFR solution, this is usually a good indication that for some reason the measurement does not represent reality. That can happen pretty easily if you rely on dx to acquire this information.
mczak is offline   Reply With Quote
Old 15-Jan-2014, 00:16   #1136
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 3,428
Default

Quote:
Originally Posted by yuri View Post
The BKDG doc for Kaveri has been released recently... Apparently Kaveri was really supposed to be equipped with GDDR5 memory as a complement to the standard DDR3 one.
All I see about GDDR5 is a checklist in the memory section saying "GDDR5 isn't supported".
What they say is that only DCT0 and DCT3 can be used even though DCT1 and DCT2 are present.


I don't think the APU has GDDR5 support. First because I think a 2*64bit GDDR5 memory controller wouldn't look exactly like a 2*64bit DDR3 controller, which is what we see in the pictures.
Second, I also don't think they would mix the GDDR5 address space (DCT1+DCT2) between the DDR3 controllers (DCT0+DCT3).




Isn't this the manual for the A88X motherboards?
What are the chances for AMD to be releasing embedded solutions or a new family of motherboards (A89X?) with all four banks activated in the future?

It's just that the second pair of 64bit DDR3 controllers look like a terrible waste of transistors and area and worse of all: It looks like such a wasted opportunity to grab the iGPU market leadership..

The way things are, Kaveri is probably just going to be squashed by Broadwell..
Maybe the desktop motherboards/laptops with the 256bit memory are scheduled to release when Broadwell releases?








And just a question:
How would a 4-module SteamrollerB using 32nm SOI at current Vishera speeds and 8MB L3 cache?
Maybe quite closer to Intel's solutions?

Last edited by ToTTenTranz; 15-Jan-2014 at 00:21.
ToTTenTranz is offline   Reply With Quote
Old 15-Jan-2014, 02:08   #1137
Kaotik
Drunk Member
 
Join Date: Apr 2003
Posts: 5,387
Send a message via ICQ to Kaotik
Default

Quote:
Originally Posted by ToTTenTranz View Post
don't think the APU has GDDR5 support. First because I think a 2*64bit GDDR5 memory controller wouldn't look exactly like a 2*64bit DDR3 controller, which is what we see in the pictures.
We're seeing what looks like 4x64bit DDR3 there, not 2x64bit
__________________
I'm nothing but a shattered soul...
Been ravaged by the chaotic beauty...
Ruined by the unreal temptations...
I was betrayed by my own beliefs...
Kaotik is offline   Reply With Quote
Old 15-Jan-2014, 03:00   #1138
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 3,428
Default

Quote:
Originally Posted by Kaotik View Post
We're seeing what looks like 4x64bit DDR3 there, not 2x64bit
That's exactly what I wrote.
ToTTenTranz is offline   Reply With Quote
Old 15-Jan-2014, 05:49   #1139
moozoo
Junior Member
 
Join Date: Jul 2010
Posts: 94
Default

The only way I can see Kaveri having any real market is if HSA/hUMA is extended to amd discrete cards and that there is some real performance/cost advantage in doing this.
I'm guessing only FM2+ motherboards with Kaveri CPU's will have the hardware capable of doing this.

Are cheap no memory discrete graphics cards feasible?
Would it have enough bandwidth if it plugged into more than one PCIe16 slot? i.e. a motherboard with two PCIe 16 slots next to each other.

Is it possible to reverse the problem and map the entire graphics card memory into the system address space and implement shared virtual memory for it.
Perhaps only pages marked as nonexecutable would be assigned to this memory.
Of course this would mean all CPU memory data accesses are going across the PCIe bus....

It would be cool if they could implement hUMA for graphics cards with dual gpus. i.e. share the card memory between the gpus on a dual gpu card. But again, would this give a performance advantage?
With existing dual gpu cards cards, when it uploads textures etc to both gpu's does it use broadcast pcie packets or does the driver upload to each in turn.

At a 1:16 fp64 rate and DP Gflops below that of a Intel CPU, Kaveri has no real value to me.
moozoo is offline   Reply With Quote
Old 15-Jan-2014, 06:14   #1140
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,405
Default

Quote:
Originally Posted by moozoo View Post
Are cheap no memory discrete graphics cards feasible?
Would it have enough bandwidth if it plugged into more than one PCIe16 slot? i.e. a motherboard with two PCIe 16 slots next to each other.
The idea that a DRAM-free board hanging off of PCIe can be cheap presupposes that a highly non-standard and standard-violating board with a dubious business case and non-standard GPU can be cheap.

If someone is so cost-conscious that even inexpensive DRAM is too much, you might be getting down to the most stripped-down and non-expandable motherboards you can find.
A graphics unit without access to local memory hasn't been practical since early in the last decade, and I doubt even the vaunted latency-hiding capabilities of a GPU can hide the impact of having no local framebuffer. The ROPs would probably be one of the first elements to falter, with the necessary batch sizes and local caching necessary becoming too large to be practical.
The following is more speculative, but pure PCIe accesses may also subject the GPU to more stringent ordering constraints than its aggressive memory pipeline can tolerate, negating the GPU's ability to utilize it well.

I would argue that AMD's APUs, or just dispensing with graphics hardware altogether have a higher upside.

Quote:
Of course this would mean all CPU memory data accesses are going across the PCIe bus....
You'd probably save money and gain performance by just not bothering with the discrete board.

Quote:
With existing dual gpu cards cards, when it uploads textures etc to both gpu's does it use broadcast pcie packets or does the driver upload to each in turn.
There are more complex transactions with modern PCIe, including things like broadcasting or endpoint to endpoint transfers. My limited understanding of it is that some kind of software process needs to perform it.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is online now   Reply With Quote
Old 15-Jan-2014, 09:11   #1141
no-X
Senior Member
 
Join Date: May 2005
Posts: 2,094
Default

I believe these numbers could be possible, but not by this underclocked and bandwidth-starving silicon with increased latencies due to the iGPU. Full-speed + L3-equipped model would perform better.
__________________
Sorry for my English. But I hope it's better than your Czech
no-X is offline   Reply With Quote
Old 15-Jan-2014, 15:19   #1142
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 3,015
Send a message via Skype™ to fellix
Default



So, apparently the i-cache size scales with the associativity, although in an "odd" manner, which means the bank structure is unchanged.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 15-Jan-2014, 15:30   #1143
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,691
Default

Quote:
Originally Posted by fellix View Post


So, apparently the i-cache size scales with the associativity, although in an "odd" manner, which means the bank structure is unchanged.
I believe the bigger problem with BD family l1i cache is associativity (or rather the lack thereof...), not size, so I wonder why they didn't change the structure and went with a 64kB cache with 4 times associativity instead. Maybe just increasing the size was simpler, but still that looks like a more expensive solution to me.
mczak is offline   Reply With Quote
Old 15-Jan-2014, 17:01   #1144
pjbliverpool
B3D Scallywag
 
Join Date: May 2005
Location: Guess...
Posts: 5,814
Send a message via MSN to pjbliverpool
Default

Quote:
Originally Posted by 3dilettante View Post
The idea that a DRAM-free board hanging off of PCIe can be cheap presupposes that a highly non-standard and standard-violating board with a dubious business case and non-standard GPU can be cheap.

If someone is so cost-conscious that even inexpensive DRAM is too much, you might be getting down to the most stripped-down and non-expandable motherboards you can find.
A graphics unit without access to local memory hasn't been practical since early in the last decade, and I doubt even the vaunted latency-hiding capabilities of a GPU can hide the impact of having no local framebuffer. The ROPs would probably be one of the first elements to falter, with the necessary batch sizes and local caching necessary becoming too large to be practical.
The following is more speculative, but pure PCIe accesses may also subject the GPU to more stringent ordering constraints than its aggressive memory pipeline can tolerate, negating the GPU's ability to utilize it well.
I assume at some point (soon hopefully) we'll see HSA like functionality in discrete GPU's in that the GPU can freely read/write to the main memory and the CPU can do the same to the graphics memory. Bandwidth and latency would still be crap because you're going over PCI-E (although if implemented under PCI-E 4.0 at 64GB/s total bandwidth it's nothing to sneeze at) but at least it would do away with having to copy data back and forth.

Does that sound feasible?
__________________
PowerVR PCX1 -> Voodoo Banshee -> GeForce2 MX200 -> GeForce2 Ti -> GeForce4 Ti 4200 -> 9800Pro -> 8800GTS -> Radeon HD 4890 -> GeForce GTX 670 DCUII TOP

8086 8Mhz -> Pentium 90 -> K6-2 233Mhz -> Athlon 'Thunderbird' 1Ghz -> AthlonXP 2400+ 2Ghz -> Core2 Duo E6600 2.4 Ghz -> Core i5 2500K 3.3Ghz
pjbliverpool is offline   Reply With Quote
Old 15-Jan-2014, 17:05   #1145
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,333
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by pjbliverpool View Post
I assume at some point (soon hopefully) we'll see HSA like functionality in discrete GPU's in that the GPU can freely read/write to the main memory and the CPU can do the same to the graphics memory. Bandwidth and latency would still be crap because you're going over PCI-E (although if implemented under PCI-E 4.0 at 64GB/s total bandwidth it's nothing to sneeze at) but at least it would do away with having to copy data back and forth.

Does that sound feasible?
Thanks to 32-bit OSes, only a portion of the graphics card's memory is exposed to the CPU. For AMD parts, this is around 256MB of the total video memory. This means that the CPU cannot directly access arbitrary regions of video memory.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 15-Jan-2014, 17:18   #1146
pMax
Member
 
Join Date: May 2013
Location: deep in your "PC"
Posts: 195
Default

Quote:
Originally Posted by OpenGL guy View Post
Thanks to 32-bit OSes
with XB1&PS4, 32 bit will become less relevant for big games in 1 year or less... even in PC space. At least for all titles that matters, I believe (EA,UBI, TAKE2).

Quote:
Originally Posted by OpenGL guy View Post
only a portion of the graphics card's memory is exposed to the CPU
I bet you can remap it easily, or use a GART for that. It would lead to the hassle of addressing memory in 256Mb chunks, maybe -but possible, no?
__________________
...securing your world for fun and profit.
pMax is offline   Reply With Quote
Old 15-Jan-2014, 17:45   #1147
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,333
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by pMax View Post
with XB1&PS4, 32 bit will become less relevant for big games in 1 year or less... even in PC space. At least for all titles that matters, I believe (EA,UBI, TAKE2).
We have to support our customers and that means people upgrading machines with older OSes.
Quote:
Originally Posted by pMax
I bet you can remap it easily, or use a GART for that. It would lead to the hassle of addressing memory in 256Mb chunks, maybe -but possible, no?
No, not possible. What would you do if multiple applications were trying to access different 256MB windows? (This is supposed to be seamless, right?) Adding in new APIs to change the mappings isn't seamless (or simple or bulletproof), so you may as well stick with existing technology.

Once everyone is convinced that 64-bit is the de facto standard then we can improve the situation.

Also, keep in mind that discrete GPU memory is tagged as uncached since the GPU can't/doesn't probe the CPU's caches for data. This means that CPU reads from discrete GPU memory will be very slow. CPU writes are fine as long as you fill write-combine buffers.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 15-Jan-2014, 18:44   #1148
pMax
Member
 
Join Date: May 2013
Location: deep in your "PC"
Posts: 195
Default

Quote:
Originally Posted by OpenGL guy View Post
We have to support our customers and that means people upgrading machines with older OSes.
[..] Once everyone is convinced that 64-bit is the de facto standard then we can improve the situation.
...of course, you have to support legacy 32 bit customers.
But you have rewritten the driver for 64 bit support, and you easily add the feat for 64 bit family. 64 bit code and 32 bit code cant mix.
Leave the devs deal with 32-bit porting, they will then choose what to do. Most of today's game customer should already have 64 bit OS since years - at least for top selling games that would love to use such features, I think.

Quote:
Originally Posted by OpenGL guy View Post
No, not possible. What would you do if multiple applications were trying to access different 256MB windows?
Ouch, yeah - you are right, I tend to forget... anyway, it can just be a "singleton" resource, on 32bit desktop space (or maybe 2 with half space). And full access on 64 bit space. But probably not worth the effort, I suppose.

Quote:
Originally Posted by OpenGL guy View Post
CPU writes are fine as long as you fill write-combine buffers.
...you say WB is ok ...but you have to do a read-modify no?
Aaah, I see. You mean final write is delayed in CPU until it needs to be flushed. Interesting, thanks.
__________________
...securing your world for fun and profit.

Last edited by pMax; 15-Jan-2014 at 19:05.
pMax is offline   Reply With Quote
Old 15-Jan-2014, 19:33   #1149
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,405
Default

Quote:
Originally Posted by pjbliverpool View Post
I assume at some point (soon hopefully) we'll see HSA like functionality in discrete GPU's in that the GPU can freely read/write to the main memory and the CPU can do the same to the graphics memory. Bandwidth and latency would still be crap because you're going over PCI-E (although if implemented under PCI-E 4.0 at 64GB/s total bandwidth it's nothing to sneeze at) but at least it would do away with having to copy data back and forth.

Does that sound feasible?
A GCN GPU's primary cache subsystem is operating on primary 64 byte transactions, and a GDDR5 channel controller would mostly be working in terms of two or more 32B bursts.
The memory subsystem spends several hundred cycles trying to coalesce as many accesses as it can, so total latency hiding for that memory traffic is several hundred cycles which translates into as many nanoseconds. There are physical protocols and heuristics handled by the hardware in an autonomous fashion, and their actions are solely the concern of the GPU's low-level hardware.

PCIe transaction latencies for GPGPU seem to be all over the place, but link latency seems to start in the microsecond range, thanks to the traversal of hardware and software protocol layers and system management of the device. There seem to be issues with different GPGPU tests and driver/software differences with very different latency numbers, where the number of microseconds can vary by orders of magnitude.

Going from those, there may be an order of magnitude in latency difference, with a bus whose best utilization is transfer sizes in the hundreds or thousands of bytes.
Small transactions would lose a significant fraction to packet overhead, and I wouldn't trust this arrangement to not be throttled by uncore and IO subsystem of the CPU, coupled with the driver stack and OS management.

The necessary expansion in the size of the GPU, and the even larger batches seem counterproductive to me. The GPU texturing and CU path may have good latency hiding, but they have their limits with the current wavefront count and storage size, and the caches are sized for a certain amount of reuse that will not happen when ideal transaction sizes exceed them.
ROP cache tiling doesn't seem to have the level of scalability needed, and the command front end and queueing latency even with local memory is atrocious in terms of compute.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is online now   Reply With Quote
Old 15-Jan-2014, 21:16   #1150
homerdog
hardly a Senior Member
 
Join Date: Jul 2008
Location: still camping with a mauler
Posts: 4,387
Default

Quote:
Originally Posted by OpenGL guy View Post
We have to support our customers and that means people upgrading machines with older OSes.
Steam survey shows 75% of users on 64bit versions of Windows. 30% of the 32bit users on still on XP, which won't be supported anyway.

I wouldn't bother supporting 32bit any more, just like I wouldn't bother with DX9 (and wouldn't have for years now).
__________________
Releasing a game in 2010 without AA is a completely foreign concept to me. If the technique you're using makes it impossible to use AA then you're using the wrong techniques. As simple as that. Releasing a PC game without AA options is OK only if that means you can only have it enabled
-Humus
homerdog is online now   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 18:40.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.