Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 02-Jul-2012, 02:18   #26
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,570
Default

Quote:
Originally Posted by sebbbi View Post
Please don't tell me you think a 70-100 GB/s unified memory architecture is considered "slow" by today's standards.
It is only reasonably aggressive but it is where we will be within a year with DDR4.

Quote:
Not even Intel's highest end 12 thread Sandy Bridge E and the fully enabled 16 thread Xeon server CPU versions are equipped with a memory system that fast. Quad channel DDR3-1600 is the fastest officially supported, and it provides a 51 GB/s theoretical bandwidth (37 GB/s in benchmarks, not far from AMDs utilization percentages: http://www.anandtech.com/show/5091/i...gh-end-alive/4). These chips cost 1000$+ and the motherboards supporting quad channel memory aren't cheap either.
Xeons play in a very different envelope of the design space than something like a gaming focused console. They make a lot of trade offs to support larger memory capacities. Large being defined in the approaching TB range which requires multiple dimms per channel and advanced ECC capabilities. The advanced ECC capabilities in turn direct the memory interfaces into wide 128b channels and burst-chop mode lowering efficiency.

Also it is important to recognize that there can be little correlation between cost and price.


Quote:
Relatively large manual high speed "caches" such as the Xbox 360 EDRAM are very good for reducing redundant bandwidth usage (especially for GPU rendering). EDRAM removes all the memory bandwidth waste you get from blending, overdraw, MSAA and z-buffering. Basically you get all these for free. The bandwidth free overdraw of course also helps with shadowmaps as well, but since Xbox 360 cannot sample from EDRAM, you have to eventually copy the shadowmap to main memory (consumes memory bandwidth) and sample it from there (consumes memory bandwidth just like any static texture). Same is true for g-buffer rendering and sampling (must be copied eventually to main memory and sampled from there consuming memory bandwidth).
A larger edram without integrated ROPs would allow sampling from edram.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 02-Jul-2012, 03:31   #27
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by aaronspink View Post
The embedded GPUs really aren't aimed at gamers. They are primarily designed for cost/packaging/board/thermal reasons. If you want to game, you will always be better off with a discrete GPU until such a time as moderate sized on package memories become viable/standard. though we are getting closer, a single wide IO DRAM will be able to provide in the range of 100-200 GB/s of bandwidth and between 512-1024 MB of capacity. Combined with a main memory in the range of 50 GB/s and the integrated GPUs will finally be able to stretch their legs. Realistically, that is all about 3-5 years out for mainstream computers at the front edge. Lots of other markets though would prefer if PCs got their sooner so they could leverage off of them.
Is that a considered estimate or a generic "3-5 years away" statement that people often use when they don't know how long it is going to take?
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 02-Jul-2012, 04:46   #28
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,570
Default

Quote:
Originally Posted by rpg.314 View Post
Is that a considered estimate or a generic "3-5 years away" statement that people often use when they don't know how long it is going to take?
That's my realistic estimate of when things like wide I/O on package and stacked memory will start to hit the mainstream.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 02-Jul-2012, 05:12   #29
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by aaronspink View Post
That's my realistic estimate of when things like wide I/O on package and stacked memory will start to hit the mainstream.
Do you consider interposer based solutions to be inadequate/immature/limited-by-something/too-narrow/too-unstacked or are you defining mainstream, as say >50% marketshare of all PCs?
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 02-Jul-2012, 07:05   #30
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,570
Default

Quote:
Originally Posted by rpg.314 View Post
Do you consider interposer based solutions to be inadequate/immature/limited-by-something/too-narrow/too-unstacked or are you defining mainstream, as say >50% marketshare of all PCs?
my definition of mainstream is on the order of 100-300m parts per year.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 02-Jul-2012, 10:19   #31
Rodéric
a.k.a. Ingenu
 
Join Date: Feb 2002
Location: Apsley, U.K.
Posts: 2,727
Default

Quote:
Originally Posted by ERP View Post
As an aside one of the things that irritates me about new college grads is the lack of understanding of basic memory architecture, and behavior. None of this stuff is rocket science.
They are probably too busy learning some massively verbose programming language, and don't have time for computer architecture neither algorithms...
__________________
So many things to do, and yet so little time to spend...
Rodéric is offline   Reply With Quote
Old 02-Jul-2012, 20:39   #32
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by aaronspink View Post
A larger edram without integrated ROPs would allow sampling from edram.
That's true, but none of the current PC or console GPUs work like that.

GPU accessible read/write EDRAM would practically nullify all the bandwidth costs of the deferred g-buffer generation/sampling and post process rendering (etc full screen effects that are consumed later in the pipeline). And it would be great for GPU compute. However it wouldn't nullify the bandwidth cost of shadow maps, unless you had huge amount of EDRAM. A single 4096x4096 shadow map atlas takes 64 MB, and even that isn't enough if you want to have above 720p rendering with good shadow map quality.

---

Last weekend I bumped into an article of Sequoia. It's the new #1 super computer in the TOP500 list. It doubles the performance of the previous champ, and consumes almost 40% less power. The most interesting thing is that it uses EDRAM to reach high memory bandwidth, and the PowerPC A2 CPU is basically a spiritual successor for Xenon. It has in-order execution, powerful vector units, lots of cores (Xenos had the highest core and thread count when it was released) and SMT/hyperthreading (four way this time).

16 cores, 4 threads per core = 64 threads per CPU. Each CPU has double channel DDR3-1333 memory bus and 32 MB of EDRAM. This is an interesting design if we analyze its memory performance. Large chunk of EDRAM gives it very fast local work memory. Compared to Cell SPU local stores (256 KB) the EDRAM is 128x larger. That's a huge deal, and allows you to run much wider selection of algorithms inside the fast local work memory. The main memory bus isn't wide, but the four way SMT provides the chip with good memory latency hiding capacity. Low 1.6 GHz CPU clock also means that memory latency (in cycles) remains low. Put 1.6 million of these processing cores to a same room, and you get nice chunk of processing power (and nice amount of combined EDRAM bandwidth)
sebbbi is offline   Reply With Quote
Old 02-Jul-2012, 21:50   #33
Billy Idol
Senior Member
 
Join Date: Mar 2009
Location: Europe
Posts: 2,601
Default

Quote:
Originally Posted by sebbbi
Last weekend I bumped into an article of Sequoia. It's the new #1 super computer in the TOP500 list. It doubles the performance of the previous champ, and consumes almost 40% less power. The most interesting thing is that it uses EDRAM to reach high memory bandwidth, and the PowerPC A2 CPU is basically a spiritual successor for Xenon. It has in-order execution, powerful vector units, lots of cores (Xenos had the highest core and thread count when it was released) and SMT/hyperthreading (four way this time).

16 cores, 4 threads per core = 64 threads per CPU. Each CPU has double channel DDR3-1333 memory bus and 32 MB of EDRAM. This is an interesting design if we analyze its memory performance. Large chunk of EDRAM gives it very fast local work memory. Compared to Cell SPU local stores (256 KB) the EDRAM is 128x larger. That's a huge deal, and allows you to run much wider selection of algorithms inside the fast local work memory. The main memory bus isn't wide, but the four way SMT provides the chip with good memory latency hiding capacity. Low 1.6 GHz CPU clock also means that memory latency (in cycles) remains low. Put 1.6 million of these processing cores to a same room, and you get nice chunk of processing power (and nice amount of combined EDRAM bandwidth)
Now just hope, that they let me use this pretty lady for my computations
Damn, one can dream...right!
Fortunately Jülich gets a similar machine...not that big, but still nice

Gimme gimme gimme!!!!!
__________________
I bid farewell with a rebel yell...
Billy Idol is offline   Reply With Quote
Old 03-Jul-2012, 03:53   #34
Xenus
Senior Member
 
Join Date: Nov 2004
Location: Ohio
Posts: 1,205
Default

Quote:
Originally Posted by Rodéric View Post
They are probably too busy learning some massively verbose programming language, and don't have time for computer architecture neither algorithms...
Being a recent college grad myself. I'd say it's more them teaching you to code by reinventing the wheel and very little large project coding. Most of my coding classes the projects could be done in a day or 2. Only 2-3 classes really had any large scale coding that took tons of time and planning and even then they weren't performance driven but output correctness driven.
Xenus is offline   Reply With Quote
Old 03-Jul-2012, 10:12   #35
Rodéric
a.k.a. Ingenu
 
Join Date: Feb 2002
Location: Apsley, U.K.
Posts: 2,727
Default

Quote:
Originally Posted by Xenus View Post
Being a recent college grad myself. I'd say it's more them teaching you to code by reinventing the wheel and very little large project coding. Most of my coding classes the projects could be done in a day or 2. Only 2-3 classes really had any large scale coding that took tons of time and planning and even then they weren't performance driven but output correctness driven.
To be fair it makes sense to teach how to do it right before teaching how to do it fast.
(But I'd expect at least a couple classes about performance/bottlenecks on modern computers.)
We are off-topic ^^
__________________
So many things to do, and yet so little time to spend...
Rodéric is offline   Reply With Quote
Old 03-Jul-2012, 10:38   #36
upnorthsox
Senior Member
 
Join Date: May 2008
Posts: 1,130
Default

Quote:
Originally Posted by Rodéric View Post
To be fair it makes sense to teach how to do it right before teaching how to do it fast.
(But I'd expect at least a couple classes about performance/bottlenecks on modern computers.)
We are off-topic ^^
Only slightly, to bring it back around, what offers the better optimization/performance gains and which is easier for especially young programmers to optimize, more memory or more bandwidth?
upnorthsox is offline   Reply With Quote
Old 03-Jul-2012, 12:48   #37
dragonelite
Senior Member
 
Join Date: Dec 2009
Location: netherlands
Posts: 1,443
Default

Quote:
Originally Posted by Xenus View Post
Being a recent college grad myself. I'd say it's more them teaching you to code by reinventing the wheel and very little large project coding. Most of my coding classes the projects could be done in a day or 2. Only 2-3 classes really had any large scale coding that took tons of time and planning and even then they weren't performance driven but output correctness driven.
I can second this, hell mine c++ classes were a complete drama imo.
It was like here you guys know C# and Java here is how you make a array in C++, here are some assignments go make them and totally don't make use of pointers.
It was like getting programming 101 all over again.
dragonelite is offline   Reply With Quote
Old 03-Jul-2012, 20:31   #38
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by ERP View Post
That's certainly a valid point, but IME outside of a few small blocks of code that transform streams of data, modern CPU's rarely suffer L2 cache misses, and almost never miss the ICache.
Most of the win in the streaming case is not poluting the cache with data you will never read.
(Finally got time to write an answer for this one)

Instruction cache is not a concern, I can agree with that one, but modern CPUs do L2 misses very frequently.

As I said earlier in my post, 4MB L2 gets fully evicted around 50 times every frame (if no memory bandwidth is wasted). You can't count on having data in L2 for long time. If you access the same cache line at start of the frame, middle of the frame and end of the frame, you will pay for 3 memory fetches (and likely also 3x L2 misses if your structures are not cache optimized).

Some time ago we did extensive benchmarking for different styles of data structures and access patterns (on different processors ranging from mobile chips to Sandy Bridge). Even a Sandy Bridge benefits hugely from predictable access patterns. A pointer list for example is up to 4x slower in our benchmarks compared to a cache line aligned bucketed list (on modern PC CPUs). Bucketed list generates more instructions (both more ALU and memory instructions), but the predictable access pattern (combined with manual cache prefetching) makes it considerably faster. All ALU instructions get masked out (by L2 stalls) and thus are practically free.

Similar performance can be seen if you compare a (balanced) tree based search structure with (open address) hashing. Search operation has similar instruction counts for both of the structures (with moderately large data sets), but hashing is usually 5x-10x faster, because it has (often) a single memory access (and a single L2 miss). Trees have very slow pointer->pointer->...->pointer style traversal (lots of L2 cache misses).

The most important thing (when performance is considered) of a component based architecture (in comparison to inheritance based large objects) is to slice object to smaller slices so that memory accessing becomes more efficient.

Example:

You have an inheritance based object structure and for example you want an object to have both physics and graphics behavior, and you want it have have transform (and children hierarchy) as well. Lets say all this data makes the object 250 bytes long (matrices alone take 64 bytes each, so this is a realistic estimate).

Now you want to determine the visibility of all your potentially visible objects. Say you have 25000 of potentially visible objects in total and 5000 of them will be visible (20%). Now you iterate though them and the visibility determination algorithm reads position and bounding radius from each of the objects. Position is a 3d vector and thus takes 12 bytes (3 x float32). Bounding radius is 4 bytes (single float32). Sandy Bridge has 64 byte cache lines. Each object takes 4 cache lines, but the code only accesses a single cache line of a object, and only 16 bytes of it. A good modern automatic prefetcher detects the correct stride quickly, so it only reads one cache line per object from the memory. So 64 bytes per object gets read. However the code only utilizes 16 bytes per object (position + radius = 16 bytes), so 75% of the memory bandwidth gets wasted.

If you have a component based architecture, the transform components (containing position and radius = 16 bytes) are stored separately in a linear array. Now a single cache line contains only data you need (four transform components), and 100% of the bandwidth gets utilized. As a nice bonus this kind of linear batch processing is very much suited for SOA vector processing (8 wide AVX is very efficient in processing it). Around 70-80% of the raw ALU performance of modern CPUs come from the vector execution units. Performance critical parts of a game engine should be designed to exploit vector execution as much as possible.

Link how Battlefield 3 does their vectorized culling (data driven / component based architecture):
http://publications.dice.se/attachme...attlefield.pdf

They even use 16 bit floats (halfs) to optimize the memory accesses (even if that means some extra ALU usage for decompression). Xbox 360 has vector instructions to float16 <-> float32 point conversion, and so does Ivy Bridge, Bulldozer and Piledriver (and soon also Haswell). Packing data as small as possible is now more important than ever.

Last edited by sebbbi; 03-Jul-2012 at 20:51.
sebbbi is offline   Reply With Quote
Old 03-Jul-2012, 20:44   #39
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by Brimstone View Post
Wasn't the Jon Olick demo of the ID Tech6 stuff over 1 gig with just a single model on screen?
That's true. However... SVO renderers are highly memory bandwidth bound. GPU based SVO renderers are significantly faster because of GDDR5 and because of GPUs excellent memory latency hiding (thread slackness).

We need both memory bandwidth and memory amount of make SVO renderers viable. Data compression is one of the main areas of research in voxel rendering. We have lots of extra ALU to burn to decompress the data (as voxel renderers are memory bound).

SVO streaming isn't much harder than virtual texture streaming. It's basically pretty much the same. Viewport changes aren't actually much more critical than they are for virtual texturing. The data gets gradually sharper if streaming bandwidth hits a cap. Fortunately human brains take a lot of time to process completely new scenes, and details can be kept blurry for tens of frames without any problems (we cannot "see" it). Look though a corner (to an unknown scene), and you will see that you cannot instantly focus your eyes to small details.

Last edited by sebbbi; 03-Jul-2012 at 20:50.
sebbbi is offline   Reply With Quote
Old 03-Jul-2012, 21:54   #40
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,570
Default

Quote:
Originally Posted by sebbbi View Post
(Finally got time to write an answer for this one)

Instruction cache is not a concern, I can agree with that one, but modern CPUs do L2 misses very frequently.
Yep, it really has to do with access pattern. Pointer chains are simply evil. also data layout has huge impacts as you said. You want to organize your data structure based on what will be accessed as blocks not what is necessarily easier as a programmer.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 03-Nov-2012, 03:39   #41
Squilliam
Beyond3d isn't defined yet
 
Join Date: Jan 2008
Location: New Zealand
Posts: 3,037
Default

If a console was to use an interposer for memory attachment/stacking would it be worthwhile for them to include embedded RAM in the interposer itself?
__________________
It all makes sense now: Gay marriage legalized on the same day as marijuana makes perfect biblical sense.
Leviticus 20:13 "A man who lays with another man should be stoned". Our interpretation has been wrong all these years!
Squilliam is offline   Reply With Quote
Old 28-Jan-2013, 21:59   #42
liquidboy
Junior Member
 
Join Date: Jan 2013
Posts: 49
Default

Quote:
Originally Posted by sebbbi View Post
That's true, but none of the current PC or console GPUs work like that.

GPU accessible read/write EDRAM would practically nullify all the bandwidth costs of the deferred g-buffer generation/sampling and post process rendering (etc full screen effects that are consumed later in the pipeline). And it would be great for GPU compute. However it wouldn't nullify the bandwidth cost of shadow maps, unless you had huge amount of EDRAM. A single 4096x4096 shadow map atlas takes 64 MB, and even that isn't enough if you want to have above 720p rendering with good shadow map quality.

---

Last weekend I bumped into an article of Sequoia. It's the new #1 super computer in the TOP500 list. It doubles the performance of the previous champ, and consumes almost 40% less power. The most interesting thing is that it uses EDRAM to reach high memory bandwidth, and the PowerPC A2 CPU is basically a spiritual successor for Xenon. It has in-order execution, powerful vector units, lots of cores (Xenos had the highest core and thread count when it was released) and SMT/hyperthreading (four way this time).

16 cores, 4 threads per core = 64 threads per CPU. Each CPU has double channel DDR3-1333 memory bus and 32 MB of EDRAM. This is an interesting design if we analyze its memory performance. Large chunk of EDRAM gives it very fast local work memory. Compared to Cell SPU local stores (256 KB) the EDRAM is 128x larger. That's a huge deal, and allows you to run much wider selection of algorithms inside the fast local work memory. The main memory bus isn't wide, but the four way SMT provides the chip with good memory latency hiding capacity. Low 1.6 GHz CPU clock also means that memory latency (in cycles) remains low. Put 1.6 million of these processing cores to a same room, and you get nice chunk of processing power (and nice amount of combined EDRAM bandwidth)

Do you have the link to that article ?

Those 16 cores, 4 threads per core, vector units sounds like VTE's ?! Very interested to read that article
liquidboy is online now   Reply With Quote
Old 23-Feb-2013, 04:33   #43
Kb-Smoker
Member
 
Join Date: Aug 2005
Posts: 309
Default

Any thoughts on ps4 set up?
Kb-Smoker is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 06:55.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.