If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#26 | |||
|
Senior Member
Join Date: Jun 2003
Posts: 2,570
|
Quote:
Quote:
Also it is important to recognize that there can be little correlation between cost and price. Quote:
__________________
Aaron Spink speaking for myself inc. |
|||
|
|
|
|
|
#27 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#28 |
|
Senior Member
Join Date: Jun 2003
Posts: 2,570
|
That's my realistic estimate of when things like wide I/O on package and stacked memory will start to hit the mainstream.
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
#29 |
|
Senior Member
|
Do you consider interposer based solutions to be inadequate/immature/limited-by-something/too-narrow/too-unstacked or are you defining mainstream, as say >50% marketshare of all PCs?
|
|
|
|
|
|
#30 |
|
Senior Member
Join Date: Jun 2003
Posts: 2,570
|
my definition of mainstream is on the order of 100-300m parts per year.
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
#31 | |
|
a.k.a. Ingenu
Join Date: Feb 2002
Location: Apsley, U.K.
Posts: 2,727
|
Quote:
__________________
So many things to do, and yet so little time to spend... |
|
|
|
|
|
|
#32 | |
|
Member
Join Date: Nov 2007
Posts: 938
|
Quote:
GPU accessible read/write EDRAM would practically nullify all the bandwidth costs of the deferred g-buffer generation/sampling and post process rendering (etc full screen effects that are consumed later in the pipeline). And it would be great for GPU compute. However it wouldn't nullify the bandwidth cost of shadow maps, unless you had huge amount of EDRAM. A single 4096x4096 shadow map atlas takes 64 MB, and even that isn't enough if you want to have above 720p rendering with good shadow map quality. --- Last weekend I bumped into an article of Sequoia. It's the new #1 super computer in the TOP500 list. It doubles the performance of the previous champ, and consumes almost 40% less power. The most interesting thing is that it uses EDRAM to reach high memory bandwidth, and the PowerPC A2 CPU is basically a spiritual successor for Xenon. It has in-order execution, powerful vector units, lots of cores (Xenos had the highest core and thread count when it was released) and SMT/hyperthreading (four way this time). 16 cores, 4 threads per core = 64 threads per CPU. Each CPU has double channel DDR3-1333 memory bus and 32 MB of EDRAM. This is an interesting design if we analyze its memory performance. Large chunk of EDRAM gives it very fast local work memory. Compared to Cell SPU local stores (256 KB) the EDRAM is 128x larger. That's a huge deal, and allows you to run much wider selection of algorithms inside the fast local work memory. The main memory bus isn't wide, but the four way SMT provides the chip with good memory latency hiding capacity. Low 1.6 GHz CPU clock also means that memory latency (in cycles) remains low. Put 1.6 million of these processing cores to a same room, and you get nice chunk of processing power (and nice amount of combined EDRAM bandwidth) |
|
|
|
|
|
|
#33 | |
|
Senior Member
Join Date: Mar 2009
Location: Europe
Posts: 2,601
|
Quote:
Damn, one can dream...right! Fortunately Jülich gets a similar machine...not that big, but still nice Gimme gimme gimme!!!!!
__________________
I bid farewell with a rebel yell... |
|
|
|
|
|
|
#34 |
|
Senior Member
Join Date: Nov 2004
Location: Ohio
Posts: 1,205
|
Being a recent college grad myself. I'd say it's more them teaching you to code by reinventing the wheel and very little large project coding. Most of my coding classes the projects could be done in a day or 2. Only 2-3 classes really had any large scale coding that took tons of time and planning and even then they weren't performance driven but output correctness driven.
|
|
|
|
|
|
#35 | |
|
a.k.a. Ingenu
Join Date: Feb 2002
Location: Apsley, U.K.
Posts: 2,727
|
Quote:
(But I'd expect at least a couple classes about performance/bottlenecks on modern computers.) We are off-topic ^^
__________________
So many things to do, and yet so little time to spend... |
|
|
|
|
|
|
#36 |
|
Senior Member
Join Date: May 2008
Posts: 1,130
|
Only slightly, to bring it back around, what offers the better optimization/performance gains and which is easier for especially young programmers to optimize, more memory or more bandwidth?
|
|
|
|
|
|
#37 | |
|
Senior Member
Join Date: Dec 2009
Location: netherlands
Posts: 1,443
|
Quote:
It was like here you guys know C# and Java here is how you make a array in C++, here are some assignments go make them and totally don't make use of pointers. It was like getting programming 101 all over again. |
|
|
|
|
|
|
#38 | |
|
Member
Join Date: Nov 2007
Posts: 938
|
Quote:
Instruction cache is not a concern, I can agree with that one, but modern CPUs do L2 misses very frequently. As I said earlier in my post, 4MB L2 gets fully evicted around 50 times every frame (if no memory bandwidth is wasted). You can't count on having data in L2 for long time. If you access the same cache line at start of the frame, middle of the frame and end of the frame, you will pay for 3 memory fetches (and likely also 3x L2 misses if your structures are not cache optimized). Some time ago we did extensive benchmarking for different styles of data structures and access patterns (on different processors ranging from mobile chips to Sandy Bridge). Even a Sandy Bridge benefits hugely from predictable access patterns. A pointer list for example is up to 4x slower in our benchmarks compared to a cache line aligned bucketed list (on modern PC CPUs). Bucketed list generates more instructions (both more ALU and memory instructions), but the predictable access pattern (combined with manual cache prefetching) makes it considerably faster. All ALU instructions get masked out (by L2 stalls) and thus are practically free. Similar performance can be seen if you compare a (balanced) tree based search structure with (open address) hashing. Search operation has similar instruction counts for both of the structures (with moderately large data sets), but hashing is usually 5x-10x faster, because it has (often) a single memory access (and a single L2 miss). Trees have very slow pointer->pointer->...->pointer style traversal (lots of L2 cache misses). The most important thing (when performance is considered) of a component based architecture (in comparison to inheritance based large objects) is to slice object to smaller slices so that memory accessing becomes more efficient. Example: You have an inheritance based object structure and for example you want an object to have both physics and graphics behavior, and you want it have have transform (and children hierarchy) as well. Lets say all this data makes the object 250 bytes long (matrices alone take 64 bytes each, so this is a realistic estimate). Now you want to determine the visibility of all your potentially visible objects. Say you have 25000 of potentially visible objects in total and 5000 of them will be visible (20%). Now you iterate though them and the visibility determination algorithm reads position and bounding radius from each of the objects. Position is a 3d vector and thus takes 12 bytes (3 x float32). Bounding radius is 4 bytes (single float32). Sandy Bridge has 64 byte cache lines. Each object takes 4 cache lines, but the code only accesses a single cache line of a object, and only 16 bytes of it. A good modern automatic prefetcher detects the correct stride quickly, so it only reads one cache line per object from the memory. So 64 bytes per object gets read. However the code only utilizes 16 bytes per object (position + radius = 16 bytes), so 75% of the memory bandwidth gets wasted. If you have a component based architecture, the transform components (containing position and radius = 16 bytes) are stored separately in a linear array. Now a single cache line contains only data you need (four transform components), and 100% of the bandwidth gets utilized. As a nice bonus this kind of linear batch processing is very much suited for SOA vector processing (8 wide AVX is very efficient in processing it). Around 70-80% of the raw ALU performance of modern CPUs come from the vector execution units. Performance critical parts of a game engine should be designed to exploit vector execution as much as possible. Link how Battlefield 3 does their vectorized culling (data driven / component based architecture): http://publications.dice.se/attachme...attlefield.pdf They even use 16 bit floats (halfs) to optimize the memory accesses (even if that means some extra ALU usage for decompression). Xbox 360 has vector instructions to float16 <-> float32 point conversion, and so does Ivy Bridge, Bulldozer and Piledriver (and soon also Haswell). Packing data as small as possible is now more important than ever. Last edited by sebbbi; 03-Jul-2012 at 20:51. |
|
|
|
|
|
|
#39 | |
|
Member
Join Date: Nov 2007
Posts: 938
|
Quote:
We need both memory bandwidth and memory amount of make SVO renderers viable. Data compression is one of the main areas of research in voxel rendering. We have lots of extra ALU to burn to decompress the data (as voxel renderers are memory bound). SVO streaming isn't much harder than virtual texture streaming. It's basically pretty much the same. Viewport changes aren't actually much more critical than they are for virtual texturing. The data gets gradually sharper if streaming bandwidth hits a cap. Fortunately human brains take a lot of time to process completely new scenes, and details can be kept blurry for tens of frames without any problems (we cannot "see" it). Look though a corner (to an unknown scene), and you will see that you cannot instantly focus your eyes to small details. Last edited by sebbbi; 03-Jul-2012 at 20:50. |
|
|
|
|
|
|
#40 |
|
Senior Member
Join Date: Jun 2003
Posts: 2,570
|
Yep, it really has to do with access pattern. Pointer chains are simply evil. also data layout has huge impacts as you said. You want to organize your data structure based on what will be accessed as blocks not what is necessarily easier as a programmer.
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
#41 |
|
Beyond3d isn't defined yet
Join Date: Jan 2008
Location: New Zealand
Posts: 3,037
|
If a console was to use an interposer for memory attachment/stacking would it be worthwhile for them to include embedded RAM in the interposer itself?
__________________
It all makes sense now: Gay marriage legalized on the same day as marijuana makes perfect biblical sense. Leviticus 20:13 "A man who lays with another man should be stoned". Our interpretation has been wrong all these years! |
|
|
|
|
|
#42 | |
|
Junior Member
Join Date: Jan 2013
Posts: 49
|
Quote:
Do you have the link to that article ? Those 16 cores, 4 threads per core, vector units sounds like VTE's ?! Very interested to read that article |
|
|
|
|
|
|
#43 |
|
Member
Join Date: Aug 2005
Posts: 309
|
Any thoughts on ps4 set up?
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|