*spin off* RAM & Cache Streaming implications

That just goes to show you how utterly inefficient traditional hardware texturing is at this time. Sebbi's game could theoretically use far more detailed textures in that 50MB memory than a game using 1-2GB of RAM. All that's required for this is a large background storage and lots of artist time...
And magic one-button procedural content creation. :p
I am all for an approach like sebbi's work, but creating a new console featuring 50 MB of RAM sounds like a first class pain in the ass.

Although don't worry, we will use it to play Tetris and Pong, and we will be able to launch missiles, like the PS2.

Come on.... Vast numbers of memory -and we are talking here about 4GB- is not a status symbol. It's the minimum we could expect for a console launched in 2014-2015.

You might not realize it yet, but you have a lot more freedom than most developers ever had if your console features 4GB of RAM.

As fine, stunning and amazing iD's, sebbi, and others work sounds, it isn't suitable for every game out there.

Battlefield will run fine with those features -yet it's not memory starved on PC and it shows-, but what about a game like Supreme Commander, a RTS game that gob ups huge amounts of RAM? There aren't any other examples I can think of, but I hope I've made myself understood.

Heck, it makes no sense whatsoever to use that insufficient RAM on a console anymore, regardless the new techniques being developed, but that's how the hype machine works.
 
I am all for an approach like sebbi's work, but creating a new console featuring 50 MB of RAM sounds like a first class pain in the ass.
Of course a 50Mb console isn't going to be launched, and isn't being requested.

Come on.... Vast numbers of memory -and we are talking here about 4GB- is not a status symbol. It's the minimum we could expect for a console launched in 2014-2015.
Ummm, I've advocated 4GBs. I've also just had the sense to know that there's no point in picking added cost if you can't justify it.

Battlefield will run fine with those features -yet it's not memory starved on PC and it shows-, but what about a game like Supreme Commander, a RTS game that gob ups huge amounts of RAM? There aren't any other examples I can think of, but I hope I've made myself understood.
Right. But what if it's eating up huge gobs of RAM because it's using old-fashioned, low efficiency methods? Same as massive textures in RAM where, theoretically, 50MBs is all that's really needed. If every game can be implemented effectively in 2GBs, what's the point in adding 4GBs?

That's the hypothetical scenario and the correct approach for engineers. I'd also say (and have said before) that it looks like 4GB will provide the best compromise between cost and options for an uncertain future where we don't know what algorithms can and can't, will and won't, be invented. The key point here is that More Power isn't the only solution, and isn't always the best solution, but is the most expensive solution. Virtual Texturing is one example that shows better results can be achieved with less cost (although trading lots of RAM for some processing power).
 
hm, is RAM only useful for graphics? What about physics? What if next gen increases physics in games, like destruction or fluid dynamics or whatever...this needs some RAM as well, right?

Or gameplay, or AI...everything game tec related should benefit from higher amounts of RAM! Is it clear that those game tec aspects scale the same with respect to RAM usage next gen compared to graphics?
 
hm, is RAM only useful for graphics? What about physics? What if next gen increases physics in games, like destruction or fluid dynamics or whatever...this needs some RAM as well, right?

Or gameplay, or AI...everything game tec related should benefit from higher amounts of RAM! Is it clear that those game tec aspects scale the same with respect to RAM usage next gen compared to graphics?
I haven't written any serious games professionally but I don't think I'm too far off if I say in a first/third person shooter/RPG the AI and physics related data can fit into just a few MB's (<25MB is probably a safe bet) and AI taking several times less RAM than physics. Obviously for physics there is also the collision structures but I don't think they take too much RAM. Though I'd love to hear from an industry expert how far off I am here :)
 
Remember that sebbbi's analysis ignores filtering for the most part, so while the theory is sound you're probably talking a few multiples of the numbers he quoted to get a robust virtual cache that handles trilinear and anisotropic filtering.
 
Remember that sebbbi's analysis ignores filtering for the most part, so while the theory is sound you're probably talking a few multiples of the numbers he quoted to get a robust virtual cache that handles trilinear and anisotropic filtering.
We have custom filtering that fetches mip levels based on GPU anisotropic filtering mip calculation. So our system fetches pretty much the same amount of data to our cache than real anisotropic would do. In many places our filtering actually fetches slightly too detailed data, causing slight oversampling. Our artists really hate blurry textures... and sadly that exceeds my hate of GPU texture cache stalls :(

Trilinear filtering is easy, you just need to add a single mip level to your tile cache (25% additional memory = 10 MB = not that much). Hardware trilinear can be also used, as long as the tiles have at least one pixel extra border (aniso requires a bit more, depending on the aniso level). Trilinear doesn't add any streaming cost, since you downsample the tile during loading time instead of loading it (a simple 64x64 quad with a single bilinear tfetch between pixels = really fast).

My analysis is of course only true for HDD based streaming (or from flash memory) and at 720p resolution. If you are streaming from DVD/BR or from network, a two times larger cache would be really benefical, and you would likely want to have physical macrotiles as well (needing another smaller macrotile cache). So in total you would need a bit more than 100 MB of memory.
One "+1" for more RAM is that often you can trade computing resources for extra memory usage.
Yes, however with current computers cache misses are the most common way to kill your performance. A single cache miss can be up to 500 cycles, so often it's much faster just to calculate the result than fetch it from a (random access) lookup table. Small and tightly packed cache efficient structures are really important right now, and since memory speeds rise up much more slowly than ALU, I firmly believe memory optimizations will become even more important in the future. As odd as it sounds, we mainly optimize our memory usage to improve our performance :)

Of course you need some really good search structures for many algorithms (collision detection, view cone culling, etc), but often it seems that the structures that use less memory perform also better. Many of our old structures contained lots of preculated data, since calculation was slow on older platforms (our engine has roots on our Warhammer 40K game released on PSP), but now we just calculate the stuff again. A single cache miss can cost worth 10-500 calculations, so memory accessing should be always minimized.

I personally believe we could manage with 4GB in the next generation... 64 bit pointers take 8 bytes each, what a waste :(
 
Yes, however with current computers cache misses are the most common way to kill your performance. A single cache miss can be up to 500 cycles, so often it's much faster just to calculate the result than fetch it from a (random access) lookup table
True but didn't you earlier say how you are adding some extra texture detail by combining base texture with some detail texture during load time? That's pretty much the definition of using extra memory for saving on computing resources :)
I personally believe we could manage with 4GB in the next generation... 64 bit pointers take 8 bytes each, what a waste :(
I wonder how bad would memory segmenting be for performance/ease of coding with 4GB pages. It was pretty awful back when I used to code for 16bit machines but with, say, 6-8GB of (unified) RAM and having two pages it shouldn't be that bad, you can just default one page to the more used stuff and put the less used stuff to other part.

Though I think that 4GB will likely be the sweetspot for next-gen, at least assuming that other HW won't eat into that address space like they do on PC.
 
Of course you need some really good search structures for many algorithms (collision detection, view cone culling, etc), but often it seems that the structures that use less memory perform also better. Many of our old structures contained lots of preculated data, since calculation was slow on older platforms (our engine has roots on our Warhammer 40K game released on PSP), but now we just calculate the stuff again. A single cache miss can cost worth 10-500 calculations, so memory accessing should be always minimized.

But even in search structures, it really comes down to cache misses. Unless a large portion of the structure is cache resident for a lot of the frame what you should be minimizing isn't the number of compares to get to a result, but the number of unique cache lines touched.

Outside a very small percentage of code where there really is very high calculation density, you can pretty much assume everything is free except a cache miss, and be damn close to actual performance.

You can still get wins out of lookup tables if they are used frequently enough in the time they are cache resident to offset the initial misses.

Old school game practices like 0 runtime allocation policies (which are getting much less common) can really help performance, because they force people to think about memory access, and you don't run into the death by a thousand cuts performance issues in the same way. Practically a lot of teams are not in a position to do this anymore and in general these techniques tend to reduce the amount of usable memory by trading internal fragmentation for external fragmentation.

Having said that when you get to architectures like recent intel CPU's with large L2 caches, they are hard to predict across frames, and a lot of they time you just have to measure.

Last time I made dramatic increases in the speed of a piece of code, I trimmed the structure to be cache-line aligned and moved all of the data used for an inclusion test to sit in the first cache-line of the structure.

Interesting in the DB world they have to optimize the same way, but to minimize disk hits for data.
 
Trilinear doesn't add any streaming cost, since you downsample the tile during loading time instead of loading it (a simple 64x64 quad with a single bilinear tfetch between pixels = really fast).
Sure, although you sacrifice using better filters for the mip generation as the cost. Probably worth it in this case.

Aniso is a bigger deal though, and this is mainly what I was referring to. It changes the analysis of "ideal 1 texel/pixel" since the hardware is integrating through up to 16 different fetches. More importantly, compared to trilinear LOD it bumps the required texture resolution up by a few factors to do this integration along the major axis. So compared to a system that handles trilinear, you need more data to do anisotropic. This data isn't "wasted" in the same way that tile data that is never sampled is - it's actually required.

So indeed 4k by 4k might be enough for 720p scenes (I haven't tested any real ones myself), but I'm just saying that the delta between that and the 1 megapixel framebuffer does not all represent waste, assuming you are doing trilinear and aniso. I imagine a lot more of that actually gets touched by the texture filter kernel than just a million or so elements (have you run those numbers by chance?).

Outside a very small percentage of code where there really is very high calculation density, you can pretty much assume everything is free except a cache miss, and be damn close to actual performance.
Yeah agreed. Particularly on GPUs, as you mention, we've pretty much moved into database-like performance analysis where you only count data movement. There are actually a lot of parallels between databases and clusters and what is now happening on-die on high performance chips. Those guys have just had to be dealing with these things earlier than us when we were all pretending that memory access was uniform :)

I personally believe we could manage with 4GB in the next generation... 64 bit pointers take 8 bytes each, what a waste :(
It's really not a big deal... use indexed-based structures in cases where it is (and in a lot of those cases, 32-bit pointers are a "waste" too!). Index-based structures are nicer for parallelism/aliasing analysis, moving stuff to the GPU and reorganizing data structures as well.
 
Last edited by a moderator:
you do need some addressing space to access the hardware don't you? 4GB ram with working in 32bit mode would be funny, you can always decide 256MB or 512MB of it always goes to waste but it's not that nice :)
unless you do banking or something I'm not aware of.

what about mapping file systems or files in address space? does that add significant benefits when streaming data from hdd or flash? or is a file access rare enough that you would rather deal with the overhead and have smaller pointers the rest of the time.

it would be one main use of going 64bit.
 
you do need some addressing space to access the hardware don't you? 4GB ram with working in 32bit mode would be funny, you can always decide 256MB or 512MB of it always goes to waste but it's not that nice :)
unless you do banking or something I'm not aware of.

what about mapping file systems or files in address space? does that add significant benefits when streaming data from hdd or flash? or is a file access rare enough that you would rather deal with the overhead and have smaller pointers the rest of the time.

The sanest way to handle 4GB of memory is to use a 64bit processor, but store your pointers as 4B. You get universal easy access to the first 4GB of the address space, which is mapped to the ram, and code that needs it can use 8B pointers to do mem-mapped files or GPU transfers.
 
A dreaming scennario.

What would be best choice for next gen console: a split system with 4 GB DDR4* memory interface 128Bit 3.2Gbps/51.2GB/sec + 2GB GDDR5/4GHz/128GB/sec for gpu or 4GB RAM XDR2 256/512GB/sec. (8 DRAM modules) Unified?

* Or 4+ GB RAM DDR3 1600 like "A-system way of life" in Gamecube...thinking more in streaming texture constantly from HDD/Drive midia/SSD storage than aplications skype,internet browser whateaver,friends list interactive,virtual keyboard advanced etc...
 
A dreaming scennario.
What would be best choice for next gen console: a split system with 4 GB DDR4* memory interface 128Bit 3.2Gbps/51.2GB/sec + 2GB GDDR5/4GHz/128GB/sec for gpu or
Why 4GB ram *and* a 2GB 256bit gddr5 interface? Are you completely ignoring the costs?
4GB RAM XDR2 256/512GB/sec. (8 DRAM modules) Unified?
Apparently yes.
XDR uses differential signaling, so every bit of width needs a pair of lanes. Or in other words, that would need 512 signal lines on the motherboard. Which would be utterly ridiculously beyond any sense from a manufacturing standpoint.

When people talk about xdr2, they usually mean a 128-bit bus where you'd normally see a 256bit bus, or in the case of consoles, a 64-bit bus where you'd usually see a 128 bit bus.

Both the options you gave would be frankly insane, and the only way you can put them in a console would be to ask so much money for it that most people who want to buy one would just pass it by.
 
Unified memory will make developer's life easier. Whatever performance you may gain with two memory banks, it'll be compensated by the required extra effort to properly manage them... unless you can also double the memory bandwidth, but that would increase costs significantly (more complex motherboard etc)
 
Why 4GB ram *and* a 2GB 256bit gddr5 interface? Are you completely ignoring the costs?

Apparently yes.
XDR uses differential signaling, so every bit of width needs a pair of lanes. Or in other words, that would need 512 signal lines on the motherboard. Which would be utterly ridiculously beyond any sense from a manufacturing standpoint.

When people talk about xdr2, they usually mean a 128-bit bus where you'd normally see a 256bit bus, or in the case of consoles, a 64-bit bus where you'd usually see a 128 bit bus.

Both the options you gave would be frankly insane, and the only way you can put them in a console would be to ask so much money for it that most people who want to buy one would just pass it by.


Im partially agree with you,but we have to imagine 2013/2014 and not fixed thought in 2011 ...

DDR3 and DDR4...I thought a scenario in which the DDR3 or future DDR4 be much cheaper and allow to combine low costs for paradign with constant steady streaming textures and graphic aspects,physics,sounds and others general (skype,cross chat,friend list interactive,streaming music and vídeos etc) with performance bandwidth GDDR5, but really 2GB would be too much in this case 1GB would be enough to 1080P and next generation engines (ID Tech 6,Frostbite 2 +,Epic UE4/Samaritan realtime vídeo etc).


About XDR2... if Sony was able to use XDR (since ps2 uses Rambus) no one else bought it and now pay much less than US$10 for 256MB(Isuppli BOM in 2009 today much less), why not think of a hypothesis that there is more acceptance XDR2 with MS and even AMD with your GCN 7900 allows costs reasonably acceptable to consoles 2013/2014?


I think its not so insane as put US$250 blu-ray drive first generation (fighting against HD-DVD) with BOM US$838 according to isuppli in ps3 60GB at end 2006.Surely sony or ms probably will not commit this error again(Sony), but I don't think they will come with Bill Of Materials lower than xbox360 in 2005 from US$525 to 565 in 2013/2014 for US$399 launch price and still offering a reasonable powerful hardware standards for 2 to 3 years ahead than 2011(i imagine same power Radeon HD 6970 for gpu next gen console in 2013/14 at least).


As a reflection or thought... as far as possible we have to use creativity with common sense to imagine what will be the scene in 2013 and 2014 with hardware ready for production /Taped out at least 6 months (xbox NV2A and xbox360 R-500/C1 final specs clock etc completed only six months before the release ...) before launch date.Clearly beta SDKs will come long before that, but certainly the final specs are subject to last(6 months) minute changes.
 
Last edited by a moderator:
you do need some addressing space to access the hardware don't you? 4GB ram with working in 32bit mode would be funny, you can always decide 256MB or 512MB of it always goes to waste but it's not that nice :)
unless you do banking or something I'm not aware of.
I was thinking about unified memory architecture. Just one 4GB memory area, fully accessible with both GPU and CPU and shared coherent caches. Intel already has L3 shared between GPU&CPU in Sandy, and AMD's slides show that they are aiming for even more tightly integrated operation. The next gen Bulldozer (Trinity) would be ready in time, but AMD hasn't yet spilled the beans about it (how much tighter the GPU&CPU intergration on it will be). I would assume unified address space and shared L3 at least. Mixed CPU&GPU operation will be a key to get the best out of the next gen consoles, and shared memory/caches would be really benefical for this.

So indeed 4k by 4k might be enough for 720p scenes (I haven't tested any real ones myself), but I'm just saying that the delta between that and the 1 megapixel framebuffer does not all represent waste, assuming you are doing trilinear and aniso. I imagine a lot more of that actually gets touched by the texture filter kernel than just a million or so elements (have you run those numbers by chance?).
The anisotropic mip calculation hits around 50% more tiles than bilinear mip calculation based on my analysis on our scenes. While normal anisotropic would sample a wider area of the more detailed mip, we just use simple bilinear to get four pixels from the mip calculated by the aniso hardware. But this doesn't mean we need to fetch less data to the cache, as all the extra pixels for real anisotropic would be neighbors to the pixels we access and in the same mip level, so they always lie in the same tile (if tile borders are off course wide enough).

Actually now that I think about it, I have to test how much perf hit the real thing would cause... our sampling is texture cache bound, so it would be likely that we get the real thing pretty much for free (extra samples accessed are already in the texture cache) :)

Outside a very small percentage of code where there really is very high calculation density, you can pretty much assume everything is free except a cache miss, and be damn close to actual performance.
Agreed. Most of our optimizations have focused on removing cache misses, or other related data loading stalls (load hit store stalls in particular). Extra calculations to improve memory/cache efficiency will be the key to get most performance out of the future systems. The core/thread count rises all the time, and memory is basically the only shared resource. It will become relatively slower and slower all the time, since more and more cores/threads will share it. It's easy to add more calculation units, but the shared memory and fast communication between them are hard problems to improve.
 
About XDR2... if Sony was able to use XDR (since ps2 uses Rambus) no one else bought it and now pay much less than US$10 for 256MB(Isuppli BOM in 2009 today much less), why not think of a hypothesis that there is more acceptance XDR2 with MS and even AMD with your GCN 7900 allows costs reasonably acceptable to consoles 2013/2014?

My reply would be that XDR2 may be more suitable for solving the problem outlined above by Sebbbi:

sebbbi said:
It's easy to add more calculation units, but the shared memory and fast communication between them are hard problems to improve.

The Cell/EiB/XDR setup is actually a very good approach, and to my knowledge few chip designers have so far been able to come up with something better - most that tried to solve the same problem come up with something that resembles it.

I personally have a hard time imagining that we'll see something completely different from it in future designs. It will be interesting to see if this setup can be fully integrated into a GPU architecture, and if that would be worth it.

But that said all my knowledge about this subject is from reading these forums and some manuals and articles on the web, so my view is very limited and not informed by actual programming / chip-designing expertise. ;)
 
The Cell/EiB/XDR setup is actually a very good approach, and to my knowledge few chip designers have so far been able to come up with something better - most that tried to solve the same problem come up with something that resembles it.
When Cell was released, I though so also. But now I am no longer sure anymore. Cell is basically a multicore CPU with manual caches. A standard multicore CPU with as big/fast caches for each core would perform pretty much the same if properly memory optimized structures are used on both platforms (and proper data prefeching / cache hints were used for the standard multicore CPU).

Cell is too hard to program (for general purpose code). While you also need to care about your caches and data movement on stardard multicore CPUs, you at least have the chance to do random accesses in places where performance is not critical. I have read white papers describing custom (software based) caches for Cell SPUs. My question really boils down to this: Why use software based cache and waste resources and performance on that, while you can have a really fast optimized fixed function hardware for that purpose alone? Memory accesses and caching are so commonly used features in software that spending some dedicated transistors for the job sounds like a really good choice. And that's pretty much what you get, when you move from Cell to standard eight core CPUs.

And if we can't get data out of the memory fast enough, we can always add more threads to a single computation unit. This way stall cycles can be used to execute commands on threads that are not waiting for memory. New IBM processors execute up to four threads per core, and Sun/Oracle Sparcs run up to eight. GPUs execute even more. Latency hiding is often as good solution as solving the latency (if the code can be simply parallelized). The goal is just to keep keep all the execution units filled at all times.
 
Back
Top