Predict: The Next Generation Console Tech

Status
Not open for further replies.
Heard from other developers mention many cores.
Maybe IBM have promised some anemic many core CPU monster for the future. Imagine a Bobcat-type CPU with full speed L2 cache running at a modest 3GHz but with 16 cores on die. Any nice?

Blue Gene/Q processor nicely fits that bill, if as sebbbi says, devs don't mind going for massively multithreaded engines. ~205 Gflops, at 55 watts, @45 nm, it almost fits into a console CPU profile now, other than the die size(360mm2). Being a supercomputer aimed chip though, I would imagine they could trim it down a lot for a custom console chip. Dropping the two spare cores and most of the external communication and other supercomputer design aspects could probably get it down around 300mm2 at 45nm, and with the rumored 2014 launches they would most certainly have shrunk it to 32nm, possibly even 22nm. At 32nm I think it would be down near Xenons die size at launch or smaller, with lower power consumption, or possibly higher clocks than the Blue Gene/Q version.

Die shot/info: http://www.hpcwire.com/hpcwire/2011-08-22/ibm_specs_out_blue_gene_q_chip.html
 
So, i think it's fair to assume we won't have a bulldozer as a CPU...
Bulldozer doesn't seem to be that bad in eight core workloads, and console developers would always be using all it's cores. You should not look at the single core benchmarks when deciding what CPU is suited for a console.

Bulldozer seems to be currently suffering from two things: Windows thread sheduler allocates threads to modules inefficiently (5%-10% improvements in Win 8 already measured). L1 data cache (simple 2 way cache, shared between two cores in module) also seems to be having aliasing issues in multithreaded workloads (discussed in Linux kernel list). A simple fix yielded 3% peformance boost. These are not problems in console development. Libraries, OS and compiler are all fully optimized to the hardware and console developers allocate their threads manually (to maximize shared cache usage, and minimize shared FPU contention). We could see 15% better performance when these issues are taken out of the equation.

Add in the promised 10%-15% IPC improvements from revision 2 core (Piledriver), and we have total performance improvement in 25% range. If you only look at the 8 thread benchmarks and factor in a 25% improvement for Bulldozer, things do not look that bad anymore.

And then we have AVX, XOP and FMA4. Console developers would be using these extensively on all games, and all libraries would be optimized for these. Theoretically FMA4 doubles the vector thoughput (in reality the the gain is of course lower). These are really important additios, but are not currently visible in current benchmarks (except x264 HD 3.03 where Bulldozer really shines).

As for the excessive 125W power consumption and the huge caches. The low power Opteron 4256 EE has 8 cores, 2.5 GHz base clock and consumes only 32 Watt. Add in all Piledriver (a core designed for laptops as well) improvements, and we would likely have a much more power efficient CPU. Huge caches are not needed in games either, they are mainly there to have reasonable performance in server workloads and in general workloads (much more cache trashing and less efficient data structures than in highly optimized game engines). Cut the caches and the transistor count drops dramatically.

The question really becames, could AMD/GF produce that thing in a timely manner, have good enough yields, manufacturing costs, etc... They have been consistently late, and that doesn't bring confidence. Microsoft and Sony cannot be late this time, they need to be absolutely sure everything goes as planned, and IBM has very very solid track record in that. And both manufacturer's have existing experience in IBM CPUs from their last consoles.
 
Bulldozer doesn't seem to be that bad in eight core workloads, and console developers would always be using all it's cores.
Then again with similar transistor budget you could get at least 10-12 cores worth of i7 with HT ontop for a total of 20-24 parallel threads
 
That's actually not nearly as interesting as it sounds. The JEDEC Wide I/O is meant for mobile and embedded devices, and they are using the high width to compensate for not running the interface faster than the ram chips are ran like in most present memory standards. Presently the fastest Wide I/O spec is 512bit * 200MHz, or 12GB/s. This makes sense, because when the ram chips are integrated, wide interfaces are cheap and wide and slow takes only a fraction of the power of narrow and fast.

But I'm pretty sure they'd have real trouble fitting eight of those on a single shrunk-once next-gen console cpu, so they can't really meet the bandwidth of using GDDR5 (or DDR4).

Did you actually read the article? This technology is evolving rapidly, who knows what will be the outcome of the joint venture of IBM and 3M? They are certainly looking into high-performance computing.
 
Then again with similar transistor budget you could get at least 10-12 cores worth of i7 with HT ontop for a total of 20-24 parallel threads
Yes, if you assume that the gaming console part would need those huge caches that are pretty much there for enterprise purposes and for running generic unoptimized code. If you take out the excess caches, the Bulldozer isn't that large. Revision 2 Bulldozer (Piledriver) will be inside the new Trinity APU, it cannot be that large, since they have to also cram in a GPU more powerful than the current Llanos have. Trinity of course might be a total failure, but hopefully we see the results early next year.
 
If you take out the excess caches, the Bulldozer isn't that large.
IIRC the 6-core i7 is around 1.176B transistors with 13.5M cache (6x0.25 L2 + 12ML3). BD has 16M (4x2M L2 + 8M L3) so only about 2.5M more than 6-core i7. How many transistors would BD roughly have if you'd cut L2 from 2M per-module to 0.5M and completely remove L3? What would be left of it's performance?

[edit]
Fixed i7 L2 amount
 
Last edited by a moderator:
IIRC the 6-core i7 is around 1.176B transistors with 13.5M cache (6x0.25 L2 + 12ML3). BD has 16M (4x2M L2 + 8M L3) so only about 2.5M more than 6-core i7. How many transistors would BD roughly have if you'd cut L2 from 2M per-module to 0.5M and completely remove L3? What would be left of it's performance?

[edit]
Fixed i7 L2 amount

Sometimes intel doesn´t count the transistors of the cache memory, I don´t know about AMD. It may be har to draw any conclusions based on those numbers.
 
Sometimes intel doesn´t count the transistors of the cache memory, I don´t know about AMD. It may be har to draw any conclusions based on those numbers.
If Intel doesn't count their caches in that 1.17B then they must be packaging them at significantly higher density compared to what BD uses. I know Intel is good at that but I don't think they are that good
 
If Intel doesn't count their caches in that 1.17B then they must be packaging them at significantly higher density compared to what BD uses. I know Intel is good at that but I don't think they are that good

13.5 MB with T6 cells translates to 6 * 8 * 13.5 => 640 million transistors and that number doesn't include the associative memory and lookup logic. Feel free to make your own conclusions.
 
Intel counts their caches as part of the transistor counts. Some of the confusion with SB was around the methodology used to count transistors, one that counted the number of gates in the higher-level circuit plans, and the higher number that corresponded to the number of physical gates that implemented the design. Implementing the design on the wafer sometimes involved placing multiple gates to produce the desired results of a single abstract gate.

I will reserve judgement on Piledriver. I've learned not to count on AMD performance improvement projections until they've hatched. The slide that promised the improvement for Piledriver had an asterisk for that claim.

One small correction to a claim above: BD has a problem with aliasing in its L1 instruction cache.
The problem with the data caches is their puny size and possible problems with write bandwidth and streamout.
 
Imho the most relevant comparison between AMD and Intel offering is pretty straight forward, just compare the performances of a 4 cores SnB cores with SMT activated (8 logical cores) to the ones of a 4 modules Bulldozer. Then measure across a broad range of applications (from desktop to server passing by games). One may also consider that the SnB cores comes with an IGP.
There are application where extra cache from Bulldozer will serve its purpose
one should also take in account that while its always terrible to do a round trip to main memory Intel arch takes significantly less time to reach the main memory and achieve higher bandwidth, so it's questionable to which extend the extra cache in BD is there to make up for its inefficiency
but I believe that when it all say and done it will look really really ugly for AMD offering.
I would not be surprised if AMD "perf per transistor" are twice lower than Intel offering.
Take in account the IGP, the power consumption of the BD + discrete igp and AMD may also be at a x2 disadvantage in power efficiency, it's terrible.

I would say that we are a bit soft with the Bulldozer, it's really a dog. It has issues but there are also bad decision like a lengthier pipeline and aiming at higher clock speed. To some extend I wonder if it would be possible to AMD to copy paste the execution pipeline from their "star" cores into the BD.
 
The counterargument I've seen to a shrink for Stars is that Llano is a power-optimized Stars shrink to 32nm, and with 4 cores it is not performance competitive and also subject to power and yield concerns.

The TDP for that chip swings wildly with modest changes in CPU frequency, but it is pretty insensitive to GPU changes.
 
...one should also take in account that while its always terrible to do a round trip to main memory Intel arch takes significantly less time to reach the main memory...
That really shouldn't happen on a console, and with proper cache management instructions that shouldn't really happen on any processor going forwards with well optimised code. Game developers know what data they need when and can fetch it ahead of time manually in cases where the cache would miss.
 
If we're assuming a transistor budget comparable to a Bulldozer, I'd just as soon have 8 Cell processors stacked together. 8 PPEs (16 threads...) and 64 SPEs.
 
I doubt you'll find much, 6+ cores are what, 1% of the gaming market? Maybe next year we'll see some, Battlefield 3 is looking like it might scale better with more than 4 cores.
Right but that's irrelevant to the underlying scaling of the engine. If indeed the engines are using a granular task system with work stealing as claimed, then it should continue scaling much further than 4 cores.

My suspicion is that the size of the "tasks" that are currently being used are too large. i.e. it's easy to make "rendering" or "physics" a task or something, but it takes more work to - say - do parallel scene graph traversal and update. To some extent people can dodge this with double-buffering, but even then you need to make sure that your tree-based algorithms are reasonable scalable. That's easy with nested parallelism (cilk, TBB, etc) but the home-made task systems that I've seen so far in games are not particularly sophisticated as of yet.

The big problem is there's still lots of serial legacy code around, even in modern games. I expect this will gradually go away, but again I'm sort of surprised that I have yet to see a game that really scales beyond 4 cores let alone >16 which is where it really starts to get interesting. I imagine I will soon but had you asked me 5 years ago I would have predicted that we'd see more ubiquitous scaling/multicore usage by now.


That really shouldn't happen on a console, and with proper cache management instructions that shouldn't really happen on any processor going forwards with well optimised code. Game developers know what data they need when and can fetch it ahead of time manually in cases where the cache would miss.
Most of the time yes, but you do have data-dependent lookups in some algorithms. Histogram for a sufficiently large number of bins (such that they don't fit in cache) is the simple example where you cannot predict the memory access pattern. To complicate matters, sometimes the most computationally efficient algorithm has a non-trivial memory access pattern (due to using acceleration structures, etc) so if the hardware can't provide fairly low-latency access to random memory then you're simply in trouble.

What is the most troubling about BD to me is that even with massively multi-threaded integer workloads (which should be the best case for it), it still at best achieves relative parity with a 4-core/8-thread i7, and uses more power doing it. I want to see a case where it gets nearly double the performance of an i7... if that's not possible than I question the logic of the design. Remember that for a fixed target performance, hardware that requires less parallelism to reach that performance is preferred. Parallel algorithms involve overhead and sometimes they are vastly less efficient than the best known serial algorithms, so while I love parallel programming and all, the less parallelism that a processor *requires* to reach peak throughput the better.

So 8 integer cores would be great if it meant I could achieve 2x the performance or at least significantly more than a typical quad core. It's less interesting if I need to go even more parallel (including the associated overhead) just to get parity with a 4-core part.
 
Last edited by a moderator:
had you asked me 5 years ago I would have predicted that we'd see more ubiquitous scaling/multicore usage by now.
I wouldn't people were way to optimistic in how game developers would drive parallel solutions. Fact of the matter is 10 years ago I might have predicted faster multiple core adoption, but game teams of 5 years ago were already too big to be exclusively composed of stars. And the challenge of game development had already swapped from technology to engineering.
I don't think the legacy code will go away, because there is a lot of code being produced today with a legacy mindset.

On CPU design, I still think the real way to go with highly parallel architectures is lots of fine grain hardware threads with shared computational resources making it possible to mask memory latencies. Even ignoring main memory hits, L2 cache hits are enormously expensive today, and the latencies aren't likely to go down. Exploiting an architecture like this in a game though is a long ways off.
 
My suspicion is that the size of the "tasks" that are currently being used are too large. i.e. it's easy to make "rendering" or "physics" a task or something, but it takes more work to - say - do parallel scene graph traversal and update...
What sort of improvements are you expecting/hoping for? I thought modern PC games were GPU bound on the whole, so more CPU performance won't net much gains in measurable performance in terms of FPS. The whole schism between CPU and GPU makes fine granularity less workable. A single unified processing architect like LRB would have shown the best scalability. Within the current consoles you can only look at those article on job scheduling to see how well/badly multithreading is being implemented, and those devs who have shared have shown good results on Cell.

Most of the time yes, but you do have data-dependent lookups in some algorithms. Histogram for a sufficiently large number of bins (such that they don't fit in cache) is the simple example where you cannot predict the memory access pattern. To complicate matters, sometimes the most computationally efficient algorithm has a non-trivial memory access pattern (due to using acceleration structures, etc) so if the hardware can't provide fairly low-latency access to random memory then you're simply in trouble.
Get BIG L3 cache. ;)

What's the largest likely sort of dataset a job in a game is going to work on? Is it feasible to pick an eDRAM cache amount to serve that? Although TBH I'd say the cost isn't worth it. If one of your jobs thrashes main memory, just take the hit and keep the rest of the system busy. Most problems can be well controlled and don't need exotic solutions. No point putting BD with eDRAM just to overcome its main memory shortcomings, especially when there are better performing rivals!
 
Status
Not open for further replies.
Back
Top