Predict: The Next Generation Console Tech

Status
Not open for further replies.
marketing = win.
Times are changing
http://www.vnunet.com/vnunet/news/2202706/seagate-settles-class-action
02 Nov 2007

Seagate has agreed to settle a class action lawsuit by offering customers from the past six years a cash refund or free back-up and recovery software.

The suit was filed by Michael Lazar and Sarah Cho in March 2005 following claims that the actual storage capacity of Seagate's hard drives was seven per cent less than promised.

In particular, the plaintiffs alleged that Seagate mislead consumers by using the decimal definition of 'gigabyte' in which one gigabyte equals one billion bytes.

Computer operating systems calculate hard drive capacity using the binary definition of gigabyte, in which one gigabyte equates to 1,073,741,824 bytes, creating the seven per cent discrepancy.
 
My hope is that whatever tech they go with they do it right and not overshoot. I rather have a console with EVERY game at 720P with 4xAA and 8xAF, 60fps locked with great textures and advanced lighting than what we have this gen. Stuff all over the place.

I want to be able to pick up a game and know that I'll get consistant great performance from it.
 
Does it matter? The fact is that for efficient utilization you still need lots of threads with customized thread scheduling. If you fetch too many texels from a guess using the first fetch, then you get poor bandwidth usage.
yes it does. just look at your code if you do some SIMD optimization, you won't get away with performance if you just switch to SIMD, cause you add a lot of latency. e.g. transforming vertices my Mad(V,Mx,Mad(V,My,Mad(V,Mz.... wont give you much performance (especially on in-order Cores). you need to hide the latency by e.g. processing 4 or more vertices in parallel (it's not just loop unrolling, it's also static scheduling instructions to hide latency).

This is not realistic. Texture filtering is low precision (at full speed) and fits in a very specific data path so it has a fraction of the cost of a shader unit. Having it parallel also allows variable fetch cost (e.g. aniso, wide formats, etc) to be hidden by shader instructions.
I thought the R600 is always filtering in high quality?

It's a business decision because they can't achieve the high efficiency of ATI and NVidia. S3, XGI, Trident etc. failed to break through in recent years because of this, as you could definately see dependent texturing killing them in performance.
texturing is just a small part of the equation, and it's a very well known part. as I said, even onboardGPUS have enough of them beside all the other transistors, even sony and nintendo could add them to their mobile devices.
if you say it's hard to hide texturefetching latency, I'd agree with you. but if you say it's hard to build a TMU, I disagree.


It'll never merge completely due to fundamental differences in the workloads. If your units get too much larger to try accomodating non-GPU loads, then there's the opportunity for the competition to crush you in perf/mm2 by going back to basics. This is what happened in the G7x vs. R5xx generation. NVidia had record profits and margins, whereas ATI barely broke even. RV530 vs. G73 was particularly devastating.
your units wont get large if you manage to hide latencies. look at SUNs Niagarah, they aren't using as good processes as Intel or IBM does, but they manage to get 8cores on a DIE with 64threads. those 64threads are not a feature like they tell, they're essential to get those 8cores small. cause the transitor-count is not scaling linear with speed, for twice the Radix-division performance, you need 3 to 4 times more transistors (correct me if i'm wrong). so, on the opposite, having twice the latency, _can_ cut transistor cost to 1/4, allowing you to run 4cores, but you need to hide the latency somehow.
that's how "the basics" of GPUs work, it's not magic about highly optimized float units, highly optimized scheduling, highly optimized caches, it's just about a reaaaaly long pipeline with very coherent memory access pattern.
and that's why switching states/textures/etc is so expensive. cause you flush that pipe and if you do it frequently, it can't be even filled with enough load to hide the latency. (if you look at the recent NV gpu, they added mostly stuff to hide latency, like double the Temporary register count...)
that's also the big problem on GPGPU and why they'll move towards CPUs, you're very restricted in what you do if you want performance. running cuda with a raytracer that randomly reads from memory to traverse a BSP will completely kill your performance, ending up in far less than simple CPUs can render.

And I think those unified processors will be first on consoles, just because they have less backward compatibility restrictions, like we saw tripplecore-SMT on x360.
 
Last edited by a moderator:
yes it does. just look at your code if you do some SIMD optimization, you won't get away with performance if you just switch to SIMD, cause you add a lot of latency. e.g. transforming vertices my Mad(V,Mx,Mad(V,My,Mad(V,Mz.... wont give you much performance (especially on in-order Cores). you need to hide the latency by e.g. processing 4 or more vertices in parallel (it's not just loop unrolling, it's also static scheduling instructions to hide latency).
This has nothing to do with our discussion. I said we need hundreds of cycles of latency hiding for texturing, then you said we only need it for the first fetch, which I say is irrelevent because it's still needed.

Why are you explaining instruction latency to me now? Of course I understand that.

I thought the R600 is always filtering in high quality?
R600 did FP16 at full speed, but it was useless and no other GPU does it. Not G80, not GT200, not RV770...

if you say it's hard to hide texturefetching latency, I'd agree with you. but if you say it's hard to build a TMU, I disagree.
Yes, I'm talking about the former. If you want to compete, you have to be efficient, and you must hide latency to do that.

your units wont get large if you manage to hide latencies. look at SUNs Niagarah,
Niagara's cores aren't remotely close to the density of a GPU's SPs, nor can it handle enough threads to hide texture latency.

The problem for ATI/NV is not about hiding latency, as they can already do that well. We were talking about granularity, and GPUs can't reduce granularity without increasing SP size and thus reducing perf/mm2.

that's also the big problem on GPGPU and why they'll move towards CPUs, you're very restricted in what you do if you want performance. running cuda with a raytracer that randomly reads from memory to traverse a BSP will completely kill your performance, ending up in far less than simple CPUs can render.
Latency isn't the problem here. GPGPU can handle random dependent reads quite well - much better than a CPU - if your program doesn't use too many registers.

The problem is granularity and coherence. If the data access is random, you will be very inefficient with bandwidth. Branch coherence is needed, too. If you try to solve these problems, you will have to increase die size dramatically, and lose the advantage that you had over CPUs in the first place. You also lose your competitive edge in your core market of graphics chips, opening the door for a competitor to blow you out of the water.

And I think those unified processors will be first on consoles, just because they have less backward compatibility restrictions, like we saw tripplecore-SMT on x360.
I don't know. I really doubt it will happen this coming generation, and AMD/Intel will probably get there before two generations from now.
 
This has nothing to do with our discussion. I said we need hundreds of cycles of latency hiding for texturing, then you said we only need it for the first fetch, which I say is irrelevent because it's still needed.

Why are you explaining instruction latency to me now? Of course I understand that.
i've just explained why you dont need any special 'thread scheduling', if you can process enough pixels at the same time.

Niagara's cores aren't remotely close to the density of a GPU's SPs, nor can it handle enough threads to hide texture latency.
it has enough threads to hide most of the arithmetic/branch latency. that's how gpus do it as well, while normal cpus barely hide latency, even OutOfOrder cores. not because they would be not able to, but because of the dependencies in most code. To hide latency, you need either parallelization on code side, or a lot of 'hyper threads'. niagarah is the prove of concept, although it might not be a highend floatingpoint cpu. (and it's even worse with latencies due to cache misses, but yeah, I was just talking about arithmetic latencies which is already an issue and could be hiddeny by several threads per core which will move cpus towards gpus).

The problem for ATI/NV is not about hiding latency, as they can already do that well. We were talking about granularity, and GPUs can't reduce granularity without increasing SP size and thus reducing perf/mm2.
yes, and that's what I say all the time. they'll reduce granularity to fit better to GPGPU tasks.

Latency isn't the problem here. GPGPU can handle random dependent reads quite well - much better than a CPU - if your program doesn't use too many registers.
no they can't hide random reads, they can just hide latency for very coherent accesses, cause the first memory-access of a pixel-batch gets the hit of about 300cycles (which is hidden by processing the next pixels of a batch), all successive reads need to hit cached nearby texture tiles. if you do random reads you will kill performance, cause the gpu can't hide 300cycles per pixel, just per batch. you can easily test that by disabling mipmapping. to hide memory latency on finer granularity, you need big caches. a lot of threads can just hide a few (althought very expensive) cache misses and mostly arithmetic latencies.

The problem is granularity and coherence. If the data access is random, you will be very inefficient with bandwidth. Branch coherence is needed, too. If you try to solve these problems, you will have to increase die size dramatically, and lose the advantage that you had over CPUs in the first place. You also lose your competitive edge in your core market of graphics chips, opening the door for a competitor to blow you out of the water.
you could say the same about adding float and/or double precision to the cores. Speed is not anymore the only decision driving force. More flexibility by more capable cores with finer granularity is also needed to be more attractive to other markets. Of course they will have to offer more speed over the last generation, but they want a bigger market for their products, so they need to offer more than just a fast rasterizer.

I don't know. I really doubt it will happen this coming generation, and AMD/Intel will probably get there before two generations from now.
the next generation is still about 4 years away, I think till then we'll have the first unified architectures. the x360 had more than two cores and SMT a way before other native 3+ hit the market. it's primitive compared to some phenom cores, but it fits the needs with a very affordable price. I expect the same for the unifiedPUs, not the best performer, but a good unified solution for a console. i can't say it's going to be that way, it's just what i expect ;)
 
i've just explained why you dont need any special 'thread scheduling', if you can process enough pixels at the same time.
And that's the tough part - scaling to enough threads economically. GT200 can hand 30K threads, and AFAIK, R580 and R6xx are in the same ballpark (but given their low texture throughput, that's rather silly).

no they can't hide random reads, they can just hide latency for very coherent accesses, cause the first memory-access of a pixel-batch gets the hit of about 300cycles (which is hidden by processing the next pixels of a batch), all successive reads need to hit cached nearby texture tiles.
Why do you think latency hiding ability just disappears? If you have 300 cycles for the first batch, you have 300 cycles for every other one, too.

if you do random reads you will kill performance, cause the gpu can't hide 300cycles per pixel, just per batch. you can easily test that by disabling mipmapping.
It absolutely can hide it. The performance drop from disabling mipmapping is due to bandwidth and data access granularity. If you have a minimum read burst of, say, 64 bytes due to your cache line size, you can only randomly fetch 2.2 GSamples/s at most on GT200 even with perfect latency hiding.

CPUs memory controllers have similar granularity, so even a theoretical super-threaded CPU wouldn't do any better unless, of course, the "random" fetches are coherent enough to fit in the cache.

to hide memory latency on finer granularity, you need big caches.
I wouldn't call big caches a method of hiding latencies. It's more of a method to avoid the problem.

you could say the same about adding float and/or double precision to the cores. Speed is not anymore the only decision driving force. More flexibility by more capable cores with finer granularity is also needed to be more attractive to other markets.
Maybe you should quantify those markets before making such assumptions. You're not going to jeapordize a multibillion dollar revenue stream to try attacking a currently niche market. By focusing too much on GPGPU, NVidia could be in serious trouble with GT200 now with RV770 and its derivatives. Fortunately G92 is still within 25% in perf/mm2, so they won't get crushed.

the next generation is still about 4 years away, I think till then we'll have the first unified architectures. the x360 had more than two cores and SMT a way before other native 3+ hit the market. it's primitive compared to some phenom cores, but it fits the needs with a very affordable price. I expect the same for the unifiedPUs, not the best performer, but a good unified solution for a console. i can't say it's going to be that way, it's just what i expect ;)
We'll see. Putting multiple dies together is one thing. Sharing resources as complex as shader engines is another entirely different task.
 
Don´t know if GDDR5 has been discussed in this thread already, thought it may be of interest.

Macri said advanced signaling technologies from Rambus will not be competitive, in part because they use a differential (two-wire) approach rather than the single-wire technique in GDDR5. The extra wire typically requires more pins and power. "We don't think a differential solution make sense until you get to speeds of 8 to10 Gbits/s," he said.

The Rambus technology is used as an interconnect for main memory in the Sony Playstation3, but all three major videogame platforms today use GDDR3 as their graphics memory interconnect, Macri said. "XDR doesn't have a footprint in any console for graphics today," he said. "I believe GDDR5 will be a nice fit for the console space."

Ultimately, the industry will have to switch to differential technology, said Michael Ching, director of product marketing at Rambus. A Qimonda white paper shows today's techniques running out of steam at 5 to 6 GHz, he said.

"That's pretty much the end of the line for the single-ended approach," Ching said, adding that the Rambus XDR approach can consume less power than single-ended techniques. "Our analysis shows differential technology results in lower power even at 4 GHz or so, and the difference between the two grows as you go faster," he said.

The Rambus XDR technology is available at data rates from 3.2 to 4.8 GHz and will scale to 6.4 GHz. Within weeks, Rambus will disclose its XDR-2 technology, which will start at 8 GHz and has been demonstrated at 16 GHz, Ching said.
 
NEC has done 12 Gbps using pre-emphasis filters, there is plenty of life left in single ended signalling yet ... and once we get far past 10 Gbps things will go optical anyway. For non board level integration Rambus doesn't have a chance in hell of getting into the game before optical takes over either (I don't think Intel is ready for another venture with them, which is the only way it would happen).

Apart from patent milking Rambus has become irrelevant.
 
The question on optical is a big one though, as a significant volume of business (and roadmapping) is always at play when the console generations roll around. Taking the case of PS4 specifically, between IBM's efforts into optical interconnects and STI's shared interests in the successor to Cell, it would seem synergistic to have said architecture be one of the first to use an optical link. But will the technologies be ready in time? And if not, then I could see Rambus making its way again into the fray, as GDDR5 likely won't be something IBM feels comfortable building around for its own enterprise products.

Now on the add-in front, I don't see Rambus gaining share. Not that I don't like them either, because I do think XDR has a lot of merit, but... well yeah.
 
I think IBM will feel far more comfortable using a commodity technology than a Rambus one in their enterprise products ...
 
I don't really get the advantages of Rambus. What's the point in higher data rate for differential signalling when you need twice as many lines? You might as well use a double width single-ended system.

Am I missing something? Does Rambus need less than twice as many pins for the same data width? 64 pairs of wires connected to 8GHz XDR-2 can't transfer any more data than 128 (+ a few ground) lines connected to 4GHz GDDR5.
 
I think IBM will feel far more comfortable using a commodity technology than a Rambus one in their enterprise products ...

Well honestly, it'll be interesting to see what they do. They clearly want to leverage the console's economies of scale when they go to develop what will probably be their contemporary anti-Larrabee chip, but as the PowerXCell 8i shows, there was still a bridge that had to be crossed to make the joint-product truly well suited to the particular market. And indeed SpursEngine shows the same on the opposite side.

I have to imagine that they would want to reach a universal 'ideal' as best they can next go-around, but although I agree commodity wouldn't be shunned, I just don't think it'll be any of the GDDR variants. So that leaves a tug-of-war between Sony's pin-count/power/bandwidth needs and IBM's bandwidth/addressability needs.

I don't really get the advantages of Rambus. What's the point in higher data rate for differential signalling when you need twice as many lines? You might as well use a double width single-ended system.

Am I missing something? Does Rambus need less than twice as many pins for the same data width? 64 pairs of wires connected to 8GHz XDR-2 can't transfer any more data than 128 (+ a few ground) lines connected to 4GHz GDDR5.

It had a pin advantage back in the day when it was DDR2 that it was competing against on a pin-per-bandwidth level. :)

GDDR5 is awesome, but I just don't see it being the targeted memory for the chip. Now the RSX successor, sure.
 
My hope is that whatever tech they go with they do it right and not overshoot. I rather have a console with EVERY game at 720P with 4xAA and 8xAF, 60fps locked with great textures and advanced lighting than what we have this gen. Stuff all over the place.

I want to be able to pick up a game and know that I'll get consistent great performance from it.

I think 1080p with 4x AA and decent texture filtering will be an easy minimum to achieve. I mean the rv770 is already killing 1920x1200 with 4x AA, in alot of games.

I know every gen we all say, this will be the gen of 60fps for all games, but,I really expect that by the time the next gen is out it will be smashing 1080p @60fps for most things. I literally cannot see why this wont happen.

will have a chip 4x the power of the rv770 by then.

I mean 32nm will get you 3x the trannies and at least 50% more clock, than the rv770.
 
I don't really get the advantages of Rambus. What's the point in higher data rate for differential signalling when you need twice as many lines? You might as well use a double width single-ended system.

Am I missing something? Does Rambus need less than twice as many pins for the same data width? 64 pairs of wires connected to 8GHz XDR-2 can't transfer any more data than 128 (+ a few ground) lines connected to 4GHz GDDR5.

A major benefit of Rambus is granularity. In order to achieve the same bandwidth with non-serialized memory, you have to go wide. Not only does that cost in "pins", bonding and traces but importantly for consoles, you need a certain number of physical devices.
The PS3 has 256MB of main memory, and the original design has 4 XDR DRAM chips of 512Mbit each. As lithography improves, these four could be reduced to a single 2GBit device soldered to the mainboard, with all communication with the rest of the system unchanged. I'm convinced that at the end of the PS3 life cycle, we'll see this done. To do a similar number with the 512Mbit GDDR3 chips however, you would have to move away from commodity chips.
 
I think 1080p with 4x AA and decent texture filtering will be an easy minimum to achieve. I mean the rv770 is already killing 1920x1200 with 4x AA, in alot of games.

I know every gen we all say, this will be the gen of 60fps for all games, but,I really expect that by the time the next gen is out it will be smashing 1080p @60fps for most things. I literally cannot see why this wont happen.

will have a chip 4x the power of the rv770 by then.

I mean 32nm will get you 3x the trannies and at least 50% more clock, than the rv770.

Looking at how close to reality GT5P looks sometimes and thinking its essentially running on a 7900GT, i think in the next generation, with the power of the GPU's by then (they will have what? 5-10x the power of an RSX), close to photorealistic graphics can easily been achieved at 1080p at 60fps, and after that it GPU power shouldn't really matter that much.

In the next-next gen, the gpu's will be what? 40x more powerful than current tech? GPU power will eventually become a extremely rare bottleneck (unless they start processing other things aswell)
 
With two (CrossFired) HD 4870X2 cards (4 GPUs), in highend PC, fall 2008. That's just short of 4 billion transistors (3824 M), almost 5 TeraFLOPs of computing power (at least 4.8 TeraFLOPs, but depends on clock).

I'm sure AMD's R8xx and R9xx will have greater performance per W, per mm2 and per transistor. Hopefully next-gen Xbox uses some derivative of R9xx on 32nm or even 22nm, or some half node between 32nm and 22nm.

R8xx will be DX11 / D3D11 / Shader Model 5.


speculation:

I assume R9xx will probably be a major overhaul/upgrade of R8xx, ala R7xx over R6xx. The next Xbox GPU should be a further customized variant of R9xx without any "fat" that PC GPUs need.

1080p will be the max resolution. 60fps should be no problem even with a significant increase in graphic detail/complexity and image quality. However many developers will want to push the detail up so much, in some games, that they'll target 30fps again. However 60fps should be at least somewhat more commonplace than it is this gen.

Hey, and I didn't even mention the rumor that AMD has R1000 / R1xxx on the drawing board, if Inq is to be believed. They DO get some things right, some of the time.
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top