Predict: The Next Generation Console Tech

Status
Not open for further replies.
Right, but we have no idea how Larrabee will compare to chips from ATI and NVidia in sustained performance for graphical workloads, particularly immediate mode rendering.

I'm not sure Intel is going to even try achieving the amazing latency hiding and thread interleaving that ATI and NVidia do with their shader engines. The way the register file, texture units, thead scheduler, etc are organized, it just seems to specialized for graphical workloads.
latency hiding for graphical work might be not as hard as for general purpose computing, at least for Arithmetic instructions, you "just" need a lot and wide registers. SIMD with 16 float elements like on g80 is already nice, if you could have 32, 64 or 128 of them, you could do quite some work till you need the results. for memory, if you'd have a 'prefetch' instruction, you'd get away by simulating your 'threading' this way.

of course, it's urgent that all pixels follow the same instruction path, but that was/is kinda the same issue with GPUs nowadays.
 
Larrabee with new Intel 2T eDRAM would be a monster. Intrestingly enough the projected clock speed of Larrabee is the same as the max clock speed for Intel 2T eDRAM @ 2GHz.


Larrabee GPU with 2T eDRAM @ 128GB/s of bandwidth.
 
Granted i'm sure processing such as AI, physics & animation could probably benefit from a speed boost over traditional CPUs, however even the CPUs of today are hardly "traditional" (CELL in particular) & if, going forward we're going to see the next generation of CPUs advancing in terms of parallel processing capacity (8+ cores, tens of threads, wider SIMD & greater floating point perf etc..) then I STILL don't see why any console developer would have any reason to move all these non-graphics-centric tasks onto (what is in the console space, essentially) an already overloaded GPU..
the reason is simple, because it can be faster, simpler to implement or just because it's not possible on GPUs with the default API.
some algorithms do not fit to the pure gpu work so they're still done on gpu, but less efficient, or they're done on cpu, which is less load on the gpu, but probably not the quality you'd have it on gpu.

e.g. you want add fog as a postprocess. with pixelshader you read the depth of each pixel and calculate some alphablend value. with gpgpu you can run one 'kernel' for 4*4 pixel-area. calculating the fog for the first pixel and checking for the next if it's depth value is within a range of e.g. +-5% and blend the fog with the same instensity, otherwise recalculate the fog intensity.

in worst case you'll get the same performance like with pixelshader, recalculating fog for every pixel, but more likely nearby areas will get similar intensity, which makes fog 1/16 the cost.
you can do a lot more optimizations, cause lot of input (e.g. textures) is the same for groups of pixel, like sampling through volume-textures for clouds, the first some iterations can be the same for 16pixels and if you're above some treshold, you can add samples for every single pixel.

on pixelshaders you often do the same work on nearby pixels, just because you barely can exchange data.
additionally you get more freedom to implement ideas that weren't possible on gpu without big headache, like compressing dxt textures. e.g. if you render some imposter billboards for metroids of a spacegame during runtime. or you can skin several objects at the same time e.g. some group of soldiers may have a very similar pose, just offsetted by position or you can skin them directly for several viewports e.g. to the player's camera and some shadowbuffers-views. (yes, you could do that already with geometry shaders, but not on current consoles, and also not with all the freedom of gpgpu.)

Besides that, do modern GPGPU solutions provide the means to execute both core graphics & GP code at the same time..?
you can run cuda and normal opengl/d3d stuff at the same time, but they'll probably interleave your work.


another big benefit for consoles with gpgpu would be (if they'd merge cpu+gpu), that you've a better possibility for load balancing. some games are cpu bound, some are gpu bound, if you'd have the freedom to decide yourself how much of what work you put on the ..hmm.. unifited-pu-core, you'd make better use of it. will you do skinning on SPU or GPU? will you do postprocessing on SPU or GPU? maybe dynamically decide by the amount of load they have? how will you shedule all the work between SPU/GPU and how will you manage all the buffers?
with just one unit doing all the work, you've some less headache about that :)
 
Last edited by a moderator:
Larrabee with new Intel 2T eDRAM would be a monster. Intrestingly enough the projected clock speed of Larrabee is the same as the max clock speed for Intel 2T eDRAM @ 2GHz.


Larrabee GPU with 2T eDRAM @ 128GB/s of bandwidth.
dont they already have a huge cache planed for it?
eDRam is still a DRam architecture which has suboptimal random access times. I'd prefer some SRam if you need to feed a lot of cores doing random stuff.

or are you talking about EDRam as replacement for GDDRam? then i'd agree, it'd be a monster :), but also pricewise :(
 
Rapso in case you mis the annoncement:
http://www.tomshardware.com/news/Intel-DRAM-CPU,5697.html

And I'm happy that with my little knowledge i came to the same conclusion as you in regard to this:
another big benefit for consoles with gpgpu would be (if they'd merge cpu+gpu), that you've a better possibility for load balancing. some games are cpu bound, some are gpu bound, if you'd have the freedom to decide yourself how much of what work you put on the ..hmm.. unifited-pu-core, you'd make better use of it. will you do skinning on SPU or GPU? will you do postprocessing on SPU or GPU? maybe dynamically decide by the amount of load they have? how will you shedule all the work between SPU/GPU and how will you manage all the buffers?
with just one unit doing all the work, you've some less headache about that
Between one could be bitchy and consider the GPU as made out of two types of units:
Texture units
simd arrays
...
:LOL:
 
Last edited by a moderator:
dont they already have a huge cache planed for it?
eDRam is still a DRam architecture which has suboptimal random access times. I'd prefer some SRam if you need to feed a lot of cores doing random stuff.

or are you talking about EDRam as replacement for GDDRam? then i'd agree, it'd be a monster :), but also pricewise :(


Intel developed 2T eDRAM. SRAM is six transistors, this only takes two. It can't be clocked as high as SRAM, so it couldn't be used on something like the CELL SPUs that are clocked at 3.2 GHz. But since it's a lot smaller, you can put more on the die.



Larrabee core with some Intel 2T eDRAM and a 128bit bus connected to a 2GB memory pool of GDDR-5.
 
latency hiding for graphical work might be not as hard as for general purpose computing, at least for Arithmetic instructions, you "just" need a lot and wide registers. SIMD with 16 float elements like on g80 is already nice, if you could have 32, 64 or 128 of them, you could do quite some work till you need the results. for memory, if you'd have a 'prefetch' instruction, you'd get away by simulating your 'threading' this way.
If it was that easy there would be a lot more players in the graphics field today. Arithmetic instructions are never the issue. It's hiding the hundreds of cycles of latency with texturing that separates the men from the boys.

You need lots of registers, you need lots of bandwidth to/from it to get your operands, you need thread scheduling to figure out which batches are ready, etc.

If you look at the kind of densities that GPU makers are achieving, IMO it doesn't really make sense to modify CPU cores to be good at that those workloads. The converse might be useful, but it's a bit of a challenge to keep them small and efficient while also being shared with the CPU.
 
If it was that easy there would be a lot more players in the graphics field today. Arithmetic instructions are never the issue. It's hiding the hundreds of cycles of latency with texturing that separates the men from the boys.
that hiding is done by using those amount of threads. usually just the first texturefetch is a problem, as you need to set a new address to the memory controller. all nearby pixels will then just read from the cache. that's why you need a lot of threads to hide this first read. the filtering and perspective divsion latency is moving more and more towards the shading units. the TMUs will more and more degenerate towards some kind Fetch4 units I guess.

and "more players in the graphics field "? i think that's more a business decision, not a decision of capability to build texturunits, every on-board gpu has them.


You need lots of registers, you need lots of bandwidth to/from it to get your operands, you need thread scheduling to figure out which batches are ready, etc.
thread sheduling might cost you more than doing it the bruteforce way and just executing one instruction for, lets say, 64pixels. 64pixels, each using 2 temporary registers, will result in 128 needed registers. they're already on the SPUs. but that mights explain why the NV gpus, since Geforce5800 are that heaviliy speedlimited with shaders that use a lot of those temp registers.

if you look at the cuda doc, you'll see that a memoryread that does not match a 'prefered' pattern across one kernel, will result in stalls of 300cycles. nvidia and ATI/AMD are sugesting to have a Artithmetic:textureread ration of 6:1. that would result in about 50pixels shaded at the same time.
I Guess sheduling is done on a higher level to balance the load between different tasks like vertex/geometri/pixel shader

If you look at the kind of densities that GPU makers are achieving, IMO it doesn't really make sense to modify CPU cores to be good at that those workloads. The converse might be useful, but it's a bit of a challenge to keep them small and efficient while also being shared with the CPU.
but the problem is, that the GPU makers move more and more towards the tasks that are usually suitable for CPUs. so while the CPUs move towards wider registers (e.g. 8*float on some next Intel chips), the GPUs move to tasks with finer granularity and they'll have to add caches to be not limited by stacks and non-coherent jumps across those batches that they currently use.

That's why I think CPU and GPU will get very similar and merge on day. not because they want it so hard, but because they adapt their hardware to the tasks.
and on consoles, due to
 
if you'd have a 'prefetch' instruction, you'd get away by simulating your 'threading' this way.
The problem with software pipelining is that it's not very flexible (probably going to have to resort to runtime compilation to mix "threads"). Register windows could help, but that would be getting rather far from x86.
 
With PC graphics using 256-bit and 512-bit busses as of 2007-2008, and projecting the possibility of larger than 512-bit bus for PC by 2011/2012, and I am talking normal external bus not ring bus or internal EDRAM bus, and seeing how bandwidth constrained 360/PS3 are with 128-bit buses, I do think console need to move beyond 128-bit.

higher clocked DRAMs are only gonna go so far. Maybe the answer will be if RAMBUS can really make 1 TB/sec RAM. Even 500 GB/sec would be enough.

I also believe EDRAM is still important. I suppose next-gen Xbox could, like Xbox 360, get away with low main memory bandwidth if it has ultra high bandwidth EDRAM in the several TB/sec ballpark.
 
that hiding is done by using those amount of threads. usually just the first texturefetch is a problem, as you need to set a new address to the memory controller. all nearby pixels will then just read from the cache. that's why you need a lot of threads to hide this first read.
Does it matter? The fact is that for efficient utilization you still need lots of threads with customized thread scheduling. If you fetch too many texels from a guess using the first fetch, then you get poor bandwidth usage.

the filtering and perspective divsion latency is moving more and more towards the shading units. the TMUs will more and more degenerate towards some kind Fetch4 units I guess.
This is not realistic. Texture filtering is low precision (at full speed) and fits in a very specific data path so it has a fraction of the cost of a shader unit. Having it parallel also allows variable fetch cost (e.g. aniso, wide formats, etc) to be hidden by shader instructions.

and "more players in the graphics field "? i think that's more a business decision, not a decision of capability to build texturunits, every on-board gpu has them.
It's a business decision because they can't achieve the high efficiency of ATI and NVidia. S3, XGI, Trident etc. failed to break through in recent years because of this, as you could definately see dependent texturing killing them in performance.

You need high silicon and bandwidth utilization to compete. CPU standards of efficiency are way too low.

but the problem is, that the GPU makers move more and more towards the tasks that are usually suitable for CPUs. so while the CPUs move towards wider registers (e.g. 8*float on some next Intel chips), the GPUs move to tasks with finer granularity and they'll have to add caches to be not limited by stacks and non-coherent jumps across those batches that they currently use.

That's why I think CPU and GPU will get very similar and merge on day. not because they want it so hard, but because they adapt their hardware to the tasks.
and on consoles, due to
It'll never merge completely due to fundamental differences in the workloads. If your units get too much larger to try accomodating non-GPU loads, then there's the opportunity for the competition to crush you in perf/mm2 by going back to basics. This is what happened in the G7x vs. R5xx generation. NVidia had record profits and margins, whereas ATI barely broke even. RV530 vs. G73 was particularly devastating.
 
I think for next gen the biggest forcus on the minds of devs and manufacturers will be on development and distrubution.
Reducing devs costs and finding ways to more quickly distribute large files. That kind of background stuff that gamers don't pay as much attention to and isn't as sexy as flashier graphics.
 
Also if Nintendo are hoping to provide hardware more inline with what current gen performance systems are offering in terms of HD graphics, I fail to see how even 8GB game storage will ever be sufficient going forward..

the X360 has a 8.5GB drive actually.
also if you want to directly stream content from the media, with solid state there's no need for redundancy to minimise seeks.

Mfa has it better, cost is the problem.
Though, games are expensive anyway. if I'm to buy and collect them I should be entitled to have nice, bulky cartridges (japanese megadrive carts look great :)). I like the convenience and reliability (dealing with CD cases or cases at all is boring, and a all solid-state console is better). I even feel I should get a gold-plated connector for a 50€ game, which for many is a bit of a luxury item.
They can make a few million dollars less profit and offer something I'll like better, this should be about customer's interest sometimes.
 
Sorry, this was the thread of conversation as I saw it:

Also if Nintendo are hoping to provide hardware more inline with what current gen performance systems are offering in terms of HD graphics, I fail to see how even 8GB game storage will ever be sufficient going forward..

the X360 has a 8.5GB drive actually.

So I pointed out that the X360 was providing current gen HD graphics with only 6.8Gb available.
 
Status
Not open for further replies.
Back
Top