What custom hardware features could benefit a console?

For voxels, how much of a speed up would they get using custom silicon, versus simply using the GPU?
Wouldn't it be better to modify the GPU cores to make them more efficient at ray casting, and a lot of local cache on the GPU?
 
A raytracing chip wouldn't just be for graphics. However, there's a whole discussion here that's bigger than the intention of this thread. I point you here for further research and discourse.

Right, I understand it would have generic compute performance. But would that performance be fully utilized in all cases is my question.

For voxels, how much of a speed up would they get using custom silicon, versus simply using the GPU?
Wouldn't it be better to modify the GPU cores to make them more efficient at ray casting, and a lot of local cache on the GPU?

That was my thought too. Carmack sees SOV as an extension upon megatexturing. GCN already has PRT hardware built-in so why not add some functionality to speed up SOV?
 
A raytracing chip wouldn't just be for graphics. However, there's a whole discussion here that's bigger than the intention of this thread. I point you here for further research and discourse.

I could see value in a "ray-tracing chip", but I have to wonder how exactly it would work, you'd have to have some sort of scene graph that can be walked to cast rays.
I have to wonder if a hardware solution to that wouldn't end up being inflexible, leading to a lot of duplicate data. You'd have the games Scene graph, the Physics database, and the one you use for raytracing. Then you have to wonder if the HW data structure for efficient raytracing is amenable to dynamic update...

Would be interesting though, potential uses in Graphics, Audio and GamePlay.

Also not a fan of the blitter idea, if the idea is to move things to fast memory for the GPU, then I'd do it with the GPU, because you don't have to synchronize anything explicitly.
If the idea is to be able to move memory around for the CPU, how big do the copies have to be to offset the latency.
If the idea is format conversion for GPU formats, I still don't get it, you preswizzle textures in your asset pipeline, GPU's can read and write Swizzled/UnSwizzled and Tiled/Untiled formats with varying performance penalties, so they already have all of the functionality of the proposed "blitter" and you don't have the synchronization issue you would if it was a separate unit.

For something to become fixed function, it really needs to have established functionality. i.e it needs to help the way everyone does something, which is why I think fixed function acceleration for things like SVO's are unlikely.

I could see having a separate set of CU's taked to the CPU for GPGPU work IF significant changes to them could dramatically increase their performance in less suitable workloads. I'm not sure I know what those changes would look like.

I'd guess most of the units are incredibly mundane
Audio DSP
Compression/Decompression Engine
etc
 
You can do something quite similar with mwait/monitor in a unified CPU-GPU memory space. On a console you may provide an API instead of allowing it to be executable in user-mode.

Thanks, I wasn't aware of that. It's pretty much what I was thinking of.

I'm well out of my depth here, but I was also thinking of something whereby the CPU would look at the upcoming code for the GPU and pull data and instructions either into L2 (if it was shared) or shunt them into a GPU cache so that the GPU was ready to go as soon as the CPU handed over, and so latency in getting back to the CPU with a result was reduced.
 
Eric Mejdrich (Now in Microsoft) has a patent about Ray Tracing. I know it is not mean anything, I just remembered about it.
 
The current use of Raytracing in current/past games is limited to Raymarching, but it is used to make a lot of the fine detail. SSAO uses (should, if done right) raymarching, POM uses raymarching, etc. Voxeltracing could also be done via Raymarching.
All of those technologies point to do "raytracing" on a discreet raster instead of geometry: 2D textures, 3D textures. These textures have mipmaps mostly, so it's not really a problem to extend the raymarching query to use the multi-resolution information in there, you just have to provide min-max mipmaps instead. A GPU with min-max z-buffer and the general ability to query vector-raster intersection-points would be quite something.
It's possibly not worth it, but maybe more worth than geometry raytracing. It's also quite open for abuse (complex LUT mechanism can be implemented with that).
 
Haha, I am curious about geometry raytracing and other "pseudo-GPU" geometry work because the problem area seems to lie between the CPU and GPU. For all raster operations, the GPU should be able to tackle them adequately with no/minor modifications.

If I were Microsoft or Sony, I would explore the following:
Create an independent set of 3D environment model using the camera(s) and player sensor data, then let the GPU merge the in-game content with the "real life" models together. :runaway:

If it's a pure graphics work, it may be "easier" to enhance the GPU.
 
Also not a fan of the blitter idea, if the idea is to move things to fast memory for the GPU, then I'd do it with the GPU, because you don't have to synchronize anything explicitly.
Aww you're no fun :p but yeah wasn't meant as a very serious idea.


I could see value in a "ray-tracing chip", but I have to wonder how exactly it would work, you'd have to have some sort of scene graph that can be walked to cast rays.
I can't see them enforcing a scene graph structure, it goes against the current vogue for programmability and personal choice. In which case you'd also need quite a complex DMA/structured walk HW just to get the data near the ray caster...
Now something like that could be useful not just for ray-casting/tracing but other compute/GPU things. Effectively programable complex prefetcher, though don't know if it would be better than the auto prefetch stuff in most OoE CPUs in which case just add something like that...
 
Last edited by a moderator:
What about

Some kind of fast, high quality scaler that can be used at will on textures / buffers in memory? That way you could do a fast resize of, say, an alpha buffer or reflection map, and you could dynamically change resolution without having to rely on a fast but shitty bilinear upscale on the GPU in order to match output resolution.

Maybe you could set it to work on textures or buffers copied out of the edram, as part of the ... whatever it is that copies stuff out of the edram. Then you wouldn't need an intermediate buffer and you wouldn't need to keep the buffer resident in edram (wasting GPU time) while you completed the upscale.
 
I can't see them enforcing a scene graph structure, it goes against the current vogue for programmability and personal choice. In which case you'd also need quite a complex DMA/structured walk HW just to get the data near the ray caster...
Now something like that could be useful not just for ray-casting/tracing but other compute/GPU things. Effectively programable complex prefetcher, though don't know if it would be better than the auto prefetch stuff in most OoE CPUs in which case just add something like that...

There have been GPU front ends that allowed hierarchical walks of display lists, little more than embedded predicates. The issue is always having to wait for the test before proceeding to read the display list resulting in huge GPU stalls. Generally meaning it's never really been practical to actually use features like this.
I think any workable prefetcher for something like raytracing would end up like the front end of a modern a GPU, basically you'd have thousands of threads in flight to hide the memory latency and a cache. Where it would differ is every thread could be at a different point in the "scene description" or predicated "DMA description", I can't personally see something like that being made very generic.
 
A custom wavefront coalescing device for cases of divergent wavefront execution would be interesting. It would probably require a very custom GPU to be able to sport it. A break from quad-based pixel pipelines would be another way to really get the most out of the hardware, but nothing's indicated that.

If the system were meant to be as tuned to have as many disparate computing elements as some speculate: a hardware heterogenous system thread arbiter/controller to accellerate thread movement/data passing. :runaway:
 
how about "blitter"= dma engine with swizzles/permutes , able to copy compressed data from main memory to/from either CPU or GPU caches.
i'm thinking of sometihing like the PS2's VIF, but more versatile
imagine compressed VBOs. (greater levels of indexing, storage schemes which can't be homogenously randomly-acessed like traditional GPU data) (again , do i remember correctly that the VIF could decompress vertices stored as deltas.. )

but all that sounds quite far off the "easy to program" philosophy, and more like the ps2/ps3,a and i think more of "image data" when i hear the word blitter.

Something for post-processing sounds most likely to me, 'smart-resolve' .. hardware MLAA support pehaps .. or maybe assists specific to rendering techniques with many intermediate buffers ?
 
Would a straight memory-copy unit have some use?
i dont beleive either of these strongly but throwing the ideas out there..

[1] on an amiga related forum i remember someone claiming ibm were looking into a seperate dma-copy unit for their power-architecture machines

[2] what about a copy unit as about acceleration for garbage-collection.
functional langauges are popular for multithreading. (again , "nurseries" = caches, imagine if you could blit from them when the generation 'ages').
This all sounds crazy though because graphic-intensive games are still C++ (and i'm really skeptical about GC runtimes). and again that all sounds more cell-like and way off the "traditional architecture" rumours.

(hmm.. but microsoft are keen on C# aren't they.)

again sounds unlikely because this would all be a CPU assist, not a GPU assist - and if it was possible/feasible it might have appeared on general purpose architectures already
 
Out of curiosity, what is the primary consumer of bandwidth on a GPU?

Vertex data? Which I assume are arrays of single precision floats?

Or texture data? What format are textures typically stored in? Some sort of uncompressed RGB format?

Also, what is written back to memory (other than the frame buffer)?

I'm assuming that if anything can be compressed and decompressed on the fly, you can save bandwidth (and thus power) also if you limit the amount of things written back to main memory and don't have to reread them, you're saving bandwidth as well.

Would something like this make any sense?

Main memory => main bus => decompressor for texture/vertex data => small (or large enough depending on how you look at it) embedded memory pool for decompressed data => gpu processing <=> embedded memory large enough for a 1090p frame buffer
 
If there are russian readers who are reposting the blitter idea as fact on other forums, please refer to the chip as "болезненно тучный ягненок". It's the internal code name. You heard it here first.
 
LOL

A blitter is utter useless on it's own as a copy-machine, you do not want to just copy pixels without processing them, and the general purpose "DSP" you got for that is called GPU, it even has the blitter-backend operations (AND OR etc.) in it's ROPs, GPU ~= blitter, the other part of the blitter, the analog converter is in the seperate RAMDAC now (well, since 1990s), a last part (bresenham line-stuff) is in some other chip which I don't even know which that is now because nobody gives a sh*t about 2D-operations in hardware nowadays. The blitter in Agnus has never been used as a copy-machine, it was to allow graphics-manipulation, a 2D accelerator chip, to fe. paint the lines of a rectange in a texture-pattern without: ever making a "copy" and stipple it together with the CPU. You can think of the chip-RAM as one huge volative drawing/scratch-board on which the blitter operated. Most of the time the memory of the previous scanline wasn't valid anymore one line later. It was for composition, not for memory-copying.
If you don't do anything with the data while copying it, don't copy it. If everyone shouts for UMA (unified memory) it's because nobody wants to copy, it sucks!

Nicer would be to have Copper, that was a mini-coprocessor inside Agnus which was able to program the blitter to manipulate the display-scan independent of the main CPU, it could do that even mid-scanline, fantastic demo-processor, but it's totally impossible to do that thing in a console/graphicscard now, as RAMDAC and GPU are no merged silicon and the ROPs don't have wires to the scanline converter, and the display-buffer is not in some dedicated RAM-setup, and whatnot, don't want to imagine the bus-protocol required for an Agnus-alike setup.

Translating the Copper-concept to shader-programs though would be interesting. A co-processor which can (re)program the GPU instruction-stream independent of the CPU would be ... I guess lots of developers would think: awesome.
Imagine you could shut down the CPU, just have some electricity on the PCIe-port (you can shut down the southbridge as well), and the demo continues playing ...
 
Last edited by a moderator:
Out of curiosity, what is the primary consumer of bandwidth on a GPU?

Vertex data? Which I assume are arrays of single precision floats?

Or texture data? What format are textures typically stored in? Some sort of uncompressed RGB format?

Also, what is written back to memory (other than the frame buffer)?

I'm assuming that if anything can be compressed and decompressed on the fly, you can save bandwidth (and thus power) also if you limit the amount of things written back to main memory and don't have to reread them, you're saving bandwidth as well.

Would something like this make any sense?

Main memory => main bus => decompressor for texture/vertex data => small (or large enough depending on how you look at it) embedded memory pool for decompressed data => gpu processing <=> embedded memory large enough for a 1090p frame buffer
Texture and frame buffer traffic are typically the largest users of bandwidth though which is largest and by how much varies from game to game.
 
Thanks. So then moving the frame buffer to a high speed embedded RAM would cut down on the bandwidth requirements in a measurable way.
 
LOL
A blitter is utter useless on it's own as a copy-machine, you do not want to just copy pixels without processing them, and the general purpose "DSP" you got for that is called GPU,

I didn't ask for a blitter.. I'm just trying to speculate what some special hardware might be and responding to the use of that word.
many systems have had scratchpad memories to which you could DMA. being able to copy *and decompress* is something thats appeared, as well as copy to & from scratchpads.

but maybe "blitter" has been thrown out as a red herring to distract us technical coders who grew up on them :)

Also it WAS sometimes used for straight copy -(saving/) restoring backgrounds under sprites & windows :)

What about the ability to simpy permute data for cache-coherency?

If everyone shouts for UMA (unified memory) it's because nobody wants to copy, it sucks!
I think this is more to do with versatility i.e. texture vs vertex budgets & ease of porting.

Nicer would be to have Copper, that was a mini-coprocessor inside Agnus which was able to program the blitter to manipulate the display-scan independent of the main CPU,

the copper again is like just like a command list which gpu's already have. in the PS2 there were codes in the dma chain for writing to registers, and i'm sure gpus have similar; - thats what the copper did.. it just happened to be tied to video frames.

on the "natami" forums (it was an FPGA next gen amiga project :) ) they talked of adding a "BOPPER", a second copper NOT tied to the display.. which would function exactly like a modern display-list.


Would a second cut-down post-process oriented GPU make sense?
hardware fixed-function support for the most popular bloom and newer types of antialiasing (MLAA)
it might save silicon in various ways compared to a real gpu... simpler sampers? no triangle rasterizing, simpler state management?

i think there's talk of a seperate system dedicated to the OS.. perhaps they will have dual-plane video output (like old paralax scrolling :)) to avoid needing to composit
 
Back
Top