What custom hardware features could benefit a console?

Shifty Geezer · Jan 17, 2013

Ethatron said:
The blitter in Agnus has never been used as a copy-machine...
If you don't do anything with the data while copying it, don't copy it.

So when you have a sprite in RAM and want it in the framebuffer (in a different part of RAM), you...do...what exactly, if not copy the data across?

Nicer would be to have Copper, that was a mini-coprocessor inside Agnus which was able to program the blitter to manipulate the display-scan independent of the main CPU, it could do that even mid-scanline, fantastic demo-processor, but it's totally impossible to do that thing in a console/graphicscard now,

TVs don't use scanlines any more, so how is the Copper relevant? I didn't bother including the copper in my list of possible valuable features because it doesn't fit anywhere in the 3D pipeline, or even 2D pipeline now. We can easily generate rainbow backgrounds without loading a new value every scanline (which don't exist any more).

Love_In_Rio · Jan 17, 2013

ey!. Do you realize Amiga was the best custom hardware in history?. And it had also the coolest names for its chips!. Where are those engineers?.

Ethatron · Jan 17, 2013

ebola said:
Would a second cut-down post-process oriented GPU make sense?

I think it's very difficult to design a perfectly balanced chip. It would help, you can make it fast(er), but there remains the question if the effort is worth it. If you take the DirectX-pipeline as a rough model, then this would basically cut the chip in two - between traditional shader pipeline and compute shaders. I believe not much developers would be happy with a more restricted compute-only chip and a general-purpose chip without compute. That both functionalities are very near to each other makes it attractive, both share the same resources, ALUs, caches, etc.
I don't think the paradigm to use one chip for all is really a problem.
I've implemented a software sphere-tracer a while ago, and it's execution time was completely dominated by z-buffer clear (30%). If in Microsofts shoes I wouldn't look at general and broad performance issues, those will go away with new generations, or with brute force designs, scaling everything up until enough. I'd look more for hot-spots like the z-buffer clear, or things that don't scale well. That is not a beauty-thing, like AA-bandwidth, or slow anisotropic filtering, or something, you can live without that. But not without re-normalization of filtered normals fe., it's a requirement. I don't have a statistic of hot-spots over all game-engines to make a good guess what that may be, but AMD surely has the data, and Microsoft can elect the hot-spots it doesn't want and could make them put special hardware to solve them.
I also expect MS not to go with anything which strays away from the DirectX-schemes. A ray-tracing chip would surprise me a lot. A chip related to shadow-mapping not so much.

Shifty Geezer said:
So when you have a sprite in RAM and want it in the framebuffer (in a different part of RAM), you...do...what exactly, if not copy the data across?

You'd replace a pointer if you could.

Shifty Geezer said:
TVs don't use scanlines any more, so how is the Copper relevant?

The Copper was an autonomous co-processor able to program another co-processor on-the-fly. On the XBox 360 you can prepare execution-"lists" and then let the GPU go over it. But you had to use the CPU for that, and the lists where in GPU-code. A Copper equivalent would have it's own dialect, dedicated to produce those lists itself, and to initiate their dispatch ... with a completely dead CPU.

ebola · Jan 17, 2013

Shifty Geezer said:
TVs don't use scanlines any more, so how is the Copper relevant?

(as ethatron says)
Copper was a *command list* processor. it just happened to be reset every videoframe hence the scanline effects. the copper could be used to drive the blitter asynchronously.
of course GPU's all have this already.

If in Microsofts shoes .... I'd look more for hot-spots like the z-buffer clear,

heh. straight clear is something Blitter could do

maybe they just mean they have the abillty to transfer buffers around asynchronously better.(better efficiency for clears, resolves etc)

Didn't the nintendo 3DS have a custom gpu which was designed by looking at todays popular shaders and implementing them in hardware - i'd imagine aa post-process unit doing just that. specific acceleration for MLAA, to encourage this bandwidth-saving technique.
i know microsoft originally wanted 4xMSAA to be mandatory ("this is how its wired up, this is how to use the EDRAM to best effect..").. so they were not adverse to wiring up their hardware to suit a specific use case

Shifty Geezer · Jan 17, 2013

Ethatron said:
You'd replace a pointer if you could.

The framebuffer's a 2D bitmap, not a list of pointers to objects. The only way to get a bitmap in there is to write the image data, and if that image data is based on a preloaded graphic, that constitutes a copy.

The Copper was an autonomous co-processor able to program another co-processor on-the-fly. On the XBox 360 you can prepare execution-"lists" and then let the GPU go over it. But you had to use the CPU for that, and the lists where in GPU-code. A Copper equivalent would have it's own dialect, dedicated to produce those lists itself, and to initiate their dispatch ... with a completely dead CPU.

Yeah, it was, but I still don't see what it brings to the table of a new console. It was mostly used for syncing with the scanline. I don't see the value in poking GPU instructions outside of the CPU, especially when GPU instructions consist of long shader programs and cached data and sudden changes aren't good for them. We need preemptive GPU architectures for that sort of thing..

MrFox · Jan 17, 2013

What I liked about the Copper was that there wasn't any intermediate buffer, it would be like a GPU that calculates and renders each line on the fly, and send it immediately to the hdmi without any frame buffer. The frame lag was basically 0ms which is impossible to do even today. You can feel it when comparing an Amiga emulator to the Real Thing with a CRT. There's something weirdly snappy on it which cannot be reproduced even the latest PC a million times faster.

I don't know if they could implement at least the post-processing effects in such an "in-line" way, having GPU cores that can do the 2D effects at "wire-speed" while sending the data through HDMI. We'd save one frame of lag.

Shifty Geezer · Jan 17, 2013

I don't know how HDMI works to know if to know if partial data could be steamed, but most (all?) games are using full screen effects anyway (blur/bloom) that need a complete framebuffer to work on. There's no obvious purpose to a direct video injection into the video out.

Cyan · Jan 17, 2013

Davros said:
I vote 3 DSP/audio processor
dsp demo

Me too, nice videos btw, thanks for sharing.

I was pretty disappointed about the fact that the PS3 and the X360 didn't have something like a DSP. Prior to them ALL the consoles had a dedicated chip just for sound!!

I always loved the capabilities of specialized chips for audio. As bkillian already pointed out is a serious overhead for a CPU which could expend those cycles in more useful things.

All I am saying is that I think it is unnecessary to use the CPU for such important tasks when you can use dedicated hardware, thus alleviating the CPU and decreasing processing times.

bkilian · Jan 18, 2013

Cyan said:
Me too, nice videos btw, thanks for sharing.

I was pretty disappointed about the fact that the PS3 and the X360 didn't have something like a DSP. Prior to them ALL the consoles had a dedicated chip just for sound!!

I always loved the capabilities of specialized chips for audio. As bkillian already pointed out is a serious overhead for a CPU which could expend those cycles in more useful things.

All I am saying is that I think it is unnecessary to use the CPU for such important tasks when you can use dedicated hardware, thus alleviating the CPU and decreasing processing times.

Especially since CPUs arent really getting any faster lately, just more efficient. Good audio requires heavy use of FP math, so it can basically tie up the entire SSE unit, especially if you're doing anything interesting. One car game maker wanted a hundred voices per vehicle, plus DSP effects, compression, 3d positioning, etc. It would basically have taken an entire 360 to pull off their wishlist for audio alone.

rekator · Jan 18, 2013

And Audio need very good latency and synchronization (from my old memories), so not really easy with Multi-threaded Engine, I'm thinking? So a specific chip for audio solve the problems.

Ethatron · Jan 18, 2013

ebola said:
of course GPU's all have this already.

Well, you have half of it. Copper could wait on external events to initate something/progress in it's program. That's directly equivalent to the earlier mwait/monitor.
The problem is that GPUs can (apparently) not dispatch code to themself. That's why we have the strange constructs of compute-shader chains passing results of earlier stages to later stages, via "passthrough" compute shaders, not only because the shader can't be automatically started on an event, but also because the communication-channel isn't changeable by the GPU itself (the GPU can't rewrite a shader to pass a variable in a buffer instead of a constant, by itself).
I say apparently, because I've not yet seen/read anything which indicates a GPU can control itself, (re)write programs for itself etc. Even though I could write a shader which writes out GPU-ISA into a texture, I can't feed it as a program to the GPU from within that same shader.

A "Copper" could rewrite camera-matrices (manipulation of a shader's constant buffer) based on listening to a USB-port, without the CPU. I suspect a simple "Copper" would already be so capable that it probably could compile HLSL-assembler to GPU-ISA, or re-optimize GPU-ISA when a extern variable becomes a constant. This is just relative, today 100k transistors isn't very much, and cheap.
It doesn't really need to have caches or a real complex memory-controller, as we are talking about possibly a 500kB working set. It only needs the appropriate connectivity to i/o, to the GPU and the event-producers, maybe to the L2/L3 cache of the CPU.

ebola said:
heh. straight clear is something Blitter could do
maybe they just mean they have the abillty to transfer buffers around asynchronously better.(better efficiency for clears, resolves etc)

Well, no. The solution to this was (it has been solved long ago in hardware): not to clear at all. The z-buffer is often hierarchical, or at least has a minimum tile resolution, and each tile is represented by a bit in the GPU-internal z-buffer map (that map also holds the compressed z-buffer information). That bit indicates if a memory region is cleared or not. The memory isn't even touched.

Shifty Geezer said:
The framebuffer's a 2D bitmap, not a list of pointers to objects. The only way to get a bitmap in there is to write the image data, and if that image data is based on a preloaded graphic, that constitutes a copy.

You agree that the fastest copy is: not to copy. Right? We also agree that in composing the Windows display-surface we actually never copy (as in duplicate) anything, no fonts, no rectanges, no fills etc. but that a source-pixel (as a pointer) or an abstract description of a display-element (most elements are procedual now) enters a transforming function and then is written slightly changed to the display-surface. Correct?
That's the thing I wanted to remind of. A data-duplicating blitter IMHO is really useless, we have no UIs anymore which consist only of identical repeated elements. A non-programmable blitter is also useless because display-composition is so complex now that you can not gain anything by just accelerating a tiny fraction of the utilized composition-methods.
A special data-transforming programmable "blitter" is unnecessary if a GPU is present.

Shifty Geezer · Jan 18, 2013

Ethatron said:
You agree that the fastest copy is: not to copy. Right? We also agree that in composing the Windows display-surface we actually never copy (as in duplicate) anything, no fonts, no rectanges, no fills etc. but that a source-pixel (as a pointer) or an abstract description of a display-element (most elements are procedual now) enters a transforming function and then is written slightly changed to the display-surface. Correct?

I don't know. That's why I was asking. It is still needed to copy 2D bitmaps from RAM to framebuffer in 2D games, but that's a job I've already discounted as being eminently doable on CPU and GPU. On a split RAM pool too, you'll need to copy data from one to t'other. AFAIK there's no other general moving of memory around, but a couple of the devs here have suggested otherwise. Without an understanding of the low-level functions within a game engine, I'm pretty clueless on this one!

What custom hardware features could benefit a console?

Shifty Geezer

uber-Troll!

Love_In_Rio

Ethatron

ebola

Shifty Geezer

uber-Troll!

MrFox

Deludedly Fantastic

Shifty Geezer

uber-Troll!

Cyan

orange

bkilian

rekator

Ethatron

Shifty Geezer

uber-Troll!

Similar threads