"Digital Media Professionals"

DudeMiester said:
http://www.saarcor.de/

That's what I'm waiting for. They actually have real hardware, that really works! One day....

But I think much of this hardware implementation stuff is not a good idea. I think perhaps having a library of hardware implemented functions you can use within the context of a programmable pipeline could be a good idea. However, I don't think that's what these guys have in mind.

Nah. I highly doubt that will ever scale to the point you can have a modern looking game on it.
 
Uttar said:
- One MASSIVELY KICKASS anti-aliasing algorithm. It's years ahead of NVIDIA's and ATI's algorithms. I do have my doubts about the "performance does not decrease" bit though... It seems possible the performance hit would be minimal with such a technique, but it'd have to be highly parallel, and in that particuliar implementation you'd always have a very slight performance it because of the video memory used, and the corresponding bandwidth. The only apparent problem with it is the lack of gamma correction.
Did you even look at the screenshots? That is one UGLY-ASS algorithm. Look at what happens when two similarly angled lines get close to intersection. Look at what happens in areas composed of three colours. Low angle performance is pretty crap as well. Imagine what would happen in more complicated scenes in motion, with pixels popping on and off or changing colour, and artifacts sliding along edges.

ATI did its homework on FSAA (with all due respect to NVidia, who introduced MSAA to the gamer). With a shift to longer shaders, there won't be a need for a speedier algorithm, and I seriously doubt there is enough market pressure to produce something better than ATI's 6x.
 
*grumble* it only has any effect on edges what is so full about it anyway *grumble* it is all multisampling I tell ya, stop calling it something else *grumble*
 
MfA said:
Oh so now the cost isn't worth the gain in flexibility :)

Anyway, the shader need be no less parallelizeable than the present quad based approach ... it would be upto the developer.
Well, the moment you start allowing pixel shaders to access the results from neighboring pixels in anything resembling a general way, you break the parallelism of the code.

You could, obviously, just allow access to other samples within a quad, which is what DDX/DDY do, but since you can also use DDX/DDY to get pretty good estimations of those samples, I don't see a reason to.
 
Chalnoth said:
Well, the moment you start allowing pixel shaders to access the results from neighboring pixels in anything resembling a general way, you break the parallelism of the code.
That depends on what the developer wants to do ... if rasterization is iterative along one axis (horizontal or vertical) and independent along the other, there is plenty of parallelism left on large tris. It would open up a huge amount of extra algorithms which could be implemented on a GPU (both for rendering, and for general purpose computing).

Im proposing something which would make the hardware a lot more flexible (and no, it would not remove all potential for parallelism at the triangle level ... unless the developer wanted to do that). It would make the hardware a lot more flexible. Yet you are argueing against it, and in favour of more hardwired circuitry ... so as I said before, it is all a trade off.
 
Well, what I'm saying here is that if you want to do something like this in general, you'll destroy most of what makes a GPU a GPU.

The only sort of thing you could possibly hope for in this avenue is a set of limited, pre-set neighboring-pixel accesses, but since current hardware acts on quads, well, you can pretty much only hope to operate on one quad at a time at most. Regardless, I have rather strong doubts that you could find an algorithm that desires data from nearby pixels, but doesn't just depend upon the partial derivatives of the data at the on-site pixel.

Note that there is always a workaround to this:
You could calculate within the current pixel what the value of a specific register in a neighboring pixel should be. On the whole, this would probably be more efficient, even though it wastes quite a bit of processing, since it doesn't destroy parallelism.

And lastly I'd like to comment directly on your limited destruction of parallelism by only, say, reading pixels in the x direction: this sort of limitation doesn't help unless the hardware developers know you would have wanted to do such a thing. In other words, you'd need this to be one of a library of accepted neighboring pixel reads, but, as I said, since hardware works on quads, well, I just don't think it's feasible.
 
Chalnoth said:
Note that there is always a workaround to this: You could calculate within the current pixel what the value of a specific register in a neighboring pixel should be.
That is the same kind of work around as emulating branching with predicated code, a poor excuse for one.

In other words, you'd need this to be one of a library of accepted neighboring pixel reads
TOP/BOTTOM and LEFT/RIGHT ... small library.

since hardware works on quads, well, I just don't think it's feasible.
My definition of a GPU isnt tied to quad based rendering.

As I said though, it is all a trade off ... if you think the only workable way for the GPU to operate is such a hardwired one that is your perogative.
 
MfA said:
That depends on what the developer wants to do ... if rasterization is iterative along one axis (horizontal or vertical) and independent along the other, there is plenty of parallelism left on large tris.
Have you tried to actually think out the coherency issues that would result from this kind of generalized neighbor access?? If you have any back-and-forth communication between 2 neighboring pixels, you need to run them (in fact the entire polygon, if every pixel has back-and-forth communication with neighbor pixels) in lockstep, if you want to propagate a datum from one pixel to the next in a chain, you get serial dependencies that could easily count tens or hundreds of cycles per chain element etc.
It would open up a huge amount of extra algorithms which could be implemented on a GPU (both for rendering, and for general purpose computing).
Any algorithms you have in mind that cannot be done efficiently today with Render-to-texture tricks?
 
Im not suggesting using it for back and forth communication, just forth.

Dependent texturing already introduces dependency chains with that kind of latency, the tri obviously has to be larger if parallelism from either columns or rows alone has to provide a way to hide that in this case ... but it isnt an insurmountable problem.

As for applications, some texture synthesis methods can become a lot more efficient with forward differencing. As can tesselation. Image based rendering techniques also tend to be iterative in nature, Ive always loved the "voxel" approach to heightmap raycasting (it can be generalized to 6 DOF, you can do displacement mapping as an iterative 1D search ... a little like offset mapping, but without the problems).
 
MfA said:
Dependent texturing already introduces dependency chains with that kind of latency, the tri obviously has to be larger if parallelism from either columns or rows alone has to provide a way to hide that in this case ... but it isnt an insurmountable problem.

As for applications, some texture synthesis methods can become a lot more efficient with forward differencing. As can tesselation. Image based rendering techniques also tend to be iterative in nature, Ive always loved the "voxel" approach to heightmap raycasting (it can be generalized to 6 DOF, you can do displacement mapping as an iterative 1D search ... a little like offset mapping, but without the problems).
Forward differencing for texture synthesis sounds like the kind of thing you could do with Render-to-texture. If you would like to generate a running sum or product, or even run simple IIR filters, from one pixel to the next, you can do that with render-to-texture in O(log N) passes.
 
MfA said:
Which is a shitload of passes.
About 8 to 11, but with a bit of trickery (Brent-Kung algorithm) the number of pixels you actually need to render need not be more than ~2-3 times the number of pixels in the final image. I suspect that a GPU with serial chain dependencies between pixels in a row will have major trouble doing any better.

PS. running sum okay, but arbitrary IIR filters?
Probably not arbitrary, but at least a simple 1st order Highpass or Lowpass filter should be doable.
 
MfA said:
Im not suggesting using it for back and forth communication, just forth.
Except the moment you introduce communication one way, you get it the other way automatically, unless you want to restrict the hardware to just one such read.

Anyway, quad-based approaches work fine with a GPU because they don't break the inherent parallelism: they just expand it from one pixel to a 2x2 pixel block.

Dependent texturing already introduces dependency chains with that kind of latency, the tri obviously has to be larger if parallelism from either columns or rows alone has to provide a way to hide that in this case ... but it isnt an insurmountable problem.
No, because it's still within a single pixel. Once you start adding dependency, you'll have to either start rendering pixels in a specific order, or render part of all pixels in a scene before rendering the other part, etc. It's a huge mess, and a bad idea for hardware implementation on a GPU.

As for applications, some texture synthesis methods can become a lot more efficient with forward differencing. As can tesselation. Image based rendering techniques also tend to be iterative in nature, Ive always loved the "voxel" approach to heightmap raycasting (it can be generalized to 6 DOF, you can do displacement mapping as an iterative 1D search ... a little like offset mapping, but without the problems).
Well, since we'd be talking about, most likely, an entirely different functional unit for tesselation anyway, I see no problem in that regard. The same goes for any sort of raycasting/raytracing technique that is potentially implemented in future hardware. As for the other things, well, you'd just have to settle with render to texture methods.
 
Chalnoth said:
Well, since we'd be talking about, most likely, an entirely different functional unit for tesselation anyway, I see no problem in that regard. The same goes for any sort of raycasting/raytracing technique that is potentially implemented in future hardware.
So now you are suggesting seperate specialized circuitry :p Which just goes to show, it is a trade off.
 
Well, what I'm suggesting is that when a large number of different algorithms share similar characteristics, they should be done by the same programmable unit. This was my problem with the presentation: it mentioned hardware support of a number of things that can be done quite efficiently through shaders in today's hardware.

It's not a tradeoff so much as it is an efficiency question. The more specialized hardware becomes, the faster it is, but more of the hardware will be left unused. If you make things programmable, you can keep more units working at once, but may have problems with performance. So you do your research and make your processor as general as possible without losing significant performance, then add in specialized hardware for those algorithms that are useful for your target audience but are just not amenable to a programmable implementation on your hardware (or, even better, add a different programmable unit that has the different performance characteristics to apply to the different sort of job that your current programmable unit has such a hard time with).
 
I'll link these again since they do contain an architecture which allows serial dependencies between loop iterations (i.e. pixels on a GPU).

http://www.cag.lcs.mit.edu/scale/papers/vta-isca2004-slides.pdf
http://www.cag.lcs.mit.edu/scale/papers/vta-isca2004.pdf

For general purpose algorithms, you could use the control (vector) processor to take over rasterization and dispatch "pixels" in an order designed to minimize stalls caused by missing data. The architecture provides simple inter-thread communication explicitly (not through memory based synchronization), and schedules inter-thread communication dynamically.
 
Well, I don't have the time to go through the paper right now (class and stuff), but from a cursory glance, this looks like sort of a more efficient way of doing multipass rendering than really allowing dependencies in the shaders themselves. Might be useful, but it'd get a lot more interesting if one could give a concrete example of an algorithm that would run well on such an architecture, but very poorly on a current one.
 
Back
Top