Hardware MSAA

This would a be an example of where the most generic solution, a 1-D memory space, does penalize a usage case where the more natural mapping would be 2-D.
 
This would a be an example of where the most generic solution, a 1-D memory space, does penalize a usage case where the more natural mapping would be 2-D.

:D

I think you hit 1 nail in the head, a very specific nail in a broad wooden construction. But such a big one, that has been the topic of countless research efforts. All of them (the serious ones) with excellent results. My take is, and it is very likely with graphics converging with compute, is that hardware (physical memory mapping) solutions will be adapted to 2D mappings in the future.
 
Supposedly, according to what citable source?
64b per cycle fits within the relased data for LRB1 if all loads hit the same cache line. That number does not seem feasible if

the scatter/gatter crosses cache line boundaries.
http://forum.beyond3d.com/showthread.php?p=1533681#post1533681 There's no hard data but LRB1 likely always required multiple cycles to complete a gather operation, while it is suggested that LRB3 can do it in one cycle in the best case.
Architectures with fast caches and multiple ports already exist. Non-specialized architectures with fast caches and scatter/gather do not.
Current many-port caches use multi-banking, which becomes slow when you have bank conflicts and requires a lot of duplicate addressing logic. Making use of wider cache lines and gathering multiple elements in parallel can drastically reduce the number of ports you need.

And just because such gather/scatter hardware is not available to the public yet doesn't make it an argument against it. Innovation would be at a standstill if we only looked at existing solutions.
Can you clarify this? If there were already spatial coherence, why the need to gather/scatter?
Just use wide loads and shuffle.
Texel lookups are typically very localized (with lots of reuse between adjacent pixels). But it can't be implemented with wide loads and shuffle.
It isn't the case that any vector instruction is quite as generic as a scalar, nor is a gather/scatter as generic as a load/store. They can potentially draw close, but there are corner cases that would mean that the scalar/vector ops cannot be fully interchangable, and scalar ops are the baseline for being generic (within the defined data types and operand widths they support).
Gather/scatter is no less generic than any other vector operation. Yes there are worst case scenario's where you need a cycle per scalar, but that's also true for arithmetic operations. But just like arithmetic vector operations are very valuable because in the typical case it provides a major parallel speedup, so does gather/scatter help a great deal, with a low hardware cost. So there's no need to point out the corner cases.
I'm not certain why we should waste silicon on a gather/scatter. It is readily implemented by generic loads and stores, and imagine the higher silicon utilization. Much better than having the locality checks and multiplexing hardware that would just sit there idle with scalar loads.
Generic load/store caches as implemented in Fermi and Cayman both only offer half the bandwidth compared to ALU throughput. And because they're multi-banked they're not without area compromises either. And considering that they're shared by hundreds of threads, they're really tiny and must have a poor hit ratio.

And yes, while the gather/scatter-specific logic remains idle with regular wide loads, note that current GPUs have lots of different specialized cache which are frequently either a bottleneck or idle. As the workloads diversify, unifying them into one generic cache starts to make sense. By combining several techniques it can offer high performance for a wide range of uses, while not necessarily taking lots of area. Plus you get a higher total amount of storage for the data you care most about, it simplifies the hardware and software design, and enables further unification (texture units, ROP units, etc).
Settings (res,AF,AA,etc)? FPS, in-game, benchmark, etc?
Tell me the settings you want me to test (just one combination please).
 
Texture accesses have good spatial coherence, but they will not generally fall on the same cache-line.
What makes you think that? Mipmapped texture accesses have a very high spacial locality because on average only one extra texel is accessed per pixel. The sampling footprints of adjacent pixels largely overlap.
A straightforward large triangle rasterizer using scanlines for instance will access the textures on a pretty much random angled axis in UV space. Add tiled storage and the chances of all workitems hitting the same cacheline rapidly goes to 0 without some rather complex sorting of workitems by texture accesses.
Tiling improves locality and saves bandwidth. Take for example a 4x4 pixel tile. Without texture tiling, you need to access 5-7 cache lines. With 4x4 texel tiles, the pixel tile typically hits 4 texel tiles. Also with texture tiling the next pixel tile will typically reuse half of the texture tiles, while without tiling it depends on the orientation whether or not any of the cached tiles will get reused before being evicted.

Combining true multi-porting and gather/scatter seems close to ideal to me to achieve high performance at a reasonable hardware cost.
PS. just for reference, would it be possible to determine which percentage of texture accesses during normal operation stall? You obviously do a lot more work in between texture accesses, which distorts the picture a bit.
VTune can count cache misses, but because there are many dynamically generated functions it's not really feasible to isolate the ones related to texturing. It might be possible to use serializing instructions and count the actual number of clock cycles per access but then we'd still have to isolate the effect of out-of-order execution, prefetching and SMT. Frankly that's a lot of work and I seriously doubt it will change the overall conclusion.

I'm confident that even the simplest out-of-order execution implementation would allow GPUs to reduce the number of threads needed to achieve good utilization. It also reduces the required register set size, which in turn further reduces latency. Of course the average memory access latency also has to be brought down, but with less threads the thrashing decreases and the smaller register set frees up space for a larger cache, and simple prefetching can work miracles.

Mark my words, in the long run there is no other option. The number of ALUs keeps increasing exponentially, but soon noone can develop a practical consumer application which can keep them all busy if the latencies remain this high. NVIDIA has already resorted to parallel kernel execution and superscalar execution, but that won't suffice for long. Eventually it will have to evolve into true SMT and out-of-order execution.
 
This would a be an example of where the most generic solution, a 1-D memory space, does penalize a usage case where the more natural mapping would be 2-D.
Tiling merely requires swapping some of the address bits around. You could just have regular and tiled load/store instructions. No need to change the actual cache.
 
VTune can count cache misses, but because there are many dynamically generated functions it's not really feasible to isolate the ones related to texturing..
If you can compile your code under Linux, you could try "kcachegrind" which might give you some useful (albeit approximate/simulated) figures.
 
Multi-core,
Good luck selling >4 cores to 90% of the consumer population. The data center market, for some reason wants LOTS of tiny x86/arm cores per die. I wonder if utilization is the reason ... :)
No predication, no scatter/gather. IOW, useful only for tiny niches.

You can already dual issue mul and add. FMA doesn't increase compute density by itself. You'll need to ~double the area of the simd unit for that.

and gather/scatter are making the CPU significantly more powerful for high throughput tasks than their ancestors.
Where are these cpu's with gather/scatter?

GPUs on the other hand struggle to become more efficient at tasks other than graphics, because they hang on to a significant amount of fixed-function hardware, and hiding all latency through threading. This is why things like GPU physics have so far only had mediocre success at best.

There are many problems with gpu's today but excessive ff hw is not one of them. What they need is a more flexible memory hierarchy, ability to submit jobs and integration with cpu's mmu. Ditching ff hw even when it makes sense for some ideological goal is pointless.
 
http://forum.beyond3d.com/showthread.php?p=1533681#post1533681 There's no hard data but LRB1 likely always required multiple cycles to complete a gather operation, while it is suggested that LRB3 can do it in one cycle in the best case.
There is no hard data for that claim, true.
My understanding from earlier Larrabee presentations was that it could sustain generally 1 L1 cache access per cycle when gathering.

Current many-port caches use multi-banking, which becomes slow when you have bank conflicts and requires a lot of duplicate addressing logic.
It does slow down in conflict cases. I am not clear on the latter claim. Both banking (pseudo-dual porting if an AMD CPU) and multiporting involve performing addressing on more than one access per cycle. What is the lots of additional logic?

And just because such gather/scatter hardware is not available to the public yet doesn't make it an argument against it. Innovation would be at a standstill if we only looked at existing solutions.
There must be reasons why it is not available. Innovation is usually made as people work around problems and constraints. It is difficult to predict innovations by ignoring those constraints.

Texel lookups are typically very localized (with lots of reuse between adjacent pixels). But it can't be implemented with wide loads and shuffle.
There is locality in the problem space and locality with respect to the linear address space. If there is locality within the address space, why can't a few wide loads and shuffling the values around suffice?

Gather/scatter is no less generic than any other vector operation.
That is consistent with what I said. I went on to say that vector operations are not as generic as scalar ones.

But just like arithmetic vector operations are very valuable because in the typical case it provides a major parallel speedup,
We're on a utilization kick here, why are we now pulling peak performance (at the expense of utilization) into the argument?

so does gather/scatter help a great deal, with a low hardware cost. So there's no need to point out the corner cases.
This has been asserted, not substantiated.

Generic load/store caches as implemented in Fermi and Cayman both only offer half the bandwidth compared to ALU throughput. And because they're multi-banked they're not without area compromises either.
My statement was a shot at implementing a specialized load on a generic architecture, not a specialized design.
We can get much better utilization of generic hardware with a stream of scalar loads.

And yes, while the gather/scatter-specific logic remains idle with regular wide loads, note that current GPUs have lots of different specialized cache which are frequently either a bottleneck or idle.
When exactly is idle hardware good or bad?

Tell me the settings you want me to test (just one combination please).
I was curious about what settings you used to arrive at your numbers.

I think a run at whatever is closest to 1080p in the benchmark at High settings could be a good test, that should have a decent level of AF and whatever AA is used. Extreme may have the full featureset, but I haven't kept up on what it enables.
I haven't kept up with the discussion on hacking the ini to activate the various modes to see what else can be done.

I'd prefer actual gameplay, but that throws comparability out the window.

Combining true multi-porting and gather/scatter seems close to ideal to me to achieve high performance at a reasonable hardware cost.
True multi-porting is more expensive than banking, since it increases the size of the storage cells and adds word lines.

I'm confident that even the simplest out-of-order execution implementation would allow GPUs to reduce the number of threads needed to achieve good utilization.
In the absence of consideration for power, area, and overall performance within those constraints, probably true.

Tiling merely requires swapping some of the address bits around. You could just have regular and tiled load/store instructions. No need to change the actual cache.

How would this be implemented? It sounds like it would need some kind of stateful load-store unit to know the proper mapping over varying formats and tiling schemes.
 
The TEX:ALU ratio is still going down. At the same time, more generic memory accesses are needed. So instead of having multiple specialized caches with varying utilization, you could have one cache hierarchy which doesn't detriment any usage.
Fermi is already there.

And yet 2-way SMT works like a charm for SwiftShader. Combined with out-of-order execution and prefetching, memory access latency is not an issue.
That is partly because on a cpu you do a lot more alu operations than you would with some ff hw (filtering, texture address generation and the like), so it's easier to keep an even otherwise alu-poor (relatively speaking, of course) architecture fed.
The number of threads is becoming a significant problem for GPUs. A lot of GPGPU applications achieve only a fraction of the core utilization because they just don't have enough data to process in parallel. And it's not getting any better as the core counts increase! Also, with ever longer code the context for each thread requires lots of on-die storage.
True, but for that a more flexible memory hierarchy is much more important as memory latency is 30x more than alu latency and rising.

The only solution is reducing latencies. And because it allows a smaller register set it doesn't have to cost a lot, or even anything at all. Plus it makes the architecture capable of running a wider variety of workloads.
Will a smaller alu latency vanquish the need for hiding mem latency?
 
I'm not aware how Fermi does its screen tiling. Would you be so kind as to update me on this technology? I'm quite interested to know how this would allow to forget about triangle sizes.
The fermi arch analysis on b3d explained how fermi does it.

As long as you can rasterize faster than you can shade them, it doesn't matter what was the size of the original triangle.
 
Keep in mind that for every hardware designer, there are probably hundreds of software developers. So if you dedicate transistors to feature A, and applications implement feature X, Y and Z in software, your GPU will either be slower than the more programmable chip of the competition, or it will have a higher power consumption to reach the same performance.
Where are these "more programmable chips of the competition...."?

That's for a micropolygon rasterizer only. What I said is that "a dedicated rasterizer for large as well as tiny polygons, which also supports motion and defocus blur, is going to be relatively large". And as I also noted before, their area and power consumption numbers are for the isolated rasterizer. An actual integrated rasterizer unit with all of its supporting circuitry will be larger and hotter.
A full custom rasterizer which can do large and small triangles with high throughput will take much more closer to 4 mm2 area than ~20mm2 area of fermi's cores. Besides, large triangles are becoming less and less common, so you can probably cut corners there without having to worry too much.

Their work also doesn't include any hardware to make the wide pixel pipeline process fragments from multiple micropolygons.
Meh, a 32 or 64 deep fifo should be enough. I doubt if it will take even 0.1 mm2 of area.
So I sincerely doubt that it would take only 4 mm² to make GPUs capable of efficiently processing micropolygons.
Performance/Watt assumes 100% utilization. That makes all fixed-function hardware look great, even stuff that's practically useless!
No, overall perf/W factors in whatever utilization actually took place. Besides, if the ff hw is cleanly isolated, it is easier to power down than a bit of hw deeply embedded inside the data path.
Not necessarily. Effective total performance depends on utilization. If you have long shaders, the rasterizer will be idle most of the time.
Let it be, hardly something to bother about when the rest of the 250mm2 worth of silicon is right next door.
So you're wasting space, which could have been used for more shader units instead. To achieve the same effective performance you'd have to overclock the available cores (frequency + voltage), which consumes more power than just having more shader units available.
How many shader units does 4mm2 buy these days?

At a certain point it simply complicates things more to fit dedicated hardware into your architecture than to just do it in software.
In 10 years, sure, why not?
 
Take for example a 4x4 pixel tile. Without texture tiling, you need to access 5-7 cache lines. With 4x4 texel tiles, the pixel tile typically hits 4 texel tiles.
Which, if every one of those tiles is a cacheline, means about 75% of redundant data read.

Also small triangles are the future, so you'd need to defer the texturing a bit to actually get 4x4 tiles.
 
Last edited by a moderator:
That's hardly a relevant question. You have all of the programmable cores available for any task, exactly when you need them and for as long as you need them. So the area, power consumption and performance of dedicated versus programmable rasterization is a complex function of the utilization and other factors. Simply comparing the area for dedicated hardware versus programmable cores tells you nearly nothing.
What you and a lot of people don't understand is that the majority of space taken up by shader units is data flow. NVidia has also talked about using lots of distributed cache to reduce power consumption, because data flow is the big problem there, too.

Programmable shader units need a lot of flexibility in moving data around, but certain fixed function tasks do not. This is why you will never see fixed function texture filtering go away in a GPU. The cost of just getting all the data from the texture cache to the shader units at the rate needed to maintain speed is more than that of the logic eliminated. It just makes sense to decompress and filter eight RGBA vales and send one to the shader.

Here's a presentation which suggests that efficient software rasterization is within reach: https://attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf
Again it's about data. Look at steps 1 and 3 on page 12. How do you parallelize step 1 in shaders? It's easy to pack bounding boxes into your wavefront one per clock, but not beyond that unless you want to make your wavefront an incoherent mess. What about early z-reject? You need access to the HiZ cache and then have to do Z decompression. Step 3 is not trivial in shaders either. You either have to keep those parts fixed function or immensely complicate your shader units.
But in the long term fixed-function just isn't interesting. Features like bump mapping which once wowed us all and demanded a hardware upgrade, are now ridiculously simple to 'emulate' in the shaders, and actually we now want high quality tesselation instead. There's no doubt that one day dedicating any hardware to tesselation will be as ridiculous as fixed-function bump mapping, alpha testing, fog tables, T&L, etc.
Bump mapping is not a good example, as pixel shaders evolved from DOT3 and EMBM (i.e. PS is just bump mapping extended). Same thing for T&L; in fact, ATI's first T&L capable chip - R100 - had a vertex shader just short of DX8 specs to implement all the fixed-function DX7 vertex stuff. There's nothing ridiculous about alpha testing, as it's still fixed function. You just think that it's "general" because they gave it an instruction name.

That's for a micropolygon rasterizer only. What I said is that "a dedicated rasterizer for large as well as tiny polygons, which also supports motion and defocus blur, is going to be relatively large".
First of all, any algorithm that puts the burden of motion and defocus blur on the rasterizer is very slow compared to other techniques with nearly as good results. Secondly, nobody is planning to implement those features, and rpg.314 wasn't suggesting it either.

So I sincerely doubt that it would take only 4 mm² to make GPUs capable of efficiently processing micropolygons.
The data routing and pixel ordering challenges in a real GPU are even harder to address with ALU rasterization, so you're not helping your case.

Performance/Watt assumes 100% utilization. That makes all fixed-function hardware look great, even stuff that's practically useless!
Not really. What matters is idle consumption. If you can stop your FF unit from using power when not in use (and AMD/NVidia are pretty good at that today), then FF will always clobber general purpose.

Not necessarily. Effective total performance depends on utilization. If you have long shaders, the rasterizer will be idle most of the time. So you're wasting space, which could have been used for more shader units instead.
There's no need to state the obvious. What you're glossing over is that once you put reasonable estimates of figures into the argument, we're not even close to seeing a win from eliminating rasterizers.
 
Could you point me to a few of those? Thanks.

There has been a lot of research in linear algebra algorithms with regards to special orders for matrix-matrix or matrix-vector operations. The idea is to map the 2D structures of matrices to the linear structure found in caches.

Make a search on google about "peano order" or "morton order", or in general "space filling curves".
 
It all depends on the utilization. Clearly when a dedicated component is never used, it means that to achieve the same performance for the rest of the chip you'd have a higher power consumption than when you have the entire chip for performing the task. Note that reaching higher clock frequencies also requires increasing voltage, so the power consumption increase can be substantial. So dedicated hardware has to have a certain utilization level before it becomes worthwhile. More software diversity means the average utilization of non-generic components is decreasing though.
If you look at the area associated with a rasteriser you couldn't put sufficient ALUs down in that area to give equivalent performance, so you end up with ALUs that are bigger comsumers of power active for longer, net result is higher power consumption.

I've actually been round this loop recently with fixed function interpolation, and although the overall area and even utilisation of a prgrammable solution looks better the extra power consumption wasn't even close to being an acceptable compromise.

Also note that making things programmable enables clever software optimizations. Today's applications take a lot of detours just to fit the graphics pipeline. It's utterly ridiculous that on modern GPUs you often get no speedup from using LOD techniques. I'd rather use that headroom for something more useful, and I'm not alone.

Even simply using dynamic code generation with specialization can avoid a lot of useless work. "The fastest instruction is the one that never gets executed."

I disagree on the degree of "detours" that are really needed. Further, just about all modern programmable GPU architectures do not handle arbitrary code well, this is a byproduct of making them small enough to put large numbers of units down in order to give you the huge amount of horse power that you are seeing today. To do what you appear to want will require a significant increase in complexity which means higher power and less performance. We onyl have to look to intel for exampls of this.

Despite both being programmable, vertex and pixel processing were quite different (at the time of the unification). Vertices need high precision, they need streaming memory access, and they use lots of matrix transforms. Pixels can generally use lower precision, need lots of texture accesses, use exp/log for gamma correction, use interpolator data, etc.
That depends on how you design your pipeline, for example the SGX unified shader handles narrower data types at a higher rate within the same data path, this means there is no power hit for processing pixel data down the same path as vertex data. The streaming of vertex data into or out of the shader pipeline is not unified and is handled by a seperate (tiny) peice of hardware, as I beleive it is in all other GPU designs, same is true for texture units i.e. they are not part of the unified design, for good reason.

Also from a power consumption point of view, unifying them certainly wasn't logical. The GeForce 7900 and GeForce 8800 had competitive performance for contemporary games, but the latter had a significantly higher power consumption.

Despite these things, they did unify vertex and pixel processing, and they never looked back. So the real reason for the unification was to enable new capabilities and new workloads. And nowadays developers are still pushing the hardware to the limits, by using it for things it was hardly designed for. So while it might be hard to imagine what exactly they'll do with it, more programmability is always welcomed. It's an endless cycle between hardware designers saying that dedicated hardware is more power efficient, and software developers saying they care more about the flexibility to allow more creativity.

The power consumption increase had nothing to do with unification, it was the move from vector to scalar processing units and the move from Dx9 to Dx10 precision & functionality resulting in less effiency in terms of flops/mm^2 and more power for similar performance.

The important benifit NVidia (and AMD before them) acheived when they moved to unidied shaders was an increase in both pixels and vertices processed per flop, this is the key benifit of unification. Applicability to more generalised work loads is secondary, irrespective of what NVidia marketing says!

Performance scales at a dazzling rate anyhow. So you might as well make sure that it can be used for something more interesting than what was done with the last generation of hardware. Obviously fixed-function hardware can't be replaced with programmable hardware overnight, but surely by the end of this decade hundreds of TFLOPS will be cheap and power efficient and there will be no point making it more expensive with fixed-function hardware, or making it even cheaper by crippling certain uses.
For performance scaling to continue it requires continued scaling of power consumption, even desktops are now approaching a practical power cap (at least in terms of true mass market) which will restict that scaling. In mobile space that power cap is around three orders of magnitude lower (1-2W TDP vs ~1KW TDP), there is simply never going to be the power budget to make everything programmable in mobile space.

I disagree. Rasterization is evolving, and even the very idea of polygon rasterization is crumbling. There's a lot of research on micropolygons, ray-tracing, volumetric rendering, etc., but IHVs better think twice before they dedicate considerable die space on it. Some games may love the ability to have fine-grained displacement mapping and advanced blur effects, others prefer accurate interreflections and advanced refraction effects, and others don't care about these things at all (and I'm not necessarily talking about GPGPU / HPC applications).

Rasterisation isn't evolving, but alternative rendering techniques are continually being developed. I think there is space for a paradigm shift but I don't think you need hardware that caters for everything as many techniques have little or no practical value outside of research. From my perspective the next step is the ability to efficiently cast rays within a scene, to do this well in terms of both power efficeincy and managable memory BW is going to require no small amount of fixed function HW. However going forward I beleive that HW needs to be accessible without the restrictions of the current GPU pipeline, I think rasterisation can be refactored in the same way.

You see, every application has some task that would be executed more efficiently with dedicated hardware, but you simply can't cater for all of them. It's overall more interesting to have a fully programmable chip, than something which prefers certain usage.

Every application doesn't have to do things in completely different ways, in fact everyone going off and constantly re-inventing the wheel is of little benefit to anyone.

Ultimately practicalities dictate what you get not what is interresting.

When John Carmack first proposed floating-point pixel processing, certain people (including from this very forum) practically called him insane. But these people (understandably) never envisioned that by now we have Shader Model 5.0 and OpenCL, and it's still evolving toward allowing longer and more complex code. So with all due respect I think it would be really shortsighted to think that in ten years from now dedicated hardware will be as important as it is today.

10 years is a long time but I would be prepared to bet you a few quid that fixed function will still be around for things like video encode/decode and that graphic hardware will have evolved but still include significant fixed function units.

The existence of cards with very high power consumption doesn't mean those are the norm. It also doesn't mean the programmability is excessive. Nobody cares if Unreal Tournament 2004 could run faster and more efficiently with fixed-function hardware. People care about Crysis 2 and every other contemporary application, which depend heavily on the shading performance. The next generation of consoles will also spur the creation of more diverse games which demand generic computing cores.
Applications such a crysis 2 still make extensive use of fixed function hardware, texturing units beign the most obvious, I'm pretty sure rasterisation is heavily utilised as well...

The simple reality is that GPUs won't increase in theoretic performance as fast as they did before. They've only been able to exceed Moore's Law because they played catch-up in process technology, aggressively increased the die size, and made the GPU the hottest part of your system. The only way the're now able to significantly increase performance, is by waiting for another process shrink to allow more transistors for a given power envelope. But do you want that to benefit particular applications, or do you want it to benefit all applications? So far the programmability has always increased, and I don't see this ending any time soon...

I'd agree that programmability will increase, however I'm certan that doesn't mean the end of fixed function hardware, I think it just means a refactoring of how it's expressed within the pipeline.

John.
 
What you and a lot of people don't understand is that the majority of space taken up by shader units is data flow. NVidia has also talked about using lots of distributed cache to reduce power consumption, because data flow is the big problem there, too.
Nitpicking here, but to be clear, you mean ...
majority of space inside shader units is taken up by data flow.

Right?
 
Back
Top