Hardware MSAA

Nick · Apr 8, 2011

3dilettante said:
It does slow down in conflict cases. I am not clear on the latter claim. Both banking (pseudo-dual porting if an AMD CPU) and multiporting involve performing addressing on more than one access per cycle. What is the lots of additional logic?

The addressing logic for an entire vector. With gather/scatter as supposedly implemented by Larrabee, you only need to compute one full address per cycle. The rest only requires comparing the upper bits of the offsets to check whether they translate to the same cache line, and using the lower bits for addressing within the cache line.

There must be reasons why it is not available. Innovation is usually made as people work around problems and constraints. It is difficult to predict innovations by ignoring those constraints.

Yes, there absolutely has to be a reason, but design constraints is just one of many possible reasons. So again, by itself, something not being available yet is absolutely never an arguments against it.

If there is locality within the address space, why can't a few wide loads and shuffling the values around suffice?

Because at the software level you don't know in advance which vectors to load and how to shuffle them. When sampling 16 texels they could all be located in one vector so you just need one vector load and one shuffle operation, or they could all be further apart and require 16 loads and shuffles.

Of course gather/scatter can be implemented in software too, but you absolutely won't achieve a peak throughput of 1 vector per cycle.

We're on a utilization kick here, why are we now pulling peak performance (at the expense of utilization) into the argument?

Because they can't be separated. Low utilization for a dedicated rasterizer is fine as long as it is tiny and offers high peak performance. Likewise, not using all lanes in arithmetic vector operations is an acceptable compromise. And finally, requiring extra cycles to access multiple cache lines in a gather or scatter operation is fine as long as the hardware cost is reasonable and peak performance makes up for it.

This has been asserted, not substantiated.

What did you expect, a netlist?

I've indicated that it requires only relatively simple functionality. So instead of counter-asserting it with nothing at all, please get me some real counter-arguments why it wouldn't be feasible.

My statement was a shot at implementing a specialized load on a generic architecture, not a specialized design.

I know, but it begs the question whether doubling the number of ports would have been too expensive. Note that these architectures can perform 2 FMA operations for every load/store. That's 256 bits of ALU input/output data for every 32 bits of load/store bandwidth. Correct me if I'm wrong, but that seems like it could be a severe bottleneck. Larrabee has twice the L1 bandwidth.
We can get much better utilization of generic hardware with a stream of scalar loads.

I was curious about what settings you used to arrive at your numbers.

To prove/disprove what point?

Anyway, I'm short on time (too busy working on ANGLE), but feel free to test it yourself with the public evaluation demo. It runs Crysis 2 fine, and by running the benchmark with different RAM latencies you can evaluate how much out-of-order execution, prefetch and Hyper-Threading compensate for it. I'm curious about the exact results myself, although I'm quite confident that the effect of increasing RAM latency will be small.

True multi-porting is more expensive than banking, since it increases the size of the storage cells and adds word lines.

Sure, but it's free of banking conflicts. And for gather/scatter it might be a necessary compromise to keep the throughput close to 1. You need to weigh the cost against the gains.

In the case of SNB, it already has dual-ported L1 caches, so two 128-bit gather/scatter units (instead of a bigger 256-bit unit) would allow to complete the operations in 4 cycles worst case when all data is in L1. And with a best case of 1 cycle it means the average throughput for accesses with high locality (like texture sampling) would be excellent. In my opinion it would make the IGP redundant.

In the absence of consideration for power, area, and overall performance within those constraints, probably true.

Fast forward ten years. What good would a power efficient GPU with 100,000 ALUs be if you can't realistically reach good utilization? It seems well worth it to me to spend part of that area on techniques that speed up single-threaded performance.

How would this be implemented? It sounds like it would need some kind of stateful load-store unit to know the proper mapping over varying formats and tiling schemes.

Why? Just let the software keep track of what version of load/store instruction has to be used.

One generic implementation is to interleave the lower address bits:
... b11 b10 b9 b8 b7 b5 b6 b3 b4 b1 b2 b0

Nick · Apr 8, 2011

rpg.314 said:
True, but for that a more flexible memory hierarchy is much more important as memory latency is 30x more than alu latency and rising.

Yes, that's why I suggested prefetching to bring down the average memory access latency.

Will a smaller alu latency vanquish the need for hiding mem latency?

I didn't say ALU latency. I said latency, including both ALU and memory latency.

Nick · Apr 8, 2011

rpg.314 said:
IIRC, lrb1 could do scatter/gather to one cache line from L1 in 4 clocks.

Interesting. Do you have a source for that?

rpg.314 · Apr 8, 2011

Nick said:
Interesting. Do you have a source for that?

paper for lrb1.

rpg.314 · Apr 8, 2011

Nick said:
Yes, that's why I suggested prefetching to bring down the average memory access latency.

A prefetch instruction in gpu's would be a good idea.

I said latency, including both ALU and memory latency.

You can try the former. Good luck with the latter.

rpg.314 · Apr 8, 2011

Nick said:
Low utilization for a dedicated rasterizer is fine as long as it is tiny and offers high peak performance.

Then why are you arguing against ff rasterizers?

3dilettante · Apr 8, 2011

Nick said:
The addressing logic for an entire vector. With gather/scatter as supposedly implemented by Larrabee, you only need to compute one full address per cycle.

I must have misunderstood. I thought you were comparing a multi-port design with a multi-banked design, not a single port versus banked.

Because at the software level you don't know in advance which vectors to load and how to shuffle them. When sampling 16 texels they could all be located in one vector so you just need one vector load and one shuffle operation, or they could all be further apart and require 16 loads and shuffles.

I can see the first case being characterized as having good locality, but it would seem that the second case with many loads would not have good locality.

I've indicated that it requires only relatively simple functionality. So instead of counter-asserting it with nothing at all, please get me some real counter-arguments why it wouldn't be feasible.

Okay, let's get down to my interpretation of the scheme in question. Let's say it's Larrabee-like.
We want a scatter/gather implementation capable of 1 vector's worth of memory throughput per cycle.

In the case of gather:
We have a 16-element vector composed of 32-bit addresses. I am assuming cache line width equals vector width like LarrabeeI was disclosed as having.
I am going to assume that the goal is 1 cycle throughput, and not 1 cycle to execute (cache load penalty ignored) in the best case.

The best address checking I can think of without going to the TLB is a comparison amongst the 16 values to the cache line base address (each value with the 6 least significant bits ignored, I suppose).
This should catch virtual addresses that fall within 64 bytes of each other.

This does not catch pointers that are further apart but hit pages that are mapped to the same address. I think we are fine when it comes to page boundaries because I don't think a page can end in the midst of a line with what has already been assumed. I'm not up on x86 segmentation to know if the same can be said there, so I'm going to ignore this. (If it can split, then only equivalence of the whole value can be checked).

Does 120 comparisons in a cycle sound correct to you?
At the end, we have a list or array of 1-16 addresses with some bits for the destination vector position.

Now, we can hit the TLB get a translation and start loading.
Worst-case, we have 16 separate loads. Or is it the worst case when we have 16 separate loads that have pages that alias?

Best case, we only need one load of 64 bytes.
Best best case, the values are in vector order in memory.
If not, we need to take 1-16 values and route/broadcast them to 16 locations.
Each offset of the cache line can go to 1-16 different locations in the vector.

This repeats for each load.

To issue a gather, we need 120 comparisons between the address vector elements.
We queue up to 16 loads, which can require the loading of up to 1 KiB of data for 64B of result, worst case.
Once the data is in the core, it needs to be routed/broadcast to the correct locations within the vector.
The hardware would support the case of 16 values needing 16 different locations, and the case where values are broadcast to multiple locations.

This does sound hefty to me.
The comparison count may be scale-wise comparable to the dependency checking done for register IDs, but with significantly larger values.
As much as 960 bits can be thrown away in the load.
I am trying to find an analogous structure on known cores that would have similar activity to the route/broadcast portion. The number of sources and destinations is larger than most load/store queues or forwarding networks on heavy cores, and those do take a good amount of space and power.
A lot of this would be on a memory pipeline, which is usually quite sensitive to having extra work thrown into the process.

To prove/disprove what point?

I wanted some context to the numbers, and to compare them to other benchmarking runs of the game at those settings. Settings can influence what parts of the process become bottlenecks, and in games there can be a wide variation depending on when and where the measurement is taken.

Sure, but it's free of banking conflicts. And for gather/scatter it might be a necessary compromise to keep the throughput close to 1. You need to weigh the cost against the gains.

In the case of SNB, it already has dual-ported L1 caches, so two 128-bit gather/scatter units (instead of a bigger 256-bit unit) would allow to complete the operations in 4 cycles worst case when all data is in L1.

Interesting note, I checked Agner Fog's optimization guide, and the SB L1 is broken into 4 banks. Bulldozer's is documented as having 16.

Fast forward ten years. What good would a power efficient GPU with 100,000 ALUs be if you can't realistically reach good utilization? It seems well worth it to me to spend part of that area on techniques that speed up single-threaded performance.

If at that time no other alternative presents itself, it would be plausible.
OoO would reduce the number of threads that could be supported, so that part would be true by default.
It does not help with incoherent branching, which seems to be a significant portion of the lack of utilization on SIMD machines, and can potentially make it worse if speculation is increased.

Why? Just let the software keep track of what version of load/store instruction has to be used.

One generic implementation is to interleave the lower address bits:
... b11 b10 b9 b8 b7 b5 b6 b3 b4 b1 b2 b0

I was trying to envision a scheme supporting a 4x4 tile and how to make a single instruction apply to a different tile size, or a change in the binary size of a pixel if the format changes.

Nick · Apr 11, 2011

Mintmaster said:
What you and a lot of people don't understand...

It's kind of arrogant to think me and a lot of people don't understand something. I'ts quite possible we're ignorant about something, but that doesn't mean we wouldn't understand it if properly explained.

...is that the majority of space taken up by shader units is data flow.

How is that solved by having distant fixed-function units that need to collect/distribute data from/to many programmable units (and growing)?

With software rasterization you can keep data movement to a minimum because it can be done more locally.

NVidia has also talked about using lots of distributed cache to reduce power consumption, because data flow is the big problem there, too.

How is that in disagreement with anything I said?

Programmable shader units need a lot of flexibility in moving data around, but certain fixed function tasks do not. This is why you will never see fixed function texture filtering go away in a GPU.

Never say never. Fixed-function alpha testing and fog also did't need any flexibility in moving data around. Yet they are gone now.

And texture sampling has many different filtering modes, addressing modes, mipmap modes, texture types, texture formats, perspective correction, gamma correction, etc. So even though we still call it fixed-function it's becoming ever more versatile. Filtering is also evolving from 8-bit to full FP32 precision. Lots of applications already regularly resort to performing custom filtering / depth testing / LOD adjusting in the shaders.

Note that with software you can prevent any data movement at all for inactive features. So the greater the variety in operations you're expecting, the more it makes sense to do it in software. Heck I'd be surprised if certain texturing stages weren't already performed in the programmable cores, or if the samplers themselves didn't contain some simple form of program.

Remember that pixel processing has evolved from just a single configurable blend operation per texture lookup, into something fully generically programmable. Texture sampling is also slowly but surely evolving away from just a few configurable fixed-function states, into a more generic form of memory access.

The cost of just getting all the data from the texture cache to the shader units at the rate needed to maintain speed is more than that of the logic eliminated. It just makes sense to decompress and filter eight RGBA vales and send one to the shader.

Making things programmable certainly isn't more power efficient nor offers the same peak performance. But despite that the pixel pipeline became programmable and even unified with vertex processing. So clearly you have to take other factors into account as well.

No matter how inefficient something may seem to hardware designers, there's no other choice but to follow the demands of the software developers. If they want more generic memory accesses, texture samplers will eventually disappear. We can disagree on whether this is what they want, but I don't think fixed-function hardware can turn it around (in the long run).

Bump mapping is not a good example, as pixel shaders evolved from DOT3 and EMBM (i.e. PS is just bump mapping extended).

That was the whole point! It's a perfect example. Fixed-function hardware disappears because the usage evolves in favor of something more programmable.

There's nothing ridiculous about alpha testing, as it's still fixed function. You just think that it's "general" because they gave it an instruction name.

It's generic because the test has become arbitrary. The actual killing of the 'pixel' is now also used for other functionality.

First of all, any algorithm that puts the burden of motion and defocus blur on the rasterizer is very slow compared to other techniques with nearly as good results.

Other techniques, that use programmable shaders?

Secondly, nobody is planning to implement those features, and rpg.314 wasn't suggesting it either.

Now ask yourself why nobody is planning to implement those features in hardware...

It's perfectly feasible, but there's a severe risk that developers will use other techniques and it becomes an utter waste of silicon. Hence fixed-function loses, despite theoretically being the most efficient solution.

The data routing and pixel ordering challenges in a real GPU are even harder to address with ALU rasterization, so you're not helping your case.

There's definitely a huge software challenge ahead of us. But that doesn't make it a bad idea. There are lots of things which used to be fixed-function, which now require serious programming effort to compute efficiently (not just for application developers but for firmware and driver developers as well). This effort is worth it because it allows more exciting applications. Like I've said before, nobody's interested in just how efficiently you could render UT2004 which fixed-function hardware. And that game is hardly 7 years old. More programmability is the only way to keep things interesting in the long run.

Nick · Apr 11, 2011

compres said:
There has been a lot of research in linear algebra algorithms with regards to special orders for matrix-matrix or matrix-vector operations. The idea is to map the 2D structures of matrices to the linear structure found in caches.

Ah, but that doesn't require any significant hardware change then. I thought that was what you meant before: "My take is, and it is very likely with graphics converging with compute, is that hardware (physical memory mapping) solutions will be adapted to 2D mappings in the future."

Make a search on google about "peano order" or "morton order", or in general "space filling curves".

Yes, I know those. Morton order is exactly the pattern created by interleaving the address bits as I suggested before.

The relevant thing is that this doesn't require a specialized cache at all. It simply requires load and store variants that perform this trivial mapping.

rpg.314 · Apr 11, 2011

Nick said:
Never say never. Fixed-function alpha testing and fog also did't need any flexibility in moving data around. Yet they are gone now.

Single instruction operations with braindead data flow. Poor examples when considering something as irregular as rasterization and/or something as specialized as texture filtering/decompression.

And texture sampling has many different filtering modes, addressing modes, mipmap modes, texture types, texture formats, perspective correction, gamma correction, etc. So even though we still call it fixed-function it's becoming ever more versatile. Filtering is also evolving from 8-bit to full FP32 precision. Lots of applications already regularly resort to performing custom filtering / depth testing / LOD adjusting in the shaders.

Fp32 filtering might be practical in alu's, but as the texture formats and the compression schemes get weirder, it makes more and not less sense to keep the ff tmu around. And texturing is latency tolerant anyway, so makes even less sense to emulate with low latency alu's.

Texture sampling is also slowly but surely evolving away from just a few configurable fixed-function states, into a more generic form of memory access.

With modern scatter/gather and now caches, global memory accesses are arguably quite flexible already. So what are the incremental benefits of deleting TMU's? It is ironic that the gpu that was supposed to have maximum flexibilty grew TMU's midway through it's design cycle and not the other way around.

No matter how inefficient something may seem to hardware designers, there's no other choice but to follow the demands of the software developers. If they want more generic memory accesses, texture samplers will eventually disappear. We can disagree on whether this is what they want, but I don't think fixed-function hardware can turn it around (in the long run).

Developers already have generic and efficient global memory access. What will they get by deleting tmu's apart from an OoO penalty in texturing perf?

More programmability is the only way to keep things interesting in the long run.

No body is denying that. Everybody except you seems to be saying future is programmable hw + ff hw. You are saying future is programmable hw only without putting up numbers to back you claims.

Entropy · Apr 11, 2011

Nick said:
More programmability is the only way to keep things interesting in the long run.

Honestly. Why would people interested in playing games care about and pay for the intellectual satisfaction of a few tech geeks?
Computer graphics is a means to an end. And in terms of financing, that end is entertainment. If your intellectual satisfaction doesn't provide me with better entertainment, I'm not interested in paying for it, and if ff hardware makes pretty pictures cheaper and with lower power draw, that's all a consumer should care about.

3dcgi · Apr 12, 2011

rpg.314 said:
No body is denying that. Everybody except you seems to be saying future is programmable hw + ff hw. You are saying future is programmable hw only without putting up numbers to back you claims.

I'm not sure how you expect Nick to provide numbers for a belief that fixed function hardware will go away.

I personally think we'll have fixed function hardware for a while, but I can't foresee what will happen 10 years from now. The good thing is no one needs to foresee what things will look like 10 years from now as it doesn't take that long to design a chip.

nAo · Apr 12, 2011

Exercise for the readers: how many types of FF units have been removed from and added to GPUs in the last 5 years?

Simon F · Apr 12, 2011

Nick said:
More programmability is the only way to keep things interesting in the long run.

"May you live in interesting times".

I think I'll just have to disagree with you on the programmability / fixed-function balance. Some operations are just so common that it doesn't make sense to do them in anything other than dedicated hardware.

dkanter · Apr 12, 2011

nAo said:
Exercise for the readers: how many types of FF units have been removed from and added to GPUs in the last 5 years?

Zing - that's a good one Marco.

Here's a hint for folks, FF is always increasing!

David

3dilettante · Apr 12, 2011

Off the top of my head:
FF hardware TnL added and then superceded
Alpha blending was hardware, now superceded (edit: testing not blending)
Interpolation has moved to the shader cores

As these fixed function units came and went:
Media coprocessors like UVD have taken up residence and look to maintain their presence.
The number of TMUs, rasterizers, and setup pipelines has gone up.
Tesselation has been around in some ways on ATI hardware for a few hardware generations and is now a part of the pipeline.

There are elements of the architecture that probably don't get explicit mention as units, perhaps other units on the low-demand ring bus used on AMD chips, for example. New formats, additional compression schemes, and special instructions may imply specialized silicon inserted in various places.

In the case of atomics, there are units that are either dedicated to managing them or certain units like the ROPs are repurposed for the use.

Shared memory was added, and it is either a separate pool or a specially managed cache that does double-duty.

edit: And there is the topic of this thread.

nAo · Apr 12, 2011

3dilettante said:
Off the top of my head:
FF hardware TnL added and then superceded
Alpha blending was hardware, now superceded
Interpolation has moved to the shader cores

As these fixed function units came and went:
Media coprocessors like UVD have taken up residence and look to maintain their presence.
The number of TMUs, rasterizers, and setup pipelines has gone up.
Tesselation has been around in some ways on ATI hardware for a few hardware generations and is now a part of the pipeline.

There are elements of the architecture that probably don't get explicit mention as units, perhaps other units on the low-demand ring bus used on AMD chips, for example. New formats, additional compression schemes, and special instructions may imply specialized silicon inserted in various places.

In the case of atomics, there are units that are either dedicated to managing them or certain units like the ROPs are repurposed for the use.

Shared memory was added, and it is either a separate pool or a specially managed cache that does double-duty.

edit: And there is the topic of this thread.

What desktop GPUs implement alpha-blending in sw?
Anyway, the point I was trying to make is that in the last 5 years we observed a stabilization, if not a steady but slow increase of the amount of FF on (desktop) GPUs.

3dilettante · Apr 12, 2011

That was a brain-fart. I should change that to alpha-testing.

Andrew Lauritzen · Apr 13, 2011

I'm honestly not sure how anyone can claim to make blanket statements about "fixed function" and "programmable" hardware without the a specific definition and specific workload. What even is "fixed function"? 32-bit float ALUs are kind of fixed function right? Do they count? What about ROPs? Atomic units? Register renaming? Caches? It's all a bit silly to try and make sweeping statements without a specific context of "I want more functionality in X" or "Y is too slow in software".

Similarly, the latter analysis can only be made with respect to some given workload... texture units are a huge waste of space for an application that just does fp32 MADs and so on.

And Mintmaster had it right earlier in the thread: the units themselves are largely uninteresting... the data-paths are what matter. Thus instead of talking about a "fixed function" vs "programmable" rasterizer for instance, lets talk about the amount of flexibility required in and out of a conceptual rasterization stage, and whether it's a good fit for conventional programmable memory architecture or not, and so on. That sort of discussion is far more interesting and relevant IMHO.

3dilettante · Apr 13, 2011

Andrew Lauritzen said:
I'm honestly not sure how anyone can claim to make blanket statements about "fixed function" and "programmable" hardware without the a specific definition and specific workload. What even is "fixed function"? 32-bit float ALUs are kind of fixed function right? Do they count? What about ROPs? Atomic units? Register renaming? Caches? It's all a bit silly to try and make sweeping statements without a specific context of "I want more functionality in X" or "Y is too slow in software".

Where a particular block lies on the continuum from being purely fixed to specialized to fully general-purpose would seem to me to rely on an analysis of the hardware and its accessibility to the programmer. Its generality may also depend on the implicit and explicit assumptions made about the data and behavior of its expected workload.

For example, a 32-bit pipe on a standard CPU would appear to be general-purpose and programmable. In the case of running an ADD, the hardware itself supplies what is possibly at most a single algorithmic step relative to software by adding. Simple binary or unary operations and data operations that can be strung together arbitrarily to synthesize multiple algorithmic steps would seem to be the basis of being programmable. They use common data paths for their operations that are used almost directly by software instructions.
Convenience and practicality would indicate that there are certain operations worth adding some hardware assist, since there are more complex ops that may be one algorithmic step in a program (multiply, divide) that could use smaller operations but can be implemented with dedicated units.

On the other hand, a unit that can perform something like produce a bilinearly filtered value implements multiple algorithmic steps in the fetching and multiple math ops using internal data paths that mostly reside within a black box that the software does not see. It cannot arbitrarily string together the various low-precision ALU ops or add random things into the process, and the routing is either static or handled by internal sequencers that the program does not access.
Further along in the continuum might be something like the old fixed-function TnL, that implemented almost a whole algorithm through various data paths and units not really visible to a software stream.

Perhaps there is a holographic principle of sorts for the black box of hardware, instead of a black hole of a singularity.
The surface being its "fixed function-ness", being some relation of the aggregate number of units and unobservable data paths within. Perhaps some kind of base unit, like the basic functionality of a transputer element could be its measure.

Other contextual clues such as the base assumptions of the architecture can inform the debate.
What datum is the primary unit it works with, and how far is it from commonly used forms?
What kind of workload is assumed and what advantages or pitfalls does it introduce over something closer to a vanilla von Neumann device?
What kind of parallelism could it exploit, if any?
What sort of restrictions exist that prevent arbitrary organizations of instructions?

Hardware MSAA

Nick

Nick

Nick

rpg.314

rpg.314

rpg.314

3dilettante

Nick

Nick

rpg.314

Entropy

3dcgi

nAo

Nutella Nutellae

Simon F

Tea maker

dkanter

3dilettante

nAo

Nutella Nutellae

3dilettante

Andrew Lauritzen

Moderator

3dilettante

Similar threads