This would a be an example of where the most generic solution, a 1-D memory space, does penalize a usage case where the more natural mapping would be 2-D.
This would a be an example of where the most generic solution, a 1-D memory space, does penalize a usage case where the more natural mapping would be 2-D.
http://forum.beyond3d.com/showthread.php?p=1533681#post1533681 There's no hard data but LRB1 likely always required multiple cycles to complete a gather operation, while it is suggested that LRB3 can do it in one cycle in the best case.Supposedly, according to what citable source?
64b per cycle fits within the relased data for LRB1 if all loads hit the same cache line. That number does not seem feasible if
the scatter/gatter crosses cache line boundaries.
Current many-port caches use multi-banking, which becomes slow when you have bank conflicts and requires a lot of duplicate addressing logic. Making use of wider cache lines and gathering multiple elements in parallel can drastically reduce the number of ports you need.Architectures with fast caches and multiple ports already exist. Non-specialized architectures with fast caches and scatter/gather do not.
Texel lookups are typically very localized (with lots of reuse between adjacent pixels). But it can't be implemented with wide loads and shuffle.Can you clarify this? If there were already spatial coherence, why the need to gather/scatter?
Just use wide loads and shuffle.
Gather/scatter is no less generic than any other vector operation. Yes there are worst case scenario's where you need a cycle per scalar, but that's also true for arithmetic operations. But just like arithmetic vector operations are very valuable because in the typical case it provides a major parallel speedup, so does gather/scatter help a great deal, with a low hardware cost. So there's no need to point out the corner cases.It isn't the case that any vector instruction is quite as generic as a scalar, nor is a gather/scatter as generic as a load/store. They can potentially draw close, but there are corner cases that would mean that the scalar/vector ops cannot be fully interchangable, and scalar ops are the baseline for being generic (within the defined data types and operand widths they support).
Generic load/store caches as implemented in Fermi and Cayman both only offer half the bandwidth compared to ALU throughput. And because they're multi-banked they're not without area compromises either. And considering that they're shared by hundreds of threads, they're really tiny and must have a poor hit ratio.I'm not certain why we should waste silicon on a gather/scatter. It is readily implemented by generic loads and stores, and imagine the higher silicon utilization. Much better than having the locality checks and multiplexing hardware that would just sit there idle with scalar loads.
Tell me the settings you want me to test (just one combination please).Settings (res,AF,AA,etc)? FPS, in-game, benchmark, etc?
What makes you think that? Mipmapped texture accesses have a very high spacial locality because on average only one extra texel is accessed per pixel. The sampling footprints of adjacent pixels largely overlap.Texture accesses have good spatial coherence, but they will not generally fall on the same cache-line.
Tiling improves locality and saves bandwidth. Take for example a 4x4 pixel tile. Without texture tiling, you need to access 5-7 cache lines. With 4x4 texel tiles, the pixel tile typically hits 4 texel tiles. Also with texture tiling the next pixel tile will typically reuse half of the texture tiles, while without tiling it depends on the orientation whether or not any of the cached tiles will get reused before being evicted.A straightforward large triangle rasterizer using scanlines for instance will access the textures on a pretty much random angled axis in UV space. Add tiled storage and the chances of all workitems hitting the same cacheline rapidly goes to 0 without some rather complex sorting of workitems by texture accesses.
VTune can count cache misses, but because there are many dynamically generated functions it's not really feasible to isolate the ones related to texturing. It might be possible to use serializing instructions and count the actual number of clock cycles per access but then we'd still have to isolate the effect of out-of-order execution, prefetching and SMT. Frankly that's a lot of work and I seriously doubt it will change the overall conclusion.PS. just for reference, would it be possible to determine which percentage of texture accesses during normal operation stall? You obviously do a lot more work in between texture accesses, which distorts the picture a bit.
Tiling merely requires swapping some of the address bits around. You could just have regular and tiled load/store instructions. No need to change the actual cache.This would a be an example of where the most generic solution, a 1-D memory space, does penalize a usage case where the more natural mapping would be 2-D.
Could you point me to a few of those? Thanks....the topic of countless research efforts. All of them (the serious ones) with excellent results.
If you can compile your code under Linux, you could try "kcachegrind" which might give you some useful (albeit approximate/simulated) figures.VTune can count cache misses, but because there are many dynamically generated functions it's not really feasible to isolate the ones related to texturing..
Good luck selling >4 cores to 90% of the consumer population. The data center market, for some reason wants LOTS of tiny x86/arm cores per die. I wonder if utilization is the reason ...Multi-core,
No predication, no scatter/gather. IOW, useful only for tiny niches.AVX,
You can already dual issue mul and add. FMA doesn't increase compute density by itself. You'll need to ~double the area of the simd unit for that.
Where are these cpu's with gather/scatter?and gather/scatter are making the CPU significantly more powerful for high throughput tasks than their ancestors.
GPUs on the other hand struggle to become more efficient at tasks other than graphics, because they hang on to a significant amount of fixed-function hardware, and hiding all latency through threading. This is why things like GPU physics have so far only had mediocre success at best.
There is no hard data for that claim, true.http://forum.beyond3d.com/showthread.php?p=1533681#post1533681 There's no hard data but LRB1 likely always required multiple cycles to complete a gather operation, while it is suggested that LRB3 can do it in one cycle in the best case.
It does slow down in conflict cases. I am not clear on the latter claim. Both banking (pseudo-dual porting if an AMD CPU) and multiporting involve performing addressing on more than one access per cycle. What is the lots of additional logic?Current many-port caches use multi-banking, which becomes slow when you have bank conflicts and requires a lot of duplicate addressing logic.
There must be reasons why it is not available. Innovation is usually made as people work around problems and constraints. It is difficult to predict innovations by ignoring those constraints.And just because such gather/scatter hardware is not available to the public yet doesn't make it an argument against it. Innovation would be at a standstill if we only looked at existing solutions.
There is locality in the problem space and locality with respect to the linear address space. If there is locality within the address space, why can't a few wide loads and shuffling the values around suffice?Texel lookups are typically very localized (with lots of reuse between adjacent pixels). But it can't be implemented with wide loads and shuffle.
That is consistent with what I said. I went on to say that vector operations are not as generic as scalar ones.Gather/scatter is no less generic than any other vector operation.
We're on a utilization kick here, why are we now pulling peak performance (at the expense of utilization) into the argument?But just like arithmetic vector operations are very valuable because in the typical case it provides a major parallel speedup,
This has been asserted, not substantiated.so does gather/scatter help a great deal, with a low hardware cost. So there's no need to point out the corner cases.
My statement was a shot at implementing a specialized load on a generic architecture, not a specialized design.Generic load/store caches as implemented in Fermi and Cayman both only offer half the bandwidth compared to ALU throughput. And because they're multi-banked they're not without area compromises either.
When exactly is idle hardware good or bad?And yes, while the gather/scatter-specific logic remains idle with regular wide loads, note that current GPUs have lots of different specialized cache which are frequently either a bottleneck or idle.
I was curious about what settings you used to arrive at your numbers.Tell me the settings you want me to test (just one combination please).
True multi-porting is more expensive than banking, since it increases the size of the storage cells and adds word lines.Combining true multi-porting and gather/scatter seems close to ideal to me to achieve high performance at a reasonable hardware cost.
In the absence of consideration for power, area, and overall performance within those constraints, probably true.I'm confident that even the simplest out-of-order execution implementation would allow GPUs to reduce the number of threads needed to achieve good utilization.
Tiling merely requires swapping some of the address bits around. You could just have regular and tiled load/store instructions. No need to change the actual cache.
Fermi is already there.The TEX:ALU ratio is still going down. At the same time, more generic memory accesses are needed. So instead of having multiple specialized caches with varying utilization, you could have one cache hierarchy which doesn't detriment any usage.
That is partly because on a cpu you do a lot more alu operations than you would with some ff hw (filtering, texture address generation and the like), so it's easier to keep an even otherwise alu-poor (relatively speaking, of course) architecture fed.And yet 2-way SMT works like a charm for SwiftShader. Combined with out-of-order execution and prefetching, memory access latency is not an issue.
True, but for that a more flexible memory hierarchy is much more important as memory latency is 30x more than alu latency and rising.The number of threads is becoming a significant problem for GPUs. A lot of GPGPU applications achieve only a fraction of the core utilization because they just don't have enough data to process in parallel. And it's not getting any better as the core counts increase! Also, with ever longer code the context for each thread requires lots of on-die storage.
Will a smaller alu latency vanquish the need for hiding mem latency?The only solution is reducing latencies. And because it allows a smaller register set it doesn't have to cost a lot, or even anything at all. Plus it makes the architecture capable of running a wider variety of workloads.
The fermi arch analysis on b3d explained how fermi does it.I'm not aware how Fermi does its screen tiling. Would you be so kind as to update me on this technology? I'm quite interested to know how this would allow to forget about triangle sizes.
Where are these "more programmable chips of the competition...."?Keep in mind that for every hardware designer, there are probably hundreds of software developers. So if you dedicate transistors to feature A, and applications implement feature X, Y and Z in software, your GPU will either be slower than the more programmable chip of the competition, or it will have a higher power consumption to reach the same performance.
A full custom rasterizer which can do large and small triangles with high throughput will take much more closer to 4 mm2 area than ~20mm2 area of fermi's cores. Besides, large triangles are becoming less and less common, so you can probably cut corners there without having to worry too much.That's for a micropolygon rasterizer only. What I said is that "a dedicated rasterizer for large as well as tiny polygons, which also supports motion and defocus blur, is going to be relatively large". And as I also noted before, their area and power consumption numbers are for the isolated rasterizer. An actual integrated rasterizer unit with all of its supporting circuitry will be larger and hotter.
Meh, a 32 or 64 deep fifo should be enough. I doubt if it will take even 0.1 mm2 of area.Their work also doesn't include any hardware to make the wide pixel pipeline process fragments from multiple micropolygons.
No, overall perf/W factors in whatever utilization actually took place. Besides, if the ff hw is cleanly isolated, it is easier to power down than a bit of hw deeply embedded inside the data path.Performance/Watt assumes 100% utilization. That makes all fixed-function hardware look great, even stuff that's practically useless!
Let it be, hardly something to bother about when the rest of the 250mm2 worth of silicon is right next door.Not necessarily. Effective total performance depends on utilization. If you have long shaders, the rasterizer will be idle most of the time.
How many shader units does 4mm2 buy these days?So you're wasting space, which could have been used for more shader units instead. To achieve the same effective performance you'd have to overclock the available cores (frequency + voltage), which consumes more power than just having more shader units available.
In 10 years, sure, why not?At a certain point it simply complicates things more to fit dedicated hardware into your architecture than to just do it in software.
IIRC, lrb1 could do scatter/gather to one cache line from L1 in 4 clocks.http://forum.beyond3d.com/showthread.php?p=1533681#post1533681 There's no hard data but LRB1 likely always required multiple cycles to complete a gather operation, while it is suggested that LRB3 can do it in one cycle in the best case.
Which, if every one of those tiles is a cacheline, means about 75% of redundant data read.Take for example a 4x4 pixel tile. Without texture tiling, you need to access 5-7 cache lines. With 4x4 texel tiles, the pixel tile typically hits 4 texel tiles.
What you and a lot of people don't understand is that the majority of space taken up by shader units is data flow. NVidia has also talked about using lots of distributed cache to reduce power consumption, because data flow is the big problem there, too.That's hardly a relevant question. You have all of the programmable cores available for any task, exactly when you need them and for as long as you need them. So the area, power consumption and performance of dedicated versus programmable rasterization is a complex function of the utilization and other factors. Simply comparing the area for dedicated hardware versus programmable cores tells you nearly nothing.
Again it's about data. Look at steps 1 and 3 on page 12. How do you parallelize step 1 in shaders? It's easy to pack bounding boxes into your wavefront one per clock, but not beyond that unless you want to make your wavefront an incoherent mess. What about early z-reject? You need access to the HiZ cache and then have to do Z decompression. Step 3 is not trivial in shaders either. You either have to keep those parts fixed function or immensely complicate your shader units.Here's a presentation which suggests that efficient software rasterization is within reach: https://attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf
Bump mapping is not a good example, as pixel shaders evolved from DOT3 and EMBM (i.e. PS is just bump mapping extended). Same thing for T&L; in fact, ATI's first T&L capable chip - R100 - had a vertex shader just short of DX8 specs to implement all the fixed-function DX7 vertex stuff. There's nothing ridiculous about alpha testing, as it's still fixed function. You just think that it's "general" because they gave it an instruction name.But in the long term fixed-function just isn't interesting. Features like bump mapping which once wowed us all and demanded a hardware upgrade, are now ridiculously simple to 'emulate' in the shaders, and actually we now want high quality tesselation instead. There's no doubt that one day dedicating any hardware to tesselation will be as ridiculous as fixed-function bump mapping, alpha testing, fog tables, T&L, etc.
First of all, any algorithm that puts the burden of motion and defocus blur on the rasterizer is very slow compared to other techniques with nearly as good results. Secondly, nobody is planning to implement those features, and rpg.314 wasn't suggesting it either.That's for a micropolygon rasterizer only. What I said is that "a dedicated rasterizer for large as well as tiny polygons, which also supports motion and defocus blur, is going to be relatively large".
The data routing and pixel ordering challenges in a real GPU are even harder to address with ALU rasterization, so you're not helping your case.So I sincerely doubt that it would take only 4 mm² to make GPUs capable of efficiently processing micropolygons.
Not really. What matters is idle consumption. If you can stop your FF unit from using power when not in use (and AMD/NVidia are pretty good at that today), then FF will always clobber general purpose.Performance/Watt assumes 100% utilization. That makes all fixed-function hardware look great, even stuff that's practically useless!
There's no need to state the obvious. What you're glossing over is that once you put reasonable estimates of figures into the argument, we're not even close to seeing a win from eliminating rasterizers.Not necessarily. Effective total performance depends on utilization. If you have long shaders, the rasterizer will be idle most of the time. So you're wasting space, which could have been used for more shader units instead.
Could you point me to a few of those? Thanks.
If you look at the area associated with a rasteriser you couldn't put sufficient ALUs down in that area to give equivalent performance, so you end up with ALUs that are bigger comsumers of power active for longer, net result is higher power consumption.It all depends on the utilization. Clearly when a dedicated component is never used, it means that to achieve the same performance for the rest of the chip you'd have a higher power consumption than when you have the entire chip for performing the task. Note that reaching higher clock frequencies also requires increasing voltage, so the power consumption increase can be substantial. So dedicated hardware has to have a certain utilization level before it becomes worthwhile. More software diversity means the average utilization of non-generic components is decreasing though.
Also note that making things programmable enables clever software optimizations. Today's applications take a lot of detours just to fit the graphics pipeline. It's utterly ridiculous that on modern GPUs you often get no speedup from using LOD techniques. I'd rather use that headroom for something more useful, and I'm not alone.
Even simply using dynamic code generation with specialization can avoid a lot of useless work. "The fastest instruction is the one that never gets executed."
That depends on how you design your pipeline, for example the SGX unified shader handles narrower data types at a higher rate within the same data path, this means there is no power hit for processing pixel data down the same path as vertex data. The streaming of vertex data into or out of the shader pipeline is not unified and is handled by a seperate (tiny) peice of hardware, as I beleive it is in all other GPU designs, same is true for texture units i.e. they are not part of the unified design, for good reason.Despite both being programmable, vertex and pixel processing were quite different (at the time of the unification). Vertices need high precision, they need streaming memory access, and they use lots of matrix transforms. Pixels can generally use lower precision, need lots of texture accesses, use exp/log for gamma correction, use interpolator data, etc.
Also from a power consumption point of view, unifying them certainly wasn't logical. The GeForce 7900 and GeForce 8800 had competitive performance for contemporary games, but the latter had a significantly higher power consumption.
Despite these things, they did unify vertex and pixel processing, and they never looked back. So the real reason for the unification was to enable new capabilities and new workloads. And nowadays developers are still pushing the hardware to the limits, by using it for things it was hardly designed for. So while it might be hard to imagine what exactly they'll do with it, more programmability is always welcomed. It's an endless cycle between hardware designers saying that dedicated hardware is more power efficient, and software developers saying they care more about the flexibility to allow more creativity.
For performance scaling to continue it requires continued scaling of power consumption, even desktops are now approaching a practical power cap (at least in terms of true mass market) which will restict that scaling. In mobile space that power cap is around three orders of magnitude lower (1-2W TDP vs ~1KW TDP), there is simply never going to be the power budget to make everything programmable in mobile space.Performance scales at a dazzling rate anyhow. So you might as well make sure that it can be used for something more interesting than what was done with the last generation of hardware. Obviously fixed-function hardware can't be replaced with programmable hardware overnight, but surely by the end of this decade hundreds of TFLOPS will be cheap and power efficient and there will be no point making it more expensive with fixed-function hardware, or making it even cheaper by crippling certain uses.
I disagree. Rasterization is evolving, and even the very idea of polygon rasterization is crumbling. There's a lot of research on micropolygons, ray-tracing, volumetric rendering, etc., but IHVs better think twice before they dedicate considerable die space on it. Some games may love the ability to have fine-grained displacement mapping and advanced blur effects, others prefer accurate interreflections and advanced refraction effects, and others don't care about these things at all (and I'm not necessarily talking about GPGPU / HPC applications).
You see, every application has some task that would be executed more efficiently with dedicated hardware, but you simply can't cater for all of them. It's overall more interesting to have a fully programmable chip, than something which prefers certain usage.
When John Carmack first proposed floating-point pixel processing, certain people (including from this very forum) practically called him insane. But these people (understandably) never envisioned that by now we have Shader Model 5.0 and OpenCL, and it's still evolving toward allowing longer and more complex code. So with all due respect I think it would be really shortsighted to think that in ten years from now dedicated hardware will be as important as it is today.
Applications such a crysis 2 still make extensive use of fixed function hardware, texturing units beign the most obvious, I'm pretty sure rasterisation is heavily utilised as well...The existence of cards with very high power consumption doesn't mean those are the norm. It also doesn't mean the programmability is excessive. Nobody cares if Unreal Tournament 2004 could run faster and more efficiently with fixed-function hardware. People care about Crysis 2 and every other contemporary application, which depend heavily on the shading performance. The next generation of consoles will also spur the creation of more diverse games which demand generic computing cores.
The simple reality is that GPUs won't increase in theoretic performance as fast as they did before. They've only been able to exceed Moore's Law because they played catch-up in process technology, aggressively increased the die size, and made the GPU the hottest part of your system. The only way the're now able to significantly increase performance, is by waiting for another process shrink to allow more transistors for a given power envelope. But do you want that to benefit particular applications, or do you want it to benefit all applications? So far the programmability has always increased, and I don't see this ending any time soon...
Nitpicking here, but to be clear, you mean ...What you and a lot of people don't understand is that the majority of space taken up by shader units is data flow. NVidia has also talked about using lots of distributed cache to reduce power consumption, because data flow is the big problem there, too.
majority of space inside shader units is taken up by data flow.
That's exactly the sort of time frame I'm talking about.In 10 years, sure, why not?