If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
|
Hey all,
Just wanted to throw out the links for three new DX11 demos from the Advances in Real-Time Rendering in 3D Graphics and Games and Beyond Programmable Shading courses at SIGGRAPH (slides will be posted soon). For now, you can run the demos (if you have a DX11 card) and check out the source at: http://visual-computing.intel-resear...red_rendering/ http://visual-computing.intel-resear...ications/sdsm/ http://visual-computing.intel-resear...ications/avsm/ Enjoy!
__________________
The content of this message is my personal opinion only. Last edited by Andrew Lauritzen; 31-Jul-2010 at 05:08. |
|
|
|
|
|
#2 |
|
Merrily dodgy
Join Date: Aug 2003
Location: The continent
Posts: 1,077
|
Had to get the latest DX11 redist to get d3dcompiler_43.dll, and had to install the 32-bit version of the VC2010 redist (running Win7 x64).
Those shadows on sdsm are really pretty. I wonder if there's a way to get AA on the non-shadow polygons too (without CP overrides that is). Also, I noticed weird dark spots over basic textures using Nvidia 3D glasses. And sadly avsm.exe crashes for me: Problem signature: Problem Event Name: APPCRASH Application Name: AVSM.exe Application Version: 0.0.0.0 Application Timestamp: 4c4bb4a6 Fault Module Name: nvwgf2um.dll Fault Module Version: 8.17.12.5721 Fault Module Timestamp: 4c0d6d9b Exception Code: c0000005 Exception Offset: 0030d49a OS Version: 6.1.7600.2.0.0.768.3 Locale ID: 1033 Additional Information 1: 0a9e Additional Information 2: 0a9e372d3b4ad19135b953a78882e789 Additional Information 3: 0a9e Additional Information 4: 0a9e372d3b4ad19135b953a78882e789 Cheers for posting
__________________
Hemlock - When crushed, the leaves and root emit a rank, unpleasant odour |
|
|
|
|
|
#3 | ||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
|
Yup it mentions those in the readme
Quote:
Quote:
Also noted in the readme: it crashes the shader compiler on current NVIDIA cards (works on ATI). They know about it though and I imagine will have it fixed soon.
__________________
The content of this message is my personal opinion only. Last edited by Andrew Lauritzen; 31-Jul-2010 at 05:36. |
||
|
|
|
|
|
#4 |
|
Merrily dodgy
Join Date: Aug 2003
Location: The continent
Posts: 1,077
|
Ooh readme.txt, right...radical concept
__________________
Hemlock - When crushed, the leaves and root emit a rank, unpleasant odour |
|
|
|
|
|
#5 |
|
Senior Member
Join Date: Mar 2005
Posts: 1,157
|
Cool. Hope the course materials go up soon as well. I was off at the OSL presentation, so I missed your part of the course and only got in around the time to see the tail end of Marco's bit. I was kind of saddened by the Uncharted piece -- not so much because it was a bad talk, but that he painted such a depressing picture of what he had to work with.
__________________
Life is veritably the exact opposite of a vacuum cleaner. Vacuums tend to suck less and less as time goes on. |
|
|
|
|
|
#6 |
|
Crazy coder
|
Very interesting stuff. Haven't had the time to look deeper into it yet, but the demos look nice.
|
|
|
|
|
|
#7 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#8 |
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
|
AVSM and SDSM presentations from the "Advances in real-time rendering course" will be available at the end of this week.
__________________
[my blog] Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|
|
|
|
|
#9 |
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
|
The remaining presentations (SDSM and AVSM) are now available at the above links as well as a video for SDSM. Enjoy!
__________________
The content of this message is my personal opinion only. |
|
|
|
|
|
#10 |
|
Regular
|
In the SDSM presentation, slide 24, the histogram algorithm performs vastly slower on GTX480 than on HD 5870, 7.2ms versus 1.4ms. What the hell?
This is only a shared memory atomic operation, isn't it? Not a global atomic? (Notes on that slide imply it is shared.) One of the peculiar things about the atomics in OpenCL (1.0 and 1.1) is that they always return a value, whereas the D3D11 atomics make a return value optional. (I presume the SDSM algorithm ignores the return value.) http://msdn.microsoft.com/en-us/libr...(v=VS.85).aspx Why not offer both in OpenCL? Do NVidia atomics always return a value, in hardware? Is that the motivation for the OpenCL spec making the return not optional. Is the performance problem really due to atomics on NVidia?
__________________
Sweet-spot + tick-tock = monster |
|
|
|
|
|
#11 | ||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
|
Quote:
Quote:
Good question unless they assume that the compilers will detect when the return value is unused. I can only assume so as that shader isn't doing much else. Check out the source code if you want the details but it is literally just a pile of local atomics followed by a few global atomics (these turn out to be faster than writing out all the local histograms and then reducing them). I played with removing the global atomics on NVIDIA but it made very little performance difference... it appears to be the local atomics in particular that are much slower than on ATI. You can imagine that this and a number of other issues that are now exposed by these fairly low-level compute models complicate writing performance-portable code a lot...
__________________
The content of this message is my personal opinion only. |
||
|
|
|
|
|
#12 | |||||
|
Regular
|
Quote:
Quote:
What I don't get is why there's such a severe slowdown when shared memory banking should mean that non-colliding addresses are often full speed (purely a question of bank conflicts - though Fermi architecture is more susceptible to bank conflicts than GT200). Quote:
Quote:
http://forum.beyond3d.com/showthread...37#post1305337 when we had quite a bit of fun with the treacherous "Dickens word count" histogram, you were alluding to non-random distributions of data causing severe problems for GPU algorithms that depend on some form of software-managed cache. I suppose this is that application: depth is generally not random at all. I suppose your algorithm lops bits of precision off Z so that there aren't too many bins, resulting in a huge collision rate, and disastrous slow down on NVidia. So this poor atomic performance might be something of a corner case in that respect. Overall, atomics are still a little too new on GPUs to really know... Quote:
Jawed
__________________
Sweet-spot + tick-tock = monster |
|||||
|
|
|
|
|
#13 | ||||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
|
Yup although it was really the combination of that and finding the robustness issues with the more complex schemes (that require the full histogram) that made the reduce path the clear winner.
Quote:
Quote:
1) Collisions. They are super-common with this data set. It's a big win if you can lane-swizzle in your SIMD and do some cheap address comparisons to handle the majority of these cases and serialize the rest as appropriate. Unfortunately none of the current programming models have a lane swizzling mechanism because they all try to pretend it's not SIMD 2) Bin distribution for a spatial region. Each core is handling a tile of the screen and in that tile (or even the whole distribution sometimes) only a small number of bins are touched. Statically allocating local memory for the whole local histogram here and incidentally reducing the number of wavefronts in flight due to this is just a bad model... you want a proper cache. I would implement this by relying on Fermi's L1$ but I don't believe their global atomics are fully cached... i.e. just the switch from local to global atomics would probably slow it down a ton, even if there's no data-sharing between cores at all. Note that this was disastrously worse when we had to use vertex scatter + ROPs back in the pre-DX11 days... collisions in that pipeline were ridiculously expensive since they amounted to global atomics even with some clever use of multiple accumulation bins + reductions. Local atomics are the only thing that make this sort of algorithm feasible at all on GPUs but we still have a ways to go in the hardware. Quote:
Quote:
__________________
The content of this message is my personal opinion only. Last edited by Andrew Lauritzen; 11-Aug-2010 at 19:39. |
||||
|
|
|
|
|
#14 |
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 4,210
|
If your programming model doesn't explicitely expose horizontal ops and you have to do interlane/interthread communication vertically your hardware and/or compiler better be very good at local atomics.
__________________
[my blog] Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too? [Douglas Adams] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|
|
|
|
|
#15 | ||||
|
Regular
|
Quote:
I guess someone could analyse this stuff in detail. Different patterns exploiting either address conflicts or bank conflicts, in varying counts per hardware thread, would enable a decent characterisation. Quote:
Quote:
I'm not saying there's a way to make atomics vastly faster on GF100 for this kind of data/bin configuration - merely that it's early days. Before, we saw that there was a huge speed-up on x86 with the Dickens histogram by carefully fine-tuning for the cache architecture. Having a "proper cache" didn't obviate this optimisation at all... Quote:
__________________
Sweet-spot + tick-tock = monster |
||||
|
|
|
|
|
#16 | |||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
|
Quote:
Quote:
On a separate note, I'd be really happy if someone could come up with a better algorithms to solve the SDSM histogram problem. It's not like I just tried the one thing... I tried the full gamut of different ways to do it and all were as slow or slower. Hopefully a faster algorithm will "emerge", but that's not doing me a lot of good today Quote:
Not sure what you're trying to say exactly... yes obviously caches need tuning just like everything else but the static allocation/partitioning of resources model that GPUs use right now is simply not flexible enough for cases like this that have a small but *data-dependent* working set.
__________________
The content of this message is my personal opinion only. |
|||
|
|
|
|
|
#17 | |||||
|
Regular
|
Quote:
Quote:
I think there are two conflicting things going on here:
The price for these exciting times (accompanied by fame for originating the cool stuff) is a hell of a lot of bootstrapping. SDSM looks fuck-off cool, by the way. Quote:
Quote:
Quote:
Do you have some other histogram-based algorithm that's really slow on ATI because it's a poor fit for the architecture? You must be working on something else... Guess you can't say.
__________________
Sweet-spot + tick-tock = monster |
|||||
|
|
|
|
|
#18 | |||||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
|
Quote:
Similar programming model problems arise when trying to implement persistent threads in current APIs. Quote:
Quote:
I sure hope so! In my presentation I unsubtly hinted to the audience to find me a faster implementation so maybe with all those smart heads there and a touch of motivation/competition we'll get something Quote:
Quote:
__________________
The content of this message is my personal opinion only. |
|||||
|
|
|
|
|
#19 | ||||||
|
Regular
|
Quote:
As for SIMD width, 16 looks like a safe minimum for a few years yet. Quote:
Quote:
Cruft in the software is an on-going problem. The size of shared memory is just one variable out of many that results in "over-optimisation" for today's hardware. Cache-line size, L1$ size, SIMD width, register file size, DDR burst length and DDR banking are some others. NVidia, with CUDA, has attempted to obfuscate some parameters to prevent "over-optimisation". Guaranteeing warp size of 32 for the foreseeable future is good, though shared memory in Fermi (or at least GF100) has new bank conflict issues. Even when hardware parameters are obfuscated developers are liable to dig, resulting in potential "over-optimisation". Quote:
Quote:
I still haven't looked at your code, but these are three generic ideas:
Quote:
__________________
Sweet-spot + tick-tock = monster |
||||||
|
|
|
|
|
#20 | ||||||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
|
Quote:
Quote:
Sure, although it's worth noting that increasing cache sizes behaves better with legacy code than scratch pad memory (which just goes unused). Quote:
Quote:
Of course the large majority of these bins don't get touched for a given tile so if you only had a cache... Quote:
I played with strided vs "linear" lookups across the thread group but the latter were generally faster. If NVIDIA's coalescing logic remains the same then the latter will definitely be faster. I haven't played with using gather4 explicitly though... it's quite an annoying programming model - if they want the access like that they have free reign to reorganize the work items in the group. The lookup from the Z-buffer is *definitely* not the bottleneck though so I'm hesitate to try and optimize this much more. Quote:
__________________
The content of this message is my personal opinion only. |
||||||
|
|
|
|
|
#21 | ||||||||||
|
Regular
|
Quote:
Quote:
Quote:
Quote:
Some of Intel's Terascale work looks more like Transputer than Larrabee (maybe that's me in wishful thinking mode) and there's quite a vocal contingent who think the cache architecture of Larrabee isn't viable in the long term, where we're talking hundreds and thousands of cores. So if we're really going to talk about programming models that can last longer than 10 years, then at best one can only hope for "local cache", whatever that actually means when programming 1000 cores. Quote:
Quote:
So around 200 cycles (or 50 for 4xMSAA?) per sample? The inner loop is 31 cycles according to GPU Shader Analyzer. Quote:
Quote:
Quote:
Quote:
As the camera translates the histogram is, in general, shifting coherently, isn't it?
__________________
Sweet-spot + tick-tock = monster |
||||||||||
|
|
|
|
|
#22 | ||
|
Senior Member
Join Date: Jun 2003
Posts: 2,073
|
Quote:
Quote:
__________________
Aaron Spink speaking for myself inc. |
||
|
|
|
|
|
#23 | ||||||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,090
|
Quote:
Quote:
Quote:
I'll accept some argument over requiring full coherence but I really don't see how we can get away without proper caches in the future. (As an aside, I'm also told that once you have a cache, coherence isn't very hard or expensive and scales fine too...) Unless people come up with something equivalently clever it will relegate GPUs to operate only on the fairly small subset of regular problems that they can already handle now. I think Fermi made a definite step in the right direction on this one and I'm excited to see what comes next. Quote:
Quote:
Quote:
I also don't get why they don't just rearrange the compute domain into gather4-style blocks if their hardware benefits from this... DirectCompute leaves that completely up in the air and implementation-defined. I must admit to not completely understanding the fast path here on ATI and how the gather4 stuff relates to linear memory loads. For complete correctness, yes. It's like occlusion culling... even a minor shift from one frame to the next could reveal a new object (or occude an old one). You can start to play games but you immediately lose the guarantee that ever screen-space sample will have a shadow map sample. How important this is depends on stuff like the frequency of geometry in the frame and the speed of movement of the camera and objects.
__________________
The content of this message is my personal opinion only. |
||||||
|
|
|
|
|
#24 |
|
Regular
|
SCCC doesn't have coherency, that's my point.
__________________
Sweet-spot + tick-tock = monster |
|
|
|
|
|
#25 | |||||
|
Regular
|
Quote:
Quote:
Quote:
Quote:
Quote:
In the end ALU:TEX is so high that Z fetches shouldn't be in the picture. What is the worst-case with 100% collision rate?
__________________
Sweet-spot + tick-tock = monster |
|||||
|
|
|
![]() |
| Bookmarks |
| Thread Tools | |
| Display Modes | |
|
|