NVIDIA GF100 & Friends speculation

Fermi's support for register-indirect branching is something they've beat their drum about. Is that even actually needed for DX11, or in competing chip lines?
I think this is required in CS5.0 and AMD's IL has virtual function support. I still don't understand the gritty detail of the requirements, and the capabilities of the competing chips.

The same goes for Fermi's exception handling capability, which overlaps with the indirection in control flow.
There isn't any automatic exception handling. There aren't any flags, either. See section G.2.

Jawed
 
But I'm intrigued to see what it is you're thinking of specifically. I haven't spent time on CUDA 3.0 to see what clues lie there-in. A quick rummage in G.1:

I would add to that list pinned memory and concurrent kernels (AFAIK CS doesn't support the latter). One obvious thing is the limitations CS puts on the buffers you can work with.

What about OpenCL 1.1 which is due this summer-ish?

I have no idea what's planned for OpenCL but it will presumably target the lowest common denominator between Fermi and Cypress.

I'm not saying you're wrong. Just curious to see what you're thinking of specifically.

Well on the hardware side, the geometry bit mentioned earlier as well as tweaks to texture units to retrieve multiple point samples per clock - according to the info we have so far the corresponding API instruction was added at Nvidia's behest.

On a broader level though, the hardware can reasonably accomodate a less rigid rendering pipeline than what's currently offered by DirectX. The cache hierarchy lifts earlier restrictions on intermediate buffer sizes and in combination with robust support for atomics it facilitates faster communication between the various pipeline stages as well - see use of L1/L2 cache for routing data between the SM and polymorph engine or practical raytracing.

edit: To that last point, isn't it time for DirectX to start defining data structures and traversal algorithms to support raytracing? It might be nominal at first but it has to start somewhere and the hardware seems up to the task.
 
There isn't any automatic exception handling. There aren't any flags, either. See section G.2.

That would be a significant deviation from the information given out during Nvidia's Tesla introduction.
We have articles concerning this from Anandtech and Realworldtech.

There's a pdf about Tesla on Nvidia's own page that points out exception handling.
 
You're taking a different slant. You're saying the API doesn't explicitly prevent something. I'm saying that the hardware enables something not explicitly enabled by the API. See the difference? With your perspective you can always say the API is as advanced as the hardware since it defines the output.
No your taking a different slant, by assuming that sometihng is beyond an existing API just cos its new and cool. Sometimes an API delibrately only defines the output so that vendors can do whatever they like hardware wise to get the speed/features they want. In this Direct3D (and OpenGL before it) have always only defined geometry processing in terms of the output explicitly so that vendors can innovate in the hardware. AFAIK there is nothing in the geometry engine parellelisation that isn't supported via D3D/OGL (I suspect you can't even turn off the in-order requirement at a hardware level but don't know for sure)
My point has not beem, not thats is not a cool new features but that you claimed its Beyond D3D/OGL and so would require a change to the API, in this case is explicitly that it just works out the box that makes it so good :D

Something like new CUDA features that isn't suppored under CS 5.0 is something that is beyond the API because CS 5.0 spec explictly doesn't allow freedom to add new instructions (however it does allow the freedom to implement the existing instructions in anyway they like, i.e. Sin/Cos have gone from fast to slow over the generation as the quality requirements and general computing requirement have gone up)
 
I would add to that list pinned memory and concurrent kernels (AFAIK CS doesn't support the latter).
Nothing that I know of in the CS specs limit it to singular kernel invocation, each start is placed into the command buffer and then execution runs at the hardware/schedular behest.
I believe the LRB schedular has always been multi-threaded with its own complex schedular.
 
Every design, even the tightest, most transistor budget-conscious, can sometimes deliver a few extra compute or rendering features (beyond what's required by spec), whether by accident or design. However, I think at this point no one expects that Cypress has a giant, completely outside of spec feature along the lines of tessellation (prior to DX11) hidden away, but they may have a few more esoteric things that are of interest to GPGPU developers.

If they do have this giant proprietary feature built in to Cypress, I admire their restraint in keeping it secret for all this time post launch. In fact, I would have to wonder, why not tell developers about it and get a head start on Fermi, being able to point to shipping products using this unique / proprietary feature?

Back to speculation, has this been posted?

Don't see anything new in there, just an amalgamation of items from existing speculation articles / posts.
 
__syncthreads_count()

Btw, isn't that the same operation we were discussing a while back in the prefix sum patent?

AFAIK there is nothing in the geometry engine parellelisation that isn't supported via D3D/OGL (I suspect you can't even turn off the in-order requirement at a hardware level but don't know for sure)

Again you're equating the absence of obstacles with explicit support. Just because the hardware guys were able to workaround API shortcomings it doesn't mean the API is all honky dory.

My point has not beem, not thats is not a cool new features but that you claimed its Beyond D3D/OGL and so would require a change to the API, in this case is explicitly that it just works out the box that makes it so good :D

D3D has no support for raytracing acceleration structures for example. Are you saying that won't take a change to the API to introduce?

Nothing that I know of in the CS specs limit it to singular kernel invocation, each start is placed into the command buffer and then execution runs at the hardware/schedular behest.
I believe the LRB schedular has always been multi-threaded with its own complex schedular.

Once again, the absence of an explicit limitation doesn't imply explicit support. That would mean everything under the sun is supported unless they explicitly state that it's not the case :)
 
If they do have this giant proprietary feature built in to Cypress, I admire their restraint in keeping it secret for all this time post launch. In fact, I would have to wonder, why not tell developers about it and get a head start on Fermi, being able to point to shipping products using this unique / proprietary feature?

It's not quite a Fermi-related question, but I suspect it's because AMD the CPU company is not particularly interested in a big monolithic HPC GPU, and Cypress and its ilk are not meant to persist in that sector.

A lot of people would agree with this, given how much the rumor mill loves it some "Nvidia x86 skunkworks" grist.
 
I would add to that list pinned memory
A feature of existing APIs and host-side, not GPU side.

and concurrent kernels (AFAIK CS doesn't support the latter).
Does Fermi do anything more than overlapping the execution of successive kernels?

One obvious thing is the limitations CS puts on the buffers you can work with.
How does that relate to OpenCL 1.0/1.1?

I have no idea what's planned for OpenCL but it will presumably target the lowest common denominator between Fermi and Cypress.
Along with some extensions :p

Well on the hardware side, the geometry bit mentioned earlier as well as tweaks to texture units to retrieve multiple point samples per clock - according to the info we have so far the corresponding API instruction was added at Nvidia's behest.
You're confusing a performance optimisation in hardware for the capability to support an instruction in the language. The former is not a "feature" that goes beyond D3D11.

Jawed
 
edit: To that last point, isn't it time for DirectX to start defining data structures and traversal algorithms to support raytracing? It might be nominal at first but it has to start somewhere and the hardware seems up to the task.
Just exposing compute shader has to be enough here. Microsoft can't afford to standardise how developers do RT inside of D3D and limit flexibility, since there's no upper bound to real-time RT R&D yet, and no really good fits to common hardware either really.
 
And here is where everyone points out that GDDR5 != DDR3. the economies of scale are significantly different. 1GB of high speed GDDR5 is in the realm of 80-100 by itself. In other words, GDDR5 commands a significant price premium above and beyond the HIGHEST VOLUME PRODUCTION DRAM IN THE WORLD? Who would have thought?


yeah the price i quoted was for on street for DDR 3 ;) please re read my post.
 
Every design, even the tightest, most transistor budget-conscious, can sometimes deliver a few extra compute or rendering features (beyond what's required by spec), whether by accident or design. However, I think at this point no one expects that Cypress has a giant, completely outside of spec feature along the lines of tessellation (prior to DX11) hidden away, but they may have a few more esoteric things that are of interest to GPGPU developers.
GDS is a bit special. "LDS for all work items". But a cache system that did the same would be more generally useful I reckon.

If they do have this giant proprietary feature built in to Cypress, I admire their restraint in keeping it secret for all this time post launch. In fact, I would have to wonder, why not tell developers about it and get a head start on Fermi, being able to point to shipping products using this unique / proprietary feature?
AMD is still getting the basic stuff working.

Jawed
 
Back
Top