NVIDIA Maxwell Speculation Thread

IF (a big one) the recent rumor about Maxwell is true, the specs of GM100 is a bit strange, how could downgrade the FP32 to FP64 throughout from fermi to 2:1 to Kepler's 3:1 to Maxwell's 4:1 is considered as an optimization of FP64 performance?

The only thing I can think of about the logic behind this is, Nvidia go the kepler route: adding more floating-point uints at the expense of interger performance of the device.

So we expect a FP32 monster who is pretty weak at interger/bit operations, even weaker than Kepler in terms of peak Giops/peak Gflops etc.

I hope not, since many algrothims depend on interger ops and the weak Giops of Kepler has already become the performance bottleneck on GK110 for some of my applications to the degree I have to give some of these work to CPU to handle.
 
what is the difference to the "Unified Virtual Addressing" they had so far? (or was that a win7/8 64bit only feature so far?)
 
what is the difference to the "Unified Virtual Addressing" they had so far? (or was that a win7/8 64bit only feature so far?)

No, UVA has worked on Linux for quite a while now. UVM lets the GPU page non-page-locked CPU memory. The first implementation, which NVIDIA demonstrated at GTC in the spring and called UVM-lite, works on Kepler as well, but requires the memory be allocated through a special malloc that uses a kernel extension to handle page faults coming from the GPU. If you allocate memory with this allocator, you don't have to explicitly move data to and from the GPU, it gets paged back and forth as required by your program. The full UVM that Maxwell brings should remove the need for a special allocator, letting the GPU access any memory in the system.
 
No, UVA has worked on Linux for quite a while now. UVM lets the GPU page non-page-locked CPU memory. The first implementation, which NVIDIA demonstrated at GTC in the spring and called UVM-lite, works on Kepler as well, but requires the memory be allocated through a special malloc that uses a kernel extension to handle page faults coming from the GPU. If you allocate memory with this allocator, you don't have to explicitly move data to and from the GPU, it gets paged back and forth as required by your program. The full UVM that Maxwell brings should remove the need for a special allocator, letting the GPU access any memory in the system.
thx for the detailed explanation.
So, with maxwell, there won't be any page faults to copy to the GPU in a software emulation, but native hw access?

@MfA
I think if GPU and CPU shares the virtual address space, then it's process space from ur application. just like it's done now in the emulated way RecessionCone has described.
 
thx for the detailed explanation.
So, with maxwell, there won't be any page faults to copy to the GPU in a software emulation, but native hw access?
There will still be page faults, it's just that the GPU and CPU will share the same page tables with Maxwell. When you write an application using UVM, you won't need to copy data to the GPU explicitly. You just allocate data on the CPU as normal, and run the program on the GPU using pointers from your CPU program. When the GPU accesses data that's sitting on the CPU, the GPU page faults and requests the page from the CPU. The CPU then pages the memory out as normal with virtual memory: except that instead of paging it to disk, it pages it to the GPU. When the CPU accesses memory that's sitting on the GPU, the CPU page fault page it back in from the GPU. For simple applications, you won't have to worry about where your memory is, which will make it easier to program GPUs.

As I said, I'd guess that this particular kernel extension is for UVM-lite, which is very similar, except it can only operate on memory allocated with NVIDIA's allocator: it can't access arbitrary memory in the process. But UVM-lite runs on Kepler and so it's a step towards the full UVM.
 
As I said, I'd guess that this particular kernel extension is for UVM-lite, which is very similar, except it can only operate on memory allocated with NVIDIA's allocator: it can't access arbitrary memory in the process. But UVM-lite runs on Kepler and so it's a step towards the full UVM.

Has UVM-lite already been released for Kepler GPUs on Windows drivers? I never heard Nvidia making a big fuss about it. Or is this still to be released?
 
Has UVM-lite already been released for Kepler GPUs on Windows drivers? I never heard Nvidia making a big fuss about it. Or is this still to be released?

No, it hasn't been released yet. Nvidia demoed it at GTC 2013 running on Linux. But it hasn't been released yet on Linux, and I'd imagine Windows support will be more difficult for them to implement.
 
We got programmable. We got unified. We got general compute. We went from VLIW to "scalar". We have scalable tessellation.

What's the next frontier in graphics architecture? There are some rumblings about maxwell adding chunkier caches but that on its own isn't very exciting.
 
We got programmable. We got unified. We got general compute. We went from VLIW to "scalar". We have scalable tessellation.

What's the next frontier in graphics architecture? There are some rumblings about maxwell adding chunkier caches but that on its own isn't very exciting.

Au contraire, more caches, more coherency an more cpu's on die are very exciting.
 
Au contraire, more caches, more coherency an more cpu's on die are very exciting.

Yeah more is always good but on the surface it doesn't seem that thrilling. More bandwidth and more flops is nice too but is standard fare.

Unified memory space. GPU can access system RAM and CPU can access video RAM.

This is definitely cool and will help developers write cleaner, faster code but it's still DMA over PCIe for discrete setups. It's probably a bigger win for HPC/Linux anyway. Does DirectCompute even have any kind of DMA support?
 
More cache (or global RAM or global store or whatever you call it) is maybe great, if even the low end GPUs have at least say 1024K you now can plan to use algorithms in your game or app that need that amount of storage to work.

Little observation, the geforce 700 series (barring laptop stuff) all have at least 512K L2 : even the 750 ti has a 256bit GK104, which allows to keep the full size, and the small GK208 (seen on a few All-in-one, perhaps sold on future retail GT 7xx cards) has as much L2 cache.
You can thus run computing / shader stuff that runs fine with 512K, but would absolutely tank on a GPU with less L2 because of all that hitting the slow RAM.
 
Back
Top