The OS manages all the process VAS. Migrating pages from a physical location to another while maintaining coherency cannot bypass the OS in any way.
Largely for security(obvious arbitrary memory read issues), but at a lower level the hardware should be able to perform the operations without the OS being involved. It need not be coherent either. What I'm saying is that if a CPU resided on the GPU, at a low level it would have full access to system memory without requiring a host CPU to be involved. On P100, as well as most GCN models, what you suggested is likely accurate. The exception being the higher bandwidth links(IBM/NVLink and AMD's APUs) that likely have more robust interfaces with the system memory controller. The SSGs for example obviously don't have to go through the OS to access their local storage. This was demonstrated by AMD to be superior to even reading from system memory over PCIE as I recall.
Most mobile games use double rate FP16 extensively. Both Unity and Unreal Engine are optimized for FP16 on mobiles. So far there hasn't been PC discrete cards that gained noticeable perf from FP16 support, so these optimizations are not yet enabled on PC. But things are changing. PS4 Pro already supports double rate FP16, and both of these engines support it. Vega is coming soon to consumer PCs with double rate FP16. Intel's Broadwell and Skylake iGPUs support double rate FP16 as well. Nvidia also supports it on P100 and on mobiles, but they have disabled support on consumer PC products. As soon as games show noticeable gains on competitor hardware from FP16, I am sure NV will enable it in future consumer products as well.
I'd imagine Volta, possibly a Pascal refresh, will have the support enabled if only for marketing.
More immediately, would it be suitable for accelerating tessellation? The marketing material suggests over twice the geometry throughput. I could see packed FP16 being sufficient accuracy for offsets within a sufficiently small patch. Switch to FP16 for generating all the triangles, then convert back into the domain of world coordinates for whatever survives the culling. Maybe go to FP16 once you get past the 8x(?) tessellation level or determine the primitive to be sufficiently small.
I certainly hope that improvements like this get adopted by other IHVs, just like Nvidia adapted Intel's PixelSync and Microsoft made it a DX12.1 core feature (ROV). Same for conservative raster.
Programming question for you. The primitive shaders look like a more specialized compute shader. Along with the GeometryFX AMD presented and the culling mentioned in those papers. That seems to be the origin of the primitive shaders. So how practical would it be to emulate the capability on existing hardware? I'm guessing the issue is performance as there is insufficient parallel work? Would this be something where that flexible scalar for example could reasonably enable the capability?
It is really, really common to have a USB debugger connected to a microprocessor's serial debug bus on a development board, that can gain access to the internal states or whatever manipulation as designed. I sincerely doubt it is anything new at all...
Common, but I can't say I've come across many prototype boards with an extension and USB3 port affixed. Normally boards just add headers for USB and/or a separate breakout board if needed. That works for development and quality control. The only time I've seen actual ports were for dev kits or training devices that were more widely distributed.
It could be power related, but I doubt there are so many lanes coming off the package to warrant the extension. A simple serial interface and headers would make more sense. The infinity fabric in theory doesn't extend to the PCB so the extension wouldn't be required to wire that in. My guess would be a breakout board for all the spare PCIE or IO lanes on the board. Something partners can use to test a variety of interfaces. Thunderbolt, USB, fibre channel, storage, etc with a riser or ribbon cable. Something where all the headers on the board would be impractical.
Vega 20 says PCIe 4.0 host, meaning it might be just a 4-lane host for future NVMe SSDs. The current Fiji SSG uses two NVMe Samsung 950 Pro in RAID0 (theoretical 6.4GB/s read speeds) through a PLEX controller. A single PCIe 4.0 should be able to get about the same bandwidth with just one SSD.
4 lanes is sufficient for most NVMe devices. I'd say there is a good chance Vega has more than just 20-24 lanes without using a bridge chip. I'm guessing 32 lanes. 16 for host adapter and then another 16 for IO on the board. That should be sufficient for the fastest fibre channel (128gbit/s) currently available at 12.8GB/s in each direction. The next Thunderbolt interface is 10GB/s for 8k UHD. Odds are the board won't run 8k UHD and SAN at the same time, so should be sufficient IO without a lot of bridge chips. Also works for communicating with other chips. Vega20 in theory would use the next iteration of the standards with everything doubled.
I remember it from an interview with Raja Koduri.
I think the logic was if you have enough memory bandwidth to spare then general performance won't be affected by more frequent memory swaps through the PCIe bus.
This logic is BS. I still maintain that someone just pulled it out of their behind or at least misinterpreted what was said by AMD.
It was the basis for all those charts of how much ram games actually used. Along with their future box marketing of bandwidth in addition to capacity.
EDIT: Around 4:20 in the video linked above.