How many hours more before NDA Reviews are up...?
Exciting times...Fury X (450GB/s) or 980Ti G1....i need an upgrade.
I think the reviews are supposed to go up tomorrow at noon, GMT.
How many hours more before NDA Reviews are up...?
Exciting times...Fury X (450GB/s) or 980Ti G1....i need an upgrade.
Probably not.You're right I forgot about morton ordering... so that would be twice the number of tiles per byte size cache line. But I was told in another thread https://forum.beyond3d.com/threads/gpu-cache-sizes-and-architectures.56731/ that at least on GCN that the CB and DB caches don't go through the L2... did it change for tonga/GCN+framebuffercompression?
Because it's a bandwidth eater to decompress color RTs and depthstencil buffers (which had similar compression for a long time, not to mention color had compression for msaa too already), after all sampling them isn't uncommon (of course eats bandwidth there too if you sample from uncompressed). And yes up to GCN 1.1 this had to be done with a decompress blit, but Tonga should not have to, and nvidia doesn't have to since Fermi unless I'm badly mistaken. As said earlier, I don't know how exactly it works so both the TMU and ROPs have access to the uncompressed data in the end.Why would TMU's need access to this? If you mean render target to texture doesn't the driver resolve the compressed RT first manually? At least thats what I assumed.
Any write will affect the entire 8x8 block because it involves re-encoding the differences (unless the differences are additionally encoded with a global codebook, which is unlikely.) The first paper assumes that by keeping the entire compressed block in a dedicated cache this still gives significant memory bandwidth savings assuming typical compression ratios of 2.5-4 to 1. Of course it all depends on the details of the compression format, which is AFAIK not disclosed by AMD.I meant memory bandwidth-wise, as writing a few bytes spread out here and there is quite wasteful.
Could we, once these cards are out in the wild, perhaps DMA out the compressed framebuffer as-is, without going through decompression, and then pick apart the encoding scheme...?Of course it all depends on the details of the compression format, which is AFAIK not disclosed by AMD.
In the Heaven bench it's barely faster than my 780Ti here.Interesting read from post 75 to end ...
http://forums.hardwarezone.com.sg/hardware-clinic-2/[gpu-review]-sapphire-amd-r9-fury-x-rise-5087633-5.html#post94716834
Fiji is supposedly the same GCN level, so should work just the same.Could we, once these cards are out in the wild, perhaps DMA out the compressed framebuffer as-is, without going through decompression, and then pick apart the encoding scheme...?
Of course, if Tonga already has this tech (in perhaps the exact same implementation), then we could just use one of those cards instead.
In the Heaven bench it's barely faster than my 780Ti here.
Not with DirectX, it does not allow direct framebuffer access since DirectDraw 8, you can only set a standard front/back buffer "texture" format, but the physical representation is completely abstracted by the driver.Could we, once these cards are out in the wild, perhaps DMA out the compressed framebuffer as-is, without going through decompression, and then pick apart the encoding scheme...?
Hacking Linux drivers is far from trivial, but in the end I'd guess it would be a variation of color difference scheme.Should be pretty trivial to get that data out if you're using some hacks with the open source linux driver... Not sure though if it's really all that interesting .
If you bind the compressed resources with a graphics operation, you might be able to hack up a compute shader to grab any piece of memory mapped to the GPU, including metadata.Not with DirectX, it does not allow direct framebuffer access since DirectDraw 8, you can only set a standard front/back buffer "texture" format, but the physical representation is completely abstracted by the driver
Most video memory is not mapped to the CPU at all due to the existence of 32-bit OSes. Resizable BAR has been supported by many video cards for a while but OS support has been spotty at best.DmitryKo said:Video adapter memory is indeed directly mapped to CPU virtual address space and can be accessed using on-chip DMA engines, however it's in the kernel space, the upper half of 64-bit virtual address space, and is only accessible by the kernel-mode video driver which owns that memory range.
It could be done in Linux with a modified open source driver with an added function call to DMA out raw video RAM data? Or is AMD's driver not open source perhaps? *shrug* This is getting rather academic, heh...however it's in the kernel space, the upper half of 64-bit virtual address space, and is only accessible by the kernel-mode video driver which owns that memory range.
I don't consider it off-topic.Well, you bought it for a hell lot of routing.
If, say you implement it as a pipeline what you were drawing, and you can manage an outstanding queue, you might actually get away with 2 cycles sustained throughput per tile, which would be very nice. But I have my doubts that this implemention matches the latency requirements of the render feedback loop (I mean tiles circulating between raster-output and memory). I suspect you can't give away any cycle of added latency, or otherwise: "do it again".
Sure, the approach is always the same for this stuff (depth buffer, MSAA etc), when you overflow you just drop to baseline uncompressed.
If the delta values fit into 2bits you go say 1:8, otherwise when 3bit then 1:4, when 4bit 1:2 otherwise dump 1:1. It's never really cheap to figure the right permutation for the encoding, but if it's within 4-8 cycles per tile, it might be in a better place than the above pipelined implementation.
The acceptability very much depends on the exact statistical profile of the ROP-cache/MC transfers. An implementation might just give a damn about latency because tiles are ever only re-touched every 1k cycles. Or if going to memory already takes 400 cycles, what does it matter if we add 50-100 more. There might be a window where you can pick a dog-slow implementation, like with XB1 zlib compression. If that thing would be serial, or you have to use it serial, it's a killer; if asynchronous you might not care that much.
My preference though is, try to make it really tight (better safe than sorry) without forgetting about compression efficiency.
Anyway. Little bit off-topic, sorry for the distraction.
I can't work out what AMD's view of the balance is, with the meagre fillrates of Tonga/Antigua and Fiji in direct contrast to NVidia. There appears to be an argument that as resolutions rise, delta colour compression becomes less useful. If true, that would hint that latency is getting the better of NVidia.
I don't consider it off-topic.
Between AMD and NVidia we have claims of 25-40%. They're far from 8:1 or 4:1, let alone 2:1. It appears that the returns diminish rapidly. The transistor budgets are hardly lacking.
NVidia apparently chose to use delta colour compression in order to make more fillrate viable.
I can't work out what AMD's view of the balance is, with the meagre fillrates of Tonga/Antigua and Fiji in direct contrast to NVidia. There appears to be an argument that as resolutions rise, delta colour compression becomes less useful. If true, that would hint that latency is getting the better of NVidia.