Relooking at the HBM slides the 5.48x7.29mm is for the base die, not the exposed die ontop.7.29 has it reaching massive proportions of ~700mm2,
http://forums.anandtech.com/showthread.php?t=2430478&page=71
I think there are some patents for tile-based compression, where the framebuffer is divided up into squares, deltas are generated from that data, the tile is subdivided, and those subdivisions are run through the process, those outputs are run through the process and so on, generating a hierarchy of compressed data and encoding information.Anyone knows how this color compression scheme works anyhow? As it has to be lossless, and you must surely not want it to be too cumbersome to extract the value of an individual pixel either so you don't end up having to constantly fetch more data than you would without compression...how would you go about implementing something like this?
3dilettante and a few other hardware oriented guys (sorry forgot who) were discussing the GCN tessellator and geometry shader design in another thread year or two ago. The conclusion of that discussion was that the load balancing needs improvements (hull, domain and vertex shader have load balancing issues because of the data passing between these shader stages). These guys could pop-in and give their in-depth knowledge. I have not programmed that much for DX10 Radeons (as consoles skipped DX10). GCN seems to share a lot of geometry processing design with the 5000/6000 series Radeons.
I think people misunderstood my definition of "cache". I was not talking about spilling registers to L1 or implementing a caching scheme similar to the memory caches. Let me explain what I meant.Sorry, I don't think it's a good idea from a hardware designer point of view.
While registers and cache memory serve the same purpose, i.e. provide fast transistor-based memory to offset slow capacitor-based RAM, they belong to very different parts of the microarchitecture and use very different design techniques to implement.
Registers are part of the instruction set architecture, they are much closer to the ALU and are directly accessible from the microcode, so they are "hard-coded" as transistor logic. Caches are a part of the external memory controller, accessible through the memory address bus only, designed in blocks of
"cells", and they are transparent to the programmer and not really a part of the ISA.
https://lh4.googleusercontent.com/-...AAALJs/62rqQEaLQ-g/w2178-h1225-no/desktop.jpg
1002-67C8, 1150MHz core freq, 8GB HBM.
source: http://www.chiphell.com/forum.php?mod=redirect&goto=findpost&ptid=1302682&pid=29101917
Instead, the Radeon R9 Fury X will be the flagship video card, a watercooled part based on the Fiji XT GPU. Under that, we'll have the Radeon R9 Fury, which should be based on the Fiji PRO architecture, with an entire restack of current cards. Under these two new High Bandwidth Memory-powered video cards we'll have the Radeon R9 390X, Radeon R9 390, Radeon R9 390, R9 380, R7 370 and R7 360.
The Radeon R9 Fury X will be a reference card with AIBs not able to change the cooler, but TweakTown can confirm that it will be the short card that has been spotted in the leaked images. The Radeon R9 Fury will see aftermarket coolers placed onto it, so we should see some very interesting cards released under the Radeon R9 Fury family.
The Radeon R9 Fury X has a rumored MSRP of $849, making it $200 more than the NVIDIA GeForce GTX 980 Ti, but $150 cheaper than the Titan X. The Fury X branding is a nice change from AMD, but it does sound awfully close to the Titan X with that big, shiny, overpowering 'X' in its name, doesn't it
it does sound awfully close to the Titan X with that big, shiny, overpowering 'X' in its name, doesn't it
Someone went slightly overboard with the thermal grease there
LDS is optimized for different purposes. LDS can do (fast) atomics by itself (accumulate, etc). LDS reads and writes have a separate index for every thread (64 threads in a wave). This is 64 way scatter/gather. Register load & store (64 lanes at a time guaranteed at sequential addresses) would be much simpler to implement in hardware. When scatter & gather are not needed, the register file could be even split by lanes (or groups of 16 lanes) if that increases the register locality. Emulating a bigger register file by LDS is both slower and costs much more power than having a native support.@sebbbi That cache is the LDS for a lot of variables, now already.
But you are right that with the current GCN hardware, it is sometimes a good idea to offload seldom used registers to LDS.