I don't know what that means. I suppose it's someone's username so no.@3dgci, aren't you agd5f?
I don't know what that means. I suppose it's someone's username so no.@3dgci, aren't you agd5f?
another reason amd should have planned for an earlier announcement. Now its just almost 2 months of rumors and the majority seem to be negative.
The comparison I was thinking of was between a more conventional hierarchy and one with a 128MB additive cache layer.
If the RT workload is random enough and more sensitive to round-trip latency, the RT blocks could see limited upside.
GCN's L1 hit latency is ~114 cycles, L2 is ~190, and an L2 miss ~350, per a GDC2018 presentation on graphics optimization.
RDNA has another layer of cache, although the exact latencies aren't clear. A ~10% improvement in overall latency was mooted, but attributed to the overall increase in capacity versus the layers being sped up.
If an L2 hit is on the order of 200 cycles and it's 4MB, a cache 32x larger and outside of the L2 could add a significant amount of additive latency such that the average latency is worse than if it weren't there for pointer-chasing RT blocks.
For AMD, it's not an uncommon case that on-die traffic is as bad or worse than missing to DRAM.
Naughty Dog listed a worst-case scenario for the PS4 CPU being a remote hit to a Jaguar L1, and Zen 1 had noticeable inter-CCX latency on the order of missing to DRAM.
the absence of official information is a problem. I don't know why AMD waits so long to release all their products. Oct 8th announcement barely puts them in time for holiday sales and unless thye have extremely good avalibilty it will make a lot of people switch companiesNot really. What AMD should do is follow their own release schedule and completely ignore all the noise. There's no need for subtle marketing attempts with thinking faces and alike. Let the product do the talking. If it's great it makes the doubters look like complete idiots. If it's not great you save your face by not hyping it up with poor marketing attempts.
They're not "waiting so long to release their products", they're releasing them as soon as they're ready to be released. Can you imagine the shitstorm if they did proper paper release at this point?the absence of official information is a problem. I don't know why AMD waits so long to release all their products. Oct 8th announcement barely puts them in time for holiday sales and unless thye have extremely good avalibilty it will make a lot of people switch companies
wait so you expect it not to be a paper release in Oct ? I have a feeling Oct 8th is announcement with avalibilty late Oct or even earl NovThey're not "waiting so long to release their products", they're releasing them as soon as they're ready to be released. Can you imagine the shitstorm if they did proper paper release at this point?
RDNA2 is Oct 28thwait so you expect it not to be a paper release in Oct ? I have a feeling Oct 8th is announcement with avalibilty late Oct or even earl Nov
thanks for correcting me. So yea We may not even have it before the end of Nov at this point for shipping cardsRDNA2 is Oct 28th
As I noted more recently, the labelled XSX die shot has nothing like a massive quantity of RAM. So my comments about these "color/depth" blocks are irrelevant.Microsoft has used its own particular wording for hardware blocks before, going by its naming convention for compute units in the current gen. The coloration of the color/depth blocks is also green versus the blue cache blocks in the diagram. Seems like an omission to not note a 128MB collection next to the 4MB L2, for example.
Xenos benefitted hugely from the daughter die's performance.As great as it might be to have a massive frame buffer on-die, it seems like something could be done to make more use of it than closing the ROP memory loop on-die.
NVidia's tiled rendering (with tile sizes that vary depending upon the count of vertex parameters, pixel format, etc.) is some kind of cache scheme that only seems to "touch" relatively few tiles at any given time.For example, the geometry engine and binning rasterizers would likely have a decent idea of how many screen tiles may be reasonably needed in a given time window, and that could leave much of that cache available for something other than ROP exports.
I like this. I have a fuzzy memory of a previous discussion about an active interposer used this way.One possible way to have the necessary area is to have a cache layer inside of an active interposer containing the memory controllers and a fabric network.
That wouldn't crowd the logic above, and it might be more amenable to the thermal conditions below an active die and the disruption due to TSVs.
I'm still haunted by "Nexgen Memory":Such a GPU would also not be constrained to a single memory controller type[...]
Frankly I don't believe there is such a large "cache". The frame rate targets for XSX/XSS without an obvious hunk of memory in the die shot (frame rates that we should expect from 6800XT/6700XT, it seems) tells me that "Infinity Cache" in a giant amount is not part of Navi 2x architecture.The context seems to be that the cache is meant to make up for not having an extremely wide GDDR6 bus or HBM. A more modest GPU might be low enough to be satisfied with regular GDDR6 bus without incurring a significant die cost that a console may not be able to justify.
If there were some kind of advanced integration necessary, that might lead to such a thing being ruled out.
That's GDDR6.I'm still haunted by "Nexgen Memory"
6 and 6X certainly have features/performance (channels and signalling) that are beyond 5.That's GDDR6.
IMO immediate mode GPUs these days are too programmable to pull off the same feast again. Schemes like “tile based rendering” and “DSBR” are basically trying to improve spatial locality of caches within the current programming model. They don’t change the intrinsic nature of immediate mode not capable of guaranteeing that the GPU (or a specific screen-space partition of it) will only touch one rasteriser bin/tile at a time. So the worst case scenario remains chaotic/pathological API submissions that lead to little to no binning in practice, where the partition always have to readily deal with any amount of tiles simultaneously. In other words, we have no bound/fixed memory use, should on-chip tile memory be introduced to an immediate mode GPU. That’s unlike TBDR sorting & binning all primitives before rasterisation and fragment shading.Xenos benefitted hugely from the daughter die's performance.
The cited benefit was compensating for the lack of a major update to the external memory bus, so bandwidth amplification seems to be the primary motivation.And if it doesn't have a latency advantage I don't see what it could be for. You're still going to get slowed down by a narrow bus anytime you read from gddr with this scheme.
eDRAM has specific process needs. IBM's use extends to Power9, which is fabricated on the 14nm SOI process IBM sold to GF. IBM's Power10 is to be on Samsung's 7nm node and reverts to SRAM.Overall giant ed-ram cache isn't a new idea but, afaik, IBM is the only one continuing to use it to much extent.
There may be a limit of sorts to the pathological case, in that we don't know how many batches the DSBR can close and have in-flight for any given screen tile. The capacity for tracking API order for primitives in batches exists, or at least we know the opposite where the GPU can be told to ignore API order for cases like depth/coverage passes exists.IMO immediate mode GPUs these days are too programmable to pull off the same feast again. Schemes like “tile based rendering” and “DSBR” are basically trying to improve spatial locality of caches within the current programming model. They don’t change the intrinsic nature of immediate mode not capable of guaranteeing that the GPU (or a specific screen-space partition of it) will only touch one rasteriser bin/tile at a time. So the worst case scenario remains chaotic/pathological API submissions that lead to little to no binning in practice, where the partition always have to readily deal with any amount of tiles simultaneously.
Xenos's daughter die didn't have the capacity for all render target formats that could be bound by the GPU, so there were performance uncertainties and cliffs (full speed 4xMSAA was limited to 720p, I think - fuzzy memory alert).
"128MB" is gargantuan in comparison, 16 bytes per pixel at 4K, but I guess there will be cases where delta colour compression or MSAA sample compression fail to satisfy the 16 bytes limit.
NVidia's tiled rendering (with tile sizes that vary depending upon the count of vertex parameters, pixel format, etc.) is some kind of cache scheme that only seems to "touch" relatively few tiles at any given time.
By comparison the rumoured "128MB Infinity Cache" seems like dumb brute force if it were dedicated solely to ROPs.
I'll be honest I think "128MB Infinity Cache" is a hoax or at the very least a grave misunderstanding.
Besides, bus die size isn't what seems to be holding gpus back today, at least no more than nominally. Ampere shows how easily you can run out your thermal and power budget. Targeting that would seem a far more obvious target than wanting to maybe save a bit some die by going for a smaller bus.
Besides, bus die size isn't what seems to be holding gpus back today, at least no more than nominally. Ampere shows how easily you can run out your thermal and power budget. Targeting that would seem a far more obvious target than wanting to maybe save a bit some die by going for a smaller bus.
RGT is doubling down on 256bit bus (with ddr6). If true it will be interesting to see the bandwitdh efficiency of rdna2.