I believe these numbers could be possible, but not by this underclocked and bandwidth-starving silicon with increased latencies due to the iGPU. Full-speed + L3-equipped model would perform better.
I believe these numbers could be possible, but not by this underclocked and bandwidth-starving silicon with increased latencies due to the iGPU. Full-speed + L3-equipped model would perform better.
I believe the bigger problem with BD family l1i cache is associativity (or rather the lack thereof...), not size, so I wonder why they didn't change the structure and went with a 64kB cache with 4 times associativity instead. Maybe just increasing the size was simpler, but still that looks like a more expensive solution to me.
So, apparently the i-cache size scales with the associativity, although in an "odd" manner, which means the bank structure is unchanged.
The idea that a DRAM-free board hanging off of PCIe can be cheap presupposes that a highly non-standard and standard-violating board with a dubious business case and non-standard GPU can be cheap.
If someone is so cost-conscious that even inexpensive DRAM is too much, you might be getting down to the most stripped-down and non-expandable motherboards you can find.
A graphics unit without access to local memory hasn't been practical since early in the last decade, and I doubt even the vaunted latency-hiding capabilities of a GPU can hide the impact of having no local framebuffer. The ROPs would probably be one of the first elements to falter, with the necessary batch sizes and local caching necessary becoming too large to be practical.
The following is more speculative, but pure PCIe accesses may also subject the GPU to more stringent ordering constraints than its aggressive memory pipeline can tolerate, negating the GPU's ability to utilize it well.
Thanks to 32-bit OSes, only a portion of the graphics card's memory is exposed to the CPU. For AMD parts, this is around 256MB of the total video memory. This means that the CPU cannot directly access arbitrary regions of video memory.I assume at some point (soon hopefully) we'll see HSA like functionality in discrete GPU's in that the GPU can freely read/write to the main memory and the CPU can do the same to the graphics memory. Bandwidth and latency would still be crap because you're going over PCI-E (although if implemented under PCI-E 4.0 at 64GB/s total bandwidth it's nothing to sneeze at) but at least it would do away with having to copy data back and forth.
Does that sound feasible?
with XB1&PS4, 32 bit will become less relevant for big games in 1 year or less... even in PC space. At least for all titles that matters, I believe (EA,UBI, TAKE2).Thanks to 32-bit OSes
only a portion of the graphics card's memory is exposed to the CPU
We have to support our customers and that means people upgrading machines with older OSes.with XB1&PS4, 32 bit will become less relevant for big games in 1 year or less... even in PC space. At least for all titles that matters, I believe (EA,UBI, TAKE2).
No, not possible. What would you do if multiple applications were trying to access different 256MB windows? (This is supposed to be seamless, right?) Adding in new APIs to change the mappings isn't seamless (or simple or bulletproof), so you may as well stick with existing technology.pMax said:I bet you can remap it easily, or use a GART for that. It would lead to the hassle of addressing memory in 256Mb chunks, maybe -but possible, no?
...of course, you have to support legacy 32 bit customers.We have to support our customers and that means people upgrading machines with older OSes.
[..] Once everyone is convinced that 64-bit is the de facto standard then we can improve the situation.
Ouch, yeah - you are right, I tend to forget... anyway, it can just be a "singleton" resource, on 32bit desktop space (or maybe 2 with half space). And full access on 64 bit space. But probably not worth the effort, I suppose.No, not possible. What would you do if multiple applications were trying to access different 256MB windows?
...you say WB is ok ...but you have to do a read-modify no?CPU writes are fine as long as you fill write-combine buffers.
I assume at some point (soon hopefully) we'll see HSA like functionality in discrete GPU's in that the GPU can freely read/write to the main memory and the CPU can do the same to the graphics memory. Bandwidth and latency would still be crap because you're going over PCI-E (although if implemented under PCI-E 4.0 at 64GB/s total bandwidth it's nothing to sneeze at) but at least it would do away with having to copy data back and forth.
Does that sound feasible?
We have to support our customers and that means people upgrading machines with older OSes.
Both Windows 7 and Windows 8 have 32 bit versions. Most users having computers with 32 bit version of either OS do not know about it. It would be a customer support nightmare to explain to a paying customer that their brand new computer with Windows 8 doesn't support our game. I was personally hoping that Microsoft would have released only a 64 bit version for Windows 8, but unfortunately that didn't happen, so we will have computers sporting latest GPUs with DirectX 11.2 and a 32 bit OS. And new computers are sold with 32 bit Windows 8 every day.I wouldn't bother supporting 32bit any more, just like I wouldn't bother with DX9 (and wouldn't have for years now).
I believe the bigger problem with BD family l1i cache is associativity (or rather the lack thereof...), not size, so I wonder why they didn't change the structure and went with a 64kB cache with 4 times associativity instead. Maybe just increasing the size was simpler, but still that looks like a more expensive solution to me.
Both Windows 7 and Windows 8 have 32 bit versions. Most users having computers with 32 bit version of either OS do not know about it. It would be a customer support nightmare to explain to a paying customer that their brand new computer with Windows 8 doesn't support our game. I was personally hoping that Microsoft would have released only a 64 bit version for Windows 8, but unfortunately that didn't happen, so we will have computers sporting latest GPUs with DirectX 11.2 and a 32 bit OS. And new computers are sold with 32 bit Windows 8 every day.
Unfortunately games designed for a 32 bit OS cannot be designed access more than 2 GB of memory (since that's the upper limit per process in a 32 bit OS). Thus this is unfortunately the memory limit for most PC games. Not many game developers are willing to design a game with reduced game logic (smaller levels, etc) just for 32 bit computers. That would add quite a bit for development costs.
It's not as simple as that as it requires changing the VBIOS behavior at POST when it doesn't know what flavor of OS you will load later....of course, you have to support legacy 32 bit customers.
But you have rewritten the driver for 64 bit support, and you easily add the feat for 64 bit family. 64 bit code and 32 bit code cant mix.
Leave the devs deal with 32-bit porting, they will then choose what to do. Most of today's game customer should already have 64 bit OS since years - at least for top selling games that would love to use such features, I think.
Yes, write-combine buffers were added to the North Bridge in the AGP days IIRC, now they are integrated into the CPU. There used to be varying numbers of WC buffers per North Bridge, which caused confusion for low-level programmers.pMax said:...you say WB is ok ...but you have to do a read-modify no?
Aaah, I see. You mean final write is delayed in CPU until it needs to be flushed. Interesting, thanks.
One thing to ponder is what 28nm SHP portends for Excavator.
It is a bulk process, but still one tweaked somewhat for the benefit of AMD and high-speed CPUs over the standard processes GPUs thrive in.
Perhaps this the last remnant of benefit from AMD's initial 32nm work, or a concession it or GF made in the last few years of process drama between the two.
Kaveri's clock rates regressed at least in part because of the loss of SOI and a reduction in the upper clock range in exchange for density.
Where does this take a design that is still in the Bulldozer lineage at 20nm or below, given the constriction in process choices and a heavy focus on SOC parameters?
The elements that allowed Kaveri to be a performance wash or disappointment may not persist, and 20nm planar is not looking to be that great even for SOC clients, much less a speed racer design.
The rumored TDP reduction in Carizo may be a reflection of this.
If this is the case, then Bulldozer is very much an architecture for a different universe than the one we inhabit. Part of its design targets took into account maintaining high speeds on an inferior CPU logic process, but AMD's flaky mainline core strategy and better success with Jaguar point to BD's whole philosophy being incompatible with a process that goes beyond being inferior for high-speed logic and simply not being meant for it.
Another question, going by the AMD slides that indicate that bigger metals allow for higher clocks, coupled with the loss of clock, does this mean a 28nm SHP has compromised on that as well?
And what of the (apparently mostly unused) resonant clock mesh, which was primarily meant for the higher clock speeds and which has a taste for heavier metalization?
[URL=http://www.theregister.co.uk/2014/01/14/amd_unveils_kaveri_hsa_enabled_apu/?page=3]The Register[/URL] said:There were reasons to go with 28nm rather than 22nm, Macri told us, that were discovered during the design process. That process was run by what he identified as a "cross-functional team" composed of "CPU guys, graphics guys, mixed-signal folks, our process team, the backend, layout team." […] The problem, he said, was that "our IDsat was unpleasant" at 22nm […] "So what we saw was the frequency just fall off the cliff," he said. "This is why it's so important to get to FinFET."
Well yes I hope AMD did some analysis too. Though given how BD family looks like you have to wonder what kind of analysis they really did...Between you and AMD I expect AMD to have done more real code analysis on capacity vs associativity misses.. well, I hope they have at least. Note that others seem to be trending in this direction too, Apple increased icache (and dcache) size with Cyclone, and ARM is doing the same sort of extension from 32KB/2-way in Cortex-A15 to 48KB/3-way in Cortex-A57. Then there's nVidia's Denver with 128KB icache, although that could be more influenced by VLIW or overhead from translation or whatever Denver is doing.
Maybe it's justified by code getting more bloated and library/middleware-ridden, and more JITed code being executed.
Intel could do it too but they've stuck with alias-free caches, so if they want to go above 32KB they'd have to increase beyond 8-way set associativity which is already high.
It's not as simple as that as it requires changing the VBIOS behavior at POST when it doesn't know what flavor of OS you will load later.
It's also a more straightforward way of enhancing a cache without having to rearchitect it. Adding or removing capacity by adding ways to the cache is done in the reverse when Intel disables L3 cache in its processors in lockstep with reductions in associativity.Between you and AMD I expect AMD to have done more real code analysis on capacity vs associativity misses.. well, I hope they have at least. Note that others seem to be trending in this direction too, Apple increased icache (and dcache) size with Cyclone, and ARM is doing the same sort of extension from 32KB/2-way in Cortex-A15 to 48KB/3-way in Cortex-A57.
Looks like only Intel is out of the "trend", but their main architecture line already spots rather fast and low-latency L2 for the last 5 years as a standard feature. Curiously, Intel is being more generous to the i-cache sizes in its Itaniums for several generations now, but that's mainly because of the overhead of the weird instruction bundle format.It is quite true that cpus seems to increase cache sizes lately. I don't know though what the l1i associativity of Swift/Cyclone/Denver is, so for comparison that's only half the story. Compared to A15/A57 these still are half the size and same associativity compared to BD and SR respectively. None of these cpus have to feed two cores with one l1i, so considering that SR still doesn't really have a large cache (and terrible associativity).