I Can Hazwell?

Gubbi · Apr 25, 2013

3dilettante said:
As a cache, that amount of memory is going to have a lot of memory dedicated to tags, if this bandwidth cache is organized and accessed as if it were a standard cache.

Remember, you don't store the LSBs of the address in tags and 128MB amounts to 27 address bits. Existing Core i7 support 46 bit physical addresses which leaves us with a tag size of just 19 bits.

If they utilize the EDRAM as a 8-way victim cache with 128 byte cache lines and 64 byte sectors we get 19 address tag bits, 2 sector bits, 7 PLRU bits and some state (MOESI-XYZ ... whatever) bits, around four bytes in total.

With one million lines that amounts to 4MB of SRAM, a lot, but doable. They could use 256 byte cache lines with 4 sectors, then they'd have two more sector bits x 2^19 cache lines, - a little more than 2MB.

The GPU benefits from the EDRAM, but as a cache the CPU would too.

Cheers

rpg.314 · Apr 25, 2013

3dilettante said:
As a cache, that amount of memory is going to have a lot of memory dedicated to tags, if this bandwidth cache is organized and accessed as if it were a standard cache.

The tags could be kept on the EDRAM, although that burns bandwidth by firing off accesses to main memory and the EDRAM, unless there's something else like a page table entry filtering out accesses.

I am sceptical that all the CPU accesses will be routed via this buffer. The CPU will heavily interfere with the rendertargets as it runs much faster and has very little overlap with the GPU data. While a CPU could access it, I am guessing it will be made available only via the graphics driver.

sebbbi · Apr 25, 2013

pcchen said:
My point is, if GT3e only uses the eDRAM for frame buffer, that'd be pretty boring. However, just using the eDRAM for a "texture cache" (not the same as the traditional on-chip texture cache) is not ideal, because the relatively small size of the eDRAM would cause a lot of cache thrashing. A better way is to make the eDRAM and the main memory as some sort of "unified" memory, only that some parts are quicker. The system will have to determine which texture (or which part of some textures) are more frequently used and should be stored in the eDRAM, while others should stay in main memory.

Gubbi said:
Example: A 1920x1080 render target with 4x MSAA and 16bytes per fragment G-buffer data is 128MB, but since we only need the 12 non-Z bytes of a fragment's G-buffer when we're rendering edges the actual touched memory footprint is much lower. Assuming 40% extra fragments for 4xMSAA we get 1920x1080 pixels x 4 Z-samples/pixel x 4 bytes/Z-sample+ 1920x1080 pixels x 1.4 fragments/pixel x 12 bytes/fragment = 65MB , leaving 63 MB for textures which should be plenty, - it's more than 22 bytes per fragment.

I wouldn't expect the eDRAM to be a fully functional L4 cache for both CPU and GPU (with 64 byte cache lines, coherency, etc goodies). The cache logic + memory for tags, etc would just take too much die space (as the 128 MB cache is huge compared to traditional CPU designs).

It could however work in a similar way than AMDs PRT (partially resident textures) and 3d Labs hardware virtual texturing (http://www.graphicshardware.org/previous/www_1999/presentations/v-textures.pdf) works.
The GPU could be able to sample textures from both DDR memory and from the eDRAM, and whenever a page is accessed that is not in eDRAM, the page starts to be moved to eDRAM (and simultaneously the accessed cache lines obviously to the GPU L1/L2 caches to serve the memory request with minimal latency). The texture page then would be in the eDRAM for the next access (during the same frame, or during the next frame). Assuming 64 KB pages (like in Tahiti PRT), the cache management would be much simpler compared to 64 byte cache lines, as there's 1024x less items in cache. That wouldn't require much extra logic / on die memory at all.

I have been programming software based virtual texture algorithms for a few years now, and I have analyzed the memory requirements for different resolutions (for various types of source content). Basically in the worst case (lots if discontinuities, 64KB pages), you need to have around 16 million texture samples in memory to render a 720p scene. That's equivalent of a single 4096x4096 texture (if you have only a color map). In our game we have 3xDXT5 textures sampled per pixel, so the worst memory requirement for textures at 720p is 3 bytes * 16 million pixels = 48 megabytes. And this allows unique 1:1 pixel sharp texel for each screen pixel (source texture resolution doesn't matter at all).

Optimized g-buffer layout at 720 (similar to Cryengine 3) uses: 720p * (8888*2 + D24S8) = ~ 10.5 MB memory. Add the texture pages mapped, and you get ~60 MB required to render a frame (for opaque geometry). As the data set changes slowly from frame to frame (animation must be smooth to look good), there's not many texture pages you have to move in/out of the eDRAM every frame. 16 x 64 KB pages (per frame) would be enough for most of the time (except for camera teleports), and since we should assume the GPU can also sample textures directly from DDR memory, the eDRAM<->DDR memory transfer bandwidth would never be a bottleneck (you could amortize the transfers over several frames).

With an optimized texture layout (3 bytes per pixel) + optimized g-buffer layout (12 bytes per pixel) 1080p would need (worst case)... ~ 60 MB * 2.25 = ~135 MB. That could often be below 128 MB, allowing us to fully texture our frame from eDRAM. My calculations are however not taking account how much data the shadow maps require, or how much data the transparent passes require (particles, windows, etc). However this data could be sampled directly from the DDR memory, assuming the memory management can set priorities correctly (and try to keep a most often reused subset of 64 KB texture pages in the eDRAM).

The interesting thing here is that Haswell is the first Intel chip to have this kind of large eDRAM die. If we had instead 256 MB of eDRAM, the GPU wouldn't need to access DDR at all (except for camera teleports). Everything a single frame needs would fit to the eDRAM. The system would just need to gradually move the 64 KB pages in/out of the eDRAM based on GPU access patterns. If the system would be well designed, there wouldn't be a need to move more than a few megabytes of data every frame between DDR<->eDRAM (assuming the access patterns are similar to our games using software based virtual texturing).

Gubbi · Apr 25, 2013

sebbbi said:
I wouldn't expect the eDRAM to be a fully functional L4 cache for both CPU and GPU (with 64 byte cache lines, coherency, etc goodies). The cache logic + memory for tags, etc would just take too much die space (as the 128 MB cache is huge compared to traditional CPU designs).

<snip>
Assuming 64 KB pages (like in Tahiti PRT), the cache management would be much simpler compared to 64 byte cache lines, as there's 1024x less items in cache.

Simpler, but also a lot less efficient.

I'm pretty confident Intel will use 4KB pages if anything. However, manually swapping 4K pages around and busting the MMU/TLBs on every "miss" is very unlikely, IMO.

And they don't have to use 64 byte cache lines. As I wrote above, they could settle with 256 byte cache lines, each with 4 sectors for a total of 2MB of tags+state for the EDRAM, perfectly doable.

Cheers

sebbbi · Apr 25, 2013

Gubbi said:
I'm pretty confident Intel will use 4KB pages if anything. However, manually swapping 4K pages around and busting the MMU/TLBs on every "miss" is very unlikely, IMO.

As you would need to update only around 100 (4 KB) pages per frame, the TLB updates would be pretty much free (and so would be the 400 KB memory traffic). If you are programming with AMD PRT (OpenGL extension) you are remapping similar amount of GPU pages every frame already, and it is very fast indeed. Of course with 4 KB pages you would need to map more than with 64 KB pages (Tahiti's PRT page size is 64 KB). I don't see any problem with this kind of solution (it would be both cheap and efficient for normal rendering scenarios).

Obviously everyone would be happier to have fully coherent huge L4 cache that both the GPU and the CPU could use. That would be fantastic for GPGPU workloads. IBM A2 chips have big eDRAM caches that have transactional memory synchronization and memory versioning support. Intel's current TSX implementation is based on L1 caches (only 64 KB). It would be huge, if Intel could bring server / supercomputer market a chip with 128 MB of L4 cache with TSX support. According to some rumors Haswell-E chips will have up to 20 cores. A big shared L4 (with TSX) would be godsend for chips like that.

Gubbi · Apr 25, 2013

sebbbi said:
Intel's current TSX implementation is based on L1 caches (only 64 KB). It would be huge, if Intel could bring server / supercomputer market a chip with 128 MB of L4 cache with TSX support. According to some rumors Haswell-E chips will have up to 20 cores. A big shared L4 (with TSX) would be godsend for chips like that.

The problem with big transactions in memory is the same as in databases. The amount of processing time within a critical section is proportional (as a first order approximation) to the size of the transaction. The chance of some other core touching a data item in your transaction is thus proportional to the size squared.

I do believe the GT3e Haswells are a testbed for future Xeon products. I also expect Intel to significantly up the amount of EDRAM in future versions. There is a reason they are pushing for 450mm wafers, they want to produce the CPU, the GPU and the fast, high value, EDRAM themselves.

Cheers

entity279 · Apr 25, 2013

Gubbi said:
There is a reason they are pushing for 450mm wafers, they want to produce the CPU, the GPU and the fast, high value, EDRAM themselves.

They are pushing it because they can and they are positioning themselves even better against the competition.

Don't see any connection between 450mm and EDRAM here, no.

Grall · Apr 25, 2013

450mm wafers have been seen as the way to bring down the $/transistor ratio that is otherwise projected to remain static with upcoming nodes...so it would help intel to build bigger cache chips (capacity-wise, if not sqmm-wise) without prices galloping away for them.

Gubbi · Apr 25, 2013

entity279 said:
Don't see any connection between 450mm and EDRAM here, no.

You don't see the connection between Intel increasing production capacity, while introducing new products which grab a larger fraction of the high margin Si that goes into a computer ?

Cheers

DSC · Apr 26, 2013

http://newsroom.intel.com/community...ip-shot-4th-generation-intel-core-coming-soon

UniversalTruth · Apr 29, 2013

&

http://www.chinadiy.com.cn/html/45/n-8845.html

Grall · Apr 30, 2013

Latest news from intel-camp is that haswell in its lowest-power modes draw so tiny a current from the +12V rail (as low as 0.05A or somesuch) that the under-current protection will trigger in many power supplies and shut down the PC entirely.

Therefore, intel recommends that mobo BIOSes (well, not anymore) have the ability to turn these low power modes off - presumably even defaulting off, or the user would get a nasty surprise upon booting into windows...

Laptops would of course be safe, as the entire system can be tailored with this in mind.

Albuquerque · Apr 30, 2013

UniversalTruth said:
&

http://www.chinadiy.com.cn/html/45/n-8845.html

I know that the "E" series of Sandy Bridges supported those exact same higher FSB's as well (bclk of 125 / 166) that could additionally be tweaked just a tad more, say +/- 5Mhz not unlike what is shown in that first screen cap.

I personally run my 3930k running at 125bclk on an Intel motherboard at the "stock" turbo 36x multiplier for 4.5Ghz, and it works flawlessly.

itsmydamnation · Apr 30, 2013

my I7 920 has a bclock of 220.............

Albuquerque · Apr 30, 2013

itsmydamnation said:
my I7 920 has a bclock of 220.............

And Nehalem was the last line of Core CPU's that would deal with that kind of bclk rate. All of the non-E Sandy Bridge and Ivy Bridge processors would encounter massive instability in most cases where bclk varied more than ~5% from stock.

Intel provided something in the "E" series of Sandy Bridge that allowed for 100, 125 and 166 bclk values directly from the firmware itself. Throw in that additional 5% I mentioned above, and you get the 170Mhz result in the screenshot above.

jlippo · May 2, 2013

http://www.anandtech.com/show/6926/intel-iris-iris-pro-graphics-haswell-gt3gt3e-gets-a-brand
Really hope that the 128MB is accessible for CPU as well.

Blazkowicz · May 2, 2013

Albuquerque said:
Intel provided something in the "E" series of Sandy Bridge that allowed for 100, 125 and 166 bclk values directly from the firmware itself. Throw in that additional 5% I mentioned above, and you get the 170Mhz result in the screenshot above.

It's just multipliers so that you don't run with overclocked DMI and PCIe bus, which is a very bad thing (like when I ended up killing my PC with a 83MHz FSB and 41.5MHz PCI bus)

There was some news about it at IDF 2013.
http://www.hardware.fr/news/13056/overclocking-par-bus-haswell.html
With Intel saying "* Only some processors enable part or all of these features" it's likely the ratios will be locked to 1:1 on non-K processors.

Pressure · May 2, 2013

jlippo said:
http://www.anandtech.com/show/6926/intel-iris-iris-pro-graphics-haswell-gt3gt3e-gets-a-brand
Really hope that the 128MB is accessible for CPU as well.

It acts as an L4 cache but it will only be available in a BGA form factor.

Blazkowicz · May 2, 2013

It would be funny if you could boot the PC without additional RAM. Not likely, but with 128MB you would at least have a nominally usable PC when your RAM is dead or just not present. (bare ubuntu 13.04 + lxde + stuff)

When L2 was hitting 512K, 1M, 2M I was thinking, damn!

if the CPU could consider it as if it were main RAM then we would have a usable DOS machine, even if it were just for running old software and games, flashing micro-controllers/EPROM etc. though you could probably run some Mandelbrot generator too, or use the network card (ssh, telnet, text mode web browser..) - they tend to all have a DOS driver.

Grall · May 2, 2013

Pressure said:
It acts as an L4 cache but it will only be available in a BGA form factor.

There's been information of a 4-core i5 version in socket format as well, with L3 cut-down to 6MB.

I Can Hazwell?

Gubbi

rpg.314

sebbbi

Gubbi

sebbbi

Gubbi

entity279

Grall

Invisible Member

Gubbi

DSC

UniversalTruth

Grall

Invisible Member

Albuquerque

Red-headed step child

itsmydamnation

Albuquerque

Red-headed step child

jlippo

Blazkowicz

Pressure

Blazkowicz

Grall

Invisible Member