esram astrophysics *spin-off*

Status
Not open for further replies.
sorry i got mixed up with your other post saying it would be xbox's 12cu vs ps4 12 cu because 2 would be used by the os and the other 4 are only good for compute gpgpu stuff. but if the other stuff you said about the display planes helping xbox one games hit 60fps easily when compared ps4 games, and esram bandwidth increase + what you said about the xbox one cpu having more bandwdith than ps4 cpu (40gb compared to 20gb/s) i assumed it meant xbox would be ahead. I'll send you pm, i always want to learn more. i will admit im confused a bit now since youve clarified.

This is complete rubbish, all the CU's can be used for rendering but you might be better off doing compute on graphical structures and getting some post fx that way instead due to how compute can pass the traditional limitations of the GFX pipeline.

You can use all 18 CU's for graphics if you want and they will ALL give the same performance regardless of how many you use. Also to assume that the PS4 will reserve ~2x the GPU as the XBONE for the OS is also pretty wacky.


cool and the gpu sees that as one big pool so it's really 210gb/s? man ms engineers are amazing. all these tech upgrades + its being more efficiecnt because of the dmes and whatnot makes xbox sound just ridiculously powerful. proof is in the pudding ryse is pretty much the most amazing thing i saw at e3.

some time ago you suspected that the display plane will allow fore more games to be 1080p and 60fps is that the limit to the display planes or can it do say 4k gaming? i remember ms saying that xbox one was capable of it, would it be wrong to credit the display planes for that?

DME's do not make the XBONE magically more efficient anymore then having DMA controllers makes your desktop more efficient (hint, your desktop probably has in the range of 10-20 DMA controllers).
 
sorry i got mixed up with your other post saying it would be xbox's 12cu vs ps4 12 cu because 2 would be used by the os and the other 4 are only good for compute gpgpu stuff. but if the other stuff you said about the display planes helping xbox one games hit 60fps easily when compared ps4 games, and esram bandwidth increase + what you said about the xbox one cpu having more bandwdith than ps4 cpu (40gb compared to 20gb/s) i assumed it meant xbox would be ahead. I'll send you pm, i always want to learn more. i will admit im confused a bit now since youve clarified.

Oh, that's if you include OS stuff as per bkilian's random guesstimate about PS4 reserving 2CU's for that. Someone else said that may not be necessary. That TXB thread was me trying to ignite some discussion on the subject since nobody was actually on topic. Anyhow, don't do versus stuff or it'll just get deleted. PM me if ya want. ;)

some time ago you suspected that the display plane will allow fore more games to be 1080p and 60fps is that the limit to the display planes or can it do say 4k gaming? i remember ms saying that xbox one was capable of it, would it be wrong to credit the display planes for that?

X1 can do 4k gaming and it'd definitely use them for accomplishing that. The more novel use would actually be in handling 3D games. Apparently it's quite helpful for that sorta thing. I've heard that only 1st party devs had access to this display planes until final dev kits went out a few weeks ago. Wonder if that includes 2nd party devs like Crytek or Double Helix etc.
 
Betanumerical,

Sounds to like theres more to this then 'double pumping' and thats its something else that we are not seeing, maybe compression algorithms and they are talking in uncompressed terms.

What did you mean here? How would this fit in with any info we have on the subject? Surely MS knew full well what decompression the DME's were doing and that wasn't a surprise. That's be misleading math too and they were telling devs about the bandwidth boost. Don't really see how that meshes well with what we have heard or what we know.
 
X1 can do 4k gaming and it'd definitely use them for accomplishing that. The more novel use would actually be in handling 3D games. Apparently it's quite helpful for that sorta thing. I've heard that only 1st party devs had access to this display planes until final dev kits went out a few weeks ago. Wonder if that includes 2nd party devs like Crytek or Double Helix etc.

Then you heard wrong because the display planes have been in since the beta dev kits at the start of the year.


Betanumerical,



What did you mean here?

It sounds and feels to me when they talk of more bandwidth that they are talking about the compression that the GCN cards employ on certain tasks, and instead of talking about actual bandwidth they are speaking about it vs no compression.
 
Then you heard wrong because the display planes have been in since the beta dev kits at the start of the year.

Link to someone saying as much? I thought it was interference here who said no 3rd parties had it until final kits.

It sounds and feels to me when they talk of more bandwidth that they are talking about the compression that the GCN cards employ on certain tasks, and instead of talking about actual bandwidth they are speaking about it vs no compression.

No, it's actual real world bandwidth.
 
You design all 8 cycles to double pump if that's your goal, sure. But this wasn't what we got here. If my theory on what happened is right, I don't see anything particularly shocking about it. Just good luck on MS's part. It could have very well been something that only popped up as they were finalizing their production testing.
You don't design a motorcycle and find out halfway through tooling for mass production that it has four wheels and a moon roof.

You cannot Gilligan yourself into a functioning high-speed digital memory pipeline. If the hardware is supposed to change state at specific points on a specific clock signal, it is doing it because the professional engineers made fundamental design decisions early in the process of laying out a multibillion transistor device operating with margins measured in picoseconds.

Picking up on signals not provisioned for in the design isn't a happy accident; that means part of a defined set of state transitions drawn in silicon and wire has gone off the reservation and violated assumptions built into whole sections of the pipeline. The control units, if they weren't designed for this, would not be able to change their signals for the other half of the clock cycle. Either the interface starts spitting out signals that do not make sense, or the chip starts predicting the commands it should have gotten but never will. It then may decide to rebel against its fleshy masters.

For instance, you can get faster state changes in the transistors by making them smaller due to the way quantum tunneling works.
Are you stating they stumbled on a node shrink or three while validating their SRAM, or has the industry missed something in the 28nm range that's been use for years?

Maybe they got the eSRAM smaller, more dense during some production test runs and didn't realize it was small enough to have the states be capable of switching so fast as to open the window on the pulse for double pumping? It's possible.
I'm at least a little skeptical.
 
@3dilettante:
Well said!

@astrograd:
I thought it would be fun to read how you come up with otherworldly "explanations" for this coincidental symmetry of a 16/15 clock bump and the alleged 15/16 (x2) ratio for the eSRAM peak bandwidth. But I thought you wouldn't post it here. In any case, it reads like utter crap. As 3dilettante (and others before him) said correctly, you don't find neither a wider interface to your eSRAM nor that it is magically able to run DDR in production silicon if it was not designed that way from the beginning.
 
Thank god that was put to rest.

Oh I hope not. I'm hoping for an eSRAM theory involving quantum entanglement and/or room temperature superconducting in response :)

I wanna live in a world where Microsoft Magic knows no bounds because reality is way more boring
 
Its in some of the documentation that vgleaks and others have, it says that theres been 1 display pane since the alpha kits, 2 since beta.



Link to some saying as much?

It's in the DF article on the subject if ya want a link. At that time when it wasn't upclocked yet it was 133GB/s on the eSRAM for real world usage (as opposed to 192GB/s theoretical peak). I also know from my source who told me about the eSRAM boost back in May that it's absolutely real bandwidth and not some funny numerology.

And 1 display plane means absolutely nothing. You need 2 for the game itself in order to even work with them as intended. The VGLeaks entry on them has no mention whatsoever of dev kits getting them, so please provide a link.
 
Interesting thing to notice. :smile:

Maybe the eSRAM was found to double pump on 7 out of every 8 cycles due to the low latency opening up a window to do the double pumping on rising/falling pulse edges. They boost the clock and narrow the pulses, but if they go too far they lose the window for double pumping on those 7 cycles since it takes some amount of time to switch state to do the other operation, so they take it up as far as they can without losing the double pumping. Perhaps they feel bandwidth is more important than raw GPU power?

I would like to see someone with more experience on this then me comment, but it sounds like total rubbish to me. This wouldn't be something you suddenly realise near the end of your design, it would be something you realise early on imo.

Llano and Trinity were bottlenecked by 2 channel DDR3-1600, but not by DDR3-1800 unless OC. I would imagine Bandwidth is not a major problem for XB1 and it's Quad channel DDR3-2133. Especially since it's HSA.
 
You don't design a motorcycle and find out halfway through tooling for mass production that it has four wheels and a moon roof.

Not even remotely close to a accurate analogy here.

You cannot Gilligan yourself into a functioning high-speed digital memory pipeline. If the hardware is supposed to change state at specific points on a specific clock signal, it is doing it because the professional engineers made fundamental design decisions early in the process of laying out a multibillion transistor device operating with margins measured in picoseconds.

Neither of us have any idea about the context of this discovery. Let's stop pretending we were in the room when they discovered this. I was told it wasn't something they expected to have on the table, which is similar to what DF was told. We don't really have a justified reason to cast this literally as a totally unforeseen accident that they had no idea was possible. That's a narrative ppl push in an effort to dismiss what has evidently happened. Instead of trying to wish it away with dumb analogies I'd rather speculate as to how it might have happened, especially if it relates to other stuff (like capping the clock at 853MHz).

And multibillion transistor devices are still engineered with error margins.

Picking up on signals not provisioned for in the design isn't a happy accident; that means part of a defined set of state transitions drawn in silicon and wire has gone off the reservation and violated assumptions built into whole sections of the pipeline.

You are adding color to your description where it doesn't belong. Loaded language trying to paint this as a wholly chaotic and spontaneous flickering of electrons swarming all over the place isn't going to help anyone explain what I was told in May nor what DF was told in June. Violating assumptions isn't always a bad thing. As I said, if they were being conservative, which I submit is a rather reasonable and obvious scenario given the context surrounding the manufacturing/design of the eSRAM, then things turning out better than expected shouldn't be twisted into something dangerously foreboding.

The control units, if they weren't designed for this, would not be able to change their signals for the other half of the clock cycle. Either the interface starts spitting out signals that do not make sense, or the chip starts predicting the commands it should have gotten but never will.

What specifically do you mean here? If you are referring to current leakage causing issues then that depends on the type of transistors being used.

Are you stating they stumbled on a node shrink or three while validating their SRAM, or has the industry missed something in the 28nm range that's been use for years?

If you read my other posts you will see what I am suggesting. Namely that it's possible that during manufacturing/testing they found out their conservative designs in the massive pool of eSRAM turned out to give them more wiggle room that they didn't realize they had. Have ppl been manufacturing 32MB pools of eSRAM for yrs?
 
Let's make up a more reasonable sounding theory:
The hardware engineers know their SRAM well and said it will have a bandwidth of 128 Byte/clock, hence 102.4 GB/s at 800 MHz.
Final silicon came back and the eSRAM has indeed 102.4GB/s bandwidth at 800 MHz. Now some stupid testing guys ran a few fillrate tests and stumbled upon the fact, that the 16 ROPs supporting halfrate blending with a 4xFP16 framebuffer format. While 8 pixel/clock * 2 * 8 Byte/pixel are indeed 128 Byte/clock, they figured that the Z buffer is accessed in parallel adding another 32 bit per ROP and clock and bringing up the total amount of data to 192 Byte per clock. This gets somehow mixed up to mean some 192 GB/s theoretical peak in the end.
Let's say the fillrate tests at the revised clockspeed of 853 MHz gave them ~98% of the theoretical peak fillrate (at that mentioned 4xFP16 format with blending, which makes it half rate), i.e. 6.68 Gpixel/s. How did they convert this to a memory bandwidth? Easy! Each pixel requires 16 byte for the blending operation plus 4 byte for the 32bit Z buffer. That results in an allegedly used bandwidth of 6.68 Gpixel/s * 20 Bytes/pixel = 133.6 GB/s. 'Great!' thinks the testing guy and hastily writes a post to some internal developer blog.

Who spots the errors I intentionally put in? Personally I think such a fuckup is more probably than MS discovering AMD actually designed an eSRAM array with twice the bandwidth as specified. Or even more improbable that AMD didn't know the theoretical bandwidth of their design. Either it was a miscommunication in the beginning or it is now.
 
Last edited by a moderator:
@3dilettante:
Well said!

@astrograd:
I thought it would be fun to read how you come up with otherworldly "explanations" for this coincidental symmetry of a 16/15 clock bump and the alleged 15/16 (x2) ratio for the eSRAM peak bandwidth. But I thought you wouldn't post it here. In any case, it reads like utter crap. As 3dilettante (and others before him) said correctly, you don't find neither a wider interface to your eSRAM nor that it is magically able to run DDR in production silicon if it was not designed that way from the beginning.

You can assert it's "impossible" all day long. It happened. And why would you need a wider interface if it's acting in DDR mode? You're just increasing the transfer rate, not widening the bus. Nobody is claiming they ended up with a DDR system out of nowhere that is akin to one designed that way. What I'm saying is evidently they have a system that can read/write simultaneously on 7/8 of the cycles.

They may have designed it to be a DDR setup but weren't sure if they could get the right temperatures to close those timing windows for it to actually work. None of us here (maybe 3dcgi) know the circumstances of how this came about. I was personally told in May that it was related to timing windows that they didn't expect to have as large as they ended up getting. You guys are using loaded language to paint a picture of them designing one chip and having it magically transform into something completely different. Nobody is suggesting that was the case.
 
Let's make up a more reasonable sounding theory:
The hardware engineers know their SRAM well and said it will have a bandwidth of 128 Byte/clock, hence 102.4 GB/s at 800 MHz.
Final silicon came back and the eSRAM has indeed 102.4GB/s bandwidth at 800 MHz. Now some stupid testing guys ran a few fillrate tests and stumbled upon the fact, that the 16 ROPs supporting halfrate blending with a 4xFP16 framebuffer format. While 8 pixel/clock * 2 * 8 Byte/pixel are indeed 128 Byte/clock, they figured that the Z buffer is accessed in parallel adding another 32 bit per ROP and clock and bringing up the total amount of data to 192 Byte per clock. This gets somehow mixed up to mean some 192 GB/s theoretical peak in the end.
Let's say the fillrate tests at the revised clockspeed of 853 MHz gave them ~98% of the theoretical peak fillrate (at that mentioned 4xFP16 format with blending, which makes it half rate), i.e. 6.68 Gpixel/s. How did they convert this to a memory bandwidth? Easy! Each pixel requires 16 byte for the blending operation plus 4 byte for the 32bit Z buffer. That results in an allegedly used bandwidth of 133 GB/s. 'Great!' thinks the testing guy and hastily writes a post to some internal developer blog.

Who spots the errors I intentionally put in? Personally I think such a fuckup is more probably as MS discovering AMD actually designed an eSRAM array with twice the bandwidth as specified. Or even more improbable that AMD didn't know the theoretical bandwidth of their design. Either it was a miscommunication in the beginning or it is now.

The issue is entirely based on timing windows which are very, very sensitive to voltages, temperatures, and transistor densities. All of those things could have very well been in flux during manufacturing tests (obviously).

It's hardly unreasonable to think that when tasked with designing a 32MB pool of eSRAM for mass market applications the engineers were rather conservative in their design. It's likewise not unreasonable to think that during product testing and early manufacturing tests these conservative estimates may have been found to significantly overstate how much wiggle room they needed and as a result the windows for working with data transfers was larger than they could have reliably banked on a priori.

And I've been told it's actual bandwidth, not some numerology game nor some math mistake.
 
I assert it didn't happen. ;)

Well my source says otherwise. So does DF's.

Because it is simply impossible to run an interface designed for SDR operation in DDR mode.

Says who? If you narrow the timings on certain operations you open up the window for transferring data on. Do that enough and what is stopping you from reading/writing simultaneously?
 
The issue is entirely based on timing windows which are very, very sensitive to voltages, temperatures, and transistor densities.
All designs are susceptible to these factors. If your planned timings were too aggressive, you have to lower the clock to get it running, if they were very conservative, you have some room to raise the clocks. But it NEVER enables you to run something in a completely different operating mode (like DDR) if it was not designed for that from the beginning. We don't have to discuss about that.
 
Says who? If you narrow the timings on certain operations you open up the window for transferring data on. Do that enough and what is stopping you from reading/writing simultaneously?
Simple logic. The actual physical implementation of the design stops you from doing it. If it is designed for SDR operation, it is doing something on each rising clock edge. How should the transistors know that they have to react on both, the rising and the falling clock edge? That's nothing you can patch in easily just by applying a few other timings. One has to design for that, explicitly. Or has anybody ever succeeded to run a PC-133 SDRAM-Module as DDR-133? :rolleyes:
 
All designs are susceptible to these factors. If your planned timings were too aggressive, you have to lower the clock to get it running, if they were very conservative, you have some room to raise the clocks. But it NEVER enables you to run something in a completely different operating mode (like DDR) if it was not designed for that from the beginning. We don't have to discuss about that.

If the timings are tight enough to fit the window given, then what is stopping you from using both rising and falling edges? I don't want you to simply repeat your assumptions. That's not a discussion.

I'll remind you that we aren't necessarily talking about something designed and intended/expected to be a DDR setup. But the design of DDR setups evolved specifically as ppl struggled to get timings tight enough. You could do DDR in principle on an SDR design if those timings were narrowed, but ya couldn't get it to work reliably on all cycles. Here we likewise don't have it working on all cycles.

All the academic research on moving to DDR was done to make it a reliable method redesigned ground up to get those transfer rates up on every clock cycle, i.e. as a robust alternative to trying to exploit SDR. That's where the nuances of redesigning an interface to run that kind of operation came from. That doesn't mean you can't do DDR on an SDR setup though, IF you get around the limitations on the SDR exploits from 6 or 7 yrs ago. Since that shift towards DDR has occurred, we have also seen significant shifts in areas that were limiting those SDR exploits. I submit that it's possible that having it embedded with extremely low latency did much to trim the fat on those timings.
 
Status
Not open for further replies.
Back
Top