AMD: Speculation, Rumors, and Discussion (Archive)

mczak · Jun 11, 2016

Jawed said:
Sorry, I was wrong.

Ah ok. It looked plausible enough

(Just because it looks like it should be possible to fit 16 tris into a wave doesn't mean the hw really can do it...). In any case, with 16 tris per wave that means you may need 24kB of LDS just for attributes in the worst case (per wave).

kalelovil · Jun 11, 2016

That particular part sounds like yet another reheated Cape Verde.
Perhaps Apple will offer Polaris 11 as an upgrade option/in a higher SKU.

CarstenS · Jun 11, 2016

Love_In_Rio said:
Performance over a 980 and Fury in graphics test 4 could imply great tessellation improvement or new triangle culling hardware being really good.

I hope it's the former, because the latter, I'm still not entirely convinced of being a hardware solution (as in dedicated logic) rather than GeometryFX-like driver injection.

FWIW, Fury X with tessellation set to a max of 2x is around 59-ish fps in that GT4, while tessellation switched off entirely allows for some fps north of 70. Note that I am not saying that this is what AMD is/will be doing. I was just curious how much performance impact tessellation has in that particular test and RSCE offers such a convenient way to try it.

Malo · Jun 11, 2016

What % of PC enthusiasts have clear side panels and PC in a position to actually see inside as well? What the GPU looks like should only be relevant to those designs specifically designed for that particular consumer market, with LEDs etc. Most GPU designs should be focused on efficiency.

Kaotik · Jun 11, 2016

Malo said:
What % of PC enthusiasts have clear side panels and PC in a position to actually see inside as well? What the GPU looks like should only be relevant to those designs specifically designed for that particular consumer market, with LEDs etc. Most GPU designs should be focused on efficiency.

TBH more and more, because more and more cases have either windows or even full glass sidepanels. (I myself don't have either one, though I think (without checking under the desk) that my sidepanel does have grille for a fan

Grall · Jun 11, 2016

Malo said:
What the GPU looks like should only be relevant to those designs specifically designed for that particular consumer market, with LEDs etc.

Just knowing that a graphics card has LEDs improves framerates and rendering quality.

Scientific fact!

Anarchist4000 · Jun 11, 2016

CarstenS said:
I hope it's the former, because the latter, I'm still not entirely convinced of being a hardware solution (as in dedicated logic) rather than GeometryFX-like driver injection.

FWIW, Fury X with tessellation set to a max of 2x is around 59-ish fps in that GT4, while tessellation switched off entirely allows for some fps north of 70. Note that I am not saying that this is what AMD is/will be doing. I was just curious how much performance impact tessellation has in that particular test and RSCE offers such a convenient way to try it.

May be worth withholding judgment until we see the architecture. Fast scalar processors and the potential ability to regroup waves, while sort of a software solution, would seem to change some of those dynamics significantly. A software solution in that case may be a far more flexible option and maintain performance. The culling process could be occurring in stages as well. Seems likely Nvidia will be doing something similar with Volta, as there are a lot of papers about scalars floating about for both architectures.

CSI PC · Jun 11, 2016

Anarchist4000 said:
May be worth withholding judgment until we see the architecture. Fast scalar processors and the potential ability to regroup waves, while sort of a software solution, would seem to change some of those dynamics significantly. A software solution in that case may be a far more flexible option and maintain performance. The culling process could be occurring in stages as well. Seems likely Nvidia will be doing something similar with Volta, as there are a lot of papers about scalars floating about for both architectures.

NVidia also has a fair few papers regarding Temporal SIMT (including patents), aligned with several independent engineers and with Bill Dally talking about that maybe 2018 relating to their Tesla architecture.
They started to present the Temporal SIMT concept once the patent went public in 2013.
These are what they outlined as their 2018 vision awhile back:

Key architectural features: 2018 Vision: Echelon Compute Node & System
• Malleable memory hierarchy
• Hierarchical register files
• Hierarchical thread scheduling
• Place coherency/consistency
• Temporal SIMT & scalarization

I was debating whether this thread made sense to follow up with the other research papers/presentations pertaining to the relevance of Temporal SIMT that I posted about earlier in post #457:
Possibly more relevant for now is Jan Lucas explaining why Temporal SIMT has advantages over scalarization in conventional GPUs; he mentions as an example GCN but he maybe more focused towards the Temporal SIMT solution and I appreciate his views may not sit with everyone: http://lpgpu.org/wp/wp-content/uploads/2014/09/lpgpu_scalarization_169_Jan_Lucas.pdf
He is heavily involved with this concept and was in one of the papers in previous post showing real results (DART) using Temporal SIMT based upon early NVIDIA Cuda core.

Post #453 as reference: https://forum.beyond3d.com/threads/...max-110-130w-range.58003/page-23#post-1920449
Cheers

Edit:
I should say context is Temporal SIMT with Scalarization.

Razor1 · Jun 11, 2016

https://www.flickr.com/photos/107825676@N07/27573213556/

Don't think this is real......

renderstate · Jun 11, 2016

What's the difference between temporal SIMT and a bunch of independent scalar cores?

Silent_Buddha · Jun 11, 2016

Malo said:
What % of PC enthusiasts have clear side panels and PC in a position to actually see inside as well? What the GPU looks like should only be relevant to those designs specifically designed for that particular consumer market, with LEDs etc. Most GPU designs should be focused on efficiency.

I just want a GPU that runs at stock speed that exhausts the heat out of the back of the case. I'm long past caring about what my computer, much less the innards of the computer looks like.

And I want the heat exhausted out the back because I run my computer semi-passively. There's just one 120 mm silent fan that rarely turns on servicing the entire computer (no CPU fan) with a hybrid PSU that usually runs without activating its fan.

Regards,
SB

CarstenS · Jun 11, 2016

Anarchist4000 said:
May be worth withholding judgment until we see the architecture. Fast scalar processors and the potential ability to regroup waves, while sort of a software solution, would seem to change some of those dynamics significantly. A software solution in that case may be a far more flexible option and maintain performance. The culling process could be occurring in stages as well. Seems likely Nvidia will be doing something similar with Volta, as there are a lot of papers about scalars floating about for both architectures.

Oh, I am not judging. I am merely giving the impression I am currently under. That can and will change as more details emerge. Actually, I would find it kind of funny seeing Radeons running circles around Geforces in titles with excessive tessellation usage.

Razor1 · Jun 11, 2016

renderstate said:
What's the difference between temporal SIMT and a bunch of independent scalar cores?

http://www.cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/ieee-micro-echelon.pdf

This is a pretty good start on that.

renderstate · Jun 11, 2016

I read it but it doesn't really say how each lane can fetch and decode independent instructions. If each lane can fetch a different instruction that's 8X more ICACHE BW!

Perhaps there is a clever solution for this but to me it sounds like you got to have N scalar cores that share a bunch of other logic to amortize their cost. Temporal SIMT sounds better than 'super dumb LIW cores that share some resources'

Grall · Jun 12, 2016

Silent_Buddha said:
I just want a GPU that runs at stock speed that exhausts the heat out of the back of the case.

Then the reference 480(X) should be pretty ideal. Its design seem a remarked improvement compared to previous AMD designs, putting power delivery at the back of the PCB instead of up front made the PCB shorter and gave the blower the ability to suck air from both sides of the card. It's disappointing though that AMD chose to gimp power delivery so much with just a single auxiliary six-pin power connector, with such a smart cooler design and all.

They always have to F up in some way don't they.

CSI PC · Jun 12, 2016

renderstate said:
I read it but it doesn't really say how each lane can fetch and decode independent instructions. If each lane can fetch a different instruction that's 8X more ICACHE BW!

Perhaps there is a clever solution for this but to me it sounds like you got to have N scalar cores that share a bunch of other logic to amortize their cost. Temporal SIMT sounds better than 'super dumb LIW cores that share some resources'

Well Nvidia went that route of LIW publically back in 2011 as shown with the paper linked.
But Nvidia only went public in 2013 with a more advanced solution with the Temporal SIMT+scalarization.
Did you take a look at the patent and the associated papers (such as the Jan Lucas paper/presentations)?
http://www.freepatentsonline.com/y2013/0042090.html

Thanks
Edit:
If going further discussing the definition of Temporal SIMT and terminology used by Nvidia, maybe additional posts should be done in specific threads to Nvidia rather than here.
I was just highlighting in response to thoughts about AMD and Nvidia using similar scalar solution.
Nvidia is looking at an evolved solution possibly for Cuda cores around the concept of Temporal SIMT with Scalarization, which they and others have been pushing since 2013 and looking/vision at around 2018.

renderstate · Jun 12, 2016

Even in this case it's not discussed how instructions are magically fetched from memory without having to provide an insanely large (in terms of area) instruction cache and instruction bandwidth. I am sure you can build it but is it worth it? I am bit skeptical.

CSI PC · Jun 12, 2016

renderstate said:
Even in this case it's not discussed how instructions are magically fetched from memory without having to provide an insanely large (in terms of area) instruction cache and instruction bandwidth. I am sure you can build it but is it worth it? I am bit skeptical.

They were still talking about it last year as a requirement for eliminating waste/redundancy and improving efficiency, all of which are critical when looking to evolve more towards an exascale capable Tesla GPU.
Which P100 is part of, and yeah any future technology raises what headache this will have between Tesla/Quadro/Consumer and the lower cards.
But we digress.
Cheers

Razor1 · Jun 12, 2016

renderstate said:
Even in this case it's not discussed how instructions are magically fetched from memory without having to provide an insanely large (in terms of area) instruction cache and instruction bandwidth. I am sure you can build it but is it worth it? I am bit skeptical.

Well echelon has been on the road maps and what we heard of it from what like 7 years lol? Yeah its been pushed out because it wasn't worth to make it. But eventually it will become worth it, and is it coming up, probably in a couple of years.

Deleted member 2197 · Jun 12, 2016

AMD: Speculation, Rumors, and Discussion (Archive)

mczak

kalelovil

CarstenS

Moderator

Malo

Yak Mechanicum

Kaotik

Drunk Member

Grall

Invisible Member

Anarchist4000

CSI PC

Razor1

renderstate

Silent_Buddha

CarstenS

Moderator

Razor1

renderstate

Grall

Invisible Member

CSI PC

renderstate

CSI PC

Razor1

Deleted member 2197

Guest

Similar threads