Is Quadro's SP count more Marketing than performance?

TARTOFCP · Aug 26, 2010

First of all, I'm Sorry to a provocative Title and My poor English.

Recently, I found one person(P) that He claims 'Quadro's SP count is more Marketing than performance'.

-----
P claims 'Quadro's SP count is Marketing'
1. In OpenGL is Important for polygon GPC, not Processor(cuda core) count.
2. Regardless of SM(SP), Quadro 5000 is process 3 polygons per clock cycle, 6000 is 4 polygons.
3. That is the reason why did not down the ROP. (GTX470 40 Rop 320 bit, Quadro 6000 48 Rop 384 bit)

These links are basis of his(P) opinion.
http://techreport.com/articles.x/19404/4
http://www.behardware.com/articles/787-8/r...tx-480-470.html
-----

So, I read several GF100 Architecture review (and gf100 whitepaper),
everybody say -GF100 a parallel geometry processing architecture : 16 Polymolph Engine and 4 Raster Engine.-

http://techreport.com/articles.x/18332/2
http://www.bjorn3d.com/read.php?cID=1778&pageID=8317
http://www.scribd.com/doc/35710178/NVIDIA-GF100-Whitepaper

'To facilitate high triangle rates, we designed a scalable geometry engine called the PolyMorph Engine.
Each of the 16 PolyMorph engines has its own dedicated vertex fetch unit and tessellator, greatly expanding geometry performance.'

I think he(P) places emphasis on simply GPC's Raster Engine.

1. - I don't understand why mentioning only GPC(Raster Engine).

2. - It's natural.
...... Quadro 5000 cuda core 352(3 GPC), Quadro 6000 cuda core 448(4 GPC)
...... 1 GPC = need 1~4 SM. (1 Raster Engine per GPC)
...... 1 SM = 32 cuda core(GF100). (1 Polymolph Engine per SM)

As far as I know Polymolph Engine(SM) and Raster Engine(GPC) are closely related.
techreport.com/articles.x/18332/2
'Once the polymorph engines have finished their work, the resulting data are forwarded the GF100's four raster engines.'

3. - ROPs can explain AA Perfomance. (Geforce 32x, Quadro 64x)
http://techreport.com/articles.x/18332/4

Also, I can explain why SP count is not only Marketing.
Adobe Premiere pro cs5- Mercury Playback Engine GPU Accelation.(or RapiHD=Elemental Accelator at GT200)
Mentalimage Iray. Arion Render. Octane Render. etc..(refer to cuda showcase)

and this
http://www.awn.com/articles/article/fermi-entering-era-computational-visualization/page/1,1
http://pressroom.nvidia.com/easyir/...rsion=live&releasejsp=release_157&prid=645616

Reference 1
Nvidia fermi Quadro 6000.
GPU clock 574MHz
Cuda Core 448, Clock 1148MHz
Memory 384bit, 6GB, Clock 1500(750*2)MHz
48 ROPs
OpenGL 4.x
SM 5.x
1.3 billion triangles per second. (Based on GLperf, run by NVIDIA Performance Lab)

Could you explain it so I can understand more easily?

1. Is SM(Polymolph Engine)/SP(Cuda core) count does not particularly usefulness in openGL performance?

2. Why Quadro more ROPs than Geforce? (openGL? or AA? or Memory (bit, capacity)?)

3. Why Quadro 6000 is 1.3BTris? (Why not 1.9~2.4Btris? How?)
ex) GTX470 2428 MTris = 4 * 607 (4 GPC * GPU clock)
I don't understand how result 1.3BTris. (but i think SM(polymolph engine)s influence to result)

4. Which is more effect(or important) between Polymolph Engine or Raster Engine at OpenGL Performance?
(both sure, but I think more PE than RE)

Reference 2
'Once the polymorph engines have finished their work, the resulting data are forwarded the GF100's four raster engines.
Optimally, each one of those engines can process a single triangle per clock cycle.
The GF100 can thus claim a peak theoretical throughput rate of four polygons per cycle, although Alben called that "the impossible-to-achieve rate," since other factors will limit throughput in practice.
Nvidia tells us that in directed tests, GF100 has averaged as many as 3.2 triangles per clock, which is still quite formidable.'

'Fermi can (theoretically) produce 4 triangles at once. The reality is that it can process about 2.5 - 2.7 simultaneously.
That might not seem like a lot but previous GPU's processed one so even 2.5 per clock is a 250% polygon processing performance increase.'

Each rasterizer can do 8 pixels per clock, for a total of 32 pixels per clock over the entirety of GF100.
4 GPC = 32 pixels per clock * 574(Quadro 6000) = 18.3 Gpixels/s
48 rop = 48 pixels per clock * 574(Quadro 6000) = 27.5 Gpixels/s

Thank you for read.

cho · Aug 26, 2010

I think the "1.3BTris" is come from some low level benchmark, for example : GLperf .

CarstenS · Aug 26, 2010

TARTOFCP said:
Could you explain it so I can understand more easily?

1. Is SM(Polymolph Engine)/SP(Cuda core) count does not particularly usefulness in openGL performance?

2. Why Quadro more ROPs than Geforce? (openGL? or AA? or Memory (bit, capacity)?)

3. Why Quadro 6000 is 1.3BTris? (Why not 1.9~2.4Btris? How?)
ex) GTX470 2428 MTris = 4 * 607 (4 GPC * GPU clock)
I don't understand how result 1.3BTris. (but i think SM(polymolph engine)s influence to result)

4. Which is more effect(or important) between Polymolph Engine or Raster Engine at OpenGL Performance?
(both sure, but I think more PE than RE)

1. You seem to confuse OpenGL performance with wireframe or geometry performance. If you're talking only about that, the major bottleneck still is the front end of the shading pipeline and only if you're doing more sophisticated stuff with your polygons (maybe even at the pixel level) you will not run into the limitation imposed by the first part of the pipeline.

2. Simply put: Memory. Each ROP is fast-tied to a 64 Bit memory controller and only with full ROP counts can you utilize the full amount of memory, which is imperative in professional performance.

3. I'v asked the same question. Answer was as cho already said: not theoretical peak but observed perf in low level benchmark.

4. See 1. It depends on what you are going to do with your OpenGL programs. Do a lot of fancy stuff adding or animating polys: PME. Just throwing millions and millions of triangles into a mesh: raster engine.

3dcgi · Aug 27, 2010

I agree with Carsten. Many workstation apps don't involve fancy texturing and shading (vertex or fragment) so geometry and rasterization performance is what matters most in these situations. The two most important things for great performance in these apps is the raster engine and optimized drivers.

Artists that use programs like Maya might have viewports supporting fancy shading, but much of the time they'll still work with wireframe and untextured models.

TARTOFCP · Aug 27, 2010

Thank you for all the answers.

Thank you for all the answers.

However, there are still parts I do not understand.

1. Forgive me.
I'm still on the part of the concept is lacking.

This is because I've seen them.

http://www.nvidia.com/object/quadro-fermi-highlights.html
-Scalable Geometry Engine-

http://www.nvidia.com/object/IO_89569.html
NVIDIA_GF100_Whitepaper

'
GF100’s entire graphics pipeline is designed to deliver high performance in tessellation and geometry throughput.
GF100 replaces the traditional geometry processing architecture at the front end of the graphics pipeline with an entirely new distributed geometry processing architecture that is implemented using multiple “PolyMorph Engines” .
Each PolyMorph Engine includes a tessellation unit, an attribute setup unit, and other geometry processing units.
Each SM has its own dedicated PolyMorph Engine (we provide more details on the Polymorph Engine in the GF100 architecture sections below).
Newly generated primitives are converted to pixels by four Raster Engines that operate in parallel (compared to a single Raster Engine in prior generation GPUs).
On-chip L1 and L2 caches enable high bandwidth transfer of primitive attributes between the SM and the tessellation unit as well as between different SMs.
Tessellation and all its supporting stages are performed in parallel on GF100, enabling breathtaking geometry throughput.

While GF100 includes many enhancements and performance improvements over past GPU architectures, the ability to perform parallel geometry processing is possibly the single most important GF100 architectural improvement.

'Game developers tend to use relatively simple geometric models due to the limited bandwidth of the PCI Express bus and the modest geometry throughput of current GPUs.'

'Using GPU-based tessellation, a game developer can send a compact geometric representation of an object or character, and the tessellator unit can produce the correct geometric complexity for the specific scene. We’ll now go into greater detail discussing the characteristics and benefits of tessellation in combination with displacement mapping.'

'To facilitate high triangle rates, we designed a -Scalable Geometry Engine- called the PolyMorph Engine.'

'Each of the 16 PolyMorph engines has its own dedicated vertex fetch unit and tessellator, greatly expanding geometry performance.
In conjunction, we also designed four parallel Raster Engines, allowing up to four triangles to be setup per clock.
Together, they enable breakthrough triangle fetch, tessellation, and rasterization performance.'

'Tessellation requires new levels of triangle and rasterization performance.
The PolyMorph Engine dramatically increases triangle, tessellation, and Stream Out performance.
Four parallel Raster Engines provide sustained throughout in triangle setup and rasterization.
By having a dedicated tessellator for each SM, and a Raster Engine for each GPC, GF100 delivers up to 8× the geometry performance of GT200.'

2. Memory, the answer was a bit surprising.
(I structure a little know. (1 module = 8 Rop + 64bit MC))

I thought it was a main cause AA. (Memory, but also important)
Geforce up to 32x, Quadro up to 64x (single card)

3. Description of low level is required.
Cause I was thinking would be affected MPE (GLperf results)

His('P') links with similar data.

Http://www.behardware.com/articles/787-7/report-nvidia-geforce-gtx-480-470.html
GTX 470 Tessellation - High culling 1311 Mtri/s.
(OpenGL 4.0 support for the tessellation)

And perhaps like materials are used only in raster engine.

GTX 470 100% culled 1959.
GTX 470 607MHz * 3.2 = 1914.
(Nvidia tells us that in directed tests, GF100 has averaged as many as 3.2 triangles per clock, which is still quite formidable.)

4. Has been helpful.
(It depends on what you are going to do with your OpenGL programs.)

I wonder of the writing is 3dcgi.
MPE(or SM n SP) Would not important at VFX(OpenGL Effect)?
as far as i know, this market is larger.

I forgot an important question.

112. I would like to hear people's opinion about this.
'is Quadro's SP more marketing than performance?'

It looks to me.
According to data of Nvidia quadro SM/SP seems to be mainly to promote.

http://www.nvidia.com/object/quadro-fermi-highlights.html

2010 Analyst Day Presentations - Jeff Brown - Quadro
http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9Mzk2ODB8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1

Thank you for read.
(Please understand that I am not good at English)

trinibwoy · Aug 28, 2010

TARTOFCP said:
GTX470 2428 MTris = 4 * 607 (4 GPC * GPU clock)
I don't understand how result 1.3BTris. (but i think SM(polymolph engine)s influence to result)

I think the correct calculation is to use the culling rate which is # SM * clock / 4 according to hardware.fr. So it would be 14 * 607 / 4 = 2124 MTris. On a full chip with no SM's disabled this would match the GPC based calculation.

Also, the theoretical peak is only for triangles 8 pixels in size or smaller. Anything bigger than that would require multiple cycles in the rasterizer.

Is Quadro's SP count more Marketing than performance?

TARTOFCP

cho

CarstenS

Moderator

3dcgi

TARTOFCP

trinibwoy

Meh

Similar threads