Is Quadro's SP count more Marketing than performance?

Discussion in 'Beginners Zone' started by TARTOFCP, Aug 26, 2010.

  1. TARTOFCP

    Newcomer

    Joined:
    Aug 26, 2010
    Messages:
    8
    Likes Received:
    1
    First of all, I'm Sorry to a provocative Title and My poor English.


    Recently, I found one person(P) that He claims 'Quadro's SP count is more Marketing than performance'.

    -----
    P claims 'Quadro's SP count is Marketing'
    1. In OpenGL is Important for polygon GPC, not Processor(cuda core) count.
    2. Regardless of SM(SP), Quadro 5000 is process 3 polygons per clock cycle, 6000 is 4 polygons.
    3. That is the reason why did not down the ROP. (GTX470 40 Rop 320 bit, Quadro 6000 48 Rop 384 bit)

    These links are basis of his(P) opinion.
    http://techreport.com/articles.x/19404/4
    http://www.behardware.com/articles/787-8/r...tx-480-470.html
    -----


    So, I read several GF100 Architecture review (and gf100 whitepaper),
    everybody say -GF100 a parallel geometry processing architecture : 16 Polymolph Engine and 4 Raster Engine.-

    http://techreport.com/articles.x/18332/2
    http://www.bjorn3d.com/read.php?cID=1778&pageID=8317
    http://www.scribd.com/doc/35710178/NVIDIA-GF100-Whitepaper

    'To facilitate high triangle rates, we designed a scalable geometry engine called the PolyMorph Engine.
    Each of the 16 PolyMorph engines has its own dedicated vertex fetch unit and tessellator, greatly expanding geometry performance.'


    I think he(P) places emphasis on simply GPC's Raster Engine.

    1. - I don't understand why mentioning only GPC(Raster Engine).

    2. - It's natural.
    ...... Quadro 5000 cuda core 352(3 GPC), Quadro 6000 cuda core 448(4 GPC)
    ...... 1 GPC = need 1~4 SM. (1 Raster Engine per GPC)
    ...... 1 SM = 32 cuda core(GF100). (1 Polymolph Engine per SM)

    As far as I know Polymolph Engine(SM) and Raster Engine(GPC) are closely related.
    techreport.com/articles.x/18332/2
    'Once the polymorph engines have finished their work, the resulting data are forwarded the GF100's four raster engines.'

    3. - ROPs can explain AA Perfomance. (Geforce 32x, Quadro 64x)
    http://techreport.com/articles.x/18332/4


    Also, I can explain why SP count is not only Marketing.
    Adobe Premiere pro cs5- Mercury Playback Engine GPU Accelation.(or RapiHD=Elemental Accelator at GT200)
    Mentalimage Iray. Arion Render. Octane Render. etc..(refer to cuda showcase)

    and this
    http://www.awn.com/articles/article/fermi-entering-era-computational-visualization/page/1,1
    http://pressroom.nvidia.com/easyir/...rsion=live&releasejsp=release_157&prid=645616



    Reference 1
    Nvidia fermi Quadro 6000.
    GPU clock 574MHz
    Cuda Core 448, Clock 1148MHz
    Memory 384bit, 6GB, Clock 1500(750*2)MHz
    48 ROPs
    OpenGL 4.x
    SM 5.x
    1.3 billion triangles per second. (Based on GLperf, run by NVIDIA Performance Lab)




    Could you explain it so I can understand more easily?

    1. Is SM(Polymolph Engine)/SP(Cuda core) count does not particularly usefulness in openGL performance?

    2. Why Quadro more ROPs than Geforce? (openGL? or AA? or Memory (bit, capacity)?)

    3. Why Quadro 6000 is 1.3BTris? (Why not 1.9~2.4Btris? How?)
    ex) GTX470 2428 MTris = 4 * 607 (4 GPC * GPU clock)
    I don't understand how result 1.3BTris. (but i think SM(polymolph engine)s influence to result)

    4. Which is more effect(or important) between Polymolph Engine or Raster Engine at OpenGL Performance?
    (both sure, but I think more PE than RE)




    Reference 2
    'Once the polymorph engines have finished their work, the resulting data are forwarded the GF100's four raster engines.
    Optimally, each one of those engines can process a single triangle per clock cycle.
    The GF100 can thus claim a peak theoretical throughput rate of four polygons per cycle, although Alben called that "the impossible-to-achieve rate," since other factors will limit throughput in practice.
    Nvidia tells us that in directed tests, GF100 has averaged as many as 3.2 triangles per clock, which is still quite formidable.'

    'Fermi can (theoretically) produce 4 triangles at once. The reality is that it can process about 2.5 - 2.7 simultaneously.
    That might not seem like a lot but previous GPU's processed one so even 2.5 per clock is a 250% polygon processing performance increase.'

    Each rasterizer can do 8 pixels per clock, for a total of 32 pixels per clock over the entirety of GF100.
    4 GPC = 32 pixels per clock * 574(Quadro 6000) = 18.3 Gpixels/s
    48 rop = 48 pixels per clock * 574(Quadro 6000) = 27.5 Gpixels/s


    Thank you for read.
     
  2. cho

    cho
    Regular

    Joined:
    Feb 9, 2002
    Messages:
    416
    Likes Received:
    2
    I think the "1.3BTris" is come from some low level benchmark, for example : GLperf .
     
    #2 cho, Aug 26, 2010
    Last edited by a moderator: Aug 31, 2010
  3. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    1. You seem to confuse OpenGL performance with wireframe or geometry performance. If you're talking only about that, the major bottleneck still is the front end of the shading pipeline and only if you're doing more sophisticated stuff with your polygons (maybe even at the pixel level) you will not run into the limitation imposed by the first part of the pipeline.

    2. Simply put: Memory. Each ROP is fast-tied to a 64 Bit memory controller and only with full ROP counts can you utilize the full amount of memory, which is imperative in professional performance.

    3. I'v asked the same question. Answer was as cho already said: not theoretical peak but observed perf in low level benchmark.

    4. See 1. It depends on what you are going to do with your OpenGL programs. Do a lot of fancy stuff adding or animating polys: PME. Just throwing millions and millions of triangles into a mesh: raster engine.
     
  4. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    I agree with Carsten. Many workstation apps don't involve fancy texturing and shading (vertex or fragment) so geometry and rasterization performance is what matters most in these situations. The two most important things for great performance in these apps is the raster engine and optimized drivers.

    Artists that use programs like Maya might have viewports supporting fancy shading, but much of the time they'll still work with wireframe and untextured models.
     
  5. TARTOFCP

    Newcomer

    Joined:
    Aug 26, 2010
    Messages:
    8
    Likes Received:
    1
    Thank you for all the answers.

    Thank you for all the answers.

    However, there are still parts I do not understand.

    1. Forgive me.
    I'm still on the part of the concept is lacking.

    This is because I've seen them.




    2. Memory, the answer was a bit surprising.
    (I structure a little know. (1 module = 8 Rop + 64bit MC))

    I thought it was a main cause AA. (Memory, but also important)
    Geforce up to 32x, Quadro up to 64x (single card)




    3. Description of low level is required.
    Cause I was thinking would be affected MPE (GLperf results)

    His('P') links with similar data.
    And perhaps like materials are used only in raster engine.


    4. Has been helpful.
    (It depends on what you are going to do with your OpenGL programs.)



    I wonder of the writing is 3dcgi.
    MPE(or SM n SP) Would not important at VFX(OpenGL Effect)?
    as far as i know, this market is larger.



    I forgot an important question.

    112. I would like to hear people's opinion about this.
    'is Quadro's SP more marketing than performance?'



    It looks to me.
    According to data of Nvidia quadro SM/SP seems to be mainly to promote.


    Thank you for read.
    (Please understand that I am not good at English)
     
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    I think the correct calculation is to use the culling rate which is # SM * clock / 4 according to hardware.fr. So it would be 14 * 607 / 4 = 2124 MTris. On a full chip with no SM's disabled this would match the GPC based calculation.

    Also, the theoretical peak is only for triangles 8 pixels in size or smaller. Anything bigger than that would require multiple cycles in the rasterizer.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...