DX12 Performance Discussion And Analysis Thread

Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.

  1. Reznor007

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    629
    Likes Received:
    69
    Location:
    Norman, OK, USA
    R9 390X driver version 15.20.1062.1004
    Compute only:
    1. 52.28ms
    2. 52.43ms
    3. 52.24ms
    4. 52.25ms
    5. 52.28ms
    6. 52.23ms
    7. 52.26ms
    8. 52.34ms
    9. 52.27ms
    10. 52.23ms
    11. 52.24ms
    12. 52.24ms
    13. 52.24ms
    14. 52.29ms
    15. 52.24ms
    16. 52.28ms
    17. 52.25ms
    18. 52.24ms
    19. 52.23ms
    20. 52.30ms
    21. 52.21ms
    22. 52.27ms
    23. 52.22ms
    24. 52.25ms
    25. 52.19ms
    26. 52.24ms
    27. 52.28ms
    28. 52.27ms
    29. 52.26ms
    30. 52.33ms
    31. 52.32ms
    32. 52.18ms
    33. 52.31ms
    34. 52.21ms
    35. 52.47ms
    36. 52.24ms
    37. 52.23ms
    38. 52.28ms
    39. 52.25ms
    40. 52.27ms
    41. 52.17ms
    42. 52.22ms
    43. 52.25ms
    44. 52.24ms
    45. 52.25ms
    46. 52.25ms
    47. 52.26ms
    48. 52.17ms
    49. 52.22ms
    50. 52.22ms
    51. 52.24ms
    52. 52.28ms
    53. 52.25ms
    54. 52.35ms
    55. 52.20ms
    56. 52.35ms
    57. 52.26ms
    58. 52.20ms
    59. 52.23ms
    60. 52.27ms
    61. 52.26ms
    62. 52.60ms
    63. 52.56ms
    64. 52.24ms
    65. 52.27ms
    66. 52.25ms
    67. 52.28ms
    68. 52.31ms
    69. 52.21ms
    70. 52.21ms
    71. 52.24ms
    72. 52.36ms
    73. 52.25ms
    74. 52.34ms
    75. 52.33ms
    76. 52.18ms
    77. 52.21ms
    78. 52.40ms
    79. 52.21ms
    80. 52.53ms
    81. 52.20ms
    82. 52.22ms
    83. 52.20ms
    84. 52.35ms
    85. 52.24ms
    86. 52.33ms
    87. 52.32ms
    88. 52.33ms
    89. 52.69ms
    90. 52.23ms
    91. 52.26ms
    92. 52.21ms
    93. 52.24ms
    94. 52.28ms
    95. 52.28ms
    96. 52.24ms
    97. 52.21ms
    98. 52.28ms
    99. 52.26ms
    100. 52.26ms
    101. 52.23ms
    102. 52.33ms
    103. 52.26ms
    104. 52.33ms
    105. 52.33ms
    106. 52.30ms
    107. 52.29ms
    108. 52.25ms
    109. 52.24ms
    110. 52.22ms
    111. 52.28ms
    112. 52.26ms
    113. 52.13ms
    114. 52.27ms
    115. 52.26ms
    116. 52.61ms
    117. 52.25ms
    118. 52.24ms
    119. 52.26ms
    120. 52.53ms
    121. 52.23ms
    122. 52.20ms
    123. 52.34ms
    124. 52.33ms
    125. 52.33ms
    126. 52.35ms
    127. 52.36ms
    128. 52.26ms
    Graphics only: 27.55ms (60.89G pixels/s)
    Graphics + compute:
    1. 53.07ms (31.62G pixels/s)
    2. 52.47ms (31.97G pixels/s)
    3. 52.57ms (31.91G pixels/s)
    4. 52.57ms (31.91G pixels/s)
    5. 52.85ms (31.74G pixels/s)
    6. 52.81ms (31.77G pixels/s)
    7. 52.53ms (31.94G pixels/s)
    8. 52.69ms (31.84G pixels/s)
    9. 52.98ms (31.67G pixels/s)
    10. 52.74ms (31.81G pixels/s)
    11. 52.74ms (31.81G pixels/s)
    12. 52.97ms (31.67G pixels/s)
    13. 53.56ms (31.33G pixels/s)
    14. 53.20ms (31.54G pixels/s)
    15. 53.52ms (31.35G pixels/s)
    16. 52.66ms (31.86G pixels/s)
    17. 52.92ms (31.70G pixels/s)
    18. 52.58ms (31.91G pixels/s)
    19. 54.12ms (31.00G pixels/s)
    20. 53.10ms (31.60G pixels/s)
    21. 53.10ms (31.60G pixels/s)
    22. 52.54ms (31.93G pixels/s)
    23. 53.03ms (31.64G pixels/s)
    24. 53.28ms (31.49G pixels/s)
    25. 53.01ms (31.65G pixels/s)
    26. 54.17ms (30.97G pixels/s)
    27. 52.78ms (31.79G pixels/s)
    28. 54.23ms (30.94G pixels/s)
    29. 53.23ms (31.52G pixels/s)
    30. 53.43ms (31.40G pixels/s)
    31. 53.30ms (31.47G pixels/s)
    32. 53.42ms (31.40G pixels/s)
    33. 54.75ms (30.65G pixels/s)
    34. 52.56ms (31.92G pixels/s)
    35. 53.64ms (31.28G pixels/s)
    36. 53.16ms (31.56G pixels/s)
    37. 56.06ms (29.93G pixels/s)
    38. 53.72ms (31.23G pixels/s)
    39. 53.36ms (31.44G pixels/s)
    40. 53.40ms (31.42G pixels/s)
    41. 53.46ms (31.38G pixels/s)
    42. 53.89ms (31.13G pixels/s)
    43. 52.63ms (31.88G pixels/s)
    44. 54.40ms (30.84G pixels/s)
    45. 52.55ms (31.93G pixels/s)
    46. 55.17ms (30.41G pixels/s)
    47. 53.35ms (31.45G pixels/s)
    48. 53.36ms (31.44G pixels/s)
    49. 52.58ms (31.91G pixels/s)
    50. 53.41ms (31.41G pixels/s)
    51. 54.21ms (30.95G pixels/s)
    52. 52.57ms (31.91G pixels/s)
    53. 55.68ms (30.13G pixels/s)
    54. 54.22ms (30.94G pixels/s)
    55. 54.40ms (30.84G pixels/s)
    56. 54.30ms (30.90G pixels/s)
    57. 53.94ms (31.10G pixels/s)
    58. 56.26ms (29.82G pixels/s)
    59. 54.38ms (30.85G pixels/s)
    60. 54.93ms (30.54G pixels/s)
    61. 54.52ms (30.77G pixels/s)
    62. 56.51ms (29.69G pixels/s)
    63. 52.55ms (31.92G pixels/s)
    64. 53.81ms (31.18G pixels/s)
    65. 53.58ms (31.31G pixels/s)
    66. 54.19ms (30.96G pixels/s)
    67. 54.53ms (30.76G pixels/s)
    68. 56.66ms (29.61G pixels/s)
    69. 54.79ms (30.62G pixels/s)
    70. 54.37ms (30.86G pixels/s)
    71. 56.02ms (29.95G pixels/s)
    72. 52.53ms (31.94G pixels/s)
    73. 52.65ms (31.87G pixels/s)
    74. 52.74ms (31.81G pixels/s)
    75. 58.51ms (28.67G pixels/s)
    76. 52.61ms (31.89G pixels/s)
    77. 56.75ms (29.56G pixels/s)
    78. 52.76ms (31.80G pixels/s)
    79. 52.55ms (31.93G pixels/s)
    80. 57.43ms (29.21G pixels/s)
    81. 53.99ms (31.07G pixels/s)
    82. 57.87ms (28.99G pixels/s)
    83. 55.15ms (30.42G pixels/s)
    84. 58.63ms (28.62G pixels/s)
    85. 53.88ms (31.14G pixels/s)
    86. 58.06ms (28.90G pixels/s)
    87. 52.59ms (31.90G pixels/s)
    88. 55.23ms (30.38G pixels/s)
    89. 55.30ms (30.34G pixels/s)
    90. 55.16ms (30.42G pixels/s)
    91. 55.45ms (30.26G pixels/s)
    92. 54.03ms (31.05G pixels/s)
    93. 57.21ms (29.33G pixels/s)
    94. 55.55ms (30.20G pixels/s)
    95. 54.34ms (30.87G pixels/s)
    96. 52.55ms (31.93G pixels/s)
    97. 56.54ms (29.67G pixels/s)
    98. 55.46ms (30.25G pixels/s)
    99. 58.49ms (28.68G pixels/s)
    100. 52.78ms (31.79G pixels/s)
    101. 54.59ms (30.73G pixels/s)
    102. 56.09ms (29.91G pixels/s)
    103. 52.75ms (31.81G pixels/s)
    104. 57.93ms (28.96G pixels/s)
    105. 52.56ms (31.92G pixels/s)
    106. 57.76ms (29.05G pixels/s)
    107. 55.86ms (30.03G pixels/s)
    108. 58.50ms (28.68G pixels/s)
    109. 52.76ms (31.80G pixels/s)
    110. 54.43ms (30.82G pixels/s)
    111. 54.52ms (30.77G pixels/s)
    112. 55.31ms (30.33G pixels/s)
    113. 59.58ms (28.16G pixels/s)
    114. 55.44ms (30.26G pixels/s)
    115. 55.07ms (30.47G pixels/s)
    116. 56.05ms (29.94G pixels/s)
    117. 54.61ms (30.72G pixels/s)
    118. 56.32ms (29.79G pixels/s)
    119. 58.34ms (28.76G pixels/s)
    120. 52.76ms (31.80G pixels/s)
    121. 52.59ms (31.90G pixels/s)
    122. 54.54ms (30.76G pixels/s)
    123. 54.55ms (30.75G pixels/s)
    124. 52.73ms (31.82G pixels/s)
    125. 54.64ms (30.71G pixels/s)
    126. 58.86ms (28.50G pixels/s)
    127. 52.52ms (31.94G pixels/s)
    128. 52.63ms (31.88G pixels/s)
     
  2. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    It's running from 1 to 128 single lane compute kernels which are quite long and require minimal amount of bandwidth. The graphics queue is basically just pushing fillrate with triangles occupying 4k x 4k offscreen render target.
    So basically the best possible case.
     
    ToTTenTranz likes this.
  3. Dygaza

    Newcomer

    Joined:
    Aug 27, 2015
    Messages:
    40
    Likes Received:
    39
    Fury X - Fiji GCN 1.2
    15.20.1062.1004

    Compute only:
    1. 49.65ms
    2. 49.66ms
    3. 49.66ms
    4. 49.66ms
    5. 49.66ms
    6. 49.65ms
    7. 49.66ms
    8. 49.66ms
    9. 49.65ms
    10. 49.66ms
    11. 49.66ms
    12. 49.65ms
    13. 49.66ms
    14. 49.64ms
    15. 49.66ms
    16. 49.66ms
    17. 49.66ms
    18. 49.65ms
    19. 49.65ms
    20. 49.67ms
    21. 49.66ms
    22. 49.66ms
    23. 49.66ms
    24. 49.66ms
    25. 49.67ms
    26. 49.66ms
    27. 49.66ms
    28. 49.66ms
    29. 49.66ms
    30. 49.66ms
    31. 49.66ms
    32. 49.66ms
    33. 49.66ms
    34. 49.66ms
    35. 49.67ms
    36. 49.65ms
    37. 49.65ms
    38. 49.66ms
    39. 49.66ms
    40. 49.67ms
    41. 49.66ms
    42. 49.66ms
    43. 49.66ms
    44. 49.66ms
    45. 49.66ms
    46. 49.66ms
    47. 49.65ms
    48. 49.66ms
    49. 49.66ms
    50. 49.66ms
    51. 49.66ms
    52. 49.66ms
    53. 49.66ms
    54. 49.66ms
    55. 49.66ms
    56. 49.66ms
    57. 49.66ms
    58. 49.66ms
    59. 49.66ms
    60. 49.65ms
    61. 49.66ms
    62. 49.66ms
    63. 49.66ms
    64. 49.66ms
    65. 49.68ms
    66. 49.66ms
    67. 49.68ms
    68. 49.67ms
    69. 49.66ms
    70. 49.67ms
    71. 49.66ms
    72. 49.66ms
    73. 49.65ms
    74. 49.67ms
    75. 49.67ms
    76. 49.66ms
    77. 49.66ms
    78. 49.66ms
    79. 49.66ms
    80. 49.67ms
    81. 49.66ms
    82. 49.65ms
    83. 49.66ms
    84. 49.66ms
    85. 49.67ms
    86. 49.65ms
    87. 49.67ms
    88. 49.66ms
    89. 49.66ms
    90. 49.66ms
    91. 49.65ms
    92. 49.66ms
    93. 49.67ms
    94. 49.67ms
    95. 49.67ms
    96. 49.67ms
    97. 49.65ms
    98. 49.66ms
    99. 49.66ms
    100. 49.66ms
    101. 49.66ms
    102. 49.67ms
    103. 49.67ms
    104. 49.66ms
    105. 49.67ms
    106. 49.66ms
    107. 49.67ms
    108. 49.67ms
    109. 49.66ms
    110. 49.68ms
    111. 49.67ms
    112. 49.67ms
    113. 49.67ms
    114. 49.67ms
    115. 49.68ms
    116. 49.67ms
    117. 49.67ms
    118. 49.67ms
    119. 49.67ms
    120. 49.68ms
    121. 49.66ms
    122. 49.69ms
    123. 49.67ms
    124. 49.67ms
    125. 49.66ms
    126. 49.66ms
    127. 49.65ms
    128. 49.67ms
    Graphics only: 25.18ms (66.62G pixels/s)
    Graphics + compute:
    1. 55.93ms (30.00G pixels/s)
    2. 56.01ms (29.95G pixels/s)
    3. 49.76ms (33.72G pixels/s)
    4. 49.76ms (33.72G pixels/s)
    5. 49.75ms (33.72G pixels/s)
    6. 49.82ms (33.68G pixels/s)
    7. 56.03ms (29.94G pixels/s)
    8. 56.05ms (29.93G pixels/s)
    9. 49.85ms (33.66G pixels/s)
    10. 49.79ms (33.69G pixels/s)
    11. 49.77ms (33.71G pixels/s)
    12. 49.80ms (33.69G pixels/s)
    13. 56.06ms (29.93G pixels/s)
    14. 62.31ms (26.92G pixels/s)
    15. 49.78ms (33.70G pixels/s)
    16. 62.34ms (26.91G pixels/s)
    17. 49.80ms (33.69G pixels/s)
    18. 62.40ms (26.89G pixels/s)
    19. 56.00ms (29.96G pixels/s)
    20. 62.35ms (26.91G pixels/s)
    21. 56.13ms (29.89G pixels/s)
    22. 56.01ms (29.95G pixels/s)
    23. 62.33ms (26.92G pixels/s)
    24. 49.82ms (33.68G pixels/s)
    25. 62.27ms (26.94G pixels/s)
    26. 49.76ms (33.72G pixels/s)
    27. 56.00ms (29.96G pixels/s)
    28. 56.07ms (29.92G pixels/s)
    29. 62.31ms (26.93G pixels/s)
    30. 56.12ms (29.90G pixels/s)
    31. 68.61ms (24.45G pixels/s)
    32. 49.77ms (33.71G pixels/s)
    33. 56.01ms (29.95G pixels/s)
    34. 62.27ms (26.94G pixels/s)
    35. 62.35ms (26.91G pixels/s)
    36. 68.59ms (24.46G pixels/s)
    37. 55.99ms (29.96G pixels/s)
    38. 75.62ms (22.19G pixels/s)
    39. 49.79ms (33.70G pixels/s)
    40. 49.80ms (33.69G pixels/s)
    41. 49.79ms (33.69G pixels/s)
    42. 49.77ms (33.71G pixels/s)
    43. 49.76ms (33.72G pixels/s)
    44. 74.78ms (22.44G pixels/s)
    45. 56.01ms (29.95G pixels/s)
    46. 49.79ms (33.69G pixels/s)
    47. 49.81ms (33.68G pixels/s)
    48. 56.15ms (29.88G pixels/s)
    49. 56.07ms (29.92G pixels/s)
    50. 49.81ms (33.68G pixels/s)
    51. 62.28ms (26.94G pixels/s)
    52. 49.78ms (33.70G pixels/s)
    53. 62.30ms (26.93G pixels/s)
    54. 49.82ms (33.68G pixels/s)
    55. 62.34ms (26.91G pixels/s)
    56. 56.04ms (29.94G pixels/s)
    57. 56.05ms (29.93G pixels/s)
    58. 56.03ms (29.94G pixels/s)
    59. 49.77ms (33.71G pixels/s)
    60. 62.38ms (26.90G pixels/s)
    61. 49.82ms (33.68G pixels/s)
    62. 56.02ms (29.95G pixels/s)
    63. 56.05ms (29.93G pixels/s)
    64. 56.07ms (29.92G pixels/s)
    65. 56.05ms (29.93G pixels/s)
    66. 56.03ms (29.94G pixels/s)
    67. 56.09ms (29.91G pixels/s)
    68. 49.79ms (33.70G pixels/s)
    69. 62.24ms (26.95G pixels/s)
    70. 49.79ms (33.70G pixels/s)
    71. 62.32ms (26.92G pixels/s)
    72. 49.81ms (33.68G pixels/s)
    73. 55.98ms (29.97G pixels/s)
    74. 49.81ms (33.68G pixels/s)
    75. 49.79ms (33.69G pixels/s)
    76. 49.76ms (33.71G pixels/s)
    77. 55.98ms (29.97G pixels/s)
    78. 56.10ms (29.90G pixels/s)
    79. 49.82ms (33.67G pixels/s)
    80. 62.48ms (26.85G pixels/s)
    81. 49.77ms (33.71G pixels/s)
    82. 49.79ms (33.69G pixels/s)
    83. 55.96ms (29.98G pixels/s)
    84. 49.78ms (33.70G pixels/s)
    85. 49.78ms (33.70G pixels/s)
    86. 49.78ms (33.70G pixels/s)
    87. 62.28ms (26.94G pixels/s)
    88. 68.57ms (24.47G pixels/s)
    89. 62.28ms (26.94G pixels/s)
    90. 56.00ms (29.96G pixels/s)
    91. 62.43ms (26.87G pixels/s)
    92. 68.55ms (24.47G pixels/s)
    93. 68.58ms (24.46G pixels/s)
    94. 49.77ms (33.71G pixels/s)
    95. 62.44ms (26.87G pixels/s)
    96. 49.80ms (33.69G pixels/s)
    97. 56.02ms (29.95G pixels/s)
    98. 56.06ms (29.93G pixels/s)
    99. 56.03ms (29.94G pixels/s)
    100. 56.03ms (29.95G pixels/s)
    101. 55.98ms (29.97G pixels/s)
    102. 56.02ms (29.95G pixels/s)
    103. 74.82ms (22.42G pixels/s)
    104. 62.31ms (26.92G pixels/s)
    105. 56.13ms (29.89G pixels/s)
    106. 62.26ms (26.95G pixels/s)
    107. 49.79ms (33.69G pixels/s)
    108. 56.07ms (29.92G pixels/s)
    109. 49.78ms (33.71G pixels/s)
    110. 49.78ms (33.70G pixels/s)
    111. 56.05ms (29.93G pixels/s)
    112. 56.05ms (29.94G pixels/s)
    113. 56.12ms (29.90G pixels/s)
    114. 74.67ms (22.47G pixels/s)
    115. 62.26ms (26.95G pixels/s)
    116. 56.01ms (29.96G pixels/s)
    117. 49.83ms (33.67G pixels/s)
    118. 49.78ms (33.70G pixels/s)
    119. 49.78ms (33.70G pixels/s)
    120. 49.78ms (33.70G pixels/s)
    121. 49.81ms (33.69G pixels/s)
    122. 55.96ms (29.98G pixels/s)
    123. 62.45ms (26.87G pixels/s)
    124. 75.07ms (22.35G pixels/s)
    125. 56.07ms (29.92G pixels/s)
    126. 62.38ms (26.90G pixels/s)
    127. 56.08ms (29.92G pixels/s)
    128. 62.24ms (26.96G pixels/s)
     
    fellix likes this.
  4. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,525
    Likes Received:
    460
    Location:
    Varna, Bulgaria
    It looks like GCN hits a flatline in compute-only loads and performance doesn't benefit from lower batch count, like Nvidia parts. :???:
     
  5. Godfavor

    Joined:
    Aug 31, 2015
    Messages:
    2
    Likes Received:
    0
    Laptop 8970M, i74700mq
    Compute only:
    1. 61.52ms
    2. 61.52ms
    3. 61.52ms
    4. 61.58ms
    5. 61.47ms
    6. 61.51ms
    7. 61.51ms
    8. 61.52ms
    9. 61.50ms
    10. 61.51ms
    11. 61.52ms
    12. 61.52ms
    13. 61.52ms
    14. 61.52ms
    15. 61.53ms
    16. 61.51ms
    17. 61.51ms
    18. 61.52ms
    19. 61.52ms
    20. 61.50ms
    21. 61.52ms
    22. 61.52ms
    23. 61.52ms
    24. 61.51ms
    25. 61.53ms
    26. 61.50ms
    27. 61.53ms
    28. 61.51ms
    29. 61.53ms
    30. 61.52ms
    31. 61.52ms
    32. 61.51ms
    33. 61.52ms
    34. 61.52ms
    35. 61.49ms
    36. 61.49ms
    37. 61.50ms
    38. 61.51ms
    39. 61.52ms
    40. 61.52ms
    41. 61.52ms
    42. 61.51ms
    43. 61.54ms
    44. 61.52ms
    45. 61.52ms
    46. 61.51ms
    47. 61.52ms
    48. 61.51ms
    49. 61.52ms
    50. 61.54ms
    51. 61.52ms
    52. 61.52ms
    53. 61.51ms
    54. 61.53ms
    55. 61.51ms
    56. 61.50ms
    57. 61.52ms
    58. 61.52ms
    59. 61.51ms
    60. 61.52ms
    61. 61.51ms
    62. 61.50ms
    63. 61.50ms
    64. 61.53ms
    65. 61.50ms
    66. 61.39ms
    67. 61.36ms
    68. 61.39ms
    69. 61.43ms
    70. 61.51ms
    71. 61.54ms
    72. 61.51ms
    73. 61.52ms
    74. 61.51ms
    75. 61.50ms
    76. 61.52ms
    77. 61.54ms
    78. 61.51ms
    79. 61.53ms
    80. 61.52ms
    81. 61.50ms
    82. 61.51ms
    83. 61.51ms
    84. 61.52ms
    85. 61.48ms
    86. 61.53ms
    87. 61.52ms
    88. 61.48ms
    89. 61.50ms
    90. 61.52ms
    91. 61.50ms
    92. 61.51ms
    93. 61.51ms
    94. 61.52ms
    95. 61.51ms
    96. 61.50ms
    97. 61.52ms
    98. 61.52ms
    99. 61.52ms
    100. 61.52ms
    101. 61.52ms
    102. 61.52ms
    103. 61.52ms
    104. 61.50ms
    105. 61.51ms
    106. 61.51ms
    107. 61.52ms
    108. 61.53ms
    109. 61.52ms
    110. 61.50ms
    111. 61.50ms
    112. 61.51ms
    113. 61.52ms
    114. 61.52ms
    115. 61.50ms
    116. 61.51ms
    117. 61.52ms
    118. 61.50ms
    119. 61.51ms
    120. 61.52ms
    121. 61.54ms
    122. 61.49ms
    123. 61.52ms
    124. 61.52ms
    125. 61.52ms
    126. 61.51ms
    127. 61.50ms
    128. 61.51ms
    Graphics only: 59.03ms (28.42G pixels/s)
    Graphics + compute:
    1. 62.97ms (26.64G pixels/s)
    2. 63.01ms (26.63G pixels/s)
    3. 63.00ms (26.63G pixels/s)
    4. 63.01ms (26.63G pixels/s)
    5. 62.99ms (26.64G pixels/s)
    6. 63.01ms (26.63G pixels/s)
    7. 63.00ms (26.63G pixels/s)
    8. 63.01ms (26.63G pixels/s)
    9. 63.00ms (26.63G pixels/s)
    10. 62.99ms (26.64G pixels/s)
    11. 63.02ms (26.62G pixels/s)
    12. 63.00ms (26.63G pixels/s)
    13. 63.00ms (26.63G pixels/s)
    14. 62.98ms (26.64G pixels/s)
    15. 63.01ms (26.63G pixels/s)
    16. 62.94ms (26.65G pixels/s)
    17. 63.01ms (26.63G pixels/s)
    18. 62.99ms (26.63G pixels/s)
    19. 62.99ms (26.64G pixels/s)
    20. 63.00ms (26.63G pixels/s)
    21. 63.00ms (26.63G pixels/s)
    22. 63.00ms (26.63G pixels/s)
    23. 63.00ms (26.63G pixels/s)
    24. 63.01ms (26.62G pixels/s)
    25. 63.00ms (26.63G pixels/s)
    26. 63.01ms (26.62G pixels/s)
    27. 62.96ms (26.65G pixels/s)
    28. 63.02ms (26.62G pixels/s)
    29. 62.99ms (26.63G pixels/s)
    30. 63.01ms (26.63G pixels/s)
    31. 63.01ms (26.63G pixels/s)
    32. 63.01ms (26.63G pixels/s)
    33. 62.99ms (26.63G pixels/s)
    34. 63.01ms (26.63G pixels/s)
    35. 63.00ms (26.63G pixels/s)
    36. 63.02ms (26.62G pixels/s)
    37. 62.99ms (26.63G pixels/s)
    38. 63.00ms (26.63G pixels/s)
    39. 63.01ms (26.63G pixels/s)
    40. 63.01ms (26.62G pixels/s)
    41. 63.00ms (26.63G pixels/s)
    42. 63.00ms (26.63G pixels/s)
    43. 63.01ms (26.62G pixels/s)
    44. 62.96ms (26.65G pixels/s)
    45. 63.01ms (26.63G pixels/s)
    46. 63.00ms (26.63G pixels/s)
    47. 63.01ms (26.63G pixels/s)
    48. 62.99ms (26.63G pixels/s)
    49. 63.01ms (26.63G pixels/s)
    50. 62.97ms (26.64G pixels/s)
    51. 63.01ms (26.63G pixels/s)
    52. 62.97ms (26.64G pixels/s)
    53. 63.01ms (26.63G pixels/s)
    54. 63.00ms (26.63G pixels/s)
    55. 63.00ms (26.63G pixels/s)
    56. 63.01ms (26.63G pixels/s)
    57. 63.01ms (26.62G pixels/s)
    58. 63.00ms (26.63G pixels/s)
    59. 62.99ms (26.63G pixels/s)
    60. 63.00ms (26.63G pixels/s)
    61. 63.01ms (26.62G pixels/s)
    62. 63.01ms (26.62G pixels/s)
    63. 62.99ms (26.63G pixels/s)
    64. 63.00ms (26.63G pixels/s)
    65. 62.98ms (26.64G pixels/s)
    66. 63.02ms (26.62G pixels/s)
    67. 62.98ms (26.64G pixels/s)
    68. 63.01ms (26.62G pixels/s)
    69. 63.00ms (26.63G pixels/s)
    70. 63.01ms (26.62G pixels/s)
    71. 63.01ms (26.63G pixels/s)
    72. 63.03ms (26.62G pixels/s)
    73. 63.02ms (26.62G pixels/s)
    74. 63.01ms (26.63G pixels/s)
    75. 63.01ms (26.63G pixels/s)
    76. 62.99ms (26.63G pixels/s)
    77. 63.02ms (26.62G pixels/s)
    78. 63.00ms (26.63G pixels/s)
    79. 63.00ms (26.63G pixels/s)
    80. 62.98ms (26.64G pixels/s)
    81. 63.02ms (26.62G pixels/s)
    82. 63.01ms (26.63G pixels/s)
    83. 63.00ms (26.63G pixels/s)
    84. 63.00ms (26.63G pixels/s)
    85. 63.01ms (26.63G pixels/s)
    86. 63.00ms (26.63G pixels/s)
    87. 63.01ms (26.63G pixels/s)
    88. 63.01ms (26.63G pixels/s)
    89. 63.01ms (26.62G pixels/s)
    90. 63.00ms (26.63G pixels/s)
    91. 63.00ms (26.63G pixels/s)
    92. 62.99ms (26.63G pixels/s)
    93. 63.01ms (26.63G pixels/s)
    94. 63.00ms (26.63G pixels/s)
    95. 62.99ms (26.63G pixels/s)
    96. 63.02ms (26.62G pixels/s)
    97. 63.01ms (26.62G pixels/s)
    98. 63.01ms (26.63G pixels/s)
    99. 62.97ms (26.64G pixels/s)
    100. 63.01ms (26.63G pixels/s)
    101. 63.01ms (26.62G pixels/s)
    102. 62.99ms (26.64G pixels/s)
    103. 63.00ms (26.63G pixels/s)
    104. 63.00ms (26.63G pixels/s)
    105. 63.00ms (26.63G pixels/s)
    106. 62.99ms (26.63G pixels/s)
    107. 63.00ms (26.63G pixels/s)
    108. 62.98ms (26.64G pixels/s)
    109. 63.01ms (26.63G pixels/s)
    110. 62.99ms (26.64G pixels/s)
    111. 63.01ms (26.62G pixels/s)
    112. 63.01ms (26.63G pixels/s)
    113. 63.01ms (26.63G pixels/s)
    114. 63.00ms (26.63G pixels/s)
    115. 63.00ms (26.63G pixels/s)
    116. 62.99ms (26.64G pixels/s)
    117. 63.01ms (26.62G pixels/s)
    118. 63.00ms (26.63G pixels/s)
    119. 63.00ms (26.63G pixels/s)
    120. 63.01ms (26.63G pixels/s)
    121. 63.02ms (26.62G pixels/s)
    122. 62.99ms (26.63G pixels/s)
    123. 62.97ms (26.64G pixels/s)
    124. 63.02ms (26.62G pixels/s)
    125. 62.98ms (26.64G pixels/s)
    126. 62.98ms (26.64G pixels/s)
    127. 62.99ms (26.64G pixels/s)
    128. 63.01ms (26.63G pixels/s)
     
  6. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    448
    Likes Received:
    488
    Looks like GCN is single thread and latency limited since the test is single lane, I wonder whether Maxwells benefit from the narrower 32 threads optimized pipeline and fma + load/store dual issue here
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,215
    Likes Received:
    1,412
    Location:
    London
    That's a good point, the compute kernel might need to be normalised against a version that computes a set of 100,000 work items. Or a variant that ramps up the count of work-items, allowing work groups to be filled...

    I dare say the results being posted are fascinating even for such a simple test. Fillrate on GM200 (980Ti) starts at a relatively high percentage of native fillrate, 61/94 = 65% and falls to 29/94 = 31%. Fiji starts worse, but doesn't change at about a constant 50%.

    ---

    Having looked at the shader code and seeing the results on GM200, I think two things need to be said:

    1. performance steps in increments of 32, which is the work group size on GM200. This looks to me as if NVidia's driver has observed the static configuration here and discerned that 32 work groups can be packed into a single work group. There are no barriers (which are a work-group wide synchronisation) and there's no use of memory fences, so the kernel is trivial to pack into 32-wide work-groups instead of 1-wide.

    2. the kernel contains a loop iterator with hard-coded bounds. This is bait for any compiler to pre-compute the result and stick in a constant. If the compiler spots this, then the kernel runs in near-0 ALU cycles.
     
  8. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,525
    Likes Received:
    460
    Location:
    Varna, Bulgaria
    AMD pours more drama in the AofS case:
    Reddit
     
  9. wirtold

    Joined:
    Aug 31, 2015
    Messages:
    6
    Likes Received:
    0
    Very good app. But i have same questions: You are rendering graphics to 4k x 4k surface, it means that about 16M pixel shaders are in flight. As i can see on NVidia HW compute is serialized after graphics task - so maybe graphics task have a some kind of higher priority, and can't be preempted by compute tasks. Maybe You try to refine your app so there will be fewer pixel shaders in flight? But i have no idea how to do it ;) Small triangle at about 500px will be just very fast, many small triangles will be executed parallel and will also take all resources...
     
  10. madyasiwi

    Newcomer

    Joined:
    Oct 7, 2008
    Messages:
    194
    Likes Received:
    32
    Picture speaks louder, visualizing results from Fellix and Dygaza. To scale. Lower is better.

    [​IMG]
     
    Lightman, Kodiack, DegustatoR and 4 others like this.
  11. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,708
    Likes Received:
    6,597

    Can you raise this number to 256 or 512, for example?
    None of the GCN chips have their performance even budging between 1 and 128 kernels, and with async there's hardly any difference at all. Perhaps it'd be interesting to see how many more are needed until we see the latency stepping up like the Kepler and Maxwell chips.
     
  12. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    605
    Likes Received:
    320
    XB1 and PS4 SDK and documentations are under NDA.... But, if I remember correctly, the November 2014 SDK of the XB1 has been leaked many months ago...
     
  13. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    That's more of a which part finishes first. Graphics load is static, compute load changes. So 980 Ti starts at a high percentage because compute is that much quicker to complete. It doesn't go parallel with graphics though.

    It steps in groups of 32 on GM200, groups of 16 on GTX 750 and groups of 8 on GTX 680. I think you'll also find it interesting that the jump is actually at 31 not at 32.

    Well compiler would also have to assume that thread id is actually constant. It could do that I guess given it's also there in the code, but the numbers we're seeing don't suggest anywhere near 0 alu cycles. I just wanted something simple and that would keep SMM/SMX/CU busy for a long easily configurable time so fibonacci with a small change seamed like the best idea.
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,215
    Likes Received:
    1,412
    Location:
    London
    Are you saying that you are computing fillrate based on the total execution time, even if the triangle was completed much earlier?

    The boundary is 31, 64, 96 and 128. So the first boundary is the outlier in this case, though it appears that the first boundary always behaves this way on the 3 NVidia architectures documented so far...

    Also 16 and 8, do those sizes correspond with the native width of the SIMD in each case?

    Wasn't it stated before that Maxwell 2 supports 32 queues for compute? I now realise that this could be the reason for the magic number and so the match with the SIMD width is pure coincidence.

    I'm now wondering if the compute latencies being reported are system-wide latencies. ~50ms for a single kernel on Fiji is mostly system latency, since the theoretical execution time at 1.05 GHz is 7.6ms. So the kernel itself doesn't run for long enough to hide system latencies.

    Of course, system latencies are a thing to quantify.

    On Fiji, though, you'd need about 17,000 kernels before you'd fill the GPU with enough work to use up 50ms execution time. That's 50/7.6ms = 6.7 factor of 256 SIMDs with 10 work-groups each. Arguably, less kernel launches, since they won't all be launched concurrently.

    I checked and on GCN it compiles like this:

    Code:
    shader csMain
      asic(SI)
      type(CS)
    
      v_add_i32     v0, vcc, s8, v0                             // 00000000: 4A000008
      v_add_i32     v1, vcc, s9, v1                             // 00000004: 4A020209
      v_cvt_f32_u32  v0, v0                                     // 00000008: 7E000D00
      v_cvt_f32_u32  v1, v1                                     // 0000000C: 7E020D01
      v_add_f32     v0, 1.0, v0                                 // 00000010: 060000F2
      v_add_f32     v1, 1.0, v1                                 // 00000014: 060202F2
      v_mov_b32     v2, 0                                       // 00000018: 7E040280
      s_movk_i32    s0, 0x0000                                  // 0000001C: B0000000
    label_0008:
      s_cmp_ge_i32  s0, 0x00100000                              // 00000020: BF03FF00 00100000
      s_cbranch_scc1  label_0012                                // 00000028: BF850007
      v_add_f32     v0, v0, v1                                  // 0000002C: 06000300
      v_mul_f32     v2, 0x3f000011, v0                          // 00000030: 100400FF 3F000011
      s_add_u32     s0, s0, 1                                   // 00000038: 80008100
      v_mov_b32     v0, v1                                      // 0000003C: 7E000301
      v_mov_b32     v1, v2                                      // 00000040: 7E020302
      s_branch      label_0008                                  // 00000044: BF82FFF6
    label_0012:
      v_lshl_b64    v[0:1], 0, 0                                // 00000048: D2C20000 00010080
      buffer_store_dword  v2, v[0:1], s[4:7], 0 offen idxen     // 00000050: E0703000 80010200
      s_endpgm                                                  // 00000058: BF810000
    end
    Which results in an 8 cycle inner loop and theoretical 7.6ms execution time.

    Someone should be able to get the compiled code for GM200. ~10ms being reported seems like a reasonable indication, since it should be ~25% slower than Fiji in the worst-case (9.9ms).

    All the same, it would be nice to eliminate system-wide latencies on both platforms. But I think NVidia has historically had radically lower kernel launch latencies in compute, so this pattern isn't surprising.

    Fiji's 40+ms overhead is looking pretty useless.
     
  15. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    Correct. Though now that you mentioned it... WaitForMultipleObjects does return which event finished first so this would be nice to see yes.

    Warp is 32 on all NV platforms though that I don't think this plays a role here. It was stated 31 compute queues + 1 graphics/compute.

    40+ms launch overhead would indeed make it completely useless. That would cap any game making a dispatch call to 25 fps. Can you try to shorten the loop to 1024 and see how it reacts? You have a GCN board right?

    I'm still a bit bothered by Intel, which I'd use as kind of a control. But I still can't get Haswell gpu to do multiple dispatches in parallel and Andrew says it can :). So it really makes you wonder what is actually enabled in current d3d12 drivers.
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,215
    Likes Received:
    1,412
    Location:
    London
    I've delayed W10 installation until I'm reasonably sure it's not going to piss me off. Can I be bothered to knock-up an OpenCL version of the test...

    As for the launch overhead, I think that's partially related to the test which fills then drains the device queues. Normally games never let device queues run dry. It might be an idea to enqueue a hundred iterations of the 128-kernel scenario, for example, to see the effect.

    Still there's something fishy. If a shadow-buffer pass takes <5ms, then on AMD is that finished before the compute kernel that's supposed to run in parallel even starts? Does D3D allow the developer to synchronise kernel launches?

    Is your test multi-threaded? Should the compute and graphics tasks be running on distinct CPU threads?

    I know very little about the D3D execution model...
     
  17. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    890
    Likes Received:
    309
    I measured ~0.01 ms for AMD 295 and ~1.0 ms for Nvidia 780Ti once in an DX11 engine path using passthrough compute shaders directly after a graphics task. Compute dispatch was abysimally slower on Nvidia then AMD. I don't remember how it was when compute followed compute.This 40ms shoudn't be dispatch latency.
     
  18. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    605
    Likes Received:
    320
    Does anyone here known how to get that result? Did NVIDIA added such query to the NDA version of NVAPI? -_-
     
  19. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,708
    Likes Received:
    6,597
    An AMD official weighs in on the Oxide Games employee's post:

     
  20. hesido

    Regular

    Joined:
    Mar 28, 2004
    Messages:
    553
    Likes Received:
    85
    Unfortunately with the current market share, Nvidia not being capable doesn't hurt Nvidia, but AMD that's supposed to gain from the use of Async compute as this somewhat reduces the incentive to use these techniques. Sorry for the business related side of things on this technical thread..
     
    pharma likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...