SiS Xabre pipeline shenanigans?

I've replied to your points in various places, so let me collate them here:

  • Addressing...

    Dave H said:

    I have this response (to the empty space next to someone I'm not talking to... :p ).
  • Addressing...
    Dave H said:
    ...
    And in general I'm afraid I don't think "N shixel pipelines" is ever a useful term without discussing what constitutes each pipeline. Again, with fixed-function pipelines, the only relevant constituent has been the number of TMUs. (And the amount of filtering they can do. And, for MSAA, the number of z-units. And maybe something else I'm forgetting.) But with shixel pipelines, you also need to talk about the ALU resources in each pipeline as well. And if you want to be accurate, you need to talk about them in some detail. I'm afraid that, for PS 2.0 type shaders at least, counting up the number of pipelines will have nearly zero bearing on overall performance, as the only thing it really limits is the number of pixels issued or retired on any given clock, and these limits are never going to be a limiting factor. (Well it also limits texturing efficiency to the degree that you're reading e.g. an odd number of textures with a 4x2 vs. an 8x1 organization. But we already have 4x2 and 8x1 as terms.)

    It's sort of like...it's sort of like we're trying to measure the area of a rectangle and you're just telling me the width.
    ...

    I share my thoughts in these comments. Also, addressing your edge triangle comments...

    demalion said:
    ...
    Note also that it is my understanding that the R300 acts like a 8 proxel pipeline lock-step processor...I think it would be a matter of the organization of the pixels being rendered (4x2? 2x4? I could swear someone actually said sometime) and statistics whether it ever performed lower than the "4x2" as you are calling it for non-branching shaders.

    I'll point out further, that it seems to me the difference is more an issue of how the pixel coverage is organized. The R300 processes more pixels, and will not statistically be worse than the nv30 (unless it is organized 8 across and 1 down, I think...but it is 4 by 2 AFAIK), but since it is (my assumption, and my term, perhaps there is a better one) lock step (which could be ameliorated as another benefit of conditional pixel shading with arbitrary length, methinks) it could statistically offer worse waste. That doesn't make 4 pipelines have an (overall) advantage compared to 8 at all, it just makes it more efficient. "1x1" is perfectly efficient for dispatching shader execution, but is it superior to other organizations?
    I do think (in my complete lack of hardware designing experience in the field :LOL: ) the best case would be to have independent "1x1" shader handling, replicated however many times and in whatever organization desired, or served whatever limitations existed in the execution of the design, and I had thought at the beginning the nv30 might offer that, but unfortunately it doesn't seem to, atleast with current drivers (anyone want to run some pixel shaded line benchmarks? :p). I'm not sure if that is planned for the nv40, though they seem to be heading in that direction, and I actively doubt nv35 would offer it, and the only problem with this for nvidia is that ATI has demonstrated the ability to reinvent more thoroughly in century revisions than nvidia has in their decade revisions. Even then, it would likely take very specific circumstances to show a benefit for that, and I'm not sure (not knowing how much complexity it would add) if it is worthwhile unless it can be utilized to facilitate other improvements at the same time.

    Hmm...come to think of it, has anyone completely tested to eliminate the possibility of independent pixel shading assignment for the 4 pipelines, or proven/disproven that the nv30 offers this advantage with current benchmarks?
  • Addressing...
    Dave H said:
    ...[Wait a sec: just to be sure, by "proxel"(/"shixel") we are counting the number of fragments shaded in parallel, not the number of ops applied per clock, right? Because now that I think about it, the analogy with "texel" would probably suggest the latter definition. Anyways, it still doesn't capture all the necessary constraints.] ...

    I point you to my proxel explanation again:

    demalion said:
    ...The basic idea behind it is to have a "proxel pipeline" relate to "proxel fillrate" as "pixel pipelines" used to be related to pixel fillrate. It is in the "fine" tradition of 'texels" and, more recently, "zixels", to offer more opportunities to accurately portray strengths and weaknesses of a design.
    This could allow 4x? notations that actual made sense for shading (see below), but until people got over laziness to perform multiplication themselves, it would most likely be best to simply count the pipelines or focus on proxel fillrate. The nv30 would still have problems with architecture expressed in proxel pipelines (they couldn't playas fast and loose with the definition of proxel), but that's where proxel fillrate (similar to texel fillrate) comes in. Also, a valid case could be made for calling the nv30 "8x0.5", but simply calling it 8 pipelines would be inaccurate.

    The proposed measurements were minimum proxel fillrate, which illustrates worst case behavior (for the nv30 this would be fp32/texturing, I think), maximum proxel fillrate (for nv30 would be intermixed integer and fp16 with no texturing access), which indicates best case (where actual calculations occur...again see my mention of completeness and think of the z only shader performance figures), and a standardized measurement (which would likely resemble pocektmoon's benchmark testing examples), which would let NV30 (and hopefully more significantly NV35) optimizations and R300 >1 op per clock circumstances to be represented in a real-world usage related way.
    ...

    The notations I'm discussing are proxel pipeline notations. Note that the standardized proxel measurment figure I mention might facilitate a notation, but that I think it would be too confusing with other factors affecting execution, so the point of "proxel" is to be used only for "proxel pipeline count" and for the "proxel fill rate" specifications I list. Quite simply, the variations of calculation performance possible don't allow the concept to be successfully simplifed in the full "XxY" notation as for pixel pipelines. While less ubiquotous for its purpose than "pixel", those stipulations allow it to retain the simplicity of comparison...more points and parallels to the pixel/texel relationship are offered in that post.
  • Addressing...
    Dave H said:
    ...
    But if we combine the fact that I don't want to talk about shader performance because there are too many other variables with the fact that I only want to talk about performance hits that actually mean something, I think you end up almost necessarily in a situation with trilinear and/or AF. Unless you turn on MSAA or something, but in that case you're going to be bandwidth limited, so no point in blaming anything on low fillrate!
    ...

    I already addressed that in isolation here, if my understanding of both these comments, and your prior comments, are accurate.

I'll note that my point in addressing your comments is that pointing out that "it doesn't matter given the rest of the nv30 design" is something I would agree with when the pipeline count is indicative of either the actual pixel pipeline count or the performance of the rest of the nv30 design. It only has to do one for the pipeline count to be justified in my mind. That is, if the "proxel pipeline" count was 8 effictively (the hope of those expecting very significant performance gains in the future for the nv30), and the pixel pipeline count was 4, I might still be proposing "proxel" to avoid confusion, but I wouldn't be finding fault with nvidia for calling it an 8 pipeline part (atleast, I wouldn't still be finding fault with them). That's what I meant by my comments about if the nv35 had 8x1 pixel pipelines, but still had insufficient "proxel" pipelines to match, I wouldn't be criticizing nvidia for calling it 8x1, though I would likely be criticizing the nv35, in that case, for being an underperforming part for executing shaders.

Oh, and to answer another question, I'd say the point of terms such as those the Xabre uses is to be a substitute for actual specs. As I said, just because some people don't read the actual specs doesn't mean it is OK to put anything there you want to (but we agree on that in any case). BTW, I feel I must again point out there are people smart enough to recognize such terms as pure BS, even without fully understanding the (what is supposed to be) factual specs elsewhere, and are able to critically compare the figures presented in the specs. Heck, that's the kind of person I was when I was beginning to learn about computers, and would be still if I hadn't had the time and desire to progress further.
 
1) 4x2 is more flexible than 8x1 in the sense that it gets a reduced triangle-edge penalty. (Because each "pixel cohort" has to be from the same triangle, and in general are taken in a specific rectangular pattern as I understand it; pixels outside the triangle mean wasted pipelines.) This gets more important with small triangles and low resolutions (i.e. in future games), so it can even be seen as being forward-looking in some way. Point is, there are performance benefits in common situations as well as performance drawbacks.
It seems like demailion addresses this, but I'll do so in a shorter post for those that (like myself) sometimes get lost in longer posts.

The above statement would be correct if all 4 or all 8 "pipelines" were tied together. In reality they probably are not. R300 might have 8 1x1 pipelines, 4 2x1 pipelines, etc. These interior pipelines will be fed by FIFOs whose purpose is to squash pipeline bubbles resulting from small triangles.
I only know the details of one architecture that had 4 marketing pipelines and internally it was split up into two 2 pixel pipelines. So it operated in a SIMD fashion on 2 pixels worth of data. I think demailion is guessing that R300 is 2 4x1 pipelines.
 
WHAT!? I'm aghast that its not 8 separate pipelines!? I've been duped and I'm morally outraged.

Especially now that my 9500 performs so much worse than it did before I knew that.

;)
 
To support the PS2.0 dsx/dsy instructions and do proper mipmapping with dependent texturing, you pretty much need to operate on pixels in groups of 4 (2x2 pixels). I would guess myself that the R300 is internally organized as two 4-pipeline cores working more or less independently of each other (possibly enabling it to render 2 polygons per clock in short bursts), especially considering how easy it was for ATI to just turn off 4 pipelines in the 9500np.
 
RussSchultz said:
WHAT!? I'm aghast that its not 8 separate pipelines!? I've been duped and I'm morally outraged.

Especially now that my 9500 performs so much worse than it did before I knew that.

;)
pfffft. :rolleyes:
 
3dcgi said:
...
I think demailion is guessing that R300 is 2 4x1 pipelines.

No, I primarily assumed worst case for the R300 (potentially less efficient than NV30 sometimes) and just pointed out that this wasn't necessarily always true (like you did)...though I do have a recollection that an ATI employee mentioned specifically that it was 2 4 pixel sets (EDIT: and, in my recollection, I have the impression of each being 2x2), I didn't find that post and can't recall for sure if shader operations were specified in that discussion.
But others seem to recall it as well...?

EDIT: Oh, "demailion" is a new one...hmm...what, do I remind you of a disgruntled postal employee? :p I swear, I don't own an Uzi...
 
3dcgi said:
I only know the details of one architecture that had 4 marketing pipelines...

Wow, marketing pipelines.
Those producing marketing pixels? ;)

I foresee we'll talk about mixel/s numbers. :LOL:
 
Back
Top