How might we efficiently ensure that the triangles are in front-to-back or back-to-front order? My favorite fantasy idea would be to use a lock free linked list, but can anybody think of some more down to earth ideas?
dsaksena
Hi @Bryce, I am assuming by lock-free linked list I am assuming you will launch a thread per triangle and have a lock free linked list for each (x, y).
Now you take triangle vertices, figure out all possible (x, y) and corresponding z then put the element at right sorted location in linked list according to the z.
This sounds pretty optimized to me, I would just suggest calculating and maintaining alpha values in the linked list only, maybe that could be an optimization.
jezimmer
One of the problems with linked lists (in general, but especially with graphics) is they they have terrible locality. This means that often times they'll lead to the optimal big-Oh bounds, but they're just terribly slow. This gets worse when you try to optimize for a GPU, where all you really have are contiguous hunks of memory, using linked lists is costly.
With this in mind, what we'd like to do is replace the linked list at each ((x, y)) coordinate with something a little more cost effective. One strategy I can think of would be to somehow guess how deep the linked lists will be on average, and allocate a 3D array where the depth of this buffer is some constant factor of that average length. From here, there are a couple options:
set your constant factor high enough so that you compute say 90% of the triangles in that space
keep track of some extra booleans indicating that the number of triangles covering a sample has exceeded the allocated buffer space, and look those up specifically in linked lists (so we fall back to computation on linked lists infrequently).
I'm not quite sure how you'd go about sampling the expected average depth of the buffer, but I'd bet you could do it well enough offline by sampling scenes that you anticipate having to render, and computing the depth on those scenes.
skygao
@jezimmer Good point. You would definitely want contiguous memory access whenever you can. Particularly in this case, we are not doing too much math and this whole process is very likely going to be bandwidth limited. Putting linked lists aside, an alternative way to deal with this problem is Order Independent Transparency.
BryceSummers
@jezimmer Those sounds like some good ideas. How will you deal with the excessive memory movement inherent in sorting values in an array and can you get per triangle parallelism with 1 array per sample?
Another idea is to take out our triangle cutting scissors and cut them up into pieces that don't overlap, then the order is well defined and we can just presort the triangles into front-to-back order.
kayvonf
I like this discussion. There are two attributes of the traditional Z-buffer that make it very attractive for performance-centric programming.
O(1) cost per operation
Fixed-size memory footprint. The size of the data-structure is only a function of the size of the output image, and not dependent of the size of the scene.
In real-time systems, predictability is very important. An algorithm with pathological cases might execute very efficiently in many situations, but then have an edge case that makes frame-rate lag in specific conditionals. (This unpredictability is bad. e.g., consider how upset you'd be if a game dropped dramatically in frame rate from certain views!) When using a zbuffer, as a programmer I know that regardless of the size of the scene, that data-structures needed to handle occlusion and the final output buffer always have the size size, and thus can be accessed with predictably performance and always fit in memory.
Of course, this presents a major challenge with transparency, since transparency requires either:
Fixed space per sample, but objects be rendered in a particular order
Buffering all the fragments covering a triangle and then sorting them prior to compositing. This is the approach taken with the A-buffer, which maintains a per-sample list of covering fragments. But note that it violates desirable property #2, it requires potentially unbounded space per sample.
Here's a list of recent work that's gives approximately correct transparency results for any draw order, but with constant space per sample (note the key point is: "approximate"):
So I'm not sure why we have to compute a depth test here if we're using alpha-compositing. We're assuming that our triangles are rasterized in order, correct? And if that's the case then shouldn't there be no depth test because we're relying on the alpha values to account for occlusion / depth?
kayvonf
@dvernet: Good question. This code is assuming that all opaque surfaces have been rasterized first (prior to rasterizing the transparent geometry in order.) You'll see this end-to-end algorithm discussed on the next slide. If you render all the opaque surfaces before any of the transparent surfaces, you'll need to perform the depth check. A went ahead and gave you good on this slide that reflects what's done in practice.
Question 1: Why might drawing all the opaque stuff first might be a good idea?
Question 2: Notice that there is a depth check, but no depth write. Why?
dvernet
Ahhh, I see.
Because it saves you the overhead of computing $\texttt{over}$ for transparent samples that will just end up getting covered by an opaque sample.
Once we know that a transparent sample isn't covered by an opaque sample, depth is irrelevant because we're using alpha compositing.
How might we efficiently ensure that the triangles are in front-to-back or back-to-front order? My favorite fantasy idea would be to use a lock free linked list, but can anybody think of some more down to earth ideas?
Hi @Bryce, I am assuming by lock-free linked list I am assuming you will launch a thread per triangle and have a lock free linked list for each (x, y).
Now you take triangle vertices, figure out all possible (x, y) and corresponding z then put the element at right sorted location in linked list according to the z.
This sounds pretty optimized to me, I would just suggest calculating and maintaining alpha values in the linked list only, maybe that could be an optimization.
One of the problems with linked lists (in general, but especially with graphics) is they they have terrible locality. This means that often times they'll lead to the optimal big-Oh bounds, but they're just terribly slow. This gets worse when you try to optimize for a GPU, where all you really have are contiguous hunks of memory, using linked lists is costly.
With this in mind, what we'd like to do is replace the linked list at each ((x, y)) coordinate with something a little more cost effective. One strategy I can think of would be to somehow guess how deep the linked lists will be on average, and allocate a 3D array where the depth of this buffer is some constant factor of that average length. From here, there are a couple options:
I'm not quite sure how you'd go about sampling the expected average depth of the buffer, but I'd bet you could do it well enough offline by sampling scenes that you anticipate having to render, and computing the depth on those scenes.
@jezimmer Good point. You would definitely want contiguous memory access whenever you can. Particularly in this case, we are not doing too much math and this whole process is very likely going to be bandwidth limited. Putting linked lists aside, an alternative way to deal with this problem is Order Independent Transparency.
@jezimmer Those sounds like some good ideas. How will you deal with the excessive memory movement inherent in sorting values in an array and can you get per triangle parallelism with 1 array per sample?
Another idea is to take out our triangle cutting scissors and cut them up into pieces that don't overlap, then the order is well defined and we can just presort the triangles into front-to-back order.
I like this discussion. There are two attributes of the traditional Z-buffer that make it very attractive for performance-centric programming.
In real-time systems, predictability is very important. An algorithm with pathological cases might execute very efficiently in many situations, but then have an edge case that makes frame-rate lag in specific conditionals. (This unpredictability is bad. e.g., consider how upset you'd be if a game dropped dramatically in frame rate from certain views!) When using a zbuffer, as a programmer I know that regardless of the size of the scene, that data-structures needed to handle occlusion and the final output buffer always have the size size, and thus can be accessed with predictably performance and always fit in memory.
Of course, this presents a major challenge with transparency, since transparency requires either:
Here's a list of recent work that's gives approximately correct transparency results for any draw order, but with constant space per sample (note the key point is: "approximate"):
So I'm not sure why we have to compute a depth test here if we're using alpha-compositing. We're assuming that our triangles are rasterized in order, correct? And if that's the case then shouldn't there be no depth test because we're relying on the alpha values to account for occlusion / depth?
@dvernet: Good question. This code is assuming that all opaque surfaces have been rasterized first (prior to rasterizing the transparent geometry in order.) You'll see this end-to-end algorithm discussed on the next slide. If you render all the opaque surfaces before any of the transparent surfaces, you'll need to perform the depth check. A went ahead and gave you good on this slide that reflects what's done in practice.
Ahhh, I see.
@dvernet: correct.