Why do we only focus on 3D images here? How about 2D image generation? Or, 2D = 3D - projection?


Thinking of image generation as a series of stages in a pipeline is conceptually clean, but are there any performance benefits from taking "bigger steps" that might be messier?