Calculating a point's (u,v) turned out to be much more simple than I had expected. Is this really all there is to it?
That's the basic principle for the simple pinhole model! It can get more complicated though. For example, real camera lenses produce images with visual distortions and depth of fields, which cannot be modeled with a simple pinhole camera.
Is there any particular reason why z is the horizontal distance, but not conventionally the vertical one?
The typical convention in computer graphics (and computer vision) is that the plane spanned by the x- and y- axis is parallel to the image plane, and the z- axis represents distance/depth from the camera (i.e., the $z-$ axis projects back into the screen). I'm not aware of the origin of this convention however.
Shouldn't v be -y/z because it's inverted?
Perhaps. It is true that the image should be inverted for a pinhole camera (or even a camera with a lens), and negating the value of v would represent this.