At a very basic level, you mark common points found in multiple images, it then calculates a vector from each camera view to that point and then makes a calculation that would make each camera match a position relative to each other such that the vectors of each camera would intersect at the same point. Automatic methods do some pattern recognition to automatically figure out which points correspond. Once the camera views are matched then it can triangulate the position of any points in the image.
The big problem with photogrammetry though is that it requires some visible mark that can be tracked, so for example if you put a white board in front of the camera it won’t be able to detect any depth at all, which is why there’s other solutions that project a laser or an infrared grid onto a surface to create points that it can track (that’s how Kinect works)