As the images are not aligned without XMPs then there is no enough overlap between the images.
How many features are found over the images during the alignment process?
For the interiors it is better to capture the images as:
If you wish you can share the data with us. I will send you the invitation for the data upload.
