We could do a process called structure from motion to get depth information out of our single camera, and use that to make a map. Structure from motion requires a lot of textures and edges, which houses may not have. It also leaves lots of voids (holes) that have to be filled in the map. Structure from motion uses parallax in the video images to estimate the distance to the object in the camera's field of view. There has been a lot of interesting work in this area, and I have seen some promising results. The video image has to have a lot of detail in it so that the process can match points from one video image to the next.
Here is a survey article on various approaches to Structure from Motion, if you are interested ...