I have created a point cloud of an irregular (non-planar) complex object using SfM. Each one of those 3D points was viewed in more than one image, so it has multiple (SIFT) features associated with it.
Now, I want to solve for the pose of this object in a new, different set of images using a PnP algorithm matching the features detected in the new images with the features associated with the 3D points in the point cloud.
So my question is: which descriptor do I associate with the 3D point to get the best results?
So far I've come up with a number of possible solutions...
- Average all of the descriptors associated with the 3D point (taken from the SfM pipeline) and use that "mean descriptor" to do the matching in PnP. This approach seems a bit far-fetched to me - I don't know enough about feature descriptors (specifically SIFT) to comment on the merits and downfalls of this approach.
- "Pin" all of the descriptors calculated during the SfM pipeline to their associated 3D point. During PnP, you would essentially have duplicate points to match with (one duplicate for each descriptor). This is obviously intensive.
- Find the "central" viewpoint that the feature appears in (from the SfM pipeline) and use the descriptor from this view for PnP matching. So if the feature appears in images taken at
-30,10, and40degrees ( from surface normal), use the descriptor from the10degree image. This, to me, seems like the most promising solution.
Is there a standard way of doing this? I haven't been able to find any research or advice online regarding this question, so I'm really just curious if there is a best solution, or if it is dependent on the object/situation.
The descriptors that are used for matching in most SLAM or SFM systems are rotation and scale invariant (and to some extent, robust to intensity changes). That is why we are able to match them from different view points in the first place. So, in general it doesn't make much sense to try to use them all, average them, or use the ones from a particular image. If the matching in your SFM was done correctly, the descriptors of the reprojection of a
3dpoint from your point cloud in any of its observations should be very close, so you can use any of them 1.Also, it seems to me that you are trying to directly match the
2dpoints to the3dpoints. From a computational point of view, I think this is not a very good idea, because by matching2dpoints with3dones, you lose the spatial information of the images and have to search for matches in a brute force manner. This in turn can introduce noise. But, if you do your matching from image to image and then propagate the results to the 3d points, you will be able to enforce priors (if you roughly know where you are, i.e. from an IMU, or if you know that your images are close), you can determine the neighborhood where you look for matches in your images, etc. Additionally, once you have computed your pose and refined it, you will need to add more points, no? How will you do it if you haven't done any2d/2dmatching, but just2d/3dmatching?Now, the way to implement that usually depends on your application (how much covisibility or baseline you have between the poses from you SFM, etc). As an example, let's note your candidate image
I_0, and let's note the images from your SFMI_1, ..., I_n. First, match betweenI_0andI_1. Now, assumeq_0is a 2d point fromI_0that has successfully been matched toq_1fromI_1, which corresponds to some 3d pointQ. Now, to ensure consistency, consider the reprojection ofQinI_2, and call itq_2. MatchI_0andI_2. Does the point to whichq_0is match inI_2fall close toq_2? If yes, keep the2d/3dmatch betweenq_0andQ, and so on.I don't have enough information about your data and your application, but I think that depending on your constraints (real-time or not, etc), you could come up with some variation of the above. The key idea anyway is, as I said previously, to try to match from frame to frame and then propagate to the 3d case.
Edit: Thank you for your clarifications in the comments. Here are a few thoughts (feel free to correct me):
Let's consider a SIFT descriptor
s_0fromI_0, and let's noteF(s_1,...,s_n)your aggregated descriptor (that can be an average or a concatenation of the SIFT descriptorss_iin their correspondingI_i, etc). Then when matchings_0withF, you will only want to use a subset of thes_ithat belong to images that have close viewpoints toI_0(because of the30degproblem that you mention, although I think it should be50deg). That means that you have to attribute a weight to eachs_ithat depends on the pose of your queryI_0. You obviously can't do that when constructingF, so you have to do it when matching. However, you don't have a strong prior on the pose (otherwise, I assume you wouldn't be needing PnP). As a result, you can't really determine this weight. Therefore I think there are two conclusions/options here:SIFT descriptors are not adapted to the task. You can try coming up with a perspective-invariant descriptor. There is some literature on the subject.
Try to keep some visual information in the form of "Key-frames", as in many SLAM systems. It wouldn't make sense to keep all of your images anyway, just keep a few that are well distributed (pose-wise) in each area, and use those to propagate 2d matches to the 3d case.
If you only match between the
2dpoint of your query and3ddescriptors without any form of consistency check (as the one I proposed earlier), you will introduce a lot of noise...tl;dr I would keep some images.
1 Since you say that you obtain your 3d reconstruction from an SFM pipline, some of them are probably considered inliers and some are outliers (indicated by a boolean flag). If they are outliers, just ignore them, if they are inliers, then they are the result of matching and triangulation, and their position has been refined multiple times, so you can trust any of their descriptors.