Unsupervised Object Keypoint Learning Using Local Spatial Predictability

In this paper authors proposed the new approach to the unsupervised keypoint learning. Previous SoTA approach, Transporter, was guided by the movement between slices to learn keypoints. In current paper authors shown possible flaws of such training procedure and set up a new procedure based on the errors of prediction of area by its surroundings. The proposed pipeline is tested on the Atari environment for RL purposes. Authors name their method PermaKey for the "Prediction ERror MAp based KEYpoints"

Method

Modules of the proposed system:

Spatial Feature Embedding. Authors proposed to model features of the input images with VAE. Later on, authors compared embeddings from different layers of encoder to balance between information provided by the last layer features and localisation abilities provided by the first layer features.
Local Spatial Prediction Task. Authors trained the network to predict feature vector by its neighbouring feature vectors. As the loss squared difference between vectors was used. This step designed to find areas, which are less defined by their neighbourhood. Those expected to be related to important interactive objects.
Keypoints Extraction. Authors employed PointNet to locate keypoints. They trained network to reconstruct error maps from from previous step. PointNet to compresses the error map to mixture of gaussians with fixed covariance matrix. That leads to situation where each gaussian represents different error area and, therefore, distinct keypoint.

Also, to test it on Atatri environment authors employed procedure from Transporter paper with one notable alternation. While originally authors proposed to use CNN to embed keypoints information for the RL agent, authors proposed to use GNN. Each node of this GNN is related to one keypoint and receives input as the the average of encoder features, weighted with the related gaussian.

Experiments

First of all, authors show visually plausible keypoints, compared to Transporter:

Predicted keypoints and feature vector prediction maps shown

Moreover they show quantitative superiority:

Mean score and std reported for different cases

As an additional test authors introduce color stripes as the noise on input image. They show, that their method is more robust to this disturbance, while Transporter mostly reacts to those distractions:

Qualitative keypoints overview

Quantitative ablation results