Pointflow 3d Point Cloud Generation With Continuous Normalizing Flows

1 Introduction

Point clouds are becoming popular as a 3D representation because they can capture a much higher resolution than voxel grids and are a stepping stone to more sophisticated representations such as meshes. Learning a generative model of point clouds could benefit a wide range of point cloud synthesis tasks such as reconstruction and super-resolution, by providing a better

prior of point clouds. However, a major roadblock in generating point clouds is the complexity of the space of point clouds. A cloud of points corresponding to a chair is best thought of as samples from a distribution that corresponds to the surface of the chair, and the chair itself is best thought of as a sample from a distribution of chair shapes. As a result, in order to generate a chair according to this formulation, we need to characterize a distribution of distributions, which is under-explored by existing generative models.

In this paper, we propose PointFlow, a principled generative model for 3D point clouds that learns a distribution of distributions: the former being the distribution of shapes and the latter being the distribution of points given a shape. Our key insight is that instead of directly parametrizing the distribution of points in a shape, we model this distribution as an invertible parameterized transformation of 3D points from a prior distribution (e.g., a 3D Gaussian). Intuitively, under this model, generating points for a given shape involves sampling points from a generic Gaussian prior, and then moving them according to this parameterized transformation to their new location in the target shape, as illustrated in FigureLABEL:fig:teaser. In this formulation, a given shape is then simply the variable that parametrizes such transformation, and a category is simply a distribution of this variable. Interestingly, we find that representing this distribution too as a transformation of a prior distribution leads to a more expressive model of shapes. In particular, we use the recently proposed continuous normalizing flow framework to model both kinds of transformations[40, 5, 16].

This parameterization confers several advantages. The invertibility of these transformations allows us to not just sample but also estimate probability densities. The ability to estimate probability densities in turn allows us to train these models in a principled manner using the variational inference framework

[27], where we maximize a variational lower bound on the log likelihood of a training set of point clouds. This probabilistic framework for training further lets us avoid the complexities of training GANs or hand-crafting good distance metrics for measuring the difference between two sets of points. Experiments show that PointFlow outperforms previous state-of-the-art generative models of point clouds, and achieves compelling results in point cloud reconstruction and unsupervised feature learning.

2 Related work

Deep learning for point clouds. Deep learning has been introduced to improve performance in various point cloud discriminative tasks including classification[38, 39, 51, 55], segmentation[38, 43], and critical points sampling[10]. Recently, substantial progress has been made in point cloud synthesis tasks such as auto-encoding[1, 51, 17], single-view 3D reconstruction[12, 22, 29, 31, 13], stereo reconstruction[45], and point cloud completion[54, 53] Many point cloud synthesis works convert a point distribution to a matrix by sampling ( is pre-defined) points from the distribution so that existing generative models are readily applicable. For example, Gadelha[13] apply variational auto-encoders (VAEs)[27] and Zamorski[56] apply adversarial auto-encoders (AAEs)[34] to point cloud generation. Achlioptas[1] explore generative adversarial networks (GANs)[15, 2, 19]

for point clouds in both raw data space and latent space of a pre-trained auto-encoder. In the above methods, the auto-encoders are trained with heuristic loss functions that measure the distance between two point sets, such as Chamfer distance (CD) or earth mover's distance (EMD). Sun

[44] apply auto-regressive models[47] with a discrete point distribution to generate one point at a time, also using a fixed number of points per shape.

However, treating a point cloud as a fixed-dimensional matrix has several drawbacks. First, the model is restricted to generate a fixed number of points. Getting more points for a particular shape requires separate up-sampling models such as [54, 53, 52]. Second, it ignores the permutation invariance property of point sets, which might lead to suboptimal parameter efficiency. Heuristic set distances are also far from ideal objectives from a generative modeling perspective since they make the original probabilistic interpretation of VAE/AAE no longer applicable when used as the reconstruction objective. In addition, exact EMD is slow to compute while approximations could lead to biased or noisy gradients. CD has been shown to incorrectly favor point clouds that are overly concentrated in the mode of the marginal point distribution[1].

Some recent works introduce sophisticated decoders consisting of a cascade[51] or a mixture[17]

of smaller decoders to map one (or a mixture of) 2-D uniform distribution(s) to the target point distribution, overcoming the shortcomings of using a fixed number of points. However, they still rely on heuristic set distances that lack a probabilistic guarantee. Also, their methods only learn the distribution of points for each shape, but not the distribution of shapes. Li

[30] propose a "sandwiching" reconstruction objective that combines a variant of WGAN[2] loss with EMD. They also train another GAN in the latent space to learn shape distribution, similar to Achlioptas[1]. In contrast, our method is simply trained end-to-end by maximizing a variational lower bound on the log-likelihood, does not require multi-stage training, and does not have any instability issues common for GAN based methods.

Generative models. There are several popular frameworks of deep generative models, including generative adversarial networks[15, 2, 23], variational auto-encoders[27, 41], auto-regressive models[35, 47], and flow-based models[8, 40, 9, 25]. In particular, flow-based models and auto-regressive models can both perform exact likelihood evaluation, while flow-based models are much more efficient to sample from. Flow-based models have been successfully applied to a variety of generation tasks such as image generation[25, 9, 8], video generation[28], and voice synthesis[37]. Also, there has been recent work that combines flows with other generative models, such as GAN[18, 7], auto-regressive models[20, 36, 26], and VAEs[26, 46, 6, 40, 46, 5, 16].

Most existing deep generative models aim at learning the distribution of fixed-dimensional variables. Learning the distribution of distributions, where the data consists of a set of sets, is still under-explored. Edwards and Storkey[11]

propose a hierarchical VAE named Neural Statistician that consumes a set of sets. They are mostly interested in the few-shot case where each set only has a few samples. Also, they are focused on classifying sets or generating new samples from a given set. While our method is also applicable to these tasks, our focus is on learning the distribution of sets and generating new sets (point clouds in our case). In addition, our model employs a tighter lower bound on the log-likelihood, thanks to the use of normalizing flow in modeling both the reconstruction likelihood and the prior.

3 Overview

Consider a set of shapes from a particular class of object, where each shape is represented as a set of 3D points . As discussed in Section1, each point is best thought of as being sampled from a point distribution , usually a uniform distribution over the surface of an object . Each shape is itself a sample from a distribution over shapes that captures what shapes in this category look like.

Our goal is to learn the distribution of shapes, each itself being a distribution of points. In other words, our generative model should be able to both sample shapes and sample an arbitrary number of points from a shape.

We propose to use continuous normalizing flows to model the distribution of points given a shape. A continuous normalizing flow can be thought of as a vector field in the

-D Euclidean space, which induces a distribution of points through transforming a generic prior distribution (, a standard Gaussian). To sample points from the induced distribution, we simply sample points from the prior and move them according to the vector field. Moreover, the continuous normalizing flow is invertible, which means we can move data points back to the prior distribution to compute the exact likelihood. This model is highly intuitive and interpretable, allowing a close inspection of the generative process as shown in FigureLABEL:fig:teaser.

We parametrize each continuous normalizing flow with a latent variable that represents the shape. As a result, modeling the distribution of shapes can be reduced to modeling the distribution of the latent variable. Interestingly, we find continuous normalizing flow also effective in modeling the latent distribution. Our full generative model thus consists of two levels of continuous normalizing flows, one modeling the shape distribution by modeling the distribution of the latent variable, and the other modeling the point distribution given a shape.

In order to optimize the generative model, we construct a variational lower bound on the log-likelihood by introducing an inference network that infers a latent variable distribution from a point cloud. Here, we benefit from the fact that the invertibility of the continuous normalizing flow enables likelihood computation. This allows us to train our model end-to-end in a stable manner, unlike previous work based on GANs that requires two-stage training[1, 30]. As a side benefit, we find the inference network learns a useful representation of point clouds in an unsupervised manner.

In Section4 we introduce some background on continuous normalizing flows and variational auto-encoders. We then describe our model and training in detail in Section5.

4 Background

4.1 Continuous normalizing flow

A normalizing flow[40] is a series of invertible mappings that transform an initial known distribution to a more complicated one. Formally, let denote a series of invertible transformations we want to apply to a latent variable with distribution . is the output variable. Then the density of the output variable is given by the change of variables formula:

where can be computed from using the inverse flow: . In practice,

are usually instantiated as neural networks with an architecture that makes the determinant of the Jacobian

easy to compute. The normalizing flow has been generalized from a discrete sequence to a continuous transformation[16, 5] by defining the transformation using a continuous time dynamic , where is a neural network that has an unrestricted architecture. The continuous normalizing flow (CNF) model for with a prior distribution at the start time can then be written as:

and can be computed using the inverse flow

. A black-box ordinary differential equation (ODE) solver can been applied to estimate the outputs and the input gradients of a continuous normalizing flow

[16, 5].

4.2 Variational auto-encoder

Suppose we have a random variable

that we are building generative models for. The variational auto-encoder (VAE) is a framework that allows one to learn from a dataset of observations of [27, 41]. The VAE does the generation via a latent variable with a prior distribution , and a decoder which captures the (hopefully simpler) distribution of given . At test time, the latent variable is sampled from the prior and then the decoder is used to sample conditioned on .

The VAE is trained on a set of observations of . During training, it additionally learns an inference model (or encoder) . The encoder and decoder are jointly trained to maximize a lower bound on the log-likelihood of the observed variable

which is also called the evidence lower bound (ELBO). One can interpret ELBO as the sum of the negative reconstruction error (the first term) and a latent space regularizer (the second term). In practice, is usually modeled as a diagonal Gaussian

whose mean and standard deviation are predicted by a neural network with parameter

. To efficiently optimize the ELBO, sampling from can be done by reparametrizing as , where .

5 Model

We now have the paraphernalia needed to define our generative model of point clouds. Using the terminology of the VAE, we need three modules: the encoder that encodes a point cloud into a shape representation , a prior over shape representations, and a decoder that models the distribution of points given the shape representation. We use a simple permutation-invariant encoder to predict , following the architecture in Achlioptas[1]. We use continuous normalizing flows for both the prior and the generator , which are described below.

5.1 Flow-based point generation from shape representations

We first decompose the reconstruction log-likelihood of a point set into the sum of log-likelihood of each point

We propose to model using a conditional extension of CNF. Specifically, a point in the point set is the result of transforming some point in the prior distribution using a CNF conditioned on :

where is the continuous-time dynamics of the flow conditioned on . Note that the inverse of is given by with . The reconstruction likelihood of given follows equation (2):

Note that can be computed in closed form with the Gaussian prior.

Figure 1: Model architecture. (a) At training time, the encoder infers a posterior over shape representations given an input point cloud , and samples a shape representation from it. We then compute the probability of in the prior distribution ( ) through a inverse CNF , and compute the reconstruction likelihood of  ( ) through another inverse CNF conditioned on . The model is trained end-to-end to maximize the evidence lower bound (ELBO), which is the sum of , , and (the entropy of the posterior ). (b) At test time, we sample a shape representation by sampling from a Gaussian prior and transforming it with . To sample points from the shape represented by , we first sample points from the -D Gaussian prior and then move them according to the CNF parameterized by .

5.2 Flow-based prior over shapes

Although it is possible to use a simple Gaussian prior over shape representations, it has been shown that a restricted prior tends to limit the performance of VAEs[6]. To alleviate this problem, we use another CNF to parametrize a learnable prior. Formally, we rewrite the KL divergence term in Equation 3 as

where is the entropy and is the prior distribution with learnable parameters , obtained by transforming a simple Gaussian with a CNF:

where is the continuous-time dynamics of the flow . Similarly as described above, the inverse of is given by with . The log probability of the prior distribution can be computed by:

5.3 Final training objective

Plugging Equation4, 5, 6 , 7 into Equation 3 , the ELBO of a point set can be finally written as

Our model is trained end-to-end by maximizing the ELBO of all point sets in the dataset

We can interpret this objective as the sum of three parts:

  1. Prior: encourages the encoded shape representation to have a high probability under the prior, which is modeled by a CNF as described in Section5.2. We use the reparameterization trick[27] to enable a differentiable Monte Carlo estimate of the expectation:

    where and are mean and standard deviation of the isotropic Gaussian posterior and is simply set to .

    is sampled from the standard Gaussian distribution

    .

  2. Reconstruction likelihood: is the reconstruction log-likelihood of the input point set, computed as described in Section5.1. The expectation is also estimated using Monte Carlo sampling.

  3. Posterior Entropy: is the entropy of the approximated posterior:

All the training details (, hyper-parameters, model architectures) are included in SectionB of the appendix.

5.4 Sampling

To sample a shape representation, we first draw then pass it through to get . To generate a point given a shape representation , we first sample a point from , then pass through conditioned on to produce a point on the shape : . To sample a point cloud with size , we simply repeat it for times. Combining these two steps allows us to sample a point cloud with points from our model:

6 Experiments

In this section, we first introduce existing metrics for point cloud generation, discuss their limitations, and introduce a new metric that overcomes these limitations. We then compare the proposed method with previous state-of-the-art generative models of point clouds, using both previous metrics and the proposed one. We additionally evaluate the reconstruction and representation learning ability of the auto-encoder part of our model.

6.1 Evaluation metrics

Following prior work, we use Chamfer distance (CD) and earth mover's distance (EMD) to measure the similarity between point clouds. Formally, they are defined as follows:

where and are two point clouds with the same number of points and is a bijection between them. Note that most previous methods use either CD or EMD in their training objectives, which tend to be favored if evaluated under the same metric. Our method, however, do not use CD or EMD during training.

Table 1: Generation results. : the higher the better, : the lower the better. The best scores are highlighted in bold. Scores of the real shapes that are worse than some of the generated shapes are marked in gray. MMD-CD scores are multiplied by ; MMD-EMD scores are multiplied by ; JSDs are multiplied by .

Let be the set of generated point clouds and be the set of reference point clouds with . To evaluate generative models, we first consider the three metrics introduced by Achlioptas[1]:

  • Jensen-Shannon Divergence (JSD) are computed between the marginal point distributions:

    where and are marginal distributions of points in the reference and generated sets, approximated by discretizing the space into voxels and assigning each point to one of them. However, it only considers the marginal point distributions but not the distribution of individual shapes. A model that always outputs the "average shape" can obtain a perfect JSD score without learning any meaningful shape distributions.

  • Coverage (COV) measures the fraction of point clouds in the reference set that are matched to at least one point cloud in the generated set. For each point cloud in the generated set, its nearest neighbor in the reference set is marked as a match:

    where can be either CD or EMD. While coverage is able to detect mode collapse, it does not evaluate the quality of generated point clouds. In fact, it is possible to achieve a perfect coverage score even if the distances between generated and reference point clouds are arbitrarily large.

  • Minimum matching distance (MMD) is proposed to complement coverage as a metric that measures quality. For each point cloud in the reference set, the distance to its nearest neighbor in the generated set is computed and averaged:

    where can be either CD or EMD. However, MMD is actually very insensitive to low-quality point clouds in , since they are unlikely to be match to real point clouds in . In the extreme case, one can imagine that consists of mostly very low-quality point clouds with one additional point cloud in each mode of , yet having a reasonably good MMD score.

As discussed above, all existing metrics have their limitations. As will be shown later, we also empirically find all these metrics sometimes give generated point clouds even better scores than real point clouds, further casting doubt on whether they can ensure a fair model comparison. We therefore introduce another metric that we believe is better suited for evaluating generative models of point clouds:

  • 1-nearest neighbor accuracy (1-NNA) is proposed by Lopez-Paz and Oquab[32] for two-sample tests, assessing whether two distributions are identical. It has also been explored as a metric for evaluating GANs[50]. Let and be the nearest neighbor of in . -NNA is the leave-one-out accuracy of the 1-NN classifier:

    where is the indicator function. For each sample, the 1-NN classifier classifies it as coming from or according to the label of its nearest sample. If and are sampled from the same distribution, the accuracy of such a classifier should converge to given a sufficient number of samples. The closer the accuracy is to , the more similar and are, and therefore the better the model is at learning the target distribution. In our setting, the nearest neighbor can be computed using either CD or EMD. Unlike JSD, 1-NNA considers the similarity between shape distributions rather than between marginal point distributions. Unlike COV and MMD, 1-NNA directly measures distributional similarity and takes both diversity and quality into account.

6.2 Generation

We compare our method with three existing generative models for point clouds: raw-GAN[1], latent-GAN[1], and PC-GAN[30], using their official implementations that are either publicly available or obtained by contacting the authors. We train each model using point clouds from one of the three categories in ShapeNet[3] dataset: airplane, chair, and car

. The point clouds are obtained by sampling points uniformly from the mesh surface. All points in each category are normalized to be zero-mean per axis and unit-variance globally. Following prior convention

[1], we use points for each shape during both training and testing, although our model is able to sample an arbitrary number of points. We additionally report the performance of point clouds sampled from the training set, which is considered as an upper bound since they are from the target distribution.

In Table1, we report the performance of different models, as well as their number of parameters in total (full) or in the generative pathways (gen). We first note that all the previous metrics (JSD, MMD, and COV) sometimes assign a better score to point clouds generated by models than those from the training set (marked in mygraygray). The 1-NNA metric does not seem to have this problem and always gives a better score to shapes from the training set. Our model outperforms all baselines across all three categories according to 1-NNA and also obtains the best score in most cases as evaluated by other metrics. Besides, our model has the fewest parameters among compared models. In SectionC of the appendix, we perform additional ablation studies to show the effectiveness of different components of our model. Figure2 shows some examples of novel point clouds generated by our model. Figure3 shows examples of point clouds reconstructed from given inputs.

Figure 2: Examples of point clouds generated by our model. From top to bottom: airplane, chair, and car.
Figure 3: Examples of point clouds reconstructed from inputs. From top to bottom: airplane, chair, and car. On each side of the figure we show the input point cloud on the left and the reconstructed point cloud on the right.
Method MN40 (%) MN10 (%)
SPH[24] 68.2 79.8
LFD[4] 75.5 79.9
T-L Network[14] 74.4 -
VConv-DAE[42] 75.5 80.5
3D-GAN[48] 83.3 91.0
l-GAN (EMD)[1] 84.0 95.4
l-GAN (CD)[1] 84.5 95.4
PointGrow[44] 85.7 -
MRTNet-VAE[13] 86.4 -
FoldingNet[51] 88.4 94.4
l-GAN (CD)[1] 87.0 92.8
l-GAN (EMD)[1] 86.7 92.2
PointFlow (ours) 86.8 93.7
  • We run the official code of l-GAN on our pre-processed dataset using the same encoder architecture as our model.

Table 2: Unsupervised feature learning. Models are first trained on ShapeNet to learn shape representations, which are then evaluated on ModelNet40 (MN40) and ModelNet10 (MN10) by comparing the accuracy of off-the-shelf SVMs trained using the learned representations.

6.3 Auto encoding

We further quantitatively compare the reconstruction ability of our flow-based auto-encoder with l-GAN[1] and AtlasNet[17]. Following the setting of AtlasNet, the state of the art in this task, we train our auto-encoder on all shapes in the ShapeNet dataset. The auto-encoder is trained with the reconstruction likelihood objective only. At test time, we sample points per shape and split them into an input set and a reference set, each consisting of points. We then compute the distance (CD or EMD) between the reconstructed input set and the reference set 1 1 1We use a separate reference set because we expect the auto-encoder to learn the point distribution. Exactly reproducing the input points is acceptable behavior, but should not be given a higher score than randomly sampling points from the underlying point distribution. . Although our model is not directly trained with EMD, it obtains the best EMD score, even higher than l-GAN trained with EMD and AtlasNet which has more than times more parameters.

Table 3: Auto-encoding performance evaluated by CD and EMD. AtlasNet is trained with CD and l-GAN is trained on CD or EMD. Our method is not trained on CD or EMD. CD scores are multiplied by ; EMD scores are multiplied by .

6.4 Unsupervised representation learning

We finally evaluate the representation learning ability of our auto-encoders. Specifically, we extract the latent representations of our auto-encoder trained in the full ShapeNet dataset and train a linear SVM classifier on top of it on ModelNet10 or ModelNet40[49]. Only for this task, we normalize each individual point cloud to be zero-mean per axis and unit-variance globally, following prior works[55, 1]. We also apply random rotations along the gravity axis when training the auto-encoder.

A problem with this task is that different authors have been using different encoder architectures with a different number of parameters, making it hard to perform an apple-to-apple comparison. In addition, different authors may use different pre-processing protocols (as also noted by Yang[51]), which could also affect the numbers.

In Table2, we still show the numbers reported by previous papers, but also include a comparison with l-GAN[1] trained using the same encoder architecture and the exact same data as our model. On ModelNet10, the accuracy of our model is and higher than l-GAN (EMD) and l-GAN (CD) respectively. On ModelNet40, the performance of the three models is very close.

7 Conclusion and future works

In this paper, we propose PointFlow, a generative model for point clouds consisting of two levels of continuous normalizing flows trained with variational inference. Future work includes applications to other tasks such as point cloud reconstruction from a single image.

8 Acknowledgement

This work was supported in part by a research gift from Magic Leap. Xun Huang was supported by NVIDIA Graduate Fellowship.

References

  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3d point clouds. In ICML , 2018.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML , 2017.
  • [3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
  • [4] D.-Y. Chen, X.-P. Tian, E. Y.-T. Shen, and M. Ouhyoung.

    On visual similarity based 3d model retrieval.

    Comput. Graph. Forum , 22:223–232, 2003.
  • [5] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In NeurIPS , 2018.
  • [6] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel.

    Variational lossy autoencoder.

    In ICLR , 2016.
  • [7] I. Danihelka, B. Lakshminarayanan, B. Uria, D. Wierstra, and P. Dayan. Comparison of maximum likelihood and gan-based training of real nvps. arXiv preprint arXiv:1705.05263 , 2017.
  • [8] L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation. CoRR , abs/1410.8516, 2014.
  • [9] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. In ICLR , 2017.
  • [10] O. Dovrat, I. Lang, and S. Avidan. Learning to sample. arXiv preprint arXiv:1812.01659 , 2018.
  • [11] H. A. Edwards and A. J. Storkey. Towards a neural statistician. In ICLR , 2017.
  • [12] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR , 2017.
  • [13] M. Gadelha, R. Wang, and S. Maji. Multiresolution tree networks for 3d point cloud processing. In ECCV , 2018.
  • [14] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In ECCV , 2016.
  • [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS , 2014.
  • [16] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. In ICLR , 2019.
  • [17] T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In CVPR , 2018.
  • [18] A. Grover, M. Dhar, and S. Ermon. Flow-gan: Combining maximum likelihood and adversarial learning in generative models. In AAAI , 2018.
  • [19] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In NeurIPS , 2017.
  • [20] C.-W. Huang, D. Krueger, A. Lacoste, and A. C. Courville. Neural autoregressive flows. In ICML , 2018.
  • [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.
  • [22] L. Jiang, S. Shi, X. Qi, and J. Jia. Gal: Geometric adversarial loss for single-view 3d-object reconstruction. In ECCV , 2018.
  • [23] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In CVPR , 2019.
  • [24] M. M. Kazhdan, T. A. Funkhouser, and S. Rusinkiewicz. Rotation invariant spherical harmonic representation of 3d shape descriptors. In Symposium on Geometry Processing , 2003.
  • [25] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In NeurIPS , 2018.
  • [26] D. P. Kingma, T. Salimans, and M. Welling. Improving variational inference with inverse autoregressive flow. In NeurIPS , 2016.
  • [27] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR , 2014.
  • [28] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma. Videoflow: A flow-based generative model for video. arXiv preprint arXiv:1903.01434 , 2019.
  • [29] A. Kurenkov, J. Ji, A. Garg, V. Mehta, J. Gwak, C. B. Choy, and S. Savarese. Deformnet: Free-form deformation network for 3d shape reconstruction from a single image. In WACV , 2018.
  • [30] C.-L. Li, M. Zaheer, Y. Zhang, B. Poczos, and R. Salakhutdinov. Point cloud gan. arXiv preprint arXiv:1810.05795 , 2018.
  • [31] K. Li, T. Pham, H. Zhan, and I. D. Reid. Efficient dense point cloud object reconstruction using deformation vector fields. In ECCV , 2018.
  • [32] D. Lopez-Paz and M. Oquab. Revisiting classifier two-sample tests. In ICLR , 2017.
  • [33] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne.

    Journal of machine learning research

    , 9(Nov):2579–2605, 2008.
  • [34] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 , 2015.
  • [35] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu.

    Pixel recurrent neural networks.

    In ICML , 2016.
  • [36] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. In NeurIPS , 2017.
  • [37] R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. CoRR , abs/1811.00002, 2018.
  • [38] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR , 2017.
  • [39] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS , 2017.
  • [40] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. In ICML , 2015.
  • [41] D. J. Rezende, S. Mohamed, and D. Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In ICML , 2014.
  • [42] A. Sharma, O. Grau, and M. Fritz. Vconv-dae: Deep volumetric shape learning without object labels. In ECCV Workshops , 2016.
  • [43] M. Shoef, S. Fogel, and D. Cohen-Or. Pointwise: An unsupervised point-wise feature learning network. arXiv preprint arXiv:1901.04544 , 2019.
  • [44] Y. Sun, Y. Wang, Z. Liu, J. E. Siegel, and S. E. Sarma. Pointgrow: Autoregressively learned point cloud generation with self-attention. arXiv preprint arXiv:1810.05591 , 2018.
  • [45] V. Usenko, J. Engel, J. Stückler, and D. Cremers. Reconstructing street-scenes in real-time from a driving car. In 3DV , 2015.
  • [46] R. van den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling. Sylvester normalizing flows for variational inference. In UAI , 2018.
  • [47] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In NeurIPS , 2016.
  • [48] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NeurIPS , 2016.
  • [49] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR , 2015.
  • [50] Q. Xu, G. Huang, Y. Yuan, C. Guo, Y. Sun, F. Wu, and K. Weinberger. An empirical study on evaluation metrics of generative adversarial networks. arXiv preprint arXiv:1806.07755 , 2018.
  • [51] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In CVPR , 2018.
  • [52] W. Yifan, S. Wu, H. Huang, D. Cohen-Or, and O. Sorkine-Hornung. Patch-based progressive 3d point set upsampling. arXiv preprint arXiv:1811.11286 , 2018.
  • [53] L. Yu, X. Li, C.-W. Fu, D. Cohen-Or, and P.-A. Heng. Ec-net: an edge-aware point set consolidation network. In ECCV , 2018.
  • [54] L. Yu, X. Li, C.-W. Fu, D. Cohen-Or, and P.-A. Heng. Pu-net: Point cloud upsampling network. In CVPR , 2018.
  • [55] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In NeurIPS , 2017.
  • [56] M. Zamorski, M. Zieba, R. Nowak, W. Stokowiec, and T. TrzciÅ„ski. Adversarial autoencoders for generating 3d point clouds. arXiv preprint arXiv:1811.07605 , 2018.

Appendix A Overview

In the appendix, we first describe the detailed hyper-parameters and model architectures for our experiments in SectionB. We then compare our model with additional baselines to understand the effect of different model components in SectionC. Limitations and typical failure cases are discussed in SectionD

. Finally, additional visualizations of latent space t-SNE, interpolations and flow transformations are presented in Section

E, SectionF, and SectionG respectively.

Appendix B Training details

In this section, we provide details about our network architectures and training hyper-parameters. We will release the code to reproduce our experiments.

Encoder. The architecture of our encoder follows that of Achlioptas[1]. Specifically, we first use 1D Convolution with filter size , , , and

to process each point independently and then use max pooling to create a

-dimension feature as done in PointNet[38]. Such a feature is invariant to the permutation of points due to the max-pooling. Finally, we apply a three-layer MLP with and hidden dimensions to convert the permutation invariant feature to a -dimension one. For the unsupervised representation learning experiment, we set following convention. For all other experiments, is set to .

CNF prior. The CNF prior models the distribution . We follow FFJORD[16]'s released code to use three concatsquash layers to model the dynamics . A concatsquash layer is defined as:

where , , , , , and are all trainable parameters and

is the sigmoid function.

uses three concatsquash layers with a hidden dimension . Tanh is used as the non-linearity between layers.

We use a Moving Batch Normalization layer to learn the scale of each dimension before and after the CNF, following FFJORD's released code

[16]. Specifically, Moving Batch Normalization is defined as

where and are trainable parameters, Different from batch normalization proposed by Ioffe and Szegedy[21], and are running averages of the batch mean and standard deviation. MovingBatchNorm is invertible : . Its log determinant is given as:

CNF decoder. The CNF decoder models the reconstruction likelihood . We extend the concatsquash layer to condition on the latent vector :

where are all learnable parameters. The CNF decoder uses four conditional concatsquash layers with a hidden dimension to model the dynamic . The non-linearity between layers is Tanh. Similar to the CNF prior model, we also add a Moving Batch Normalization layer before and after the CNF. In this case, all 3D points (from different shapes) from a batch are used to compute the batch statistics.

Hyper-parameters. We use an Adam optimizer with an initial learning rate , , and . The learning rate decays linearly to starting at the 2000 epoch and ends at the epoch. We do not use any weight decay. We also learn the integration time during training by back-propogation[5].

Appendix C Additional comparisons

Table 4: Ablation studies. : the higher the better, : the lower the better. The best scores are highlighted in bold. MMD-CD scores are multiplied by ; MMD-EMD scores are multiplied by ; JSDs are multiplied by .

In this section, we compare our model to more baselines to show the effectiveness of the model design. The first baseline is Neural Statistician (NS)[11], a state-of-the-art generative model for sets. We modify its official code for generating 2D spatial coordinates of MNIST digits to make it work with 3D point cloud coordinates. We use the same encoder architecture as our model, and use the VAE decoder provided by authors with the input dimension changed from to . It differs from our model mainly in 1) using VAEs instead of CNFs to model the reconstruction likelihood, and 2) using a simple Gaussian prior instead of a flow-based one. The second baseline is VAECNF, where we use the CNF to model the reconstruction likelihood but not prior. Specifically, the VAECNF optimizes ELBO in the following form:

where is a standard Gaussian and is the KL-divergence. As another baseline, we follow l-GAN[1] to train a WGAN[19] in the latent space of our pretrained auto-encoder. Both the discriminator and the generator are MLP with batch normalization between layers. The generator has three layers with hidden dimensions 256. The discriminator has three layers with hidden dimensions 512.

The results are presented in Table4. Neural Statistician[11] is able to learn the marginal point distribution but fails to learn the correct shape distribution, as it obtains the best marginal JSD but very poor scores according to metrics that measure similarities between shape distributions. Also, using a flexible prior parameterized by a CNF (PointFlow) is better than using a simple Gaussian prior (VAECNF) or a prior learned with a latent GAN (WGAN-CNF) that requires two-stage training.

Appendix D Limitation and failure cases

In this section, we discuss the limitation of our model and present visualizations of difficult cases where our model fails. As mentioned in FFJORD[16], each integration requires evaluating the neural networks modeling the dynamics multiple times. The number of function evaluations tends to increase as the training proceeds since the dynamic becomes more complex and more function evaluations are needed to achieve the same numerical precision. This issue limits our model size and makes the convergence slow. Grathwohl indicate that using regularization such as weight decay could alleviate such an issue, but we empirically find that using regularization tends to hurt performance. Future advances in invertible models like CNF might help improve this issue. Typical failure case appears when reconstructing or generating the rare shape or shapes with many thin structures as presented in Figure4.

Figure 4: Difficult cases for our model. Rare shapes or shapes that contain many thin structures are usually hard to reconstruct in high quality.

Appendix E Latent space visualizations

We provide visualization of the sampled latent vectors in Figure5. We sample latent vectors and run t-SNE[33] to visualize these latent vectors in 2D. Shapes with similar styles are close in the latent space.

Figure 5: Visualization of latent space.

Appendix F Interpolation

In this section, we present interpolation between two different shapes using our model. For two shapes and , we first compute the mean of the posterior distribution using . Let and be the means of the posterior distribution for and respectively. We use and as the latent representation for these two shapes. We then use the inverse prior flow to transform and back to the prior space. Let and be the corresponding vectors for and in the prior space. We use spherical interpolation between and to retrieve a series of vectors . For each , we use the CNF prior and the CNF decoder to generate the corresponding shape . Figure6 contains examples of the interpolation.

Appendix G More flow transformation

Figure7 presents more examples of flow transformations from the Gaussian prior to different shapes.

Figure 6: Feature space interpolation. The left-most and the right-most shapes are sampled from scratch. The shapes in between are generated by interpolating the two shapes in the prior space.
Figure 7: Additional visualizations on the process of transforming prior to point cloud.

curtistich1958.blogspot.com

Source: https://deepai.org/publication/pointflow-3d-point-cloud-generation-with-continuous-normalizing-flows

0 Response to "Pointflow 3d Point Cloud Generation With Continuous Normalizing Flows"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel