VMD is free to download from the UIUC Theoretical and Computational Biophysics Group website. I installed Version 1.9.3, as recommended. The installation is standard.
VMD supports many file types. Here we focus on the .cube
files, which is used in Gaussian09. In PySCF, there is a module in tools called cubegen.py
, which can produce .cube
files storing the electronic density, orbital and molecular electrostatic potential. Here is an example:
from pyscf import scf
from pyscf.tools import cubegen
mol = gto.M(atom='''O 0.00000000, 0.000000, 0.000000
H 0.761561, 0.478993, 0.00000000
H -0.761561, 0.478993, 0.00000000''', basis='6-31g*')
mf = scf.RHF(mol).run()
cubegen.density(mol, 'h2o_den.cube', mf.make_rdm1()) #makes total density
cubegen.mep(mol, 'h2o_pot.cube', mf.make_rdm1())
cubegen.orbital(mol, 'h2o_mo1.cube', mf.mo_coeff[:,0])
In the above example, one can change mf.mo_coeff[:, 0]
to other orbitals by changing the orbital index.
In the VMD main window, click on File → New Molecule, then click on Browse to select the cube file to load. Finally click on Load.
One can keep on loading other .cube
files to the current molecule by selecting the molecule, and File → Load Data into Molecule.
The original color scheme can be hard to read, one should change the background and representations for a better view.
One can choose the background to be solid color or gradient color in the VMD main: Display → Background.
To change the color, open Graphics → Colors. In the Categories window, select Display, then in the Names window, select Background to change the solid color; BackgroundTop and BackgroundBot to change the gradient color.
The default of the molecules are dots, and to make the molecules prettier, one needs to change the representations. In VMD Main, open Graphics → Representation.
The CPK form can be generated with two layers of representations. One can click on Create Rep to create a new layer and Delete Rep to delete a layer. In order to make the CPK style, one create one layer with VDW, and one layer with Dynamic Bonds.
The .cube
file for orbital information is created by cubegen.orbital
. On top of the molecules created above, one can plot the charge density. The following are still in Graphics → Representations.
An example of the HOMO of H2O is
One can save the image to the PostScript (.ps) form. Open File → Render. In the Render the current scene using: window, choose PostScript (vector graphics). One can set the path and name to save.
To convert the .ps file into PDF, in the command line, install ghostview by sudo apt install ghostview
. Then use ps2pdf file.ps file.pdf
to convert to pdf.
This paper proved that non-linear canonical transformations of fermionic basic operators at one local point form a Lie group of $SU(2) \otimes SU(2) \otimes U(1) \otimes \mathbb{Z}_2$. The authors tried to enumerate all possible point-local gauge transformation acting on spin-1/2 operators. They applied their findings to identifying subgroups that form local and global gauge symmetries of the Hubbard-Heisenberg model.
Comment: This paper only tried to use the symmetries to analyze a Hamiltonian. I am more interested in whether we can find a better representation to reduce the computational complexity.
For simplicity, we use basic operators to denote creation and annihilation operators.
A natural basis of a $4\times 4$ matrix can be composed of 16 basis matrices: ${m(i,j)}_{i,j=1}^4$, where $m(i,j)$ is a matrix satisfying $[m(i,j)]_{ij} = 1$, and all other elements are equal to 0. A $SU(4)$ matrix $x$ is a $4\times 4$ matrix that satisfies $x^\dagger x = x x^\dagger = I$ and $\det(x) = 1$.
We choose the Fock vectors of a single-site system as the basis: $\psi_1 = |0\rangle$, $\psi_2 = |\uparrow\downarrow\rangle$, $\psi_3 = |\uparrow\rangle$, $\psi_4 = |\downarrow\rangle$. Then $m(i,j)$ can be represented by the non-linear operations onto the Fock basis:
\[m(i,j) \psi_k = \psi_i \delta_{jk}\]And the super-matrix of $m(i,j)$ can be represented by a matrix of non-linear operators
\[m = \begin{pmatrix}1 - n_\uparrow - n_\downarrow + n_{\uparrow}n_{\downarrow} & c_{\downarrow}c_{\uparrow} & (1 - n_\downarrow)c_{\uparrow} & (1 - n_\uparrow)c_{\downarrow} \\ c_{\uparrow}^\dagger c_{\downarrow}^\dagger & n_{\uparrow}n_{\downarrow} & -c_{\downarrow}^\dagger n_{\uparrow} & c_{\uparrow}^\dagger n_\downarrow\\ c_{\uparrow}^\dagger (1 - n_\downarrow) & -n_{\uparrow}c_{\downarrow} & n_\uparrow(1 - n_\downarrow) & c_{\uparrow}^\dagger c_{\downarrow} \\ c_{\downarrow}^\dagger(1 - n_\uparrow) & n_\downarrow c_{\uparrow} & c_{\downarrow}^\dagger c_{\uparrow} & n_\downarrow (1 - n_\uparrow)\end{pmatrix}\]where $c^\dagger_\sigma$ and $c_\sigma$ are the basic operators on this site. Note that each element of $m$ is a $4\times 4$ matrix, so it is effectively a $16\times 16$ matrix.
With the above settings, we can define the general form of a canonical transformation for creation and annihilation operators of a single-site system as
\[X = \sum_{ij} x_{ij} m(j,i) = \text{Tr}(xm) \equiv P(x)\]where $x$ is an SU(4) matrix, so $x_{ij}$ is a scalar, while $m(j,i)$ is a $4\times 4$ matrix, or a one-site operator.
The Lie bracket here is the anticommutation relation: ${c_\sigma^\dagger, c_{\sigma’}} = \delta_{\sigma,\sigma’}$, where $\sigma$ is the spin degree of freedom. Let $x$ be an SU(4) matrix, and $U = P(x)$, then the canonical transformation is
\[c_\sigma \rightarrow U^\dagger c_\sigma U, \quad c_\sigma^\dagger \rightarrow U^\dagger c_\sigma^\dagger U\]which preserves the Lie bracket.
Comment: In the paper, it says “since no subgroup of this SU(4) commutes with all of ${c_\uparrow^\dagger, c_\downarrow^\dagger, c_\uparrow, c_\downarrow}$, there is a one-to-one relation between elements of SU(4) and the point-localized canonical transformations.”
The anti-commutation relation for a multi-site system is
\[\{c_{\sigma}^\dagger(r), c_{\sigma'}^\dagger(r')\} = \delta_{\sigma, \sigma'}\delta_{r,r'}\]The above constraint further refines $U = P(x)$ that only terms that are products of an odd number of basic operators at site $r$ are permitted. If we look at the matrix of $m$, we observe that only the two off-diagonal $2\times 2$ blocks contain products of odd number of basic operators. Therefore, the SU(4) matrix $x$ should have the form
\[x = \begin{pmatrix} ue^{i\theta/2} & -v^* e^{i\theta/2} & 0 & 0 \\ ve^{i\theta/2} & u^*e^{i\theta/2} & 0 & 0 \\ 0 & 0 & ge^{i\theta/2} & -h^*e^{i\theta/2} \\ 0 & 0 & he^{i\theta/2} & g^* e^{i\theta/2}\end{pmatrix}\]where $|u|^2 + |v|^2 = 1$ and $|g|^2 + |h|^2 = 1$. Therefore, $x$ can be further reduced to $SU(2) \otimes SU(2) \otimes U(1)$, where $U(1)$ correspond to the phase $e^{i\theta/2}$. (Comment: there should be a typo in the paper.)
The entire symmetry group also contains the discrete transformations generating particle-hole exchanges separately in each spin component. However, we only need the transformations of one spin, and the other one can be derived from it.
The particle-hole exchange for the down-spin corresponds to
\[x_{\text{p-h},\downarrow} = \begin{pmatrix}0 & 0 & 0 & 1\\ 0 & 0 & 1 & 0\\ 0 & 1 & 0 & 0\\ 1 & 0 & 0 & 0\\ \end{pmatrix}\]which leads to $P(x_{\text{p-h},\downarrow}) = c_\downarrow^\dagger + c_\downarrow$, and $P(x_{\text{p-h},\downarrow}) c_{\downarrow}^\dagger |0\rangle = |0\rangle$ (particle to hole), $P(x_{\text{p-h},\downarrow}) |0\rangle = c_{\downarrow}^\dagger|0\rangle$ (hole to particle).
The particle-hole exchange for the up-spin then can be derived by multiplying $x_{\text{p-h},\downarrow}$ by the $SU(2) \otimes SU(2) \otimes U(1)$ matrix with $u = 0, v = 1, g = 0, h = 1, \theta = 0$.
Note that $(x_{\text{p-h},\downarrow})^2 = I$, so it belongs to the $\mathbb{Z}_2$ group, so the full group of permissible unitary transformations for spin-1/2 fermions is
\[G \equiv SU(2) \otimes SU(2) \otimes U(1) \otimes \mathbb{Z}_2\]Up to now, we have reached the first goal: to prove that non-linear canonical transformations of fermionic basic operators at one local point form a Lie group of $SU(2) \otimes SU(2) \otimes U(1) \otimes \mathbb{Z}_2$.
Next, the authors listed some natural subgroups of $G$. If not included in the notation, then the default values are $u = 1, v = 0, g = 1, h = 0, \theta = 0$, i.e., an identity matrix.
This is the usual spin rotations.
If there is a spin $\alpha$ on the site, then create a spin $\beta$ and gain a phase $e^{2i\theta}$; otherwise, create a spin $\beta$ with no additional phase.
The spin-up particle becomes a hole.
]]>The goal of a generative model is to learn the true data distribution $p(\mathbf{x})$ given a sample $\mathbf{x}$ from this distribution.
The application can be generating new samples from the approximated distribution.
Some popular generative models are summarized in the following table and the basic structures are shown in the figure following it.
Class | Examples | Note |
---|---|---|
Generative Adversarial Networks (GANs) | learned in an adversarial manner, difficult to train. | |
Likelihood-based models | Autoregressive models, Variational Autoencoders (VAEs) | optimization |
Energy-based models | optimization | |
Score-based models | Paper | optimization |
The data $\mathbf{x}$ we observe is usually represented or generated by a latent variable $\mathbf{z}$. This paper used the example of Plato’s Allegory of the Cave, from The Republic. People in the cave can only see shadows on the wall they face, but the two-dimensional shadows are projections of three-dimensional objects.
In generative modeling, however, people try to learn lower-dimensional latent representations instead of higher-dimensional ones. My reasoning is 1) going from a low dimension to the high dimension it belongs to requires additional information; 2) most data human has is redundant and can be reduced to the important components.
We can view the data we observe $\mathbf{x}$ and the latent variable $\mathbf{z}$ as correlated vectors, and the joint distribution is $p(\mathbf{x}, \mathbf{z})$. We know the sample distribution of our data $\mathbf{x}$, and we want to learn its true distribution $p(\mathbf{x})$ that describes best our data $\mathbf{x}$.
First we should notice there are two ways of deriving $p(\mathbf{x})$ from $p(\mathbf{x}, \mathbf{z})$:
Both approaches remove the information of $\mathbf{z}$.
In the integration approach, the information is removed by the direction $\mathbf{z}\rightarrow \mathbf{x}$, i.e., for each $\mathbf{z}$ there is a distribution of $\mathbf{x}$, then summing up all $\mathbf{z}$ values will give the distribution of $\mathbf{x}$.
In the chain rule approach, the information is removed by the direction $\mathbf{x}\rightarrow \mathbf{z}$. Similar to the chain rule of derivation, we view $\mathbf{z}$ as a function of $\mathbf{x}$ through $p(\mathbf{z}\vert\mathbf{x})$ , and $p(\mathbf{x}, \mathbf{z})$ a function of $\mathbf{x}$ and $\mathbf{z}$, then dividing $p(\mathbf{x}, \mathbf{z})$ by $p(\mathbf{z}\vert\mathbf{x})$ will remove the information of $\mathbf{z}$ and result in the distribution of $\mathbf{x}$.
The ELBO is the lower bound of the evidence, defined as the logarithm likelihood of the observed data $\log p(\mathbf{x})$, where the latter is hard to compute.
In the likelihood-based models, we want to maximize the likelihood $p(\mathbf{x})$. Why? Because we want to make the model best describe the data we have, and if the probability of our $\mathbf{x}$ is small, then it means our model largely depends on the unknown data, and we this model will be bad to describe the known data $\mathbf{x}$.
We first give the expression of the ELBO, then prove that it is the lower bound of $\log p(\mathbf{x})$, which means ELBO is variational. The equation of ELBO is
\[\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x})}\left[\log \frac{p(\mathbf{x}, \mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x})}\right] \leq \log p(\mathbf{x})\]where $q_{\phi}(\mathbf{z}\vert\mathbf{x})$ is an approximation to $p(\mathbf{z}\vert\mathbf{x})$ with tunable parameters $\phi$. Next we prove that the above equation is indeed the lower bound of the log likelihood $\log p(\mathbf{x})$ from two perspectives:
1) using Jensen’s inequality for convex and concave functions
\[\log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z})\mathrm{d}\mathbf{z} = \log \int \frac{p(\mathbf{x}, \mathbf{z})q_{\phi}(\mathbf{z}|\mathbf{x})}{q_{\phi}(\mathbf{z}|\mathbf{x})}\mathrm{d}\mathbf{z} \\ = \log \mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x})}\left[\frac{p(\mathbf{x}, \mathbf{z})}{q_{\phi}(\mathbf{z}|\mathbf{x})}\right] \geq \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log \frac{p(\mathbf{x}, \mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x})}\right]\]where the $\geq$ sign used Jensen’s inequality since $\log$ is a concave function.
2) we can actually figure out the difference between the ELBO and the true log likelihood by
\[\log p(\mathbf{x}) = \log p(\mathbf{x}) \underbrace{\int q_{\phi}(\mathbf{z}\vert\mathbf{x})}_{=1} = \int q_{\phi}(\mathbf{z}\vert\mathbf{x})(\log p(\mathbf{x}))\mathrm{d}\mathbf{z} \\ =\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x})}[\log p(\mathbf{x})] = \mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x})}\left[\log \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{z}|\mathbf{x})}\right] = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log \frac{p(\mathbf{x}, \mathbf{z})q_{\phi}(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})q_{\phi}(\mathbf{z}|\mathbf{x})}\right]\\ = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log \frac{p(\mathbf{x}, \mathbf{z})}{q_{\phi}(\mathbf{z}|\mathbf{x})}\right] + \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log \frac{q_{\phi}(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})}\right]\\ = \mathrm{ELBO} + D_{\mathrm{KL}}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}|\mathbf{x})) \geq \mathrm{ELBO}\]where $D_{\mathrm{KL}}(q_{\phi}(\mathbf{z}\vert\mathbf{x}) \vert\vert p(\mathbf{z}\vert\mathbf{x}))$ is the Kullback-Leibler divergence (or KL divergence or relative entropy) between $q_{\phi}(\mathbf{z}\vert\mathbf{x})$ and $p(\mathbf{z}\vert\mathbf{x})$, and is always $\geq 0$, due to the Gibbs inequality: the entropy of a probability distribution $-q\log q$ is always greater than or equal to the cross entropy $-q\log p$.
Therefore we figured out that the difference between the ELBO and the log likelihood is the KL-divergence of the conditional probability $p(\mathbf{z}\vert\mathbf{x})$ and the approximated one $q_{\phi}(\mathbf{z}\vert\mathbf{x})$.
Variational comes from the ELBO is the lower bound of the log likelihood, so the method is variational. Autoencoder comes from a traditional autoencoder model, where input data is trained to predict itself after undergoing an intermediate bottlenecking representation step. The idea (left) and the network (right) of are shown as follows. The bottleneck representation is the red layer, i.e., the latent data, and it looks like the shape of a bottleneck.
The encoder is $q(\mathbf{z}\vert\mathbf{x})$, and the decoder is $p(\mathbf{x}\vert\mathbf{z})$.
Practically, $p(\mathbf{x}\vert\mathbf{z})$ is what we want to know (because then we can get $p(\mathbf{x})$ with chain rule; while $q(\mathbf{z}\vert\mathbf{x})$ is intractable to calculate. Therefore, in the following, we assign parameters to them: $p(\mathbf{x})\rightarrow p_\theta(\mathbf{x})$, $q(\mathbf{z}\vert\mathbf{x})\rightarrow q_\phi(\mathbf{z}\vert\mathbf{x})$, and our goal is to optimize the parameters $\theta$ and $\phi$ in order to maximize the likelihood.
Next we want to rewrite the expression of $p_\theta(\mathbf{x})$ to explicitly contain the encoder $q_\phi(\mathbf{z}\vert\mathbf{x})$ and the decoder $p_\theta(\mathbf{x}\vert\mathbf{z})$.
Following the second expression of the ELBO, we have
\[p_\theta(\mathbf{x}) = \mathrm{ELBO} + D_{\mathrm{KL}}(q_{\phi}(\mathbf{z}\vert\mathbf{x}) \vert\vert p_\theta(\mathbf{z}|\mathbf{x}))\\ = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_{\phi}(\mathbf{z}|\mathbf{x})}\right] + D_{\mathrm{KL}}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p_\theta(\mathbf{z}|\mathbf{x}))\]The ELBO can be further written as
\[\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_{\phi}(\mathbf{z}|\mathbf{x})}\right] = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log \frac{p_\theta(\mathbf{x}| \mathbf{z})p(\mathbf{z})}{q_{\phi}(\mathbf{z}|\mathbf{x})}\right]\\ = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}| \mathbf{z})\right] + \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log \frac{p(\mathbf{z})}{q_{\phi}(\mathbf{z}|\mathbf{x})}\right] \\ = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}| \mathbf{z})\right] - D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x})|| p(\mathbf{z}))\]Therefore, the expression of $p_\theta(\mathbf{x})$ is
\[p_\theta(\mathbf{x}) = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}| \mathbf{z})\right] - D_{\mathrm{KL}}[q_\phi(\mathbf{z}|\mathbf{x})|| p(\mathbf{z})] + D_{\mathrm{KL}}[q_{\phi}(\mathbf{z}|\mathbf{x}) || p_\theta(\mathbf{z}|\mathbf{x})] \\ \geq \underbrace{\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}| \mathbf{z})\right]}_{\mathrm{reconstruction\ term}} - \underbrace{D_{\mathrm{KL}}[q_\phi(\mathbf{z}|\mathbf{x})|| p(\mathbf{z})]}_{\mathrm{prior\ matching\ term}} = \mathrm{ELBO}\]Computing $D_{\mathrm{KL}}[q_{\phi}(\mathbf{z}\vert\mathbf{x}) \vert\vert p_\theta(\mathbf{z}\vert\mathbf{x})]$ is intractable, since it involves the conditional probability distributions, but we know it is non-negative, therefore we only need to evaluate the ELBO, which further consists of two terms: the reconstruction term and the prior matching term.
In order to maximize the ELBO, we want to maximize the reconstruction term and minimize the prior matching term.
\[\max_{\phi,\theta}\left(\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}| \mathbf{z})\right] - D_{\mathrm{KL}}(q_\phi(\mathbf{z}|\mathbf{x})|| p(\mathbf{z}))\right)\]Let’s look at the second term, minimizing the KL divergence of $q_\phi(\mathbf{z}\vert\mathbf{x})$ and $p(\mathbf{z})$ means to make them as similar as possible. $p(\mathbf{z})$ is a prior distribution of the latent variables that we choose in advance. It is easy to see that the choice of $p(\mathbf{z})$ does not constrain the expressions of $p(\mathbf{x})$, so theoretically we can choose any $p(\mathbf{z})$. However, we choose $p(\mathbf{z})$ to be a standard multivariate Gaussian
\[p(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{I})\]The general form of a multivariate Gaussian with $n$ variables is
\[\mathcal{N}(\mathbf{z}; \mathbf{\mu}, \mathbf{\Sigma}) = \frac{1}{\sqrt{(2\pi)^{n}\det [\mathbf{\Sigma}]}} \exp\left[-\frac{1}{2}(\mathbf{z} - \mathbf{\mu})^T \mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})\right]\]where $\mathbf{\mu}$ is the mean of $\mathbf{z}$, and $\mathbf{\Sigma}$ is the $n\times n$ covariance matrix: $\Sigma_{ij} = \mathbb{E}[(z_i-\mu_i)(z_j - \mu_j)]$.
Then the encoder $q_\phi(\mathbf{z}\vert\mathbf{x})$ is commonly chosen to be a multivariant Gaussian with diagonal variance
\[q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\mathbf{z}; \mathbf{\mu}_{\phi}(\mathbf{x}), \mathbf{\sigma}^2_\phi(\mathbf{x})\mathbf{I})\]Then the KL divergence $D_{\mathrm{KL}}[q_\phi(\mathbf{z}\vert\mathbf{x})\vert\vert p(\mathbf{z})]$ can be computed analytically.
This term ensures that the learned distribution is modeling effective latent data that the original data can be regenerated from. This term can be evaluated using Monte Carlo sampling:
\[\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}| \mathbf{z})\right] \approx \sum_{l=1}^L \log p_\theta(\mathbf{x}| {z}^{(l)})\]where the latents $\{z^{(l)}\}_{l=1}^L$ are sampled from $q_\phi(\mathbf{z}\vert\mathbf{x})$ for every ${x}$ in the dataset.
If we directly generate a layer of $\mathbf{z}$ values from $q_\phi(\mathbf{z}\vert\mathbf{x})$, then since $\{z^{(l)}\}_{l=1}^L$ are generated stochastically, the gradient with respect to $\phi$ cannot be evaluated easily. This can be resolved by reparameterization. The idea is to move the stochastic node out of the network, as follows (lecture slide from MIT).
The reparameterization trick rewrites a random variable as a deterministic function of a noise variable (i.e. another random variable). For example, sampling from a normal distribution $x\sim \mathcal{N}(x;\mu, \sigma^2)$ with arbitrary $\mu$ and $\sigma$ can be derived from a standard normal distribution of an auxiliary noise variable $\epsilon \sim \mathcal{N}(\epsilon;0, 1)$ from
\[x = \mu + \sigma \epsilon, \mathrm{\ with \ } \epsilon \sim \mathcal{N}(\epsilon;0, 1)\]where we shifted the mean of $\epsilon$ by $\mu$ and stretched the variance of $\epsilon$ by $\sigma^2$. Thus each $\mathbf{z}^{(l)}$ can be computed by
\[\mathbf{z}^{(l)} = \mathbf{\mu}_\phi(\mathbf{x}) + \mathbf{\sigma}_\phi(\mathbf{x})\cdot \mathbf{\epsilon}^{(l)}\]where $\mathbf{\epsilon}^{(l)} \sim \mathcal{N}(\mathbf{\epsilon}; \mathbf{0}, \mathbf{I})$.
Next, we can generate new data $\mathbf{x}$ from $\mathbf{z}^{(l)}$ by the decoder. VAE likes the dimension of $\mathbf{z}$ to be smaller than the dimension of $\mathbf{x}$.
The algorithm is shown as follows (lecture slides from Stanford):
An example code of VAE is here.
The idea of Hierarchical VAE is straightforward: instead of one layer of latent variables, multiple layers of latent variables are used, as follows
The chain rule for Markovian HVAE is
\[p_\theta(\mathbf{x},\mathbf{z}_{1:T}) = p_\theta(\mathbf{x}|\mathbf{z}_1)\left(\prod_{t=2}^T p_\theta(\mathbf{z}_{t-1}|\mathbf{z}_t)\right)p(\mathbf{z}_T),\\ q_\phi(\mathbf{z}_{1:T}|\mathbf{x}) = \left(\prod_{t=2}^T q_\phi(\mathbf{z}_t|\mathbf{z}_t)\right)q_\phi(\mathbf{z}_1|\mathbf{x})\]And the ELBO is
\[\log p_\theta(\mathbf{x}) \geq \mathbb{E}_{q_\phi(\mathbf{z}_{1:T}|\mathbf{x})}\left[\log\frac{p_\theta(\mathbf{x},\mathbf{z}_{1:T})}{q_\phi(\mathbf{z}_{1:T}|\mathbf{x})}\right]\]VDM has the same structure as the Markovian Hierarchical VAE, with three constraints:
The idea is to evolve the data to a pure Gaussian noise, and the information of the distribution of the data is encoded in the hyperparameters in each layer. This is a coarse-graining procedure, similar as tensor network contraction. Shown in the following figure:
We can take each layer as a time step $t$, and the latent variables are $\mathbf{x}_t$. $\mathbf{x}_0$ is our data, while $\mathbf{x}_T$ is the last latent variables following the standard Gaussian. The chain rule becomes
\[q(\mathbf{x}_{1:T}|\mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t|\mathbf{x}_{t-1})\]At time $t$, the distribution of $\mathbf{x}_t$ is the Gaussian around $\mathbf{x}_{t-1}$.
\[q(\mathbf{x}_{t}|\mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t;\sqrt{\alpha_t}\mathbf{x}_{t-1}, (1-\alpha_t)\mathbf{I})\]where $\alpha_t$ is the parameter we choose at time $t$ to ensure $\mathbf{x}_T$ is standard Gaussian. Note that the encoding process is variance-preserving, since the scaling of $\alpha_t$ shows both on the denominator and the numeritor.
The opposite direction, going from time $T$ to time $0$, we have
\[p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I})\\ p_{\theta}(\mathbf{x}_{0:T}) = p(\mathbf{x}_T)\prod_{t=1}^T p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})\]where only $p(\mathbf{x}_T)$ has a fixed expression.
Since $q(\mathbf{x} \vert \mathbf{x}_{t-1})$ is pre-chosen, we only need to learn the backward conditions $p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)$, i.e., the parameters $\theta$.
Once $p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)$ are learned, one can start from a standard Gaussian, and iteratively run the denoising transitions $p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)$ for $T$ steps to generate new data $\mathbf{x}_0$.
We skip the derivation of the ELBO for VDM, but write down the final equation. One can find the derivations in Page 9-10 in the paper by Calvin Luo.
\[\log p(\mathbf{x}) \geq \underbrace{\mathbb{E}_{q(\mathbf{x}_1|\mathbf{x}_0)}[\log p_{\theta(\mathbf{x}_1|\mathbf{x}_0)}]}_{\text{reconstruction term}} - \underbrace{D_{\text{KL}}(q(\mathbf{x}_T|\mathbf{x}_0)|| p(\mathbf{x}_T))}_{\text{prior matching term}} \\ - \underbrace{\sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}[D_{\text{KL}}(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0))||p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)]}_{\text{denoising matching term}}\]The three terms correspond to
The reconstruction term can be approximated and optimized using Monte Carlo.
The prior matching term is how close the final distribution to a standard Gaussian, and has no trainable parameters, so we can take is as a constant, say zero, and ignore it.
The denoising matching term goes in the reverse time direction. The ground truth denoising transition step is $q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)$, and we train the parameters $\theta$ to learn $p_{\theta}(\mathbf{x}_{t-1} \vert \mathbf{x}_t)$.
This post explains how forward- and reverse-mode automatic differentiation algorithms work with the Wikipedia example $f(x_1, x_2) = x_1 * x_2 + \sin(x_1)$ and Python implementations. Autodiff is conceptually easy to understand, but untrivial to implement. Therefore, unless you can write your own autodiff code, you cannot say confidently that you understand how it works.
If you have some basic understanding of autodiff, you can skip my writing and jump to the implementation.
The first step is to divide $f$ into operation units labeled by $w^{n}_i$, where $n$ stands for hierarchy, and $i$ stands for the index of the members in the same hierarchy. The lowest hierarcy corresponds to the variables, i.e., $x_1$ and $x_2$, and the highest hierarchy corresponds to the function, i.e., $f$. Therefore, we have
\[w^2 = w^1_1 + w^1_2;\\ w^1_1 = w^0_1 * w^0_2, \quad w^1_2 = \sin(w^0_1); \\ w^0_1 = x_1, \quad w^0_2 = x_2.\]Suppose we are interested in $\partial f/\partial x_1$ and we want to fix $x_2$. The forward-mode corresponds to evaluating $\partial f/\partial x_1$ from bottom to top. We use $\dot{w} = \partial w/\partial x_1$ and work all the way up:
\[\dot{w}_1^0 = 1, \quad \dot{w}_2^0 = 0; \\ \dot{w}_1^1 = \frac{\partial w^1_1}{\partial w_1^0} \dot{w}_1^0 + \frac{\partial w^1_1}{\partial w_2^0} \dot{w}_2^0, \quad \dot{w}_2^1 = \frac{\partial w^1_2}{\partial w_1^0} \dot{w}_1^0 + \frac{\partial w^1_2}{\partial w_2^0} \dot{w}_2^0 \\ \dot{w}^2 = \frac{\partial w^2}{\partial w_1^1} \dot{w}_1^1 + \frac{\partial w^2}{\partial w_2^1} \dot{w}_2^1\]Therefore, we need to keep track of the gradient of all operation units and build them up. Because many gradient forms involve the value of the variable, e.g. $\partial (x_1 x_2)/\partial x_1 = x_2$, we also need to keep track of the values of the operation units.
We we need the derivative with respect to $x_2$ we set $\dot{w}_1^0 = 0, \dot{w}_2^0 = 1$ and redo the above. Therefore, the forward-mode requires $\mathcal{O}(N M)$ operations to evaluate the full derivative, where $N$ is the number of variables, and $M$ is the number of operation units.
We use a class to keep track of the value of $w_i^n$ and the gradient $\dot{w}_i^n$, and we have to define the operation units, because now the operations not only returns the value, but also the gradient. A vanilla python implementation is provided in the following.
We first define the data structure fw_var
:
'''
Forward-mode automatic differentiation for the function:
f = x1*x2 + sin(x1)
'''
import numpy as np
class fw_var():
# define a new data structure to store both value and gradient
def __init__(self, v=0, g=1):
self.val = v # value
self.grad = g # gradient
Next we define the operation units.
def f_prod(x, y):
# evaluate x * y
var = fw_var()
var.val = x.val * y.val
var.grad = x.val * y.grad + x.grad * y.val
return var
def f_add(x, y):
# evaluate x + y
var = fw_var()
var.val = x.val + y.val
var.grad = x.grad + y.grad
return var
And finally we build $f(x_1, x_2)$ and evaluate $\partial f/\partial x_1$.
def func(x1, x2):
# The function to be evaluated
# f = x1*x2 + sin(x1)
return f_add(f_prod(x1, x2), f_sin(x1))
x1 = fw_var(0, 1)
x2 = fw_var(1, 0)
var = func(x1, x2)
print(f"Value of f: {var.val}")
print(f"Grad wrt x1: {var.grad}")
One can also redefine the +
, -
, *
and /
signs by adding __add__()
, __sub__
, __mul__()
and __truediv__()
members to the class fw_var
. You can find many fancier versions online :)
The reverse mode is not as straightforward as the forward-mode. However, it requires less number of operations although more memory. We start from the hierarchy again
\[w^2 = w^1_1 + w^1_2;\\ w^1_1 = w^0_1 * w^0_2, \quad w^1_2 = \sin(w^0_1); \\ w^0_1 = x_1, \quad w^0_2 = x_2.\]Now we work from top to bottom, and the value we care is $\tilde{w}_i^n = \partial f/\partial w_i^n$. The routine is
\[\tilde{w}^2 = \frac{\partial f}{\partial w^2} = 1 \\ \tilde{w}^1_1 = \tilde{w}^2 \frac{\partial w^2 }{\partial w^1_1}, \quad \tilde{w}^1_2 = \tilde{w}^2 \frac{\partial w^2 }{\partial w^1_2}; \\ \tilde{w}^0_1 = \tilde{w}^1_1 \frac{\partial w^1_1 }{\partial w^0_1} + \tilde{w}^1_2 \frac{\partial w^1_2}{\partial w^0_1} = \frac{\partial f}{\partial w^0_1}, \quad \tilde{w}^0_2 = \tilde{w}^1_1 \frac{\partial w^1_1}{\partial w^0_2} + \tilde{w}^1_2 \frac{\partial w^1_2}{\partial w^0_2} = \frac{\partial f}{\partial w^0_2}\]Now we see that, in order to evaluate $\tilde{w}^k$, we need the information of $\tilde{w}^{k+1}$ as well as the coefficients $\partial w^{k+1}/ \partial w^k$.
We call $w^{k+1}$ parents, and $w^k$ children. For one child, we need to find all of its parents, and the coefficient of each parent, and so on. However, it is not easy. The easy direction is given a parent, find all of its children recursively. Lastly, $x_1$ and $x_2$ have no children, thus the recursion ends.
First, we define a slightly more complicated data structure re_var
, where self.children
is a list of tuples (coeff, child)
, the coefficient correspond to $\partial \rm parent/\partial \rm child$.
import numpy as np
class re_var():
# the data structure for reverse mode
def __init__(self, v=1, children=[]):
self.val = v # value of the variable
self.children = children # list of tuples (coeff, re_var)
self.grad = 0 # gradient
Then we define the unit operations
def f_add(x, y):
# x + y
var = re_var()
var.val = x.val + y.val
var.children = [(1, x), (1, y)] # child x with coefficient 1, and y with 1
return var
def f_prod(x, y):
# x * y
var = re_var()
var.val = x.val * y.val
var.children = [(y.val, x), (x.val, y)]
return var
def f_sin(x):
# sin(x)
var = re_var()
var.val = np.sin(x.val)
var.children = [(np.cos(x.val), x)]
return var
Next we define a recursive function that visits all the offsprings of a given re_var
object and accumulate the corresponding coefficients to the offsprings’ gradient values.
def calc_grad(var, acc_grad=1):
# var: the unit at the top hierarchy
# acc_grad: accumulated gradient from parents
# evaluate the gradient recursively, so every unit has a gradient
var.grad += acc_grad
for coef, child in var.children:
calc_grad(child, coef * acc_grad)
Finally, we define $x_1$, $x_2$ and $f(x_1, x_2)$, and evaluate the gradients.
def func(x1, x2):
return f_add(f_prod(x1, x2), f_sin(x1))
# define x1 and x2
x1 = re_var(0)
x2 = re_var(1)
var = func(x1, x2)
calc_grad(var)
print("value: ", var.val)
print("gradient wrt x1: ", x1.grad)
print("gradient wrt x2: ", x2.grad)
The complexity is $\mathcal{O}(M+N)$, which is much smaller than $\mathcal{O}(MN)$ for the forward mode.
]]>