Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matchbox recommender message 7 #428

Open
makobot-sh opened this issue Dec 29, 2022 · 12 comments
Open

Matchbox recommender message 7 #428

makobot-sh opened this issue Dec 29, 2022 · 12 comments

Comments

@makobot-sh
Copy link

makobot-sh commented Dec 29, 2022

Hi everyone,

I've been looking for where message 7 (m⁎->s as labelled in the Matchbox paper [1] factor graph) is implemented in the infer.net code, but I can't seem to find it. As a study excercise, me and @glandfried have been trying to implement the collaborative filtering version of the Matchbox recommender algorithm, as described in the Matchbox paper. We have been able to implement almost everything. However, our implementation of approximate message 7 differs greatly from the exact version of the message, the bimodal one. In contrast, we had no issues with the approximate message 2.

If anyone could help me find the infer.net implementation of the message so we can compare I would appreciate it. So far I could only find approximate message 2 (m⁎->z) at
infer/src/Runtime/Factors/Product.cs#ProductAverageLogarithm.

As a reminder, I copy here the factor graph,

factorMatchbox

and the approximation equations (approximate messages 2 and 7 respectively),

factorMatchbox

From the original Matchbox paper. I interpret this as,

factorMatchbox

t and σ2t denote the mean and variance of the (Gaussian) marginal distribution p(t). μz ->⁎ σ2z ->⁎ denote the mean and variance of message 6).

It would also be nice to get a hint to derive the approximations for 2 and 7 on our own (or a reference).

Thanks in advance!

[1] Matchbox Paper: https://www.microsoft.com/en-us/research/publication/matchbox-large-scale-bayesian-recommendations/

@tminka
Copy link
Contributor

tminka commented Dec 30, 2022

Message (7) is implemented in infer/src/Runtime/Factors/Product.cs#AAverageLogarithm. Messages (2) and (7) come from Variational Message Passing section 4.2. A concise form of those messages can be found in Table 1 of Gates: A Graphical Notation for Mixture Models.

@glandfried
Copy link

Hi and thanks Tom.

The reference has been very helpful. It allowed us to verify that the exact message 7 that we calculated was in fact correct, since when we apply the AverageLog to it we get almost exactly the approximate Gaussian proposed in the MatchBox paper. We know that the variational approximations should minimize the KL divergence with respect to the exact message. However, the proposed Gaussian is a good approximation of the transformation of the exact message, but not of the exact message. The distance between the approximate message and the exact is very high, indeed. For example, in the following figure we can see the 3 distributions, the exact message (blue), the transformation of the exact message (orange) and the approximate Gaussian proposed in the MatchBox paper (green).

approx-m7-t1

Why then are we using this approximation that do not minimize de KL divergence with respect to the exact message? Even if applying the inverse of the AverageLog to it would give us a good approximation to the exact message, we would be back to the original problem, an intractable distribution. What am I missing?

Thank you for all your contribution to modern Bayesian inference.
Gustavo Landfried


Appendix

The code to replicate the figure.

import matplotlib.pyplot as plt
from scipy.stats import norm
import math
import numpy as np

# Grid and constants
step = 0.025 # to create grids
tk = np.arange(-6,6.1, step); tk = list(tk) # grid for t_k variable
zk = np.arange(-6,6.1, step); zk = list(zk) # grid for z_k variable
mtk = 1; mzk = 2; sdzk = 1; sdtk = 2 # mean and standard deviation for t_k and z_k 

# The exact message 7 (an integral for each value of s_k) 
mensaje_7_exacto = []
Sk = np.arange(-9,9+step, step)
for s in Sk:
    integral = sum(norm.pdf(s*np.array(tk),mzk,sdzk)*norm.pdf(np.array(tk),mtk,sdtk))*step
    mensaje_7_exacto.append(integral)

# AverageLog of the exact message 7
mensaje_7_AverageLog = []
Sk = np.arange(-9,9+step, step)
for s in Sk:
    integral = np.exp(sum(np.log(norm.pdf(s*np.array(tk),mzk,sdzk))*norm.pdf(np.array(tk),mtk,sdtk))*step)
    mensaje_7_AverageLog.append(integral)

fig = plt.figure()
plt.plot(Sk,mensaje_7_exacto/(sum(mensaje_7_exacto)*step), label="Exact m7")
plt.plot(Sk,mensaje_7_AverageLog/(sum(mensaje_7_AverageLog)*step), label="Exact m7 with AverageLog")
plt.plot(Sk, norm.pdf(Sk, mtk*mzk /(sdtk**2 + mtk**2),  sdzk /np.sqrt((sdtk**2 + mtk**2)) ) , label="Approx m7" )
plt.legend(loc="upper left",fontsize=10)
fig.savefig("approx-m7-t1.pdf", bbox_inches = 'tight')

@tminka
Copy link
Contributor

tminka commented Jul 2, 2023

The reason for using this approximation is in section 3.1.2 of the paper you linked.

@glandfried
Copy link

I still don't understand why the approximate message 7 does not minimize the KL divergence with respect to the exact message 7.

@tminka
Copy link
Contributor

tminka commented Jul 10, 2023

It minimizes KL divergence with the arguments swapped. Is your issue that the code doesn't do what the paper says, or that you disagree with what the paper says to do?

@glandfried
Copy link

glandfried commented Jul 14, 2023

Our original goal was to implement the matchbox model from scratch as an exercise to learn as much as possible from the methods created by you. Matchbox is particularly interesting because it offers an analytical approximation (the proposed message 2 and 7 in the paper) to a common problem, the multiplication of two Gaussian variables. The issue is that during the implementation process, we found that the approximate message 7 proposed by the matchbox paper does not minimize the reverse KL divergence with respect to the exact message 7. Since we couldn't find the error in our calculations, we decided to examine the original code implemented by you. Thanks to your initial response, we were able to verify that the exact message 7 calculated by us was correct, as when we apply the AverageLog to it, we obtain exactly the approximate Gaussian proposed in the MatchBox paper. Now the question is why does the approximate message 7 proposed by the paper not minimize the reverse KL divergence? (at least as indicated by our calculations)

Exact analytic message

The following is the exact analytic message.
exact_message_7

Each message 7 represents an integral of the joint distribution below the isoline defined by s_k. In the following images, we present the joint distribution with four isolines on the left side, and the corresponding areas below those isolines on the right side.
individual_messages_7

Collectively, all of these integrals create the likelihood that sends the exact message 7 to the latent variable $s_k$, which naturally exhibits two peaks and two very long tails.

The reverse KL divergence

To evaluate the reverse KL divergence, we implemented a simple numerical integration. For example, when approximating a mixture of Gaussians with a unimodal Gaussian using reverse KL divergence, we obtained the following result. There are two local minima, one for each peak. And there is a single global minimum, corresponding to the highest peak.

a b
min-KL-mix-min min-KL-mix-image

Reverse KL divergence between the approximate and the exact messages

The exact message 7 has a similar structure. However, unlike the mixture of Gaussians, the tails of the exact message 7 are extremely wide, which requires the reverse KL minimization approximation to find a very different compromise compared to the example of the mixture of Gaussians.

KL divergences for a range of mu and sigmas. best approximation is Divergence evaluated point by point,
min-KL-MB-mins-divergence-image min-KL-MB-mins min-KL-MB-mins-divergence

It seems that the width of the tails has a more significant impact than the two peaks, which, ultimately, are not that far apart.
min-KL-MB-proposed

Thanks for all

We provided this explanation because we are genuinely interested in understanding the details of the valuable methodology that you have developed. We will always be indebted to you. Really thanks @tminka

@tminka
Copy link
Contributor

tminka commented Nov 22, 2023

Please write out the formula that you think should be minimized. I suspect there is some confusion about what "minimize the KL divergence" actually means.

@glandfried
Copy link

The mathematical definition of the KL-Divergence is,

$$KL(p||q) = - \sum_{x \in X} p(x) * \log( \frac{q(x)}{p(x)} )$$

with $p$ the true distribution and $q$ the approximated distribution. Then, the definition of the reverse KL-Divergence is $KL(q||p)$ (in python we can use scipy.special.rel_entr(p,q)).

In the previous message I minimize both KL divergence in a well known example. In the following figure (already shown in the previous message), the red Gaussian distribution minimize the reverse KL divergence (within the Gaussian family) with respect to the blue distribution (a mixture of Gaussian), $KL(\text{red}||\text{blue})$.

a b
min-KL-mix-min min-KL-mix-image

Here is the code that reproduce the left-side figure,

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm # Gaussian distribution
from scipy.special import rel_entr # KL-Divergence
from scipy.optimize import basinhopping # Optimization algorithm

def Dkl(p, q):
    return rel_entr(p,q)

# Grid
step = 0.025
x_grid = np.arange(-6,6,step)

# True distribution and their approximation
p_mixture = norm.pdf(x_grid,2,0.5)*0.6 + norm.pdf(x_grid,-2,0.5)*0.4
def objetive_reverse_KL(mean_sd):
    # Deriving discrete analogues of continuous distribution
    P = p_mixture * step # high_X_base[] vector
    Q = norm.pdf(x_grid,mean_sd[0],mean_sd[1]) * step
    return sum(Dkl(Q,P))

muMix_QP, sigmaMix_QP = basinhopping(objetive_reverse_KL, [2,0.5])["x"]

fig = plt.figure()
plt.plot(x_grid,p_mixture )
plt.plot(x_grid,norm.pdf(x_grid,muMix_QP,sigmaMix_QP))
plt.show()

We can verify that the optimized mean and standard deviation are correct since it is the global minima of the right-side figure, where we plot the value of the reverse KL-Divergence for each combinations of mean and standard deviation.

Exactly the same procedure was executed to compute the reverse KL Divergence between the exact distribution (known as message 7 in MatchBox paper) and the approximated distribution proposed in the paper and implemented by Infer.Net.

As a reminder, the factor graph of MatchBox is,

factorgraph

Then, the exact distribution can be computed as,

exact_message_7

and the proposed distribution in the paper is,

approx

where $μ_t$ and $\sigma_t^2$ denote the mean and variance of message 1 of the factor graph (which is equivalent to the prior of variable $t$), $μ_z$⟶* and $\sigma_z^2$⟶* denote the mean and variance of message 6 of the factor graph. The sub-index $k$ is the plate number in the factor graph.

In the following figure we plot the exact distribution (blue) and the approximated distribution (green).

exact_approx

With the naked eye we can see that the approximated distribution is not close to the exact distribution. Here is the code to reproduce the figure

import matplotlib.pyplot as plt
from scipy.stats import norm
import math
import numpy as np

# Mean and SD of message 1 and message 6 at the MatchBox factor graph
mtk = 1; sdtk = 2 # Message 1 (equivalent to the prior of variable t)
m6k = 2; sd6k = 1 # Message 6

# Grid
step = 0.025
tk = np.arange(-6,6.1, step); tk = list(tk) # grid for t_k variable
zk = np.arange(-6,6.1, step); zk = list(zk) # grid for z_k variable
Sk = np.arange(-9,9+step, step) # grid for s_k variable

# The proposed approximation
message_7_approx = norm.pdf(Sk, mtk*m6k /(sdtk**2 + mtk**2),  sd6k /np.sqrt((sdtk**2 + mtk**2)))

# The exact message 7 (an integral for each value of s_k)
message_7_exact = []
for s in Sk:
    integral = sum(norm.pdf(s*np.array(tk),m6k,sd6k)*norm.pdf(np.array(tk),mtk,sdtk))*step
    message_7_exact.append(integral)

message_7_exact = message_7_exact/(sum(message_7_exact)*step) # Normalization

# Plot
fig = plt.figure()
plt.plot(Sk, message_7_exact, label="Exact m7")
plt.plot(Sk, message_7_approx, label="Approx m7")
plt.legend(loc="upper left",fontsize=10)
plt.show()

After computing the reverse KL-divergence for all possible combinations of mean and standard deviation, $KL(N(\mu,\sigma^2)||\text{Exact Distribution})$ we get just one local minima, which is not similar to the proposed approximated distribution.

KL divergences for a range of mu and sigmas best approximation is
min-KL-MB-mins-divergence-image min-KL-MB-mins

Here is the code to reproduce the figure

import matplotlib.pyplot as plt
from scipy.stats import norm
import math
import numpy as np
from scipy.special import rel_entr # KL-Divergence

# Mean and SD of message 1 and message 6 at the MatchBox factor graph
mtk = 1; sdtk = 2 # Message 1 (equivalent to the prior of variable t)
m6k = 2; sd6k = 1 # Message 6

# Grid for variable t and variable s
step_long=0.05; Sk_long = np.arange(-100,100+0.05, step_long); tk_long = Sk_long

# Exact distribution (message 7 in MatchBox factor graph)
message_7_exact_long = []
for s in Sk_long:
    # Message7(s_k) =
    integral = sum(norm.pdf(s*np.array(tk_long),m6k,sd6k) * norm.pdf(np.array(tk_long),mtk,sdtk))*step_long
    message_7_exact_long.append(integral)

message_7_exact_long = message_7_exact_long/(sum(message_7_exact_long)*step_long)


# Compute the KL Divergence for a range of MU and SD
mus = np.arange(-15,15.25,0.25)
sigmas = np.arange(1,48,1)
muXsigma = []; min_divergence = np.inf;
for m in mus:
    muXsigma.append([])
    for s in sigmas:
        P = message_7_exact_long * step_long # Deriving discrete analogues (exact dist)
        Q = norm.pdf(Sk_long,m,s) * step_long # Deriving discrete analogues (approx dist)
        Dkl_value = sum(rel_entr(Q,P)) # Compute numerical reverse KL Divergence
        muXsigma[-1].append( Dkl_value )
        # Follow the distribution with minimum reverse KL divergence
        if muXsigma[-1][-1] < min_divergence:
            min_divergence = muXsigma[-1][-1]
            argmin = (m, s)


fig = plt.figure()
plt.imshow(muXsigma, interpolation='nearest', extent = (np.min(sigmas), np.max(sigmas), np.min(mus), np.max(mus) ) )
contours = plt.contour(sigmas, mus, muXsigma, 10, colors='black')
plt.clabel(contours, inline=1, fontsize=10)
plt.axhline(y=0, color="gray")
plt.scatter(argmin[1],argmin[0], color="black" )
plt.xlabel("Sigma",fontsize=14)
plt.ylabel("Mu",fontsize=14)
plt.show()

Even the true distribution has two modes, in this example the tails are extremely wide (unlike the mixture of Gaussians example). This explains why the distribution that minimizes the reverse KL divergence do not select one mode, as in the mixture of Gaussians example.

We'd greatly appreciate any comment, since we are genuinely interested in understanding the details of the methodology that you have developed. If we've made a mistake somewhere along the way, we'd love to understand it. I fixed some minor errors, but I don't found any major mistake. We are not obsessed with the fact that the proposed approximated distribution do not minimize the reverse KL Divergence (I am not a mathematician, and I even don't know how to derive an analytical formula for the approximated distribution). However, Macarena Piaggio (@macabot-sh) has replicated the MatchBox tutorial using exactly the same code available in the Infer.Net repository (see the diff between files) and she has obtained low performance both in terms of estimates and evidence (issue 449). So, we find ourselves stumped on both the math side and the implementation side.

@tminka
Copy link
Contributor

tminka commented Nov 23, 2023

The issue is that this is not what "minimizing the KL divergence" means. The messages in EP and VMP minimize KL divergence of the approximate posterior to the true posterior (the thing that actually matters), not the KL divergence between the "exact message" (as you call it) and the approximate message. Messages are not distributions so it doesn't even make sense to ask for their KL divergence.

@glandfried
Copy link

glandfried commented Nov 24, 2023

Yes, you are right, the message 8 (or 7) are likelihood so they are not distributions. It is my homework to verify the KL divergence correctly. Thank you very much. As we mentioned before, we have tried to replicate the output of MatchBox tutorial using the (source code available at Infer.Net), modifying as few lines as possible (diff between files) but we have obtained results that do not match with those reported by you (issue 449). We are going to review the code again before asking you another question. Thank you very much. Is there an active community where we can ask this kind of question?

@glandfried
Copy link

Hello again,

Here we will follow your technical report "Divergence measures and message passing" [1], rephrasing it as close as possible. To recap, we want the best approximation $q$ that minimizes the reverse KL divergence with respect to the true distribution $p$.

$$\text{arg min}_{q} KL(q||p)$$

First, we must write $p$ as a product of factors,

$$p(x) = \prod_a f_a(x)$$

This defines the preferred factorization, which is not unique. Then, each factor will be approximated by a member of a family $\hat{f}_a \in \mathcal{F}$, here a Gaussian distribution. We get the best approximation $q$ to the posterior $p$ by approximating each individual factor in $p$, $f_a \approx \hat{f}_a$.

$$q(x) = \prod_a \hat{f}_a(x)$$

Here we want factors $\hat{f}_a$ that minimize $ \text{arg min}_{\hat{f}_a} KL(\hat{f}_a\,q^{\setminus a}||f_a\,p^{\setminus a})$, where $q^{\setminus a} = \prod_{b\neq a}\hat{f}_b$ and $p^{\setminus a}= \prod_{b\neq a}f_b$ are the product of all other factors.

The best approximation of each factor depends on the rest of the network, giving a chicken-and-egg problem, which it is solved by an iterative message-passing procedure: each factor sends its best approximation to the rest of the net, and then recomputes its best approximation based on the messages it receives. To make this tractable, we must assume that the approximations we've already made $q^{\setminus a}$, are a good approximation to the rest of the network, $p^{\setminus a}\approx q^{\setminus a}$, at least for the purpose of solving for $\hat{f}_a$. Then, the problem becomes,

$$\hat{f}_a = \text{arg min}_{\hat{f}_a} KL(\hat{f}_a\,q^{\setminus a}||f_a\,q^{\setminus a})$$

Because $q$ is fully-factorized, factors $f_a(\vec{x})$ can be factorized into messages that factor $a$ send to each individual variable $j$ in vector $\vec{x}$, expressing its approximation as,

$$\hat{f}_a(\vec{x}) = \prod_j \hat{m}_{a\rightarrow j}(x_j)$$

Individual messages need not be normalized, and need not be proper distributions. In a fully-factorized context we only need to propagate messages between a factor and their variables.

In this issue we are discussing if the approximation of message $7$ proposed by the Matchbox Recommender paper is indeed the best one or not. We believe we have found a counterexample. In the next figure we recall Matchbox's factor graph, with labels for the main messages.

factorMatchbox

In particular, at each iteration of the message-passing procedure, we know that the true posterior of variable $s_k$ given an observation $r$ is proportional to the following factorized marginal distribution,

$$p(s_k|r) \propto p(s_k,r) \approx q(s_k) = \underbrace{m_{s_k \rightarrow \ast}(s_k)}_{\text{Message 1 for }s_k} \ \ \underbrace{m_{\ast \rightarrow s_k}(sk)}_{\text{Message 7 for }s_k}$$

In the collaborative filtering version of matchbox, message 1 for variable $s_k$ is just the prior for an individual $i$, $p(u_{ki}) = \mathcal{N}(u_{ki}|\mu_{u_{ki}}, \, \sigma^2_{u_{ki}})$. And by the sum-product algorithm we also know that the message 7 is,

$$m_{\ast \rightarrow s_k}(sk) = \int \int \delta(z_k = s_k \, t_k) \underbrace{m_{z_k \rightarrow \ast}(z_k)}_{\text{Message 6 for }z_k} \ \ \underbrace{m_{t_k \rightarrow \ast}(t_k)}_{\text{Message 1 for }t_k} \ \ dt_k \, dz_k$$

where $\delta(\cdot)$ is the Dirac distribution and it can be shown that both messages are Gaussian distributions. Then, the exact message 7 becomes,

$$\begin{align} \tag{1} \underbrace{m_{\ast \rightarrow s_k}(sk)}_{\text{Exact message 7}} & = \int_{z_k} \int_{t_k} \delta(z_k = s_k \, t_k) \ \mathcal{N}(z_k|\mu_{6_k}, \, \sigma^2_{6_k}) \ \mathcal{N}(t_k|\mu_{t_k}, \, \sigma^2_{t_k}) \ \ dt_k \, dz_k \\\ & = \int_{t_k} \mathcal{N}(s_k\,t_k|\mu_{6_k}, \, \sigma^2_{6_k}) \mathcal{N}(t_k|\mu_{t_k}, \, \sigma^2_{t_k}) \ \ dt_k \end{align}$$

We can compute a good numerical approximation of exact message 7 using simple quadrature procedure based on a dense grid.
Therefore, we must minimize the following reverse KL divergence,

$$\begin{align*} \hat{f}_{\ast } (s_{k} ) & \underset{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ }{=} \text{arg min}_{\hat{f}} \left( KL(\hat{f}_{\ast } (s_{k} )\ q^{\setminus *} (s_{k} )\ ||\ f_{*} (s_{k} )\ q^{\setminus *} (s_{k} ))\right)\\\ & \underset{\ \ \ \ \ \ ( 55\ minka2005) \ \ \ \ \ \ }{=} \text{arg min}_{\hat{f}} ( \ KL(\hat{f}_{\ast } (s_{k} )\ m_{s_{k}\rightarrow *} (s_{k} )\ ||\ f_{\ast } (s_{k} )\ m_{s_{k}\rightarrow *} (s_{k} ))\ )\\\ & \underset{\ \ \ \ \ \ ( 53\ minka2005) \ \ \ \ \ \ }{=} \text{arg min}_{\hat{m}} ( \ KL(\underbrace{\hat{m}_{*\rightarrow s_{k}}( s_{k})}_{\text{Approx message 7}} \ \underbrace{m_{s_{k}\rightarrow *} (s_{k} )}_{\text{Message 1 for }s_k}\ ||\ \underbrace{m_{*\rightarrow s_{k}}( s_{k})}_{\text{Exact message 7}} \ \underbrace{m_{s_{k}\rightarrow *} (s_{k} )}_{\text{Message 1 for }s_k}) \ )\\\ & \underset{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ }{=} \text{arg min}_{\hat{q}} ( \ KL(\hat{q}(s_k) \ ||\ q(s_k) ) \end{align*}$$

where $\hat{q}$ is the approximate marginal and $q$ the exact marginal in a given iteration of the message-passing procedure. The proposed approximation to message 7 in Matchbox's paper is,

$$\begin{align*} \hat{m}_{\ast \rightarrow s_k}( s_k) & = \mathcal{N}\left(s_k| \frac{\langle m_{\ast \rightarrow s_k} \rangle \langle t_k \rangle}{\langle t_k^2 \rangle} , \, \frac{\langle m_{z_k \rightarrow \ast}^2 \rangle - \langle m_{z_k\rightarrow \ast}^2 \rangle }{\langle t_k^2 \rangle } \right) \\\ & \overset{\dagger}{=} \mathcal{N}\left(s_k | \frac{\mu_{6_k} \, \mu_{t_k}}{\sigma_{t_k}^2 + \mu_{t_k}^2 }, \, \frac{(\mu_{6_k}^2 + \sigma_{6_k}^2) - \mu_{6_k}^2}{\mu_{t_k}^2 + \sigma_{t_k}^2)} \right) \end{align*}$$

where $\overset{\dagger}{=}$ holds by the moments of the functions and $\mu_{t_k}$, $\sigma_{t_k}^2$, $\mu_{6_k}$, $\sigma_{6_k}^2$ are the mean and variance shown at equation (1), present in the exact message 7, and implemented in infer/src/Runtime/Factors/Product.cs#AAverageLogarithm. We can evaluate if the proposed approximation of the message 7 is indeed the best one or not setting initial values for each of these quantities and computing the reverse KL divergence between the approximate posterior of $s_k$ and the exact posterior of $s_k$ (normalized version of $\hat{q}$ and $q$).

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
from scipy.special import kl_div # KL-Divergence
from scipy.optimize import basinhopping # For search global minimum

mu_uki = 1; sd_uki = 2 # Prior of U_ki
mu_tk = 1; sd_tk = 2 # Message 1 for tk
mu_6k = 2; sd_6k = 1 # Message 6

# Grid
step = 0.025
grid = list(np.arange(-19,19+step, step)); tk = grid; sk = grid; uki = grid;

# The proposed approximate message 7
mu_7sk = (mu_tk*mu_6k) / (sd_tk**2 + mu_tk**2)
sd_7sk = np.sqrt( (sd_6k**2) / (sd_tk**2 + mu_tk**2) )
message_7_approx = norm.pdf(sk, mu_7sk, sd_7sk )

## Exact message 7 (an integral for each value of s_k)
message_7_exact = []
for s in sk:
    integral = sum(norm.pdf(s*np.array(tk),mu_6k,sd_6k)*norm.pdf(np.array(tk),mu_tk,sd_tk))*step
    message_7_exact.append(integral)

# Exact message 1 (same as approx message 1)
message_1 = norm.pdf(sk, mu_uki, sd_uki)

# Posteriors (exact and approx)
exact_marginal = np.array(message_1) * np.array(message_7_exact)
exact_posterior = exact_marginal/sum(exact_marginal*step)
approx_marginal = np.array(message_1) * message_7_approx
approx_posterior = approx_marginal/sum(approx_marginal*step)

# Inverse KL minimization
def objetive_QP(mu_sd):
    return sum(kl_div(norm.pdf(sk,*mu_sd)*step, exact_posterior*step))

mu_minQP, sd_minQP = basinhopping(objetive_QP, [2,0.5])["x"]

# Plot
plt.plot(sk, exact_posterior, label="Exact posterior for sk")
plt.plot(sk, approx_posterior, label="MatchBox posterior for sk")
plt.plot(sk, norm.pdf(sk, mu_minQP, sd_minQP), label="Min KL(N($\mu,\sigma^2$)||Exact)")
plt.legend(loc="upper left", title="First iteration", fontsize=9)
plt.show()

factors_minKL_rev

Please let us know if we've made any wrong assumptions along the way. Thank you for your help.


[1] T. Minka. Divergence measures and message passing.
Technical Report MSR-TR-2007-173, Microsoft
Research Ltd., 2005.

@tminka
Copy link
Contributor

tminka commented Jan 11, 2024

Message 7 doesn't minimize that objective either. It seems that all of your confusion comes from a misunderstanding of the objective of Variational Message Passing. The objective is explained in the VMP paper that I referenced in my first reply. Questions about the mathematics of VMP should probably be asked on Cross-Validated rather than here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants