Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Discrete Bayesian Graph #210

Open
erlebach opened this issue May 27, 2022 · 16 comments
Open

A Discrete Bayesian Graph #210

erlebach opened this issue May 27, 2022 · 16 comments
Assignees

Comments

@erlebach
Copy link

erlebach commented May 27, 2022

I am implementing a discrete Bayesian network using a few Nonlinear functions and cannot get the messagePassingAlgorithm() method to run without error. The resulting graph has all its edges terminated either by Clamp or via a placeholder. Here is the code, which is rather simple:

using ForneyLab
using Plots

gA = 3
gB = 5
gC = 8

# Three deterministic functions
function grade(i, d)
    if i == 0 && d == 0
        g1,g2,g3 = [0.3, 0.4, 0.3]
    elseif i == 0 && d == 1
        g1,g2,g3 = [0.05, 0.25, 0.7]
    elseif i == 1 && d == 0
        g1,g2,g3 = [0.9, 0.08, 0.02]
    else
        g1,g2,g3 = [0.5, 0.3, 0.2]
    end
    
    return [g1, g2, g3]
end

function letter(g)
    if g == gA
        l = .9
    elseif g == gB
        l = .6
    else
        l = 0.01
    end
    return l
end

function score(i)
    if i == 0
        s = 0.05
    else
        s = 0.8
    end
    return s
end

# The model
gr = FactorGraph()

# @RV b1 ~ Clamp(0.4)
# @RV b2 ~ Clamp(0.3)
@RV d ~ Bernoulli(0.4)
@RV i ~ Bernoulli(0.3)
@RV grade_ ~ Nonlinear{Sampling}(i, d, g=grade, n_samples=50)
@RV g ~ Categorical(grade_)
@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50)
@RV l ~ Bernoulli(letter_)
@RV score_ ~ Nonlinear{Sampling}(i, g=score, n_samples=50)
@RV s ~ Bernoulli(score_)

# placeholder(d, :d)
placeholder(i, :i)
placeholder(s, :s)
placeholder(l, :l)
placeholder(g, :g)
#  The line the produces the error: 
algo = messagePassingAlgorithm([d])    # <<<<< ERROR
source_code = algorithmSourceCode(algo)
eval(Meta.parse(source_code));

The error produced is:

No applicable SumProductRule{Categorical} update for Categorical node with inbound types: Message{SampleList}, Nothing

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] inferUpdateRule!(entry::ScheduleEntry, rule_type::Type{SumProductRule{Categorical}}, inferred_outbound_types::Dict{Interface, Type})
   @ ForneyLab ~/src/2022/ForneyLab.jl/src/algorithms/sum_product.jl:75
 [3] inferUpdateRules!(schedule::Vector{ScheduleEntry}; inferred_outbound_types::Dict{Interface, Type})
   @ ForneyLab ~/src/2022/ForneyLab.jl/src/message_passing.jl:203
 [4] messagePassingSchedule(pf::PosteriorFactor)
   @ ForneyLab ~/src/2022/ForneyLab.jl/src/algorithms/posterior_factor.jl:75
 [5] messagePassingAlgorithm(target_variables::Vector{Variable}, pfz::PosteriorFactorization; ep_sites::Vector{Tuple}, id::Symbol, free_energy::Bool)
   @ ForneyLab ~/src/2022/ForneyLab.jl/src/algorithms/inference_algorithm.jl:82
 [6] messagePassingAlgorithm (repeats 2 times)
   @ ~/src/2022/ForneyLab.jl/src/algorithms/inference_algorithm.jl:58 [inlined]
 [7] top-level scope
   @ In[31]:1
 [8] eval
   @ ./boot.jl:373 [inlined]
 [9] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1196

Clearly, this means that there is no SumProduct rule to handle categorical variables with a SampleMessage Input. So my question is: how is this case treated? Would a straightforward variation technique be appropriate? Thanks.

@albertpod
Copy link
Collaborator

albertpod commented May 29, 2022

Hi @erlebach!

I am not sure I understand your model. Can you provide the equations for your generative model? Or you can try to explain what are you trying to model with this graph? It seems strange to me that you have a shared placeholder for the output of Bernoulli distribution and the input of a Nonlinear node (e.g. placeholder(i, :i))

First of all, indeed there is no close form solution (aka sp message) in ForneyLab.jl with these incoming messages, i.e. Message{SampleList}, Nothing.

To circumvent the problem you need to introduce a recognition distribution ("variation technique") around the Categorical node. In the code below I've added one more line before algorithm construction:

# The model
gr = FactorGraph()

# @RV b1 ~ Clamp(0.4)
# @RV b2 ~ Clamp(0.3)
@RV d ~ Bernoulli(0.4)
@RV i ~ Bernoulli(0.3)
@RV grade_ ~ Nonlinear{Sampling}(i, d, g=grade, n_samples=50)
@RV g ~ Categorical(grade_)
@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50)
@RV l ~ Bernoulli(letter_)
@RV score_ ~ Nonlinear{Sampling}(i, g=score, n_samples=50)
@RV s ~ Bernoulli(score_)

# placeholder(d, :d)
placeholder(i, :i)
placeholder(s, :s)
placeholder(l, :l)
placeholder(g, :g)
#  The line the produces the error: 

q = PosteriorFactorization(g, grade_, ids=[:g, :grade_]) 
algo = messagePassingAlgorithm([d])    # <<<<< ERROR

However this code won't work either, because one of your recognition factors g is clamped (connected to the placeholder(g, :g))

What you can do, is to remove placeholder(g, :g) to build the algorithm.

# The model
gr = FactorGraph()

# @RV b1 ~ Clamp(0.4)
# @RV b2 ~ Clamp(0.3)
@RV d ~ Bernoulli(0.4)
@RV i ~ Bernoulli(0.3)
@RV grade_ ~ Nonlinear{Sampling}(i, d, g=grade, n_samples=50)
@RV g ~ Categorical(grade_)
@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50)
@RV l ~ Bernoulli(letter_)
@RV score_ ~ Nonlinear{Sampling}(i, g=score, n_samples=50)
@RV s ~ Bernoulli(score_)

# placeholder(d, :d)
placeholder(i, :i)
placeholder(s, :s)
placeholder(l, :l)
# placeholder(g, :g)
#  The line the produces the error: 

q = PosteriorFactorization(g, grade_, ids=[:g, :grade_]) 
algo = messagePassingAlgorithm([d])    # <<<<< ERROR
source_code = algorithmSourceCode(algo)
eval(Meta.parse(source_code));

THIS WORKS!
However you might encounter other issues later. Hence, I would first suggest we discuss your model and then think of the inference.

@erlebach
Copy link
Author

erlebach commented May 29, 2022

Hi @albertpod:
Thank you for your investigation and suggestion, which I will try out. Let me provide you with some context. This problem was proposed as a class demonstration of Probability Graphical Models. The objective is to understand the relationships between intelligence (I), SAT scores (S), problem difficulty (D), test grade (G), and whether the student receives a letter of recommendation (L). Here is the proposed model:

P(I,S,D,G,L) = P(I) P(D) P(G|D,I) P(L|G) P(S|I)

P(I) and P(D) are modeled as Bernoulli distributions. P(G|D,I) is modeled as a categorical distribution (we assume three different grades). The categorical probabilities depend on the inputs (D,I), so there are four different categorical triplets (whose elements sum to 1). P(L|G) and P(S|I) are Bernoulli distributions whose probability parameter depends on their condition variables. So each has two probabilities. These last three probabilities are captured through the functions letter, score, and grade. The former two can be interpreted as mixtures of Bernoulli, and grade is a mixture of categorical distributions (I believe that the TransitionMixture factor could be of help but I have not looked into it.

I was hoping initially to be able to compute marginals on any variable not observed without appealing to variational inference. I wished to experiment with choosing different subsets of observed variables and examine their effect. Furthermore, I want to be able to specify which variables are observed via a dictionary, passed to some function, and have everything the appropriate inference executed. It should be noted that I am only providing a single observation at this stage, and the various distributions of the nodes are specified, so an analytical solution is possible. By analytical, I mean the following. The marginal of I is a Bernoulli with parameter p=0.3. If I specify that s=1 for example, the I variable becomes a Bernoulli with a different parameter, which is increased. This parameter can be computed exactly. I can also perform a Monte-Carlo simulation to compute it. This is what I did in PyMC. I thought that forney lab would be able to apply expectation propagation to do this, since it can be done manually (with pencil and paper). I do realize that if I place a prior on the probability of intelligence, then its posterior can be evaluated via variational inference, and clarity ensues. But that was not my initial intent.

Note that I might want to observe g. What does one do in that case? In reality, we can consider the current model to have a plate with N measurements, and I am considering the case of N=1, i.e., the simplest case. That should be possible, no?

I have implemented the model successfully in pymc, if that is of that is of interest to you. These implementations serve as exercises to develop expertise in a variety of frameworks in Julia and Python.

For reference, I have never heard the term recognition distribution before.

@erlebach
Copy link
Author

How about we move this to the discussions?

@albertpod
Copy link
Collaborator

Sorry, I noticed the new topic on Discussion earlier than I read your answer @erlebach.

As for recognition distribution q. This term is also referred to as instrumental distribution. This term arises when we approximate our posterior distribution p with some reference probability distribution q (recognition distribution). In this case, we try to minimize KL divergence between p and q as log evidence can't be computed.

We had a short discussion with @ThijsvdLaar about your issue. It seems like you can be just fine with TransitionMixture for this problem. @ThijsvdLaar will get back to you regarding this.

I understand your wish to observe g, but then (to my understanding) your graph will change as well, i.e., you won't need P(G|D, I).

In the meantime, I will try to come up with the inference solution for your model using a Nonlinear node.

@erlebach
Copy link
Author

Thank you! Quick question on something I never understood. If a variable is observed, its probability distribution is quite irrelevant, is that not true? However, when estimating a likelihood, its probability distribution still impacts the posterior through the remaining non-observed variables. If true, P(G|D, I) is still needed. In any case, ideally, I would like, if possible, the flexibility of observing or not observing any subset of random variables without having to write different methods for each combination of observables. Could't an observable be implemented in a general manner through multiplication by a Delta function? This is possible with MCMC, but perhaps not with message passing.

@albertpod
Copy link
Collaborator

I am not sure I've understood your question. What do you mean by irrelevant distribution? (let's continue this part at the Discussion; it is always helpful to draw a graph and inspect how the messages flow depending on your graph structure)

As for the solution with a Nonlinear node. It seems like we have found a bug that doesn't allow us to proceed with the mix of Bernoulli and Categorical distributions. However, Bernoulli can always be converted to a Categorical distribution, e.g. Bernoulli(0.3) <-> Categorical([0.3, 0.7]) . Given that I rewrote your model in the following way:

using ForneyLab
using ProgressMeter

gA = 3
gB = 5
gC = 8

# Three deterministic functions
function grade(i, d)
    if i == 0 && d == 0
        g1,g2,g3 = [0.3, 0.4, 0.3]
    elseif i == 0 && d == 1
        g1,g2,g3 = [0.05, 0.25, 0.7]
    elseif i == 1 && d == 0
        g1,g2,g3 = [0.9, 0.08, 0.02]
    else
        g1,g2,g3 = [0.5, 0.3, 0.2]
    end
    
    return [g1, g2, g3]
end

function letter(g)
    if g == gA
        l[1] = .9
    elseif g == gB
        l = .6
    else
        l = 0.01
    end
    return [l, 1-l] # changed to comply with Catergorical distribution
end

function score(i)
    if i[1] == 0
        s = 0.05
    else
        s = 0.8
    end
    return [s, 1-s] # changed to comply with Catergorical distribution
end

# The model
gr = FactorGraph()

@RV d ~ Categorical([0.4, 0.6]) # equivalent to Bernoulli(0.4)
@RV i ~ Categorical([0.3, 0.7]) # equivalent to Bernoulli(0.3)
@RV grade_ ~ Nonlinear{Sampling}(i, d, g=grade, n_samples=50, dims=[(3,),(2,),(2,)]) # we need to specify the dimensionality for each interface, i.e. grade_, i, d
@RV g ~ Categorical(grade_)
@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50, dims=[(2,),(2,),(2,)])
@RV l ~ Categorical(letter_)
@RV score_ ~ Nonlinear{Sampling}(i, g=score, n_samples=50, dims=[(2,),(2,),(2,)])
@RV s ~ Categorical(score_)

placeholder(s, :s, dims=(2,))
placeholder(l, :l, dims=(2,))


q = PosteriorFactorization(g, grade_, ids=[:G, :Grade]) 
algo = messagePassingAlgorithm([d])
source_code = algorithmSourceCode(algo)
eval(Meta.parse(source_code))

# Prior statistics

vmp_its = 100
data = Dict(:s => [1.0, 0.0], :l=>[0.0, 1.0])


marginals = Dict{Any, Any}(:g => ProbabilityDistribution(Univariate, Categorical, p=one(3)/3))

messages = initGrade()
@showprogress for i in 1:vmp_its
    stepGrade!(data, marginals, messages)
    stepG!(data, marginals, messages)
end

println(ForneyLab.unsafeMean(marginals[:i]))
println(ForneyLab.unsafeMean(marginals[:d]))
println(ForneyLab.unsafeMean(marginals[:g]))

As a side note, at the moment, our lab focuses on ReactiveMP.jl. Only three people can tackle ForneyLab issues (2 of those write theses, and one will be unavailable for a few days). As a consequence, our replies will be slow, but they will come sooner or later :) Thanks for trying FL in this setting; it helps to improve the framework.

@erlebach
Copy link
Author

Thanks! One question: you wrote:

@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50, dims=[(2,),(2,),(2,)])

Shouldn't this be written as

@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50, dims=[(2,),(2,)])

I am surprised this would have worked for you.

@albertpod
Copy link
Collaborator

@erlebach ah, right, I made a typo. Your variant is correct.

FL just ignored the last dimensionality specification. We should throw an error or warning when a user does it.

@erlebach
Copy link
Author

Great. I ran the code, and it does not work. But I have a question. You wrote:

data = Dict(:s =>[1.0, 0.0], :l => [0.0, 1.0])

Why is that? When sampled, a Categorical returns a single number (0,1), or (1,2)? Why are you specifying a vector of length 2? Or are you specifying two observed values? I just want to make sure since in my original problem, I only assigned one observed value.

When executing stepGRADE, I get the error:

TypeError: in keyword argument m, expected Number, got a value of type Vector{Float64}.

Looking more closely, the error is in ProgressMeter. I removed Progressmeter, and still get the same error.
More specifically:

TypeError: in keyword argument m, expected Number, got a value of type Vector{Float64}

Stacktrace:
 [1] stepGRADE!(data::Dict{Symbol, Vector{Float64}}, marginals::Dict{Any, Any}, messages::Vector{Message})
   @ Main ./none:17
 [2] top-level scope
   @ ./In[69]:3
 [3] eval
   @ ./boot.jl:373 [inlined]
 [4] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1196

@albertpod
Copy link
Collaborator

I have a feeling that you haven't specified the dimensionality of your placeholder as I, i.e. placeholder(s, :s, dims=(2,)).
Are you sure you are using exactly the same code?

As for data, the Categorical node in FL defines a one-dimensional probability distribution over the normal basis vectors of dimension d. The output is one-hot encoded, meaning that for 1 we use (1,0), and for 2 - (0,1), etc.

@erlebach
Copy link
Author

You are correct, @albertpod ; I had not specified the dimensionality of the placeholders. I did not copy/paste your code, but instead modified mine. That is how I learn faster.

I did not realize that a Categorical was one-hot encoded. Makes sense (different than PyMC).

The code now runs. Whether it runs correctly is another matter. I am surprised that a 100 iterations of a variational method runs slower than PyMC running 5,000 iterations of a Monte-Carlo sampler. Of course, the variational approach provides a parametrized distribution. And I have not been careful to make precise measurements. I guess the answer is because I am using the nonlinear sampler three times in my Model. :-)

Why do you use unsafeMean rather than mean()? I cannot find any documentation on unsafeMean. What does it do?

@albertpod
Copy link
Collaborator

albertpod commented May 30, 2022

I understand, no worries.

Well, actually you also use "sampling". ForneyLab knows how to combine different message passing techniques. Here you have used hybrid message passing, i.e. SP, VMP, and Sampling-based (around a nonlinear node). It's hard to compare the iterations from one package to another, but I am also surprised by your observation.

I would be also suspicious about the correctness of the result. Unfortunately, for this model, we cannot inspect the free energy (our minimization target). So we don't know when the algorithm converges: within 1, 2,3, or 100 iterations. Maybe you can check how your marginals change over iterations.

We still need to check a few things on our side. Perhaps, we'll come up with a better solution.

As for unsafeMean, that's just a habit (old school FL), you can safely use mean now.

@erlebach
Copy link
Author

erlebach commented May 30, 2022

All good points. Actually, I am not convinced of convergence either. Marginals look fine. To check my marginals over iterations, I would have to output data using a callback function. Is that possible?

I can also create a latex document more clearly explaining what I am doing at this point. You'll note that I have not created a prior for the various probabilities that I am "stating" when stating my model. So I am not doing inference in the normal sense of the word. I have a joint distribution that is known and am simply fixing some parameter values.

By the way, when fixing the data as :s => [1., 0.], that is as expected since the output of a Bernoulli is [1,0] or [0,1]. What happens if I set my data to [0,0] or [0.5, 0.5]? I would not expect [0,0] to work since the elements do not sum to 1. [0.5, 0.5] probably won't work since my samples can only be [0,1] or [1,0]. At least that my expectation when using Bernoulli's.

If you have any experience with PyMC, I can send you my code (which works) and then you can get a better idea about what I am doing. Or I can email you a pdf that explains more in detail.

@albertpod
Copy link
Collaborator

albertpod commented Jun 1, 2022

Hi @erlebach!

Here's pseudocode for tracking your marginals:

marginals = Dict{Any, Any}(:g => ProbabilityDistribution(Univariate, Categorical, p=one(3)/3))
marginal_storage = Dict{Any, Any}(:g => [], :i => [], :d => []) # or anything else you are interested in
messages = initGrade()
@showprogress for i in 1:vmp_its
    marginals = stepGrade!(data, marginals, messages)
    push!(marginal_storage[:i], marginals[:i])
    push!(marginal_storage[:d], marginals[:d])
    marginals = stepG!(data, marginals, messages)
    push!(marginal_storage[:g], marginals[:g])
end

Having a rigorous explanation of the model will be helpful, but I am afraid we won't have much time to look into that atm.

As for data, in FL, the Categorical node does not care how you encode the observation. For example, you can say :s => [100.0, 20.0]. Given this observation, the node will compute a backward message. In the case of Categorical likelihood, the backward message will turn out to be Dirichlet, which is parametrized by concentration parameters (and categories number). The only constraint on the concentration parameters is that they are greater than 0. Given :s => [100.0, 20.0] the message carrying Dirichlet distribution will have concentration parameters 101 and 21. If you don't have a good intuition about the concentration parameters of Dirichlet, I recommend checking the wiki. The point is, however, does observation [100.0, 20.0] encode your actual observations? Maybe. But I would rather go with one-hot encoding. When mapping Bernoulli output to Categorical you can say that 0.0 <-> [1.0, 0.0], 1.0 <-> [0.0, 1.0].

I recommend inspecting the generated code:
println(source_code)
Drawing FG also helps:
draw()

P.S. Changing Bernoulli distributions to Categorical is a temporary solution. Your issue indicated that there is a bug in FL. We will start working on it soon.

@erlebach
Copy link
Author

erlebach commented Jun 2, 2022

Thank you for this explanation. I will try all this out!

@albertpod
Copy link
Collaborator

@erlebach hi!
Not sure if this is still relevant, but ReactiveMP.jl has released message-passing with a boolean function.

You may want to look at this demo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants