Bayesian Networks¶
Bayesian Networks¶
Bayesian Networks Syntax¶
- Nodes: variables (with domains)
-
Arcs: interactions
- Indicate "direct influence" between variables
- For now: imagine that arrows mean direct causation (in general, they may not!)
- Formally: encode conditional independence
-
No cycle is allowed!
-
A directed, acyclic graph
-
Conditional distributions for each node given its "parent variables" in the graph
A Bayes net = Topology (graph) + Local conditional probabilities
Bayesian Networks Semantics¶
- Bayes nets encode joint distribution as product of conditional distributions on each variable:
Conditional Independence Semantics¶
- Every variable is conditionally independent of its non-descendants given its parents
- Conditional independence semantics \(\Leftrightarrow\) global semantics
Markov Blanket¶
- A variable’s Markov blanket consists of parents, children, children’s other parents
- Every variable is conditionally independent of all other variables given its Markov blanket
D-Separation¶
-
A triple is active in the following three cases
- Causal chain \(A \to B \to C\) where B is unobserved (either direction)
- Common cause \(A \leftarrow B \to C\) where B is unobserved
- Common effect (aka v-structure) \(A \to B \to C\) where \(B\) or one of its descendents is observed
-
A path is active if each triple along the path is active
- A path is blocked if it contains a single inactive triple
- If all paths from \(X\) to \(Y\) are blocked, then \(X\) is said to be "d-separated" from \(Y\) by \(Z\)
- If d-separated, then \(X\) and \(Y\) are conditionally independent given \(Z\)
Markov Networks¶
Markov network = undirected graph + potential functions
-
For each clique (or max clique), a potential function is defined
- A potential function is not locally normalized, i.e., it doesn't encode probabilities
-
A joint probability is proportional to the product of potentials
where \(\phi_C(\mathbf{x}_C)\) is the potential over clique \(C\) and
is the normalization constant (aka. partition function)
Bayesian Network \(\to\) Markov Network
-
Steps
- Moralization
- Construct potential functions from CPTs
-
The BN and MN encode the same distribution
Generative vs. Discriminative Models
-
Generative Models:
- A generative model represents a joint distribution \(P(X_1, X_2, \dots, X_n)\)
- Both BN and MN are generative models
-
Discriminative Models:
- In some scenarios, we only care about predicting queries from evidence
- A discriminative model represents a conditional distribution
\[ P(Y_1, Y_2, \dots, Y_n \mid X) \]- It does not model \(P(X)\)
Conditional Random Fields (CRFs)¶
- An extention of MN (aka. Markov Random Fields) where everything is conditioned on an input
where \(\phi_C(y_C, x)\) is the potential over clique \(C\) and
is the normalization constant.
BN Exact Inference¶
Enumeration¶
- Step 1: Select the entries consistent with the evidence.
- Step 2: Sum out \(H\) to get joint of Query and Evidence.
- Step 3: Normalize the joint to get the posterior.
Problem: sums of exponentially many products!
Variable Elimination¶
Factors¶
-
A factor is a multi-dimensional array to represent \(P(Y_1, \dots, Y_N \mid X_1, \dots X_M)\)
- If a variable is assigned (represented with lower-case), its dimension is missing (selected) from the array.
Operation 1: Join Factors¶
- Given multiple factors, build a new factor over the union of the variables involved.
- Each entry is computed by pointwise products.
Operation 2: Eliminate a Variable¶
- Take a factor and sum out (marginalize) a variable.
Variable Elimination Algorithm¶
-
Query: \(P(Q \mid E_1 = e_1, \dots, E_k = e_k)\)
-
Start with initial factors: Local CPTs (but instantiated by evidence)
-
While there are still hidden variables (not \(Q\) or evidence):
- Pick a hidden variable \(H\).
- Join all factors mentioning \(H\).
- Eliminate (sum out) \(H\).
-
Join all remaining factors and normalize.
Variable Elimination on Polytrees¶
Message Passing on General Graphs¶
BN Approximate Inference¶
Sampling¶
-
Sampling from a Discrete Distribution
-
Step 1: Get sample \(u\) from uniform distribution over \([0, 1)\)
-
Step 2: Convert this sample \(u\) into an outcome for the given distribution by associating each outcome \(x\) with a \(P(x)\)-sized sub-interval of \([0, 1)\).
-
Prior Sampling¶
Steps¶
- Input: \(N\) samples
-
For \(i = 1\) to \(N\) (in topological order)
- Sample \(X_i\) from \(P(X_i | \text{parents}(X_i))\)
-
Return \((X_1, X_2, ..., X_n)\)
Consistency
-
This process generates samples with probability: $$ S_{PS}(x_1, \dots, x_n) = \prod_{i=1}^n P(X_i = x_i \mid \text{parents}(X_i)) = P(x_1, \dots, x_n) $$
i.e. the BN's joint probability.
-
Let the number of samples of an assignment be \(N_{PS}(x_1, \dots , x_n)\).
-
So \(\hat{P} (x_1, \dots, x_n) = N_{PS} (x_1, \dots, x_n) / N\).
-
Then $$ \lim_{N\to \infty} \hat{P}(x_1, \dots, x_n) = P(x_1, \dots, x_n) $$
-
i.e. the sampling procedure is consistent.
Rejection Sampling¶
Steps¶
- Input: evidence \(e_1, ..., e_k\)
-
For \(i = 1, 2, \dots, n\)
- Sample \(x_i\) from \(P(X_i \mid \text{parents}(X_i))\)
- If \(X_i\) not consistent with evidence,
- Reject: Return, and no sample is generated in this cycle.
-
Return \((x_1, x_2, \dots, x_n)\)
Likelihood Weighting¶
-
Problem with rejection sampling:
- If evidence is unlikely, rejects lots of samples.
- Evidence not exploited as you sample.
-
Idea: Fix evidence variables, sample the rest.
- Problem: Sample distribution not consistent!
- Solution: Weight each sample by probability of evidence variables given parents.
Steps¶
- Input: evidence \(e_1, ..., e_k\)
- \(w = 1.0\)
-
For \(i = 1, 2, \dots, n\)
- If \(X_i\) is an evidence variable,
- \(x_i = \text{observed value}_i\) for \(X_i\)
- Set \(w = w \cdot P(X_i = x_i \mid \text{parents}(X_i))\)
- Else
- Sample \(x_i\) from \(P(X_i \mid \text{parents}(X_i))\)
- If \(X_i\) is an evidence variable,
-
Return \((x_1, x_2, \dots, x_n)\) with weight \(w\).
Gibbs Sampling¶
Gibbs Sampling¶
-
Generate each sample by making a random change to the preceding sample
-
Evidence variables remain fixed. For each of the non-evidence variable, sample its value conditioned on all the other variables
-
\(X_i' \sim P(x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n)\)
-
In a Bayes network,
\[ P(x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(X_i \mid \text{Markov blanket}(X_i)) = \alpha P(X_i \mid u_1, \dots, u_m) \prod_{j} P(y_j \mid \text{Markov blanket}(Y_j)) \]
-