Bayesian Networks¶

Bayesian Networks Syntax¶

Nodes: variables (with domains)
Arcs: interactions
- Indicate "direct influence" between variables
- For now: imagine that arrows mean direct causation (in general, they may not!)
- Formally: encode conditional independence
No cycle is allowed!
A directed, acyclic graph
Conditional distributions for each node given its "parent variables" in the graph

A Bayes net = Topology (graph) + Local conditional probabilities

Bayesian Networks Semantics¶

Bayes nets encode joint distribution as product of conditional distributions on each variable:

\[ P(X_1, X_2, \dots, X_n) = \prod_{i=1}^n P(X_i \mid \text{Parents}(X_i)) \]

Conditional Independence Semantics¶

Every variable is conditionally independent of its non-descendants given its parents
Conditional independence semantics $\Leftrightarrow$ global semantics

Markov Blanket¶

A variable’s Markov blanket consists of parents, children, children’s other parents
Every variable is conditionally independent of all other variables given its Markov blanket

D-Separation¶

A triple is active in the following three cases
- Causal chain $A \to B \to C$ where B is unobserved (either direction)
- Common cause $A \leftarrow B \to C$ where B is unobserved
- Common effect (aka v-structure) $A \to B \to C$ where $B$ or one of its descendents is observed
A path is active if each triple along the path is active
A path is blocked if it contains a single inactive triple
If all paths from $X$ to $Y$ are blocked, then $X$ is said to be "d-separated" from $Y$ by $Z$
If d-separated, then $X$ and $Y$ are conditionally independent given $Z$

Markov Networks¶

Markov network = undirected graph + potential functions

For each clique (or max clique), a potential function is defined
- A potential function is not locally normalized, i.e., it doesn't encode probabilities
A joint probability is proportional to the product of potentials

\[ p(\mathbf{x}) = \frac{1}{Z} \prod_{C} \phi_C(\mathbf{x}_C) \]

where $\phi_C(\mathbf{x}_C)$ is the potential over clique $C$ and

\[ Z = \sum_{\mathbf{x}} \prod_{C} \phi_C(\mathbf{x}_C) \]

is the normalization constant (aka. partition function)

Bayesian Network $\to$ Markov Network

Steps
1. Moralization
2. Construct potential functions from CPTs
The BN and MN encode the same distribution

Generative vs. Discriminative Models

Generative Models:
- A generative model represents a joint distribution $P(X_1, X_2, \dots, X_n)$
- Both BN and MN are generative models
Discriminative Models:
- In some scenarios, we only care about predicting queries from evidence
- A discriminative model represents a conditional distribution
\[ P(Y_1, Y_2, \dots, Y_n \mid X) \]
- It does not model $P(X)$

Conditional Random Fields (CRFs)¶

An extention of MN (aka. Markov Random Fields) where everything is conditioned on an input

\[ P(y \mid x) = \frac{1}{Z(x)} \prod_{C} \phi_C(y_C, x) \]

where $\phi_C(y_C, x)$ is the potential over clique $C$ and

\[ Z(x) = \sum_{y} \prod_{C} \phi_C(y_C, x) \]

is the normalization constant.

BN Exact Inference¶

Enumeration¶

Step 1: Select the entries consistent with the evidence.
Step 2: Sum out $H$ to get joint of Query and Evidence.
Step 3: Normalize the joint to get the posterior.

Problem: sums of exponentially many products!

Variable Elimination¶

Factors¶

A factor is a multi-dimensional array to represent $P(Y_1, \dots, Y_N \mid X_1, \dots X_M)$
- If a variable is assigned (represented with lower-case), its dimension is missing (selected) from the array.

Operation 1: Join Factors¶

Given multiple factors, build a new factor over the union of the variables involved.
Each entry is computed by pointwise products.

Operation 2: Eliminate a Variable¶

Take a factor and sum out (marginalize) a variable.

Variable Elimination Algorithm¶

Query: $P(Q \mid E_1 = e_1, \dots, E_k = e_k)$
Start with initial factors: Local CPTs (but instantiated by evidence)
While there are still hidden variables (not $Q$ or evidence):
1. Pick a hidden variable $H$.
2. Join all factors mentioning $H$.
3. Eliminate (sum out) $H$.
Join all remaining factors and normalize.

Variable Elimination on Polytrees¶

Message Passing on General Graphs¶

BN Approximate Inference¶

Sampling¶

Sampling from a Discrete Distribution
- Step 1: Get sample $u$ from uniform distribution over $[0, 1)$
- Step 2: Convert this sample $u$ into an outcome for the given distribution by associating each outcome $x$ with a $P(x)$-sized sub-interval of $[0, 1)$.

Prior Sampling¶

Steps¶

Input: $N$ samples
For $i = 1$ to $N$ (in topological order)
- Sample $X_i$ from $P(X_i | \text{parents}(X_i))$
Return $(X_1, X_2, ..., X_n)$

Consistency

This process generates samples with probability: $$ S_{PS}(x_1, \dots, x_n) = \prod_{i=1}^n P(X_i = x_i \mid \text{parents}(X_i)) = P(x_1, \dots, x_n) $$

i.e. the BN's joint probability.
Let the number of samples of an assignment be $N_{PS}(x_1, \dots , x_n)$.
So $\hat{P} (x_1, \dots, x_n) = N_{PS} (x_1, \dots, x_n) / N$.
Then $$ \lim_{N\to \infty} \hat{P}(x_1, \dots, x_n) = P(x_1, \dots, x_n) $$
i.e. the sampling procedure is consistent.

Rejection Sampling¶

Steps¶

Input: evidence $e_1, ..., e_k$
For $i = 1, 2, \dots, n$
- Sample $x_i$ from $P(X_i \mid \text{parents}(X_i))$
- If $X_i$ not consistent with evidence,
  - Reject: Return, and no sample is generated in this cycle.
Return $(x_1, x_2, \dots, x_n)$

Likelihood Weighting¶

Problem with rejection sampling:
- If evidence is unlikely, rejects lots of samples.
- Evidence not exploited as you sample.
Idea: Fix evidence variables, sample the rest.
- Problem: Sample distribution not consistent!
- Solution: Weight each sample by probability of evidence variables given parents.

Steps¶

Input: evidence $e_1, ..., e_k$
$w = 1.0$
For $i = 1, 2, \dots, n$
- If $X_i$ is an evidence variable,
  - $x_i = \text{observed value}_i$ for $X_i$
  - Set $w = w \cdot P(X_i = x_i \mid \text{parents}(X_i))$
- Else
  - Sample $x_i$ from $P(X_i \mid \text{parents}(X_i))$
Return $(x_1, x_2, \dots, x_n)$ with weight $w$.

Gibbs Sampling¶

Generate each sample by making a random change to the preceding sample
- Evidence variables remain fixed. For each of the non-evidence variable, sample its value conditioned on all the other variables
- $X_i' \sim P(x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n)$
- In a Bayes network,
  
  \[ P(x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(X_i \mid \text{Markov blanket}(X_i)) = \alpha P(X_i \mid u_1, \dots, u_m) \prod_{j} P(y_j \mid \text{Markov blanket}(Y_j)) \]

Bayesian Networks¶