Structure Learning in Bayesian Networks Eran Segal Weizmann Institute

tarix 30.07.2018 ölçüsü 477 b.

Eran Segal Weizmann Institute
Structure Learning Motivation Network structure is often unknown Purposes of structure learning Discover the dependency structure of the domain Goes beyond statistical correlations between individual variables and detects direct vs. indirect correlations Set expectations: at best, we can recover the structure up to the I-equivalence class Density estimation Estimate a statistical model of the underlying distribution and use it to reason with and predict new instances
Advantages of Accurate Structure
Structure Learning Approaches Constraint based methods View the Bayesian network as representing dependencies Find a network that best explains dependencies Limitation: sensitive to errors in single dependencies Score based approaches View learning as a model selection problem Define a scoring function specifying how well the model fits the data Search for a high-scoring network structure Limitation: super-exponential search space Bayesian model averaging methods Average predictions across all possible structures Can be done exactly (some cases) or approximately
Constraint Based Approaches G is an I-Map for P if I(G)I(P) Minimal I-Map if deleting an edge from G renders it not an I-Map G is a P-Map for P if I(G)=I(P) Strategy Query the distribution for independence relationships that hold between sets of variables Construct a network which is the best minimal I-Map for P
Constructing Minimal I-Maps Reverse factorization theorem Algorithm for constructing a minimal I-Map Fix an ordering of nodes X1,…,Xn Select parents of Xi as minimal subset of X1,…,Xi-1, such that Ind(Xi ; X1,…Xi-1 – Pa(Xi) | Pa(Xi)) (Outline of) Proof of minimal I-map I-map since the factorization above holds by construction Minimal since by construction, removing one edge destroys the factorization
Constructing P-Maps Simplifying assumptions Network has bound in-degree d per node Oracle can answer Ind. queries of up to 2d+2 variables Distribution P has a P-Map Algorithm Step I: Find skeleton Step II: Find immoral set of v-structures Step III: Direct constrained edges
Step I: Identifying the Skeleton For each pair X,Y query all Z for Ind(X;Y | Z ) X–Y is in skeleton if no Z is found If graph in-degree bounded by d running time O(n2d+2) Since if no direct edge exists, Ind(X;Y | Pa(X), Pa(Y)) Reminder If there is no Z for which Ind(X;Y | Z ) holds, then XY or YX in G* Proof: Assume no Z exists, and G* does not have XY or YX Then, can find a set Z such that the path from X to Y is blocked Then, G* implies Ind(X;Y | Z ) and since G* is a P-Map Contradiction
Step II: Identifying Immoralities For each pair X,Y query candidate triplets X,Y,Z XZY if no W is found that contains Z and Ind(X;Y | W ) If graph in-degree bounded by d running time O(n2d+3) If W exists, Ind(X;Y|W ), and XZY not immoral, then ZW Reminder If there is no W such that Z is in W and Ind(X;Y | W ), then XZY is an immorality Proof: Assume no W exists but X–Z–Y is not an immorality Then, either XZY or XZY or XZY exists But then, we can block X–Z–Y by Z Then, since X and Y are not connected, can find W that includes Z such that Ind(X;Y | W ) Contradiction
Answering Independence Queries Basic query Determine whether two variables are independent Well studied question in statistics Common to frame query as hypothesis testing Null hypothesis is H0 H0: Data was sampled from P*(X,Y)=P*(X)P*(Y) Need a procedure that will Accept or Reject the hypothesis 2 test to assess deviance of data from hypothesis Alternatively, use mutual information between X and Y
Structure Based Approaches Strategy Define a scoring function for each candidate structure Search for a high scoring structure Likelihood based scores Bayesian based scores
Likelihood Scores Goal: find (G,) that maximize the likelihood ScoreL(G:D)=log P(D | G, ’G) where ’G is MLE for G Find G that maximizes ScoreL(G:D)
Example
General Decomposition The Likelihood score decomposes as: Proof:
General Decomposition The Likelihood score decomposes as: Second term does not depend on network structure and thus is irrelevant for selecting between two structures Score increases as mutual information, or strength of dependence between connected variable increases After some manipulation can show: To what extent are the implied Markov assumptions valid
Limitations of Likelihood Score
Avoiding Overfitting Classical problem in machine learning Solutions Restricting the hypotheses space Limits the overfitting capability of the learner Example: restrict # of parents or # of parameters Minimum description length Description length measures complexity Prefer models that compactly describes the training data Bayesian methods Average over all possible parameter values Use prior knowledge
Bayesian Score
Marginal Likelihood of Data Given G
Marginal Likelihood: Binomial Case Assume a sequence of m coin tosses Recall that for Dirichlet priors Where MmH is number of heads in first m examples
Marginal Likelihood: Binomial Case
Marginal Likelihood Example Actual experiment with P(H) = 0.25
Marginal Likelihood: BayesNets Network structure determines form of marginal likelihood
Marginal Likelihood: BayesNets Network structure determines form of marginal likelihood
Idealized Experiment P(X = H) = 0.5 P(Y = H|X = H) = 0.5 + p P(Y = H|X = T) = 0.5 - p
Marginal Likelihood: BayesNets The marginal likelihood has the form: where M(..) are the counts from the data (..) are hyperparameters for each family
Priors Structure prior P(G) Uniform prior: P(G) constant Prior penalizing number of edges: P(G) c|G| (0 Normalizing constant across networks is similar and can thus be ignored
Priors Parameter prior P(|G) BDe prior M0: equivalent sample size B0: network representing the prior probability of events Set (xi,paiG) = M0 P(xi,paiG| B0) Note: paiG are not the same as parents of Xi in B0 Compute P(xi,paiG| B0) using standard inference in B0 BDe has the desirable property that I-equivalent networks have the same Bayesian score when using the BDe prior for some M’ and P’
Bayesian Score: Asymptotic Behavior For M, a network G with Dirichlet priors satisfies Approximation is called BIC score Score exhibits tradeoff between fit to data and complexity Mutual information grows linearly with M while complexity grows logarithmically with M As M grows, more emphasis is given to the fit to the data
Bayesian Score: Asymptotic Behavior For M, a network G with Dirichlet priors satisfies As M, the true structure G* maximizes the score Spurious edges will not contribute to likelihood and will be penalized Required edges will be added due to linear growth of likelihood term relative to M compared to logarithmic growth of model complexity
Summary: Network Scores Likelihood, MDL, (log) BDe have the form BDe requires assessing prior network Can naturally incorporate prior knowledge BDe is consistent and asymptotically equivalent (up to a constant) to BIC/MDL All are score-equivalent G I-equivalent to G’ Score(G) = Score(G’)
Optimization Problem Input: Training data Scoring function (including priors, if needed) Set of possible structures Including prior knowledge about structure Output: A network (or networks) that maximize the score Key Property: Decomposability: the score of a network is a sum of terms.
Learning Trees Trees At most one parent per variable Why trees? Elegant math we can solve the optimization problem efficiently (with a greedy algorithm) Sparse parameterization avoid overfitting while adapting to the data
Learning Trees Let p(i) denote parent of Xi, or 0 if Xi has no parent We can write the score as Score = sum of edge scores + constant
Learning Trees Algorithm Construct graph with vertices: 1,...n Set w(ij) = Score(Xj | Xi ) - Score(Xj) Find tree (or forest) with maximal weight This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion (Kruskal’s maximum spanning tree algorithm) Theorem: Procedure finds the tree with maximal score When score is likelihood, then w(ij) is proportional to I(Xi; Xj). This is known as the Chow & Liu method
Learning Trees: Example
Beyond Trees Problem is not easy for more complex networks In fact, no efficient algorithm exists Theorem: Finding maximal scoring network structure with at most k parents for each variable is NP-hard for k>1
Fixed Ordering For any decomposable scoring function Score(G:D) and ordering the maximal scoring network has: For fixed ordering we have independent problems If we bound the in-degree per variable by d, then complexity is exponential in d
Heuristic Search We address the problem by using heuristic search Define a search space: nodes are possible structures edges denote adjacency of structures Traverse this space looking for high-scoring structures Search techniques: Greedy hill-climbing Best first search Simulated Annealing ...
Heuristic Search
Exploiting Decomposability Caching: To update the score after a local change, we only need to rescore the families that were changed in the last move
Greedy Hill Climbing Simplest heuristic local search Start with a given network empty network best tree a random network At each iteration Stop when no modification improves score Each step requires evaluating O(n) new changes
Greedy Hill Climbing Pitfalls Greedy Hill-Climbing can get stuck in: Local Maxima All one-edge changes reduce the score Plateaus Some one-edge changes leave the score unchanged Happens because equivalent networks received the same score and are neighbors in the search space Both occur during structure search Standard heuristics can escape both Random restarts TABU search
Equivalence Class Search Idea Search the space of equivalence classes Equivalence classes can be represented by PDAGs (partially ordered graph) Advantages PDAGs space has fewer local maxima and plateaus There are fewer PDAGs than DAGs
Equivalence Class Search Evaluating changes is more expensive In addition to search, need to score a consistent network These algorithms are more complex to implement
Learning Example: Alarm Network
Model Selection So far, we focused on single model Find best scoring model Use it to predict next example Implicit assumption: Best scoring model dominates the weighted sum Pros: We get a single structure Allows for efficient use in our tasks Cons: We are committing to the independencies of a particular structure Other structures might be as probable given the data
Model Selection Density estimation One structure may suffice, if its joint distribution is similar to other high scoring structures Structure discovery Define features f(G) (e.g., edge, sub-structure, d-sep query) Compute Still requires summing over exponentially many structures
Model Averaging Given an Order Assumptions Known total order of variables Maximum in-degree for variables d Marginal likelihood
Model Averaging Given an Order Posterior probability of a general feature Posterior probability of particular choice of parents Posterior probability of particular edge choice
Model Averaging We cannot assume that order is known Solution: Sample from posterior distribution of P(G|D) Estimate feature probability by Sampling can be done by MCMC Sampling can be done over orderings
Notes on Learning Local Structures Example: in tree CPDs, score decomposes by leaves Prior may need to be extended Example: in tree CPDs, penalty for tree structure per CPD Extend search operators to local structure Example: in tree CPDs, we need to search for tree structure Can be done by local encapsulated search or by defining new global operations
Search: Summary Discrete optimization problem In general, NP-Hard Need to resort to heuristic search In practice, search is relatively fast (~100 vars in ~10 min): Decomposability Sufficient statistics In some cases, we can reduce the search problem to an easy optimization problem
Dostları ilə paylaş: