资料下载：模式识别与机器学习.pdf

模式识别与机器学习.pdf

http://www.100md.com 2020年11月18日

第1页

第8页

第12页

第28页

第35页

第201页

参见附件(10365KB，749页)。

模式识别与机器学习内容深入浅出，生动有趣，力求反映这一领域的核心知识体系和新的发展趋势。每章内容都尽可能做到丰富完整，并附有思考与计算题，便于读者对知识的巩固和融会贯通。

    内容简介

    模式识别与机器学习系统介绍了模式识别与机器学习的基础理论、模型与算法，同时兼顾了前沿知识的适当融入。本书以贝叶斯学习的思想贯穿始终，并适时与其他重要知识点(如支持向量机、深度学习)等进行交叉和关联，便于读者在形成良好知识体系的同时保持对整个领域知识的全面把握。

    全书共14章和4个附录，循序渐进地对模式识别与机器学习领域进行剖析。首先介绍贝叶斯学习基础、逻辑回归、概率图模型基础、隐马尔可夫模型和条件随机场，接着介绍支持向量机、人工神经网络与深度学习、高斯过程、聚类、主成分分析与相关的谱方法，最后介绍确定性近似推理、随机近似推理和强化学习。附录包括传统的模式识别与机器学习方法近邻法和决策树，还有向量微积分和随机变量的变换等与本学科方向强相关的重要知识点。

    作者简介

    孙仕亮，清华大学博士，华东师范大学教授。在华东师范大学从事研究工作，主讲本科和研究生“模式识别与机器学习”“高级机器学习”等课程，并在英国伦敦大学学院、美国哥伦比亚大学从事访问合作研究。在模式识别与机器学习领域的国际著名期刊和会议发表学术论文100余篇，承担多项国j级、省部级科研项目及国际知名企业的合作研究项目，研究成果多次获得省部级科学技术奖励。

    赵静，华东师范大学博士，讲师。从事模式识别与机器学习领域的研究，包括概率模型、贝叶斯学习、近似推理与优化。主讲研究生“高级机器学习”和本科生“可信机器学习”等课程。入选上海市2016年度“扬帆计划”和2019年度“晨光计划”。发表论文近20篇，代表性成果发表于JMLR、T-CYB、IJCAI等国际顶j期刊和会议。

    本书特色

    详细阐述模式识别与机器学习的核心知识体系，汇集了模式识别与机器学习中传统与现代、概率与非概率的基础理论、模型与算法，并兼顾新的发展趋势。

    内容深入浅出，生动有趣，通俗易懂，逻辑性强。部分章节包含对相似知识点的解析，便于读者梳理不同方法间的逻辑关系。

    知识阐述丰富完整，图文并茂，配有典型示例和直观示意图，并附有思考与计算题，帮助读者深入理解内容。附录包含重要数学基础，全书自我完整性好。

    适合高等院校人工智能相关专业的本科生和研究生(硕/博)以及希望从事人工智能相关工作的科技工作者使用。建议将全书作为研究生课程的内容，将第1、2.3.4、7，8，10.11、14章作为本科生课程的内容。

    贝叶斯决策

    贝叶斯决策(Bayesian decision)是概率框架下实施决策的基本方法，它通过综合考虑决策的后验分布和错误决策的损失做出决策，其中，贝叶斯公式被用于计算后验分布，贝叶斯决策的前提是假设：D决策问题可以以概分布的形式描述：@与决策有关的概率分布均是可计算的。

    同样，以熊猫的性别分类问题为例讲解贝叶斯决策，即根据熊猫的形态特征来判断熊猫的性别，使用w表示性别.w-1时表示雕性.w-2时表示雄性。由于熊猫属于哪一种性别并不确定，因此w是一个随机变量。在不同地区(动物同或某个野生动物保护区)，熊猫的雌雄比例有可能不同，因此假定熊猫为雄性的先验概率为p(w=1).为维性的先验概，为p(w=2)，并且p(w-1)+p(w-2)-1，除了变量以外，还需要另外一个随机变量来刻画熊猫的形态特征，假设，表示观测变量，其中x-1表示熊猜是干净整洁的.x-0反之，在某种类别下，计算得到观测变量的概率分布，即为似然函数pC.rlwo)

    在给定决策问题的概，描述(先验概率分布和似然函数)之后，贝叶斯决策使用贝叶斯公式推导出性别变量的后验分布pcaelax)，然后通过决策规则做出决策。这里主要介绍两种比较常用的贝叶斯决策规则：最小错误率和最小风险

    模式识别与机器学习截图

    Information Science and Statistics

    Series Editors:

    M. Jordan

    J. Kleinberg

    B. Scho ¨ lkopfInformation Science and Statistics

    Akaike and Kitagawa: The Practice of Time Series Analysis.

    Bishop: Pattern Recognition and Machine Learning.

    Cowell, Dawid, Lauritzen, and Spiegelhalter: Probabilistic Networks and

    Expert Systems.

    Doucet, de Freitas, and Gordon: Sequential Monte Carlo Methods in Practice.

    Fine: Feedforward Neural Network Methodology.

    Hawkins and Olwell: Cumulative Sum Charts and Charting for Quality Improvement.

    Jensen: Bayesian Networks and Decision Graphs.

    Marchette: Computer Intrusion Detection and Network Monitoring:

    A Statistical Viewpoint.

    Rubinstein and Kroese: The Cross-Entropy Method: A Unified Approach to

    Combinatorial Optimization, Monte Carlo Simulation, and Machine Learning.

    Studeny: Probabilistic Conditional Independence Structures.

    Vapnik: The Nature of Statistical Learning Theory, Second Edition.

    Wallace: Statistical and Inductive Inference by Minimum Massage Length. Christopher M. Bishop

    Pattern Recognition and

    Machine LearningChristopher M. Bishop F.R.Eng.

    Assistant Director

    Microsoft Research Ltd

    Cambridge CB3 0FB, U.K.

    cmbishop@microsoft.com

    http:research.microsoft.comcmbishop

    Series Editors

    Michael Jordan

    Department of Computer

    Science and Department

    of Statistics

    University of California,Berkeley

    Berkeley, CA 94720

    USA

    Professor Jon Kleinberg

    Department of Computer

    Science

    Cornell University

    Ithaca, NY 14853

    USA

    Bernhard Scho ¨ lkopf

    Max Planck Institute for

    Biological Cybernetics

    Spemannstrasse 38

    72076 Tu ¨bingen

    Germany

    Library of Congress Control Number: 2006922522

    ISBN-10: 0-387-31073-8

    ISBN-13: 978-0387-31073-2

    Printed on acid-free paper.

    2006 Springer Science+Business Media, LLC

    All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher

    (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection

    with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation,computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.

    The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such,is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

    Printed in Singapore. (KYO)

    987654321

    springer.comThis book is dedicated to my family:

    Jenna, Mark, and Hugh

    Total eclipse of the sun, Antalya, Turkey, 29 March 2006.Preface

    Pattern recognition has its origins in engineering, whereas machine learning grew

    out of computer science. However, these activities can be viewed as two facets of

    the same ?eld, and together they have undergone substantial development over the

    past ten years. In particular, Bayesian methods have grown from a specialist niche to

    become mainstream, while graphical models have emerged as a general framework

    for describing and applying probabilistic models. Also, the practical applicability of

    Bayesian methods has been greatly enhanced through the development of a range of

    approximate inference algorithms such as variational Bayes and expectation propa-

    gation. Similarly, new models based on kernels have had signi?cant impact on both

    algorithms and applications.

    This new textbook re?ects these recent developments while providing a compre-

    hensive introduction to the ?elds of pattern recognition and machine learning. It is

    aimed at advanced undergraduates or ?rst year PhD students, as well as researchers

    and practitioners, and assumes no previous knowledge of pattern recognition or ma-

    chine learning concepts. Knowledge of multivariate calculus and basic linear algebra

    is required, and some familiarity with probabilities would be helpful though not es-

    sential as the book includes a self-contained introduction to basic probability theory.

    Because this book has broad scope, it is impossible to provide a complete list of

    references, and in particular no attempt has been made to provide accurate historical

    attribution of ideas. Instead, the aim has been to give references that offer greater

    detail than is possible here and that hopefully provide entry points into what, in some

    cases, is a very extensive literature. For this reason, the references are often to more

    recent textbooks and review articles rather than to original sources.

    The book is supported by a great deal of additional material, including lecture

    slides as well as the complete set of ?gures used in the book, and the reader is

    encouraged to visit the book web site for the latest information:

    http:research.microsoft.com～cmbishopPRML

    viiviii PREFACE

    Exercises

    The exercises that appear at the end of every chapter form an important com-

    ponent of the book. Each exercise has been carefully chosen to reinforce concepts

    explained in the text or to develop and generalize them in signi?cant ways, and each

    is graded according to dif?culty ranging from (), which denotes a simple exercise

    taking a few minutes to complete, through to (), which denotes a signi?cantly

    more complex exercise.

    It has been dif?cult to know to what extent these solutions should be made

    widely available. Those engaged in self study will ?nd worked solutions very ben-

    e?cial, whereas many course tutors request that solutions be available only via the

    publisher so that the exercises may be used in class. In order to try to meet these

    con?icting requirements, those exercises that help amplify key points in the text, or

    that ?ll in important details, have solutions that are available as a PDF ?le from the

    book web site. Such exercises are denoted by www . Solutions for the remaining

    exercises are available to course tutors by contacting the publisher (contact details

    are given on the book web site). Readers are strongly encouraged to work through

    the exercises unaided, and to turn to the solutions only as required.

    Although this book focuses on concepts and principles, in a taught course the

    students should ideally have the opportunity to experiment with some of the key

    algorithms using appropriate data sets. A companion volume (Bishop and Nabney,2008) will deal with practical aspects of pattern recognition and machine learning,and will be accompanied by Matlab software implementing most of the algorithms

    discussed in this book.

    Acknowledgements

    First of all I would like to express my sincere thanks to Markus Svens′ en who

    has provided immense help with preparation of ?gures and with the typesetting of

    the book in L A T EX. His assistance has been invaluable.

    I am very grateful to Microsoft Research for providing a highly stimulating re-

    search environment and for giving me the freedom to write this book (the views and

    opinions expressed in this book, however, are my own and are therefore not neces-

    sarily the same as those of Microsoft or its af?liates).

    Springer has provided excellent support throughout the ?nal stages of prepara-

    tion of this book, and I would like to thank my commissioning editor John Kimmel

    for his support and professionalism, as well as Joseph Piliero for his help in design-

    ing the cover and the text format and MaryAnn Brickner for her numerous contribu-

    tions during the production phase. The inspiration for the cover design came from a

    discussion with Antonio Criminisi.

    I also wish to thank Oxford University Press for permission to reproduce ex-

    cerpts from an earlier textbook, Neural Networks for Pattern Recognition (Bishop,1995a). The images of the Mark 1 perceptron and of Frank Rosenblatt are repro-

    duced with the permission of Arvin Calspan Advanced Technology Center. I would

    also like to thank Asela Gunawardana for plotting the spectrogram in Figure 13.1,and Bernhard Sch¨ olkopf for permission to use his kernel PCA code to plot Fig-

    ure 12.17.PREFACE ix

    Many people have helped by proofreading draft material and providing com-

    ments and suggestions, including Shivani Agarwal, C′ edric Archambeau, Arik Azran,Andrew Blake, Hakan Cevikalp, Michael Fourman, Brendan Frey, Zoubin Ghahra-

    mani, Thore Graepel, Katherine Heller, Ralf Herbrich, Geoffrey Hinton, Adam Jo-

    hansen, Matthew Johnson, Michael Jordan, Eva Kalyvianaki, Anitha Kannan, Julia

    Lasserre, David Liu, TomMinka, Ian Nabney, Tonatiuh Pena, Yuan Qi, Sam Roweis,Balaji Sanjiya, Toby Sharp, Ana Costa e Silva, David Spiegelhalter, Jay Stokes, Tara

    Symeonides,Martin Szummer,Marshall Tappen, Ilkay Ulusoy, ChrisWilliams, John

    Winn, and Andrew Zisserman.

    Finally, I would like to thank my wife Jenna who has been hugely supportive

    throughout the several years it has taken to write this book.

    Chris Bishop

    Cambridge

    February 2006Mathematical notation

    I have tried to keep the mathematical content of the book to the minimum neces-

    sary to achieve a proper understanding of the ?eld. However, this minimum level is

    nonzero, and it should be emphasized that a good grasp of calculus, linear algebra,and probability theory is essential for a clear understanding of modern pattern recog-

    nition and machine learning techniques. Nevertheless, the emphasis in this book is

    on conveying the underlying concepts rather than on mathematical rigour.

    I have tried to use a consistent notation throughout the book, although at times

    this means departing from some of the conventions used in the corresponding re-

    search literature. Vectors are denoted by lower case bold Roman letters such as

    x, and all vectors are assumed to be column vectors. A superscript T denotes the

    transpose of a matrix or vector, so that xT will be a row vector. Uppercase bold

    roman letters, such as M, denote matrices. The notation (w1,...,wM) denotes a

    row vector with M elements, while the corresponding column vector is written as

    w =(w1,...,wM)T.

    The notation [a, b] is used to denote the closed interval from a to b, that is the

    interval including the values a and b themselves, while (a, b) denotes the correspond-

    ing open interval, that is the interval excluding a and b. Similarly, [a, b) denotes an

    interval that includes a but excludes b. For the most part, however, there will be

    little need to dwell on such re?nements as whether the end points of an interval are

    included or not.

    The M × M identity matrix (also known as the unit matrix) is denoted IM,which will be abbreviated to I where there is no ambiguity about it dimensionality.

    It has elements Iij that equal 1 if i = j and 0 if i

    = j.

    A functional is denoted f[y] where y(x) is some function. The concept of a

    functional is discussed in Appendix D.

    The notation g(x)= O(f(x)) denotes that |f(x)g(x)| is bounded as x →∞.

    For instance if g(x)=3x2 +2, then g(x)= O(x2).

    The expectation of a function f(x, y) with respect to a random variable x is de-

    noted by Ex[f(x, y)]. In situations where there is no ambiguity as to which variable

    is being averaged over, this will be simpli?ed by omitting the suf?x, for instance

    xixii MATHEMATICAL NOTATION

    E[x]. If the distribution of x is conditioned on another variable z, then the corre-

    sponding conditional expectation will be written Ex[f(x)|z]. Similarly, the variance

    is denoted var[f(x)], and for vector variables the covariance is written cov[x, y].We

    shall also use cov[x] as a shorthand notation for cov[x, x]. The concepts of expecta-

    tions and covariances are introduced in Section 1.2.2.

    If we have N values x1,..., xN of a D-dimensional vector x =(x1,...,xD)T,we can combine the observations into a data matrix X in which the nth row of X

    corresponds to the row vector xT

    n. Thus the n, i element of X corresponds to the

    i

    th element of the nth observation xn. For the case of one-dimensional variables we

    shall denote such a matrix by x, which is a column vector whose nth element is xn.

    Note that x (which has dimensionality N) uses a different typeface to distinguish it

    from x (which has dimensionality D).Contents

    Preface vii

    Mathematical notation xi

    Contents xiii

    1 Introduction 1

    1.1 Example: Polynomial Curve Fitting . . ............. 4

    1.2 Probability Theory . . ........................ 12

    1.2.1 Probability densities . .................... 17

    1.2.2 Expectations and covariances ................ 19

    1.2.3 Bayesian probabilities .................... 21

    1.2.4 The Gaussian distribution . . ............. 24

    1.2.5 Curve ?tting re-visited .................... 28

    1.2.6 Bayesian curve ?tting .................... 30

    1.3 Model Selection . . ........................ 32

    1.4 The Curse of Dimensionality . .................... 33

    1.5 Decision Theory . . ........................ 38

    1.5.1 Minimizing the misclassi?cation rate . . ......... 39

    1.5.2 Minimizing the expected loss . . .............. 41

    1.5.3 The reject option . . .................... 42

    1.5.4 Inference and decision .................... 42

    1.5.5 Loss functions for regression . . ............. 46

    1.6 Information Theory . . ........................ 48

    1.6.1 Relative entropy and mutual information . ......... 55

    Exercises . . ............................... 58

    xiiixiv CONTENTS

    2 Probability Distributions 67

    2.1 Binary Variables . . ........................ 68

    2.1.1 The beta distribution . .................... 71

    2.2 Multinomial Variables ........................ 74

    2.2.1 The Dirichlet distribution . . ............. 76

    2.3 The Gaussian Distribution . . .................... 78

    2.3.1 Conditional Gaussian distributions . ............. 85

    2.3.2 Marginal Gaussian distributions . . ............. 88

    2.3.3 Bayes’ theorem for Gaussian variables . . ......... 90

    2.3.4 Maximum likelihood for the Gaussian . . ......... 93

    2.3.5 Sequential estimation . .................... 94

    2.3.6 Bayesian inference for the Gaussian ............. 97

    2.3.7 Student’s t-distribution .................... 102

    2.3.8 Periodic variables . . .................... 105

    2.3.9 Mixtures of Gaussians .................... 110

    2.4 The Exponential Family . . .................... 113

    2.4.1 Maximum likelihood and suf?cient statistics . . 116

    2.4.2 Conjugate priors . . .................... 117

    2.4.3 Noninformative priors .................... 117

    2.5 Nonparametric Methods . . .................... 120

    2.5.1 Kernel density estimators . . ............. 122

    2.5.2 Nearest-neighbour methods . . ............. 124

    Exercises . . ............................... 127

    3 Linear Models for Regression 137

    3.1 Linear Basis Function Models .................... 138

    3.1.1 Maximum likelihood and least squares . . ......... 140

    3.1.2 Geometry of least squares . . ............. 143

    3.1.3 Sequential learning . . .................... 143

    3.1.4 Regularized least squares . . ............. 144

    3.1.5 Multiple outputs . . .................... 146

    3.2 The Bias-Variance Decomposition . . ............. 147

    3.3 Bayesian Linear Regression . .................... 152

    3.3.1 Parameter distribution .................... 152

    3.3.2 Predictive distribution .................... 156

    3.3.3 Equivalent kernel . . .................... 159

    3.4 Bayesian Model Comparison . .................... 161

    3.5 The Evidence Approximation .................... 165

    3.5.1 Evaluation of the evidence function ............. 166

    3.5.2 Maximizing the evidence function . ............. 168

    3.5.3 Effective number of parameters . . ............. 170

    3.6 Limitations of Fixed Basis Functions . . ............. 172

    Exercises . . ............................... 173CONTENTS xv

    4 Linear Models for Classi?cation 179

    4.1 Discriminant Functions ........................ 181

    4.1.1 Two classes . . ........................ 181

    4.1.2 Multiple classes ........................ 182

    4.1.3 Least squares for classi?cation ................ 184

    4.1.4 Fisher’s linear discriminant . . ............. 186

    4.1.5 Relation to least squares . . ............. 189

    4.1.6 Fisher’s discriminant for multiple classes . ......... 191

    4.1.7 The perceptron algorithm . . ............. 192

    4.2 Probabilistic Generative Models . . ................ 196

    4.2.1 Continuous inputs . . .................... 198

    4.2.2 Maximum likelihood solution . . ............. 200

    4.2.3 Discrete features . . .................... 202

    4.2.4 Exponential family . . .................... 202

    4.3 Probabilistic Discriminative Models . . ............. 203

    4.3.1 Fixed basis functions . .................... 204

    4.3.2 Logistic regression . . .................... 205

    4.3.3 Iterative reweighted least squares .............. 207

    4.3.4 Multiclass logistic regression . . .............. 209

    4.3.5 Probit regression . . .................... 210

    4.3.6 Canonical link functions . . ............. 212

    4.4 The Laplace Approximation . .................... 213

    4.4.1 Model comparison and BIC . . .............. 216

    4.5 Bayesian Logistic Regression .................... 217

    4.5.1 Laplace approximation .................... 217

    4.5.2 Predictive distribution .................... 218

    Exercises . . ............................... 220

    5 Neural Networks 225

    5.1 Feed-forward Network Functions . . ............. 227

    5.1.1 Weight-space symmetries . . .............. 231

    5.2 Network Training . . ........................ 232

    5.2.1 Parameter optimization .................... 236

    5.2.2 Local quadratic approximation . . ............. 237

    5.2.3 Use of gradient information . . .............. 239

    5.2.4 Gradient descent optimization . . ............. 240

    5.3 Error Backpropagation ........................ 241

    5.3.1 Evaluation of error-function derivatives . . ......... 242

    5.3.2 A simple example ...................... 245

    5.3.3 Ef?ciency of backpropagation . . ............. 246

    5.3.4 The Jacobian matrix . .................... 247

    5.4 The Hessian Matrix . . ........................ 249

    5.4.1 Diagonal approximation . . .............. 250

    5.4.2 Outer product approximation . ................ 251

    5.4.3 Inverse Hessian ........................ 252xvi CONTENTS

    5.4.4 Finite differences....................... 252

    5.4.5 Exact evaluation of the Hessian . . ............. 253

    5.4.6 Fast multiplication by the Hessian . ............. 254

    5.5 Regularization in Neural Networks . . ............. 256

    5.5.1 Consistent Gaussian priors . . .............. 257

    5.5.2 Early stopping ........................ 259

    5.5.3 Invariances . . ........................ 261

    5.5.4 Tangent propagation . .................... 263

    5.5.5 Training with transformed data . . ............. 265

    5.5.6 Convolutional networks . . ............. 267

    5.5.7 Soft weight sharing . . .................... 269

    5.6 Mixture Density Networks . . .................... 272

    5.7 Bayesian Neural Networks . . .................... 277

    5.7.1 Posterior parameter distribution . . ............. 278

    5.7.2 Hyperparameter optimization . . ............. 280

    5.7.3 Bayesian neural networks for classi?cation ......... 281

    Exercises . . ............................... 284

    6 Kernel Methods 291

    6.1 Dual Representations . ........................ 293

    6.2 Constructing Kernels . ........................ 294

    6.3 Radial Basis Function Networks . . ................ 299

    6.3.1 Nadaraya-Watson model . . ............. 301

    6.4 Gaussian Processes . . ........................ 303

    6.4.1 Linear regression revisited . . ............. 304

    6.4.2 Gaussian processes for regression . ............. 306

    6.4.3 Learning the hyperparameters . . ............. 311

    6.4.4 Automatic relevance determination ............. 312

    6.4.5 Gaussian processes for classi?cation ............. 313

    6.4.6 Laplace approximation .................... 315

    6.4.7 Connection to neural networks . . ............. 319

    Exercises . . ............................... 320

    7 Sparse Kernel Machines 325

    7.1 Maximum Margin Classi?ers .................... 326

    7.1.1 Overlapping class distributions . . .............. 331

    7.1.2 Relation to logistic regression . . .............. 336

    7.1.3 Multiclass SVMs . . .................... 338

    7.1.4 SVMs for regression . .................... 339

    7.1.5 Computational learning theory . . ............. 344

    7.2 Relevance Vector Machines . .................... 345

    7.2.1 RVM for regression . . .................... 345

    7.2.2 Analysis of sparsity . . .................... 349

    7.2.3 RVM for classi?cation .................... 353

    Exercises . . ............................... 357CONTENTS xvii

    8 Graphical Models 359

    8.1 Bayesian Networks . . ........................ 360

    8.1.1 Example: Polynomial regression . . ............. 362

    8.1.2 Generative models . . .................... 365

    8.1.3 Discrete variables . . .................... 366

    8.1.4 Linear-Gaussian models . . ............. 370

    8.2 Conditional Independence . . .................... 372

    8.2.1 Three example graphs .................... 373

    8.2.2 D-separation . ........................ 378

    8.3 Markov Random Fields . . .................... 383

    8.3.1 Conditional independence properties ............. 383

    8.3.2 Factorization properties . . ............. 384

    8.3.3 Illustration: Image de-noising . . ............. 387

    8.3.4 Relation to directed graphs . . ............. 390

    8.4 Inference in Graphical Models .................... 393

    8.4.1 Inference on a chain . .................... 394

    8.4.2 Trees ............................. 398

    8.4.3 Factor graphs . ........................ 399

    8.4.4 The sum-product algorithm . . .............. 402

    8.4.5 The max-sum algorithm . . ............. 411

    8.4.6 Exact inference in general graphs . ............. 416

    8.4.7 Loopy belief propagation . . ............. 417

    8.4.8 Learning the graph structure . . .............. 418

    Exercises . . ............................... 418

    9 Mixture Models and EM 423

    9.1 K-means Clustering . ........................ 424

    9.1.1 Image segmentation and compression . . ......... 428

    9.2 Mixtures of Gaussians ........................ 430

    9.2.1 Maximum likelihood . .................... 432

    9.2.2 EM for Gaussian mixtures . . ............. 435

    9.3 An Alternative View of EM . .................... 439

    9.3.1 Gaussian mixtures revisited . . .............. 441

    9.3.2 Relation to K-means . .................... 443

    9.3.3 Mixtures of Bernoulli distributions . ............. 444

    9.3.4 EM for Bayesian linear regression . ............. 448

    9.4 The EM Algorithm in General .................... 450

    Exercises . . ............................... 455

    10 Approximate Inference 461

    10.1 Variational Inference . ........................ 462

    10.1.1 Factorized distributions .................... 464

    10.1.2 Properties of factorized approximations . . ......... 466

    10.1.3 Example: The univariate Gaussian . ............. 470

    10.1.4 Model comparison . . .................... 473

    10.2 Illustration: Variational Mixture of Gaussians . . ......... 474xviii CONTENTS

    10.2.1 Variational distribution .................... 475

    10.2.2 Variational lower bound . . ............. 481

    10.2.3 Predictive density . . .................... 482

    10.2.4 Determining the number of components . . ......... 483

    10.2.5 Induced factorizations .................... 485

    10.3 Variational Linear Regression .................... 486

    10.3.1 Variational distribution .................... 486

    10.3.2 Predictive distribution .................... 488

    10.3.3 Lower bound . ........................ 489

    10.4 Exponential Family Distributions . . ............. 490

    10.4.1 Variational message passing . . .............. 491

    10.5 Local Variational Methods . . .................... 493

    10.6 Variational Logistic Regression . . ................ 498

    10.6.1 Variational posterior distribution . . ............. 498

    10.6.2 Optimizing the variational parameters . . ......... 500

    10.6.3 Inference of hyperparameters ................ 502

    10.7 Expectation Propagation . . .................... 505

    10.7.1 Example: The clutter problem . . ............. 511

    10.7.2 Expectation propagation on graphs . ............. 513

    Exercises . . ............................... 517

    11 Sampling Methods 523

    11.1 Basic Sampling Algorithms . .................... 526

    11.1.1 Standard distributions .................... 526

    11.1.2 Rejection sampling . . .................... 528

    11.1.3 Adaptive rejection sampling . . ............. 530

    11.1.4 Importance sampling . .................... 532

    11.1.5 Sampling-importance-resampling . ............. 534

    11.1.6 Sampling and the EM algorithm . . ............. 536

    11.2 Markov Chain Monte Carlo . .................... 537

    11.2.1 Markov chains ........................ 539

    11.2.2 The Metropolis-Hastings algorithm ............. 541

    11.3 Gibbs Sampling . . ........................ 542

    11.4 Slice Sampling . . ........................ 546

    11.5 The Hybrid Monte Carlo Algorithm . . ............. 548

    11.5.1 Dynamical systems . . .................... 548

    11.5.2 Hybrid Monte Carlo . .................... 552

    11.6 Estimating the Partition Function . . ............. 554

    Exercises . . ............................... 556

    12 Continuous Latent Variables 559

    12.1 Principal Component Analysis .................... 561

    12.1.1 Maximum variance formulation . . ............. 561

    12.1.2 Minimum-error formulation . . ............. 563

    12.1.3 Applications of PCA . .................... 565

    12.1.4 PCA for high-dimensional data . . ............. 569CONTENTS xix

    12.2 Probabilistic PCA . . ........................ 570

    12.2.1 Maximum likelihood PCA . . ............. 574

    12.2.2 EM algorithm for PCA .................... 577

    12.2.3 Bayesian PCA ........................ 580

    12.2.4 Factor analysis ........................ 583

    12.3 Kernel PCA .............................. 586

    12.4 Nonlinear Latent Variable Models . . ............. 591

    12.4.1 Independent component analysis . . ............. 591

    12.4.2 Autoassociative neural networks . . ............. 592

    12.4.3 Modelling nonlinear manifolds . . ............. 595

    Exercises . . ............................... 599

    13 Sequential Data 605

    13.1 Markov Models . . ........................ 607

    13.2 Hidden Markov Models . . .................... 610

    13.2.1 Maximum likelihood for the HMM ............. 615

    13.2.2 The forward-backward algorithm . ............. 618

    13.2.3 The sum-product algorithm for the HMM . ......... 625

    13.2.4 Scaling factors ........................ 627

    13.2.5 The Viterbi algorithm . .................... 629

    13.2.6 Extensions of the hidden Markov model . . ......... 631

    13.3 Linear Dynamical Systems . . .................... 635

    13.3.1 Inference in LDS . . .................... 638

    13.3.2 Learning in LDS . . .................... 642

    13.3.3 Extensions of LDS . . .................... 644

    13.3.4 Particle ?lters . ........................ 645

    Exercises . . ............................... 646

    14 Combining Models 653

    14.1 Bayesian Model Averaging . . .................... 654

    14.2 Committees . . ........................ 655

    14.3 Boosting ............................... 657

    14.3.1 Minimizing exponential error . . ............. 659

    14.3.2 Error functions for boosting . . .............. 661

    14.4 Tree-based Models . . ........................ 663

    14.5 Conditional Mixture Models . .................... 666

    14.5.1 Mixtures of linear regression models ............. 667

    14.5.2 Mixtures of logistic models . . ............. 670

    14.5.3 Mixtures of experts . . .................... 672

    Exercises . . ............................... 674

    Appendix A Data Sets 677

    Appendix B Probability Distributions 685

    Appendix C Properties of Matrices 695xx CONTENTS

    Appendix D Calculus of Variations 703

    Appendix E Lagrange Multipliers 707

    References 711

    Index 7291

    Introduction

    The problem of searching for patterns in data is a fundamental one and has a long and

    successful history. For instance, the extensive astronomical observations of Tycho

    Brahe in the 16th century allowed Johannes Kepler to discover the empirical laws of

    planetary motion, which in turn provided a springboard for the development of clas-

    sical mechanics. Similarly, the discovery of regularities in atomic spectra played a

    key role in the development and veri?cation of quantum physics in the early twenti-

    eth century. The ?eld of pattern recognition is concerned with the automatic discov-

    ery of regularities in data through the use of computer algorithms and with the use of

    these regularities to take actions such as classifying the data into different categories.

    Consider the example of recognizing handwritten digits, illustrated in Figure 1.1.

    Each digit corresponds to a 28×28 pixel image and so can be represented by a vector

    x comprising 784 real numbers. The goal is to build a machine that will take such a

    vector x as input and that will produce the identity of the digit 0,..., 9 as the output.

    This is a nontrivial problem due to the wide variability of handwriting. It could be

    12 1. INTRODUCTION

    Figure 1.1 Examples of hand-written dig-

    its taken from US zip codes.

    tackled using handcrafted rules or heuristics for distinguishing the digits based on

    the shapes of the strokes, but in practice such an approach leads to a proliferation of

    rules and of exceptions to the rules and so on, and invariably gives poor results.

    Far better results can be obtained by adopting a machine learning approach in

    which a large set of N digits {x1,..., xN} called a training set is used to tune the

    parameters of an adaptive model. The categories of the digits in the training set

    are known in advance, typically by inspecting them individually and hand-labelling

    them. We can express the category of a digit using target vector t, which represents

    the identity of the corresponding digit. Suitable techniques for representing cate-

    gories in terms of vectors will be discussed later. Note that there is one such target

    vector t for each digit image x.

    The result of running the machine learning algorithm can be expressed as a

    function y(x) which takes a new digit image x as input and that generates an output

    vector y, encoded in the same way as the target vectors. The precise form of the

    function y(x) is determined during the training phase, also known as the learning

    phase, on the basis of the training data. Once the model is trained it can then de-

    termine the identity of new digit images, which are said to comprise a test set. The

    ability to categorize correctly new examples that differ from those used for train-

    ing is known as generalization. In practical applications, the variability of the input

    vectors will be such that the training data can comprise only a tiny fraction of all

    possible input vectors, and so generalization is a central goal in pattern recognition.

    For most practical applications, the original input variables are typically prepro-

    cessed to transform them into some new space of variables where, it is hoped, the

    pattern recognition problem will be easier to solve. For instance, in the digit recogni-

    tion problem, the images of the digits are typically translated and scaled so that each

    digit is contained within a box of a ?xed size. This greatly reduces the variability

    within each digit class, because the location and scale of all the digits are now the

    same, which makes it much easier for a subsequent pattern recognition algorithm

    to distinguish between the different classes. This pre-processing stage is sometimes

    also called feature extraction. Note that new test data must be pre-processed using

    the same steps as the training data.

    Pre-processing might also be performed in order to speed up computation. For

    example, if the goal is real-time face detection in a high-resolution video stream,the computer must handle huge numbers of pixels per second, and presenting these

    directly to a complex pattern recognition algorithm may be computationally infeasi-

    ble. Instead, the aim is to ?nd useful features that are fast to compute, and yet that1. INTRODUCTION 3

    also preserve useful discriminatory information enabling faces to be distinguished

    from non-faces. These features are then used as the inputs to the pattern recognition

    algorithm. For instance, the average value of the image intensity over a rectangular

    subregion can be evaluated extremely ef?ciently (Viola and Jones, 2004), and a set of

    such features can prove very effective in fast face detection. Because the number of

    such features is smaller than the number of pixels, this kind of pre-processing repre-

    sents a form of dimensionality reduction. Care must be taken during pre-processing

    because often information is discarded, and if this information is important to the

    solution of the problem then the overall accuracy of the system can suffer.

    Applications in which the training data comprises examples of the input vectors

    along with their corresponding target vectors are known as supervised learning prob-

    lems. Cases such as the digit recognition example, in which the aim is to assign each

    input vector to one of a ?nite number of discrete categories, are called classi?cation

    problems. If the desired output consists of one or more continuous variables, then

    the task is called regression. An example of a regression problem would be the pre-

    diction of the yield in a chemical manufacturing process in which the inputs consist

    of the concentrations of reactants, the temperature, and the pressure.

    In other pattern recognition problems, the training data consists of a set of input

    vectors x without any corresponding target values. The goal in such unsupervised

    learning problems may be to discover groups of similar examples within the data,where it is called clustering, or to determine the distribution of data within the input

    space, known as density estimation, or to project the data from a high-dimensional

    space down to two or three dimensions for the purpose of visualization.

    Finally, the technique of reinforcement learning (Sutton and Barto, 1998) is con-

    cerned with the problem of ?nding suitable actions to take in a given situation in

    order to maximize a reward. Here the learning algorithm is not given examples of

    optimal outputs, in contrast to supervised learning, but must instead discover them

    by a process of trial and error. Typically there is a sequence of states and actions in

    which the learning algorithm is interacting with its environment. In many cases, the

    current action not only affects the immediate reward but also has an impact on the re-

    ward at all subsequent time steps. For example, by using appropriate reinforcement

    learning techniques a neural network can learn to play the game of backgammon to a

    high standard (Tesauro, 1994). Here the network must learn to take a board position

    as input, along with the result of a dice throw, and produce a strong move as the

    output. This is done by having the network play against a copy of itself for perhaps a

    million games. A major challenge is that a game of backgammon can involve dozens

    of moves, and yet it is only at the end of the game that the reward, in the form of

    victory, is achieved. The reward must then be attributed appropriately to all of the

    moves that led to it, even though some moves will have been good ones and others

    less so. This is an example of a credit assignment problem. A general feature of re-

    inforcement learning is the trade-off between exploration, in which the system tries

    out new kinds of actions to see how effective they are, and exploitation, in which

    the system makes use of actions that are known to yield a high reward. Too strong

    a focus on either exploration or exploitation will yield poor results. Reinforcement

    learning continues to be an active area of machine learning research. However, a4 1. INTRODUCTION

    Figure 1.2 Plot of a training data set of N =

    10 points, shown as blue circles,each comprising an observation

    of the input variable x along with

    the corresponding target variable

    t. The green curve shows the

    function sin(2πx) used to gener-

    ate the data. Our goal is to pre-

    dict the value of t for some new

    value of x, without knowledge of

    the green curve.

    x

    t

    0 1

    1

    0

    1

    detailed treatment lies beyond the scope of this book.

    Although each of these tasks needs its own tools and techniques, many of the

    key ideas that underpin them are common to all such problems. One of the main

    goals of this chapter is to introduce, in a relatively informal way, several of the most

    important of these concepts and to illustrate them using simple examples. Later in

    the book we shall see these same ideas re-emerge in the context of more sophisti-

    cated models that are applicable to real-world pattern recognition applications. This

    chapter also provides a self-contained introduction to three important tools that will

    be used throughout the book, namely probability theory, decision theory, and infor-

    mation theory. Although these might sound like daunting topics, they are in fact

    straightforward, and a clear understanding of them is essential if machine learning

    techniques are to be used to best effect in practical applications.

    1.1. Example: Polynomial Curve Fitting

    We begin by introducing a simple regression problem, which we shall use as a run-

    ning example throughout this chapter to motivate a number of key concepts. Sup-

    pose we observe a real-valued input variable x and we wish to use this observation to

    predict the value of a real-valued target variable t. For the present purposes, it is in-

    structive to consider an arti?cial example using synthetically generated data because

    we then know the precise process that generated the data for comparison against any

    learned model. The data for this example is generated from the function sin(2πx)

    with random noise included in the target values, as described in detail in Appendix A.

    Now suppose that we are given a training set comprising N observations of x,written x ≡ (x1,...,xN)T, together with corresponding observations of the values

    of t, denoted t ≡ (t1,...,tN)T. Figure 1.2 shows a plot of a training set comprising

    N =10 data points. The input data set x in Figure 1.2 was generated by choos-

    ing values of xn, for n =1,...,N, spaced uniformly in range [0, 1], and the target

    data set t was obtained by ?rst computing the corresponding values of the function1.1. Example: Polynomial Curve Fitting 5

    sin(2πx) and then adding a small level of random noise having a Gaussian distri-

    bution (the Gaussian distribution is discussed in Section 1.2.4) to each such point in

    order to obtain the corresponding value tn. By generating data in this way, we are

    capturing a property of many real data sets, namely that they possess an underlying

    regularity, which we wish to learn, but that individual observations are corrupted by

    random noise. This noise might arise from intrinsically stochastic (i.e. random) pro-

    cesses such as radioactive decay but more typically is due to there being sources of

    variability that are themselves unobserved.

    Our goal is to exploit this training set in order to make predictions of the value

     t of the target variable for some new value x of the input variable. As we shall see

    later, this involves implicitly trying to discover the underlying function sin(2πx).

    This is intrinsically a dif?cult problem as we have to generalize from a ?nite data

    set. Furthermore the observed data are corrupted with noise, and so for a given x

    there is uncertainty as to the appropriate value for t. Probability theory, discussed

    in Section 1.2, provides a framework for expressing such uncertainty in a precise

    and quantitative manner, and decision theory, discussed in Section 1.5, allows us to

    exploit this probabilistic representation in order to make predictions that are optimal

    according to appropriate criteria.

    For the moment, however, we shall proceed rather informally and consider a

    simple approach based on curve ?tting. In particular, we shall ?t the data using a

    polynomial function of the form

    y(x,w)= w0 + w1x + w2x2

    + ... + wMxM =

    M

    j=0

    wjxj

    (1.1)

    where M is the order of the polynomial, and xj

    denotes x raised to the power of j.

    The polynomial coef?cients w0,...,wM are collectively denoted by the vector w.

    Note that, although the polynomial function y(x,w) is a nonlinear function of x,it

    is a linear function of the coef?cients w. Functions, such as the polynomial, which

    are linear in the unknown parameters have important properties and are called linear

    models and will be discussed extensively in Chapters 3 and 4.

    The values of the coef?cients will be determined by ?tting the polynomial to the

    training data. This can be done by minimizing an error function that measures the

    mis?t between the function y(x,w), for any given value of w, and the training set

    data points. One simple choice of error function, which is widely used, is given by

    the sum of the squares of the errors between the predictions y(xn,w) for each data

    point xn and the corresponding target values tn, so that we minimize

    E(w)= 1

    2

    N

    n=1

    {y(xn,w) ? tn}

    2

    (1.2)

    where the factor of 12 is included for later convenience. We shall discuss the mo-

    tivation for this choice of error function later in this chapter. For the moment we

    simply note that it is a nonnegative quantity that would be zero if, and only if, the6 1. INTRODUCTION

    Figure 1.3 The error function (1.2) corre-

    sponds to (one half of) the sum of

    the squares of the displacements

    (shown by the vertical green bars)

    of each data point from the function

    y(x,w).

    t

    x

    y(xn,w)

    tn

    xn

    function y(x,w) were to pass exactly through each training data point. The geomet-

    rical interpretation of the sum-of-squares error function is illustrated in Figure 1.3.

    We can solve the curve ?tting problem by choosing the value of w for which

    E(w) is as small as possible. Because the error function is a quadratic function of

    the coef?cients w, its derivatives with respect to the coef?cients will be linear in the

    elements of w, and so the minimization of the error function has a unique solution,denoted by w, which can be found in closed form. The resulting polynomial is Exercise 1.1

    given by the function y(x,w).

    There remains the problem of choosing the order M of the polynomial, and as

    we shall see this will turn out to be an example of an important concept called model

    comparison or model selection. In Figure 1.4, we show four examples of the results

    of ?tting polynomials having orders M =0, 1, 3,and 9 to the data set shown in

    Figure 1.2.

    We notice that the constant (M =0) and ?rst order (M =1) polynomials

    give rather poor ?ts to the data and consequently rather poor representations of the

    function sin(2πx). The third order (M =3) polynomial seems to give the best ?t

    to the function sin(2πx) of the examples shown in Figure 1.4. When we go to a

    much higher order polynomial (M =9), we obtain an excellent ?t to the training

    data. In fact, the polynomial passes exactly through each data point and E(w)=0.

    However, the ?tted curve oscillates wildly and gives a very poor representation of

    the function sin(2πx). This latter behaviour is known as over-?tting.

    As we have noted earlier, the goal is to achieve good generalization by making

    accurate predictions for new data. We can obtain some quantitative insight into the

    dependence of the generalization performance on M by considering a separate test

    set comprising 100 data points generated using exactly the same procedure used

    to generate the training set points but with new choices for the random noise values

    included in the target values. For each choice ofM, we can then evaluate the residual

    value of E(w) given by (1.2) for the training data, and we can also evaluate E(w)

    for the test data set. It is sometimes more convenient to use the root-mean-square1.1. Example: Polynomial Curve Fitting 7

    x

    t

    M =0

    0 1

    1

    0

    1

    x

    t

    M =1

    0 1

    1

    0

    1

    x

    t

    M =3

    0 1

    1

    0

    1

    x

    t

    M =9

    0 1

    1

    0

    1

    Figure 1.4 Plots of polynomials having various orders M, shown as red curves, ?tted to the data set shown in

    Figure 1.2.

    (RMS) error de?ned by

    ERMS =

    2E(w)N (1.3)

    in which the division by N allows us to compare different sizes of data sets on

    an equal footing, and the square root ensures that ERMS is measured on the same

    scale (and in the same units) as the target variable t. Graphs of the training and

    test set RMS errors are shown, for various values of M, in Figure 1.5. The test

    set error is a measure of how well we are doing in predicting the values of t for

    new data observations of x. We note from Figure 1.5 that small values of M give

    relatively large values of the test set error, and this can be attributed to the fact that

    the corresponding polynomials are rather in?exible and are incapable of capturing

    the oscillations in the function sin(2πx). Values of M in the range 3 M 8

    give small values for the test set error, and these also give reasonable representations

    of the generating function sin(2πx), as can be seen, for the case of M =3, from

    Figure 1.4.8 1. INTRODUCTION

    Figure 1.5 Graphs of the root-mean-square

    error, de?ned by (1.3), evaluated

    on the training set and on an inde-

    pendent test set for various values

    of M.

    M

    ERMS

    0 3 6 9

    0

    0.5

    1

    Training

    Test

    For M =9, the training set error goes to zero, as we might expect because

    this polynomial contains 10 degrees of freedom corresponding to the 10 coef?cients

    w0,...,w9, and so can be tuned exactly to the 10 data points in the training set.

    However, the test set error has become very large and, as we saw in Figure 1.4, the

    corresponding function y(x,w) exhibits wild oscillations.

    This may seem paradoxical because a polynomial of given order contains all

    lower order polynomials as special cases. The M =9 polynomial is therefore capa-

    ble of generating results at least as good as theM =3 polynomial. Furthermore, we

    might suppose that the best predictor of new data would be the function sin(2πx)

    from which the data was generated (and we shall see later that this is indeed the

    case). We know that a power series expansion of the function sin(2πx) contains

    terms of all orders, so we might expect that results should improve monotonically as

    we increase M.

    We can gain some insight into the problem by examining the values of the co-

    ef?cients w obtained from polynomials of various order, as shown in Table 1.1.

    We see that, as M increases, the magnitude of the coef?cients typically gets larger.

    In particular for the M =9 polynomial, the coef?cients have become ?nely tuned

    to the data by developing large positive and negative values so that the correspond-

    Table 1.1 Table of the coef?cients w for

    polynomials of various order.

    Observe how the typical mag-

    nitude of the coef?cients in-

    creases dramatically as the or-

    der of the polynomial increases.

    M =0 M =1 M =6 M =9

    w

    0 0.19 0.82 0.31 0.35

    w

    1 -1.27 7.99 232.37

    w

    2 -25.43 -5321.83

    w

    3 17.37 48568.31

    w

    4 -231639.30

    w

    5 640042.26

    w

    6 -1061800.52

    w

    7 1042400.18

    w

    8 -557682.99

    w

    9 125201.431.1. Example: Polynomial Curve Fitting 9

    x

    t

    N =15

    0 1

    1

    0

    1

    x

    t

    N = 100

    0 1

    1

    0

    1

    Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M =9

    polynomial for N =15 data points (left plot) and N = 100 data points (right plot). We see that increasing the

    size of the data set reduces the over-?tting problem.

    ing polynomial function matches each of the data points exactly, but between data

    points (particularly near the ends of the range) the function exhibits the large oscilla-

    tions observed in Figure 1.4. Intuitively, what is happening is that the more ?exible

    polynomials with larger values ofM are becoming increasingly tuned to the random

    noise on the target values.

    It is also interesting to examine the behaviour of a given model as the size of the

    data set is varied, as shown in Figure 1.6. We see that, for a given model complexity,the over-?tting problem become less severe as the size of the data set increases.

    Another way to say this is that the larger the data set, the more complex (in other

    words more ?exible) the model that we can afford to ?t to the data. One rough

    heuristic that is sometimes advocated is that the number of data points should be

    no less than some multiple (say 5 or 10) of the number of adaptive parameters in

    the model. However, as we shall see in Chapter 3, the number of parameters is not

    necessarily the most appropriate measure of model complexity.

    Also, there is something rather unsatisfying about having to limit the number of

    parameters in a model according to the size of the available training set. It would

    seem more reasonable to choose the complexity of the model according to the com-

    plexity of the problem being solved. We shall see that the least squares approach

    to ?nding the model parameters represents a speci?c case of maximum likelihood

    (discussed in Section 1.2.5), and that the over-?tting problem can be understood as

    a general property of maximum likelihood. By adopting a Bayesian approach, the Section 3.4

    over-?tting problem can be avoided. We shall see that there is no dif?culty from

    a Bayesian perspective in employing models for which the number of parameters

    greatly exceeds the number of data points. Indeed, in a Bayesian model the effective

    number of parameters adapts automatically to the size of the data set.

    For the moment, however, it is instructive to continue with the current approach

    and to consider how in practice we can apply it to data sets of limited size where we10 1. INTRODUCTION

    x

    t

    ln λ = ?18

    0 1

    1

    0

    1

    x

    t

    ln λ =0

    0 1

    1

    0

    1

    Figure 1.7 Plots of M =9 polynomials ?tted to the data set shown in Figure 1.2 using the regularized error

    function (1.4) for two values of the regularization parameter λ corresponding to ln λ = ?18 and ln λ =0. The

    case of no regularizer, i.e., λ =0, corresponding to ln λ = ?∞, is shown at the bottom right of Figure 1.4.

    may wish to use relatively complex and ?exible models. One technique that is often

    used to control the over-?tting phenomenon in such cases is that of regularization,which involves adding a penalty termto the error function (1.2) in order to discourage

    the coef?cients from reaching large values. The simplest such penalty term takes the

    formof a sumof squares of all of the coef?cients, leading to amodi?ed error function

    of the form

     E(w)= 1

    2

    N

    n=1

    {y(xn,w) ? tn}

    2

    + λ

    2

    w2

    (1.4)

    where w2 ≡ wTw = w2

    0 +w2

    1 +...+w2

    M, and the coef?cient λ governs the rel-

    ative importance of the regularization term compared with the sum-of-squares error

    term. Note that often the coef?cient w0 is omitted from the regularizer because its

    inclusion causes the results to depend on the choice of origin for the target variable

    (Hastie et al., 2001), or it may be included but with its own regularization coef?cient

    (we shall discuss this topic in more detail in Section 5.5.1). Again, the error function

    in (1.4) can be minimized exactly in closed form. Techniques such as this are known Exercise 1.2

    in the statistics literature as shrinkage methods because they reduce the value of the

    coef?cients. The particular case of a quadratic regularizer is called ridge regres-

    sion (Hoerl and Kennard, 1970). In the context of neural networks, this approach is

    known as weight decay.

    Figure 1.7 shows the results of ?tting the polynomial of order M =9 to the

    same data set as before but now using the regularized error function given by (1.4).

    We see that, for a value of ln λ = ?18, the over-?tting has been suppressed and we

    now obtain a much closer representation of the underlying function sin(2πx). If,however, we use too large a value for λ then we again obtain a poor ?t, as shown in

    Figure 1.7 for ln λ =0. The corresponding coef?cients from the ?tted polynomials

    are given in Table 1.2, showing that regularization has the desired effect of reducing1.1. Example: Polynomial Curve Fitting 11

    Table 1.2 Table of the coef?cients w for M =

    9 polynomials with various values for

    the regularization parameter λ. Note

    that ln λ = ?∞ corresponds to a

    model with no regularization, i.e., to

    the graph at the bottom right in Fig-

    ure 1.4. We see that, as the value of

    λ increases, the typical magnitude of

    the coef?cients gets smaller.

    ln λ = ?∞ ln λ = ?18 ln λ =0

    w

    0 0.35 0.35 0.13

    w

    1 232.37 4.74 -0.05

    w

    2 -5321.83 -0.77 -0.06

    w

    3 48568.31 -31.97 -0.05

    w

    4 -231639.30 -3.89 -0.03

    w

    5 640042.26 55.28 -0.02

    w

    6 -1061800.52 41.32 -0.01

    w

    7 1042400.18 -45.95 -0.00

    w

    8 -557682.99 -91.53 0.00

    w

    9 125201.43 72.68 0.01

    the magnitude of the coef?cients.

    The impact of the regularization term on the generalization error can be seen by

    plotting the value of the RMS error (1.3) for both training and test sets against ln λ,as shown in Figure 1.8. We see that in effect λ now controls the effective complexity

    of the model and hence determines the degree of over-?tting.

    The issue of model complexity is an important one and will be discussed at

    length in Section 1.3. Here we simply note that, if we were trying to solve a practical

    application using this approach of minimizing an error function, we would have to

    nd a way to determine a suitable value for the model complexity. The results above

    suggest a simple way of achieving this, namely by taking the available data and

    partitioning it into a training set, used to determine the coef?cients w, and a separate

    validation set, also called a hold-out set, used to optimize the model complexity

    (either M or λ). In many cases, however, this will prove to be too wasteful of

    valuable training data, and we have to seek more sophisticated approaches. Section 1.3

    So far our discussion of polynomial curve ?tting has appealed largely to in-

    tuition. We now seek a more principled approach to solving problems in pattern

    recognition by turning to a discussion of probability theory. As well as providing the

    foundation for nearly all of the subsequent developments in this book, it will also

    Figure 1.8 Graph of the root-mean-square er-

    ror (1.3) versus ln λ for the M =9

    polynomial.

    ERMS

    ln λ ?35 ?30 ?25 ?20

    0

    0.5

    1

    Training

    Test12 1. INTRODUCTION

    give us some important insights into the concepts we have introduced in the con-

    text of polynomial curve ?tting and will allow us to extend these to more complex

    situations.

    1.2. Probability Theory

    A key concept in the ?eld of pattern recognition is that of uncertainty. It arises both

    through noise on measurements, as well as through the ?nite size of data sets. Prob-

    ability theory provides a consistent framework for the quanti?cation and manipula-

    tion of uncertainty and forms one of the central foundations for pattern recognition.

    When combined with decision theory, discussed in Section 1.5, it allows us to make

    optimal predictions given all the information available to us, even though that infor-

    mation may be incomplete or ambiguous.

    We will introduce the basic concepts of probability theory by considering a sim-

    ple example. Imagine we have two boxes, one red and one blue, and in the red box

    we have 2 apples and 6 oranges, and in the blue box we have 3 apples and 1 orange.

    This is illustrated in Figure 1.9. Now suppose we randomly pick one of the boxes

    and from that box we randomly select an item of fruit, and having observed which

    sort of fruit it is we replace it in the box from which it came. We could imagine

    repeating this process many times. Let us suppose that in so doing we pick the red

    box 40% of the time and we pick the blue box 60% of the time, and that when we

    remove an item of fruit from a box we are equally likely to select any of the pieces

    of fruit in the box.

    In this example, the identity of the box that will be chosen is a random variable,which we shall denote by B. This random variable can take one of two possible

    values, namely r (corresponding to the red box) or b (corresponding to the blue

    box). Similarly, the identity of the fruit is also a random variable and will be denoted

    by F. It can take either of the values a (for apple) or o (for orange).

    To begin with, we shall de?ne the probability of an event to be the fraction

    of times that event occurs out of the total number of trials, in the limit that the total

    number of trials goes to in?nity. Thus the probability of selecting the red box is 410

    Figure 1.9 We use a simple example of two

    coloured boxes each containing fruit

    (apples shown in green and or-

    anges shown in orange) to intro-

    duce the basic ideas of probability.1.2. Probability Theory 13

    Figure 1.10 We can derive the sum and product rules of probability by

    considering two random variables, X, which takes the values {xi} where

    i =1,...,M, and Y , which takes the values {yj } where j =1,...,L.

    In this illustration we have M =5 and L =3. If we consider a total

    number N of instances of these variables, then we denote the number

    of instances where X = xi and Y = yj by nij , which is the number of

    points in the corresponding cell of the array. The number of points in

    column i, corresponding to X = xi, is denoted by ci, and the number of

    points in row j, corresponding to Y = yj , is denoted by rj .

    }

    } ci

    rj yj

    xi

    nij

    and the probability of selecting the blue box is 610. We write these probabilities

    as p(B = r)=410 and p(B = b)=610. Note that, by de?nition, probabilities

    must lie in the interval [0, 1]. Also, if the events are mutually exclusive and if they

    include all possible outcomes (for instance, in this example the box must be either

    red or blue), then we see that the probabilities for those events must sum to one.

    We can now ask questions such as: “what is the overall probability that the se-

    lection procedure will pick an apple?”, or “given that we have chosen an orange,what is the probability that the box we chose was the blue one?”. We can answer

    questions such as these, and indeed much more complex questions associated with

    problems in pattern recognition, once we have equipped ourselves with the two el-

    ementary rules of probability, known as the sum rule and the product rule. Having

    obtained these rules, we shall then return to our boxes of fruit example.

    In order to derive the rules of probability, consider the slightly more general ex-

    ample shown in Figure 1.10 involving two random variables X and Y (which could

    for instance be the Box and Fruit variables considered above). We shall suppose that

    X can take any of the values xi where i =1,...,M, and Y can take the values yj

    where j =1,...,L. Consider a total of N trials in which we sample both of the

    variables X and Y , and let the number of such trials in which X = xi and Y = yj

    be nij . Also, let the number of trials in which X takes the value xi (irrespective

    of the value that Y takes) be denoted by ci, and similarly let the number of trials in

    which Y takes the value yj be denoted by rj .

    The probability that X will take the value xi and Y will take the value yj is

    written p(X = xi,Y = yj ) and is called the joint probability of X = xi and

    Y = yj . It is given by the number of points falling in the cell i,j as a fraction of the

    total number of points, and hence

    p(X = xi,Y = yj )= nij

    N . (1.5)

    Here we are implicitly considering the limit N →∞. Similarly, the probability that

    X takes the value xi irrespective of the value of Y is written as p(X = xi) and is

    given by the fraction of the total number of points that fall in column i, so that

    p(X = xi)= ci

    N . (1.6)

    Because the number of instances in column i in Figure 1.10 is just the sum of the

    number of instances in each cell of that column, we have ci = j nij and therefore,14 1. INTRODUCTION

    from (1.5) and (1.6), we have

    p(X = xi)=

    L

    j=1

    p(X = xi,Y = yj ) (1.7)

    which is the sum rule of probability. Note that p(X = xi) is sometimes called the

    marginal probability, because it is obtained by marginalizing, or summing out, the

    other variables (in this case Y ).

    If we consider only those instances for which X = xi, then the fraction of

    such instances for which Y = yj is written p(Y = yj |X = xi) and is called the

    conditional probability of Y = yj given X = xi. It is obtained by ?nding the

    fraction of those points in column i that fall in cell i,j and hence is given by

    p(Y = yj |X = xi)= nij

    ci

    . (1.8)

    From (1.5), (1.6), and (1.8), we can then derive the following relationship

    p(X = xi,Y = yj )= nij

    N = nij

    ci

    ·

    ci

    N

    = p(Y = yj |X = xi)p(X = xi) (1.9)

    which is the product rule of probability.

    So far we have been quite careful to make a distinction between a random vari- ......

http://www.100md.com/html/file/202011/185692.htm

您现在查看是摘要介绍页，详见PDF附件(10365KB，749页)。