Integrating Learning and Search for Structured Prediction Tutorial at IJCAI-2016

Tutorial: Integrating Learning and Search for Structured Prediction
IJCAI-2016, New York City

Organizers

Alan Fern, Oregon State University
Liang Huang, Oregon State University
Jana Doppa, Washington State University (Contact Organizer)

Time and Location

1:45 to 5:30pm in Room 2 on July 11, 2016

Description

Structured Prediction (SP) deals with the task of mapping a structured input (e.g., sequence of words) to a structured output (e.g., sequence of part of speech tags). Many important applications in natural language processing, computer vision, and bio-informatics can be naturally formulated as structured output prediction problems. Some examples are as follows: parsing (predicting the parse tree for a given sentence); information extraction (detecting entity and event mentions from a given text); co-reference resolution (recognizing and resolving co-references of entities and events); machine translation (translating a sentence from one language to another); object detection (detecting objects in a given image); semantic segmentation (labeling each pixel in an image with the corresponding semantic label); object tracking in videos (predicting the tracks of multiple objects in a video); text-to-speech and speech-to-text mapping; and protein structure prediction (predicting the three-dimensional structure of a protein from its amino acid sequence) etc.

Each of these prediction problems has a huge number of possible interpretations or outputs (e.g., many possible part of speech taggings for a sentence). A standard approach to structured prediction is to learn a cost function for scoring a potential output for each input. Given such a cost function and a new input, the output computation involves solving the so-called ``Argmin inference problem,'' which is to find the minimum cost output for the corresponding input. Unfortunately, exactly solving the Argmin inference problem is often intractable (computationally very hard) except for some special cases.

This tutorial will focus on computational frameworks that integrate "learning" and "search" in a principled manner to solve structured prediction problems.

Outline of the Tutorial

Part 1: Introduction

Motivation via Diverse Application Domains

Natural language processing
Computer vision
Bioinformatics
Planning

Classification to Structured Prediction

Standard machine learning setup w/ simple outputs
Simple outputs vs. Structured outputs
Key challenge: large number of candidate structured outputs

Part 2: Cost Function Learning Framework and Argmin Inference Challenge

Cost Function Learning Algorithms: Generalizations of Classification Algorithms

Perceptron => Structured Perceptron
SVM => Structured SVM
Naive Bayes => HMMs
Logistic Regression => CRFs

Key Elements of Cost Function Learning Framework

Joint feature function over structured input-output pairs
(Loss-augmented) Argmin inference solver
Optimization algorithm to tune the weights of the cost function

Generic Learning Framework and Challenges

Repeatedly call the Argmin Inference solver with the current weights on each training input and update the weights via online or batch optimization algorithm
Expressiveness of features vs. Time complexity of Argmin inference solver
Learning w/ "Exact" Inference vs. "Approximate" Inference

Focus of Tutorial:

Integrating "Learning" and "Search", two fundamental branches of AI to solve structured prediction problems
Structured Prediction as a search process and learning to improve the accuracy of search using training data

Part 3: Overview of Search Concepts: Search Space, Search Strategy, Search Control Knowledge etc.

Primitive Search Spaces:

Initial State Function, Successor Function, and Terminal State Function
Ordered vs. Unordered Search Spaces: Making the decisions in a fixed order (e.g., left-to-right in sequence labeling) vs. no fixed ordering. Single target path from initial state to goal state vs. Multiple target paths.

Search Procedures: Best-First Search

No pruning at one extreme and greedy search at the other extreme
Beam search with beam width B \in [1, \infinity] to access the entire spectrum
Best-first vs. Breadth-first Beam search

Search Control Knowledge

Policies, Pruning Rules, and Heuristic Functions

Cost Function

Scoring function to evaluate the terminal states

Part 4: Control Knowledge Learning Framework: Greedy Methods

Control Knowledge Learning Framework

Given a ordered or a unordered search space and a cost function, how to learn a greedy policy to quickly the search towards low cost terminal states?

Greedy Methods

Ordered Search Space: Classifier-based structured prediction
Unordered Search Space: Easy-first approach

Classifier-based Structured Prediction

Connections to Imitation Learning (IL)
Exact-Imitation and Error propagation (Theoretical results from Khardon, Syed & Schapire, Ross & Bagnell)
Advanced Imitation Learning Algorithms: SEARN, SMiLe, DAgger, AggreVaTe, LOLS
DAgger algorithm as a representative sample

Easy-First Approach

Key idea: learns to make easy prediction decisions first to constrain the harder decisions akin to constraint satisfaction algorithms
Imitation learning with a non-deterministic oracle policy: ties are broken with the learned policy (scoring function)
Standard training approach: best-scoring target node vs. best-scoring non-target node
Optimization based learning approach: best-scoring target node vs. all violated non-target nodes with a passive-aggressive regularizer

Part 5: Control Knowledge Learning Framework: Beam Search Methods

Given a ordered or unordered search space and a cost function, how to learn a heuristic function to quickly guide the search procedure towards low cost terminal states?

General Framework: 3 Choices

How to define the notion of search error?
How to learn (update weights) when we encounter a search error?
What to do after learning?

Different Instantiations

Early update instantiation
Max-violation update instantiation
LaSO instnatiation
Convergence results
Rate of convergence is inversely proportional to the beam width (amount of search allowed)

Part 6: HC-Search: A Unifying framework for Cost Function Learning and Control Knowledge Learning

Unifying View

Adding C to control knowledge learning or Adding H to cost function learning
Without H, becomes cost function learning
Without C, becomes control knowledge learning

Overview

Key Idea: Generate high-quality candidate outputs via a search procedure guided by learned heuristic H and score the candidate outputs using a learned cost function C to select the least cost candidate output as prediction
HC-Search supports learning to improve both speed and accuracy of structured prediction
H can be learned either in a primitive space (see IJCAI'16 paper on incremental parsing) or a complete output space (see also IJCAI'16 paper on heuristic search for computing M-Best Modes)

HC-Search: Learning Approach

How to learn H and C for given a complete output space, search procedure, search bound, and loss function over a training set?
Loss decomposition: generation error and selection error
Greedy stage-wise loss minimization
Learning H via Imitation Learning
Learning C via Rank Learning

Search Space Design

Quality of a search space: concept of (expected) target depth of a search space
Flipbit search space
Limited Discrepancy Search (LDS) Space parameterized by a recurrent classifier or greedy policy (see also IJCAI'16 paper on LDS for AND/OR search spaces)
Optimizing the quality of LDS space via optimizing the accuracy of recurrent classifier
Sparse LDS spaces to trade-off speed and accuracy
Randomized Segmentation Space for computer vision problems: probabilistically sample likely object configurations in the image from a hierarchical segmentation tree

Engineering Methodology for Applying HC-Search

Relation to Other Methods

Inference in HC-Search vs. Inference in CRF/SSVM
Re-ranking style algorithms (generative model, M-Best diverse modes, and DPPs etc.)

Part 7: Future Directions

Designing and optimizing search spaces for complex structured prediction problems
Leveraging deep learning techniques to enhance (search-based) structured prediction approaches
What architectures are more suitable for "Anytime predictions" and what is best way to perform learning for making anytime predictions
Theoretical analysis: sample complexity and generalization bounds
More ...

Background of Presenters

Alan Fern is an Professor of Computer Science at Oregon State University. He received his Ph.D (2004) and M.S (2000) in Computer Engineering from Purdue University under the direction of Robert Givan, and his B.S (1997) in Electrical Engineering from the University of Maine. He received a National Science Foundation CAREER award in 2006, is an associate editor of the Machine Learning Journal, and currently serves on the editorial boards of the Journal of Artificial Intelligence Research and the Artificial Intelligence Journal. His research interests span a range of topics in artificial intelligence, including machine learning and automated planning, with a particular interest in work falling at their intersection. He is also interested in applications of AI, including AI for real-time strategy games and activity recognition from videos of American football and other sports.

Liang Huang is an Assistant Professor of Computer Science at Oregon State University. He was most recently an assistant professor at the City University of New York (CUNY) and a part-time research scientist at IBM T. J. Watson Research Center. He graduated in 2008 from the University of Pennsylvania and has worked as a research scientist at Google and a research assistant professor at USC/ISI. Most of his work develops fast algorithms and provable theory to speedup large-scale natural language processing and structured machine learning. He has received a Best Paper Award at ACL 2008, several best paper nominations (ACL 2007, EMNLP 2008, and ACL 2010), two Google Faculty Research Awards (2010 and 2013), a Yahoo Faculty Research Award (2015), and a University Teaching Prize at Penn (2005). His research has been supported by DARPA, NSF, Google, and Yahoo. He also co-authored a best-selling textbook in China on algorithms for programming contests.

Jana Doppa is an Assistant Professor of Computer Science at Washington State University, Pullman. He earned his PhD working with the Artificial Intelligence group at Oregon State University (2014); and his MTech from Indian Institute of Technology (IIT), Kanpur, India (2006). His general research interests are in the broad field of Artificial Intelligence (AI) and its applications including planning, natural language processing, and computer vision. He received a Outstanding Paper Award for his structured prediction work at the AAAI (2013) conference and a Google Faculty Research Award (2015). His PhD dissertation entitled ``Integrating Learning and Search for Structured Prediction'' was nominated for the ACM Distinguished Dissertation Award (2015). He has organized a successful workshop at ICML (2013) on structured prediction and interactions between inference and learning.

References