Note: This describes an idea of Jessica Taylor’s, and is the first of several posts about aspects of online learning.
Introduction
Thanks to the recent paper on logical inductors, we are currently interested in what can be done with weighted ensembles of predictors. In particular, it would be neat if the idea about finding the fixed point of some predictors could be applied to online learning.
It seems like this approach fits in neatly into the framework of prediction with expert advice in online convex optimization. Usually, in online convex optimization, you can achieve low cumulative regret with respect to experts that make some fixed predictions. However, using the approach of finding the experts’ fixed point, we can obtain a low cumulative regret with respect to experts that make modifications to our predictions.
Setup
In each round, t=1,...,T, the learner predicts a vector wt∈Y where Y is some compact convex subset of Euclidean space. It does this with the advice of a set of expert predictors E={e1,...en}. Each of these experts ei:Y↦Y is a continuous function that maps a prediction to a prediction. These experts are interpreted as hypotheses about how predictions can be improved, and they are known to the learner. After lodging its prediction, the learner is informed of the value of some loss function ℓt:Y↦R if each expert’s advice was followed: {ℓt(e1(wt)),...ℓt(en(wt))}. ℓt is assumed to be convex and Lt-Lipschitz with respect to ∥⋅∥1.
The learner’s goal is to minimize its regret at time T, relative to the advice of its experts, supe∈ERegretT(e), where:
RegretT(e):=T∑t=1ℓt(wt)−T∑t=1ℓt(e(wt))
Proposed learning approach
The basic idea is that the learner maintains a weighting vt that indicates its relative level of trust in each expert. This lies on the simplex, vt∈Δn:={x∈[0,1]k:∑ki=1xi=1}, with each element vt,i giving the credence in the ith expert. At each time step, this weighting is used to make a prediction wt. Specifically, the prediction is chosen so that the weighted ensemble of experts would not alter it (i.e. it is the prediction that lies at their fixed point).
Formally, define the weighted combination of the experts at time step t, e∗t:Y↦Y as:
e∗t(w):=n∑i=1vt,iei(w)
For the first time step, each vt,i may simply be initialized to 1n, giving equal weight to each of the expert predictors. Then, at each time step t, the learner predicts the vector wt that lies at the fixed point of e∗t, so that:
e∗t(wt)=wt
By Brouwer’s fixed point theorem, there will be at least one such fixed point.
After submitting this prediction, the learner discovers the loss that each expert would have incurred. It uses these to compute a new value of vt for the next time step. This is done by performing an update step with respect to ℓ′t:Δn↦R, which will be convex and Lt-Lipschitz with respect to ∥⋅∥1.:
ℓ′t(vt):=ℓt(n∑i=1vt,iei(wt))
This can be done using the exponentiated gradients algorithm or some such other online convex optimization algorithm. If exponentiated gradients algorithm is used, we obtain a regret bound of [1]:
T∑t=1ℓ′t(vt)−T∑t=1ℓ′t(v∗t)≤√2Llog(n)T
where v∗t=infv∈Δnℓ′t(v) is the optimal weighting over experts and L=√1T∑Ti=1L2n depends on the Lipschitz constants of the loss functions. Recall also that n is the number of experts.
This bounds the regret with respect to the advice of any particular expert, because these experts correspond to cases in which any expert i is given weighting vt,i=1:
This is in more general than the usual use of prediction with expert advice because the experts are more varied. Instead of each expert representing a constant prediction, it represents a continuous map from the space of prediction vectors to itself. One limitation, however, is that in order for the fixed point argument to work, the prediction vectors wt must be located in a compact convex set. Although, so long as these vectors are predicting some quantity with known bounds, this is not a severe limitation. Another limitation is that locating the fixed point of the experts may be very computationally difficult.
Footnotes
Given in Corollary 2.14 on page 140 of the review Shalev-Shwartz, Shai. “Online learning and online convex optimization.” Foundations and Trends in Machine Learning 4.2 (2011): 107-194.
Online Learning 1: Bias-detecting online learners
Note: This describes an idea of Jessica Taylor’s, and is the first of several posts about aspects of online learning.
Introduction
Thanks to the recent paper on logical inductors, we are currently interested in what can be done with weighted ensembles of predictors. In particular, it would be neat if the idea about finding the fixed point of some predictors could be applied to online learning.
It seems like this approach fits in neatly into the framework of prediction with expert advice in online convex optimization. Usually, in online convex optimization, you can achieve low cumulative regret with respect to experts that make some fixed predictions. However, using the approach of finding the experts’ fixed point, we can obtain a low cumulative regret with respect to experts that make modifications to our predictions.
Setup
In each round, t=1,...,T, the learner predicts a vector wt∈Y where Y is some compact convex subset of Euclidean space. It does this with the advice of a set of expert predictors E={e1,...en}. Each of these experts ei:Y↦Y is a continuous function that maps a prediction to a prediction. These experts are interpreted as hypotheses about how predictions can be improved, and they are known to the learner. After lodging its prediction, the learner is informed of the value of some loss function ℓt:Y↦R if each expert’s advice was followed: {ℓt(e1(wt)),...ℓt(en(wt))}. ℓt is assumed to be convex and Lt-Lipschitz with respect to ∥⋅∥1. The learner’s goal is to minimize its regret at time T, relative to the advice of its experts, supe∈ERegretT(e), where: RegretT(e):=T∑t=1ℓt(wt)−T∑t=1ℓt(e(wt))
Proposed learning approach
The basic idea is that the learner maintains a weighting vt that indicates its relative level of trust in each expert. This lies on the simplex, vt∈Δn:={x∈[0,1]k:∑ki=1xi=1}, with each element vt,i giving the credence in the ith expert. At each time step, this weighting is used to make a prediction wt. Specifically, the prediction is chosen so that the weighted ensemble of experts would not alter it (i.e. it is the prediction that lies at their fixed point).
Formally, define the weighted combination of the experts at time step t, e∗t:Y↦Y as: e∗t(w):=n∑i=1vt,iei(w)
For the first time step, each vt,i may simply be initialized to 1n, giving equal weight to each of the expert predictors. Then, at each time step t, the learner predicts the vector wt that lies at the fixed point of e∗t, so that: e∗t(wt)=wt
By Brouwer’s fixed point theorem, there will be at least one such fixed point.
After submitting this prediction, the learner discovers the loss that each expert would have incurred. It uses these to compute a new value of vt for the next time step. This is done by performing an update step with respect to ℓ′t:Δn↦R, which will be convex and Lt-Lipschitz with respect to ∥⋅∥1.: ℓ′t(vt):=ℓt(n∑i=1vt,iei(wt))
This can be done using the exponentiated gradients algorithm or some such other online convex optimization algorithm. If exponentiated gradients algorithm is used, we obtain a regret bound of [1]: T∑t=1ℓ′t(vt)−T∑t=1ℓ′t(v∗t)≤√2Llog(n)T where v∗t=infv∈Δnℓ′t(v) is the optimal weighting over experts and L=√1T∑Ti=1L2n depends on the Lipschitz constants of the loss functions. Recall also that n is the number of experts.
Equivalently: T∑t=1ℓt(n∑i=1vt,iei(wt))−T∑t=1ℓt(n∑i=1v∗t,iei(wt))≤√2log(n)T
This bounds the regret with respect to the advice of any particular expert, because these experts correspond to cases in which any expert i is given weighting vt,i=1:
∀iT∑t=1ℓt(wt)−T∑t=1ℓt(ei(wt))≤√2log(n)Tsupe∈ERegretT(e)≤
Discussion
This is in more general than the usual use of prediction with expert advice because the experts are more varied. Instead of each expert representing a constant prediction, it represents a continuous map from the space of prediction vectors to itself. One limitation, however, is that in order for the fixed point argument to work, the prediction vectors wt must be located in a compact convex set. Although, so long as these vectors are predicting some quantity with known bounds, this is not a severe limitation. Another limitation is that locating the fixed point of the experts may be very computationally difficult.
Footnotes
Given in Corollary 2.14 on page 140 of the review Shalev-Shwartz, Shai. “Online learning and online convex optimization.” Foundations and Trends in Machine Learning 4.2 (2011): 107-194.