Samaneh Azadi Suvrit Sra UC Berkeley Max Planck Institute...
Transcript of Samaneh Azadi Suvrit Sra UC Berkeley Max Planck Institute...
-
Towards optimal stochastic ADMM
Samaneh Azadi, Suvrit Sra
UC Berkeley, Max Planck Institute, Tübingen
Thanks: Aaditya Ramdas (CMU)
-
Problem
Linearly constrained stochastic convex optimization
min Eξ[F (x , ξ)] + h(y)s.t. Ax + By = b, and x ∈ X , y ∈ Y
∀ξ,F (·, ξ): closed and convexh(y): closed and convexX and Y: compact convex sets.
I f ≡ E[F (x , ξ)]: Loss FunctionI h: Regularizer; generalization; structure
2 / 24
-
Comparison of convergence rates
Previous methods:
I ADMM : not stochastic optimization.
I SADMM : by Ouyang et al.(2013) and Suzuki(2013)leading to suboptimal convergence rates.
SADMM Optimal SADMMstrongly convex Lipschitz strongly convex Lipschitz
cont. gradients cont. gradientsO(log k/k) O(1/k) O(1/k) O(1/k2)
3 / 24
-
ADMM
minx∈X ,y∈Y f (x)+h(y), Ax +By−b = 0,
Introducing an augmented Lagrangian:
Lβ(x , y , λ) := f (x) + h(y)−〈λ, Ax + By − b〉+β2‖Ax + By − b‖22
λ: dual variable, β : penalty parameter.
Algorithm 1:Initialize: x0, y0, and λ0.for k ≥ 0 do
xk+1 ← argminx∈X {Lβ(x , yk , λk )}yk+1 ← argminy∈Y {Lβ(xk+1, y , λk )}λk+1 ← λk − β(Azk+1 + Byk+1 − b)
end4 / 24
-
SADMM
I For stochastic problems over a potentially unknowndistribution,
I Modified Augmented Lagrangian:
Lkβ(x , y , λ) :=f (xk )+〈gk, x〉+ h(y)− 〈λ, Ax + By − b〉+ β2‖Ax + By − b‖22 + 12ηk ‖x− xk‖
22,
(1)
=⇒Linearize f (x),gk : a stochastic (sub)gradient of f .
5 / 24
-
SADMM
I For stochastic problems with over a potentially unknowndistribution,
I Modified Augmented Lagrangian:
Lkβ(x , y , λ) :=f (xk )+〈gk, x〉+ h(y)− 〈λ, Ax + By − b〉+ β2‖Ax + By − b‖22+ 12ηk ‖x− xk‖
22,
(2)‖x − xk‖22 prox-term:ensuring that (2) has a unique solution,aiding the convergence analysis.
6 / 24
-
SADMM for strongly convex f
Assumptions:
Bounded subgradientsCompact X , Y; bounded dual variables.
Algorithm:
Similar to the SADMM algorithm.
7 / 24
-
SADMM for strongly convex f
Modification:
x̄k := 2k(k+1)∑k−1
j=0(j + 1)xj , ȳk := 2k(k+1)
∑kj=1
jyj (3)
Using nonuniform averaging of the iterates
Aim:
Giving higher weight to more recent iterates.
8 / 24
-
SADMM for strongly convex f
Result:
For a specific stepsize ηk :
E[f (x̄k )− f (x∗) + h(ȳk )− h(y∗) + ρ‖Ax̄k + Bȳk − b‖2]≤ 2G2µ(k+1) +
β2(k+1)D
2Y +
2ρ2β(k+1) .
=⇒ Convergence rate: O(1/k)
9 / 24
-
SADMM for smooth fAssumptions:
Bounded noise varianceCompact X , Y; bounded dual variables.
Algorithm 2:Input: Sequence (γk ) of interpolation parameters;
(ηk = (L + αk )−1), stepsizesInitialize: x0 = z0, y0.
for k ≥ 0 dopk ← (1− γk )xk + γkzk
E [gk ] = ∇f (pk )...
10 / 24
-
SADMM for smooth fzk+1 ← argminx∈X
{L̂kβ(x , yk , λk )
}
interpolatory sequences (pk ) and (zk ), and“stepsizes” γk based on fast-gradient methods.
xk+1 ← (1− γk )xk + γkzk+1Updating x by first computing zk+1,Using a weighted prox-term enforcing proximity to zk .
yk+1 ← argminy∈Y{
L̂kβ(zk+1, y , λk )}
Updating y by using an AL term that de-pends on zk+1 instead of xk+1 for simplification
λk+1 ← λk − β(Azk+1 + Byk+1 − b)
11 / 24
-
SADMM for smooth fOther Modifications:
I A modified Augmented Lagrangian term based onsuitable parameters (θk , γk ).
I Averaging the iterates generated by Algorithm 2non-uniformly.
I Smooth f (x)⇒ No need to average over x
12 / 24
-
SADMM for smooth f
Results:
-For specific αj and γj parameters, smooth f andnon-smooth h:
-For σ = 0,
E[f (x̄k )− f (x∗) + h(ȳk )− h(y∗) + ρ‖Az̄k + Bȳk − b‖2]
≤ 2LR2
(k + 1)2+
2βD2Yk + 1
+2ρ2
β(k + 1).
=⇒ Convergence rate of the smooth part: O(1/k2)
13 / 24
-
GFLasso with smooth loss
I Graph-guided fused lasso (GFlasso):I Using a graph based regularizer,I Variables : Vertices of the graph (xi ),I Penalizing the difference between adjacent variables:
according to the edge weight (wi,j : the weight for theedge between xi and xj ).
Ouyang et al. ICML 201314 / 24
-
GFLasso with smooth loss
I Graph-guided fused lasso (GFlasso):I Problem formulation
minE[L(x , ξ)] + λ2‖x‖22 + ν‖y‖1,s.t. Fx − y = 0.
I L(x, ξ) = 12(l − xT s)2 for (s, l) feature label pair in
the training sample ξ.
I Fij = wij , Fji = −wij for all edges {i , j} ∈ E ,
15 / 24
-
GFLasso with smooth loss
I Comparing the following methods on the 20newsgroupsdataset:
I Purpose: Classifying papers into 4 categories based onthe words they include.
0 1300 3250 5200 650065
75
85Smooth f: Accuracy %
# of iterations
ClassificationAccuracy
%
SGDProximal SGDOnline-RDARDA-AdmmSADMMoptimal-SADMM
0 1300 3250 5200 65000.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Smooth f: Objective Function
# of iterations
1Ntrain
∑i1 2(1
−l ixTs i)2
+γ 2||x
||2 2+
ν||F
x|| 1
SGDProximal SGDOnline-RDARDA-AdmmSADMMoptimal-SADMM
16 / 24
-
Overlapped group lasso
I Formulation:
f (x , ξ) = 0.1∑10
j=1 L(x , ξj),
h(y) = C(‖x (1)‖1 + 1√123‖x
(2)‖block).
I L(x, ξj) = log (1 + e−lj sTj x)→ logistic loss,
I h(y): overlapping group lasso regularizer,I y = Ax :a concatenation of m-repetitions of x ,I ‖x‖block =
∑∑∑i ‖Xi,.‖2 +
∑∑∑j ‖X.,j‖2 where X denotes a
reshaped version of x as a square matrix,I Vector x(1): related to the 123-first elements of x ,I x(2): representing the rest of x .
17 / 24
-
Overlapped group lasso
I Using ”adult” datasetI Purpose: binary classification
0 5 10 15 20 25 3075
80
85
90Smooth f: Accuracy %
CPU time (s)
ClassificationAccuracy
%
RDA-AdmmSADMMOptimal SADMM
0 5 10 15 20 25 300.25
0.35
0.45
0.55
0.65
0.75Smooth f: Objective Function
CPU time (s)
RDA-AdmmSADMMOptimal SADMM
18 / 24
-
Strongly convex loss functions
I Using hinge loss (L(x, ξ) = max{0,1− lsT x}) in thetwo previous examples,
I For ”20newsgroup” dataset,
0 1300 3250 5200 650045
55
65
75Strongly Convex f: Accuracy %
# of iterations
ClassificationAccuracy%
SGDProximal SGDSADMMoptimal-SADMM
0 1300 3250 5200 65000.6
0.7
0.8
0.9
1Strongly Convex f: Objective Function
# of iterations
1Ntrain
∑i1 2(li−xTsi)2
+γ 2||x
||2 2+ν||F
x|| 1
SGDProximal SGDSADMMoptimal-SADMM
19 / 24
-
Strongly convex loss functions
I Using hinge loss (L(x, ξ) = max{0,1− lsT x}) in thetwo previous examples,
I For ”adult” dataset,
0 0.5 1 1.5 2 2.5 3
x 104
55
65
75
85Strongly Convex f: Accuracy %
# of iterations
ClassificationAccuracy%
SGDProximal SGDSADMMoptimal-SADMM
0 0.5 1 1.5 2 2.5 3
x 104
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5Strongly Convex f: Objective Function
# of iterations
1Ntrain
∑if(x,ξ)+h(y)
SGDProximal SGDSADMMoptimal-SADMM
20 / 24
-
Accuracy improvement vs. number of features
I Generating Synthetic data,I Running GFlasso with smooth loss,I Reporting percentage improvement of Optimal-SADMM
over SADMM in terms of classification accuracy as afunction of number of features,
10 50 100 200 300 5000
10
20
30
40
50
60
Accuracy improvement using Accelerated SADMM
Number of features
%im
provement
21 / 24
-
Conclusion
I Presenting two new accelerated versions the stochasticADMM,
I A variant attaining the theoretically optimal O(1/k)convergence rate for strongly convex stochasticproblems,
I A variant algorithm for the smooth stochastic partwith an optimal O(1/k2) dependence on thesmooth part.
I Notable performance of our accelerated variants overtheir non-accelerated counterparts.
22 / 24
-
Future work
I Transferring the O(log k/k) convergence rate of the lastiterate as done for SGD to the SADMM setting.
I Obtaining high-probability bounds under light-tailedassumptions on the stochastic error.
I Incorporating the impact of sampling multiple stochasticgradients to decrease the variance in the gradientestimates.
I Deriving a mirror-descent version.I Improving rate dependence of the augmented Lagrangian
part to O(1/k2) for smooth problems.
23 / 24
-
The End
Questions?
24 / 24