- Office: CSB 725
- Curriculum Vitæ

*Xiangru Lian* (

Most distributed machine learning systems nowadays, including TensorFlow and
CNTK, are built in a centralized fashion. One bottleneck of centralized
algorithms lies on high communication cost on the central node. Motivated by
this, we ask, can decentralized algorithms be faster than its centralized counterpart?

Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.

Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.

The stochastic composition optimization proposed recently by
Wang et al. [2014] minimizes the objective with the compositional
expectation form: $\min_x~(\mathbb{E}_iF_i \circ \mathbb{E}_j G_j)(x).$ It
summarizes many important applications in machine learning, statistics, and
finance. In this paper, we consider the finite-sum scenario for composition
optimization:
\[\min_x f (x) := \frac{1}{n} \sum_{i = 1}^n F_i \left(\frac{1}{m} \sum_{j = 1}^m G_j (x) \right). \]
We propose two algorithms to solve this problem by combining the stochastic
compositional gradient descent (SCGD) and the stochastic variance reduced
gradient (SVRG) technique. A constant linear convergence rate is proved for
strongly convex optimization, which substantially improves the sublinear rate
$O(K^{-0.8})$ of the best known algorithm.

(* means equal contribution)

In this paper, we propose and study an Asynchronous parallel Greedy Coordinate
Descent (Asy-GCD) algorithm for minimizing a smooth function with
bounded constraints. At each iteration, workers asynchronously
conduct greedy coordinate descent updates on a block of variables.
In the first part of the paper, we analyze the theoretical
behavior of Asy-GCD and prove a linear convergence rate. In the
second part, we develop an efficient kernel SVM solver based on
Asy-GCD in the shared memory multi-core setting. Since our
algorithm is fully asynchronous---each core does not need to idle
and wait for the other cores---the resulting algorithm enjoys good
speedup and outperforms existing multi-core kernel SVM solvers
including asynchronous stochastic coordinate descent and
multi-core LIBSVM.

Asynchronous parallel optimization received substantial successes and extensive
attention recently. One of core theoretical questions is how much speedup (or
benefit) the asynchronous parallelization can bring us. This paper provides a
comprehensive and generic analysis to study the speedup property for a broad
range of asynchronous parallel stochastic algorithms from the zeroth order to
the first order methods. Our result recovers or improves existing analysis on
special cases, provides more insights for understanding the asynchronous
parallel behaviors, and suggests a novel asynchronous parallel zeroth order
method for the first time. Our experiments provide novel applications including
model blending problems using the proposed asynchronous parallel zeroth order
method.

Deep neural networks have been shown to achieve state-of-the-art performance in several machine learning tasks. Stochastic Gradient Descent (SGD) is the preferred optimization algorithm for training these networks and asynchronous SGD (ASGD) has been widely adopted for accelerating the training of large-scale deep networks in a distributed computing environment. However, in practice it is quite challenging to tune the training hyperparameters (such as learning rate) when using ASGD so as achieve convergence and linear speedup, since the stability of the optimization algorithm is strongly influenced by the asynchronous nature of parameter updates. In this paper, we propose a variant of the ASGD algorithm in which the learning rate is modulated according to the gradient staleness and provide theoretical guarantees for convergence of this algorithm. Experimental verification is performed on commonly-used image classification benchmarks: CIFAR10 and Imagenet to demonstrate the superior effectiveness of the proposed approach, compared to SSGD (Synchronous SGD) and the conventional ASGD algorithm.

Asynchronous parallel implementations of stochastic gradient (SG)
have been broadly used in solving deep neural network and received
many successes in practice recently. However, existing theories
cannot explain their convergence and speedup properties, mainly due
to the nonconvexity of most deep learning formulations and the
asynchronous parallel mechanism. To fill the gaps in theory and
provide theoretical supports, this paper studies two asynchronous
parallel implementations of SG: one is over a
computer network and the other is on a shared
memory system. We establish an ergodic convergence rate
$O(1/\sqrt{K})$ for both algorithms and prove that the linear
speedup is achievable if the number of workers is bounded by
$\sqrt{K}$ ($K$ is the total number of iterations). Our results
generalize and improve existing analysis for convex minimization.

We report on ${^{7}Li}$ and ${^{77}Se}$ nuclear magnetic resonance (NMR) measurements in the newly discovered iron-based superconductor (Li0.8Fe0.2)OHFeSe. Both remarkable change in spectrum and divergence of spin-spin relaxation (T2) are observed with decreasing temperature, suggesting a static magnetic ordering in this material. Moreover, the broadened ${^{7}Li}$ spectrum also shows a characteristic three-peak structure. Comparison of linewidth between ${^{7}Li}$ and ${^{77}Se}$ spectra indicates that there is a ferromagnetic component along the external field direction and it happens on Fe sites in the (Li0.8Fe0.2)OH layer rather than the FeSe layer, which is also responsible for the three-peak structure of the ${^{7}Li}$ spectrum with (Li0.8Fe0.2) random occupation. The field-dependent spectrum and T2 suggest that the above ferromagnetic component is induced by external magnetic field. Our present results indicate that the superconductivity in (Li0.8Fe0.2)OHFeSe is very robust against the observed field-induced ferromagnetism and both of them could coexist under external magnetic field.

Last Updated on:

Author: Xiangru Lian

We are maverick 救済なんていらない