Xiangru Lian ( 廉相如 れんしょうじょ ), Ph.D. student of Department of Computer Science at University of Rochester, welcomes you to his homepage.

I am on the way of getting rid of non-free software: roadmap (TBD).

## Research Areas and Publications

#### Asynchronous Decentralized Parallel Stochastic Gradient Descent

###### Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu
@misc{1705.09056,
Author = {Xiangru Lian and Ce Zhang and Huan Zhang and Cho-Jui Hsieh and Wei Zhang and Ji Liu},
Title = {Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent},
Year = {2017},
Eprint = {arXiv:1705.09056},
}

Recent work shows that decentralized parallel stochastic gradient decent (D-PSGD) can outperform its centralized counterpart both theoretically and practically. While asynchronous parallelism is a powerful technology to improve the efficiency of parallelism in distributed machine learning platforms and has been widely used in many popular machine learning softwares and solvers based on centralized parallel protocols such as Tensorflow, it still remains unclear how to apply the asynchronous parallelism to improve the efficiency of decentralized parallel algorithms. This paper proposes an asynchronous decentralize parallel stochastic gradient descent algorithm to apply the asynchronous parallelism technology to decentralized algorithms. Our theoretical analysis provides the convergence rate or equivalently the computational complexity, which is consistent with many special cases and indicates we can achieve nice linear speedup when we increase the number of nodes or the batchsize. Extensive experiments in deep learning validate the proposed algorithm.

#### Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

###### Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu
@misc{1705.09056,
Author = {Xiangru Lian and Ce Zhang and Huan Zhang and Cho-Jui Hsieh and Wei Zhang and Ji Liu},
Title = {Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent},
Year = {2017},
Eprint = {arXiv:1705.09056},
}

Most distributed machine learning systems nowadays, including TensorFlow and CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms lies on high communication cost on the central node. Motivated by this, we ask, can decentralized algorithms be faster than its centralized counterpart?
Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.

#### Finite-sum Composition Optimization via Variance Reduced Gradient Descent

###### Xiangru Lian, Mengdi Wang, and Ji Liu
@InProceedings{pmlr-v54-lian17a,
title = 	 {{Finite-sum Composition Optimization via Variance Reduced Gradient Descent}},
author = 	 {Xiangru Lian and Mengdi Wang and Ji Liu},
booktitle = 	 {Proceedings of the 20th International Conference on Artificial Intelligence and Statistics},
pages = 	 {1159--1167},
year = 	 {2017},
editor = 	 {Aarti Singh and Jerry Zhu},
volume = 	 {54},
series = 	 {Proceedings of Machine Learning Research},
address = 	 {Fort Lauderdale, FL, USA},
month = 	 {20--22 Apr},
publisher = 	 {PMLR},
pdf = 	 {http://proceedings.mlr.press/v54/lian17a/lian17a.pdf},
abstract = 	 {The stochastic composition optimization proposed recently by Wang et al. [2014] minimizes the objective with the composite expectation form: $\min_x (\mathbbE_iF_i ∘\mathbbE_j G_j)(x).$ It summarizes many important applications in machine learning, statistics, and finance. In this paper, we consider the finite-sum scenario for composition optimization: $\min_x f (x) := \frac1n \sum_i = 1^n F_i \left( \frac1m \sum_j = 1^m G_j (x) \right)$. In this paper, two algorithms are proposed to solve this problem by combining the stochastic compositional gradient descent (SCGD) and the stochastic variance reduced gradient (SVRG) technique. A constant linear convergence rate is proved for strongly convex optimization, which substantially improves the sublinear rate $O(K^-0.8)$ of the best known algorithm. }
}
The stochastic composition optimization proposed recently by Wang et al. [2014] minimizes the objective with the compositional expectation form: $\min_x~(\mathbb{E}_iF_i \circ \mathbb{E}_j G_j)(x).$ It summarizes many important applications in machine learning, statistics, and finance. In this paper, we consider the finite-sum scenario for composition optimization: $\min_x f (x) := \frac{1}{n} \sum_{i = 1}^n F_i \left(\frac{1}{m} \sum_{j = 1}^m G_j (x) \right).$ We propose two algorithms to solve this problem by combining the stochastic compositional gradient descent (SCGD) and the stochastic variance reduced gradient (SVRG) technique. A constant linear convergence rate is proved for strongly convex optimization, which substantially improves the sublinear rate $O(K^{-0.8})$ of the best known algorithm.

#### Asynchronous Parallel Greedy Coordinate Descent

###### Yang You*, Xiangru Lian*, Ji Liu, Hsiang-Fu Yu, Inderjit Dhillon, James Demmel, and Cho-Jui Hsieh (* means equal contribution)
@inproceedings{you2016asynchronous,
title={Asynchronous parallel greedy coordinate descent},
author={You, Yang and Lian, Xiangru and Liu, Ji and Yu, Hsiang-Fu and Dhillon, Inderjit S and Demmel, James and Hsieh, Cho-Jui},
booktitle={Advances In Neural Information Processing Systems},
pages={4682--4690},
year={2016}
}

In this paper, we propose and study an Asynchronous parallel Greedy Coordinate Descent (Asy-GCD) algorithm for minimizing a smooth function with bounded constraints. At each iteration, workers asynchronously conduct greedy coordinate descent updates on a block of variables. In the first part of the paper, we analyze the theoretical behavior of Asy-GCD and prove a linear convergence rate. In the second part, we develop an efficient kernel SVM solver based on Asy-GCD in the shared memory multi-core setting. Since our algorithm is fully asynchronous---each core does not need to idle and wait for the other cores---the resulting algorithm enjoys good speedup and outperforms existing multi-core kernel SVM solvers including asynchronous stochastic coordinate descent and multi-core LIBSVM.

#### A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-order to First-order

###### Xiangru Lian, Huan Zhang, Cho-Jui Hsieh, Yijun Huang, and Ji Liu
@incollection{lianNIPS2016_6551,
title = {A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order},
author = {Lian, Xiangru and Zhang, Huan and Hsieh, Cho-Jui and Huang, Yijun and Liu, Ji},
booktitle = {Advances in Neural Information Processing Systems 29},
pages = {3054--3062},
year = {2016}
}

Asynchronous parallel optimization received substantial successes and extensive attention recently. One of core theoretical questions is how much speedup (or benefit) the asynchronous parallelization can bring us. This paper provides a comprehensive and generic analysis to study the speedup property for a broad range of asynchronous parallel stochastic algorithms from the zeroth order to the first order methods. Our result recovers or improves existing analysis on special cases, provides more insights for understanding the asynchronous parallel behaviors, and suggests a novel asynchronous parallel zeroth order method for the first time. Our experiments provide novel applications including model blending problems using the proposed asynchronous parallel zeroth order method.

#### Staleness-aware Async-SGD for Distributed Deep Learning

###### Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu
author = {Zhang, Wei and Gupta, Suyog and Lian, Xiangru and Liu, Ji},
title = {Staleness-aware async-SGD for Distributed Deep Learning},
booktitle = {Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence},
series = {IJCAI'16},
year = {2016},
isbn = {978-1-57735-770-4},
location = {New York, New York, USA},
pages = {2350--2356},
numpages = {7},
url = {http://dl.acm.org/citation.cfm?id=3060832.3060950},
acmid = {3060950},
publisher = {AAAI Press},
}

Deep neural networks have been shown to achieve state-of-the-art performance in several machine learning tasks. Stochastic Gradient Descent (SGD) is the preferred optimization algorithm for training these networks and asynchronous SGD (ASGD) has been widely adopted for accelerating the training of large-scale deep networks in a distributed computing environment. However, in practice it is quite challenging to tune the training hyperparameters (such as learning rate) when using ASGD so as achieve convergence and linear speedup, since the stability of the optimization algorithm is strongly influenced by the asynchronous nature of parameter updates. In this paper, we propose a variant of the ASGD algorithm in which the learning rate is modulated according to the gradient staleness and provide theoretical guarantees for convergence of this algorithm. Experimental verification is performed on commonly-used image classification benchmarks: CIFAR10 and Imagenet to demonstrate the superior effectiveness of the proposed approach, compared to SSGD (Synchronous SGD) and the conventional ASGD algorithm.

#### Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization

###### Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu
@inproceedings{Lian:2015:APS:2969442.2969545,
author = {Lian, Xiangru and Huang, Yijun and Li, Yuncheng and Liu, Ji},
title = {Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization},
booktitle = {Proceedings of the 28th International Conference on Neural Information Processing Systems},
series = {NIPS'15},
year = {2015},
pages = {2737--2745},
numpages = {9},
url = {http://dl.acm.org/citation.cfm?id=2969442.2969545},
acmid = {2969545},
publisher = {MIT Press},
}

Asynchronous parallel implementations of stochastic gradient (SG) have been broadly used in solving deep neural network and received many successes in practice recently. However, existing theories cannot explain their convergence and speedup properties, mainly due to the nonconvexity of most deep learning formulations and the asynchronous parallel mechanism. To fill the gaps in theory and provide theoretical supports, this paper studies two asynchronous parallel implementations of SG: one is over a computer network and the other is on a shared memory system. We establish an ergodic convergence rate $O(1/\sqrt{K})$ for both algorithms and prove that the linear speedup is achievable if the number of workers is bounded by $\sqrt{K}$ ($K$ is the total number of iterations). Our results generalize and improve existing analysis for convex minimization.

#### NMR Evidence for Field-induced Ferromagnetism in Li0.8Fe0.2OHFeSe Superconductor

###### Yongping Wu, Dan Zhao, Xiangru Lian, Xiufang Lu, Naizhou Wang, Xigang Luo, Xianhui Chen, and Tao Wu
@article{PhysRevB.91.125107,
title = {NMR evidence for field-induced ferromagnetism in (${\mathrm{Li}}_{0.8}\phantom{\rule{0.16em}{0ex}}{\mathrm{Fe}}_{0.2}$)OHFeSe superconductor},
author = {Wu, Y. P. and Zhao, D. and Lian, X. R. and Lu, X. F. and Wang, N. Z. and Luo, X. G. and Chen, X. H. and Wu, T.},
journal = {Phys. Rev. B},
volume = {91},
issue = {12},
pages = {125107},
numpages = {5},
year = {2015},
month = {Mar},
publisher = {American Physical Society},
doi = {10.1103/PhysRevB.91.125107},
}

We report on ${^{7}Li}$ and ${^{77}Se}$ nuclear magnetic resonance (NMR) measurements in the newly discovered iron-based superconductor (Li0.8Fe0.2)OHFeSe. Both remarkable change in spectrum and divergence of spin-spin relaxation (T2) are observed with decreasing temperature, suggesting a static magnetic ordering in this material. Moreover, the broadened ${^{7}Li}$ spectrum also shows a characteristic three-peak structure. Comparison of linewidth between ${^{7}Li}$ and ${^{77}Se}$ spectra indicates that there is a ferromagnetic component along the external field direction and it happens on Fe sites in the (Li0.8Fe0.2)OH layer rather than the FeSe layer, which is also responsible for the three-peak structure of the ${^{7}Li}$ spectrum with (Li0.8Fe0.2) random occupation. The field-dependent spectrum and T2 suggest that the above ferromagnetic component is induced by external magnetic field. Our present results indicate that the superconductivity in (Li0.8Fe0.2)OHFeSe is very robust against the observed field-induced ferromagnetism and both of them could coexist under external magnetic field.