seminar

Advanced Database Techniques for Processing Scientific Multi-Dimensional Data and Stochastic Gradient Descent on Highly-Parallel Architectures: Multi-core CPU or GPU? Synchronous or Asynchronous?

Advanced Database Techniques for Processing Scientific Multi-Dimensional Data and Stochastic Gradient Descent on Highly-Parallel Architectures: Multi-core CPU or GPU? Synchronous or Asynchronous?

Event Info

  • Date

    22-05-2019

  • Location

    LNCC, Petrópolis RJ Brazil

Presentation 1: Advanced Database Techniques for Processing Scientific Multi-Dimensional Data

Abstract

Scientific applications are generating an ever-increasing volume of multi-dimensional data that require fast analytics to extract meaningful results. The database community has developed distributed array databases to alleviate this problem. In this talk, we introduce four classical techniques that we extend to array databases. The first is a novel distributed similarity join operator for multi-dimensional arrays that minimizes the overall data transfer and network congestion while providing load-balancing, without completely repartitioning and replicating the input arrays. The second technique is materialized array views and incremental view maintenance under batch updates. We give a three-stage heuristic that finds effective update plans and repartitions the array and the view continuously based on a window of past updates as a side-effect of view maintenance. The third technique is User-Defined Functions (UDF) for structural locality operations on arrays. We propose an in-situ UD F mechanism, called ArrayUDF, that allows users to define computations on adjacent array cells without the use of join operations and executes the UDF directly on arrays stored in data files. The fourth technique is a distributed framework for cost-based caching of multi-dimensional arrays in native format. We design cache eviction and placement heuristic algorithms that consider the historical query workload. These techniques are motivated by the Palomar Transient Factory (PTF) astronomical project and are implemented in the real-time transient detection pipeline. They have played a pivotal role in the first-ever observation of a neutron star merger which produces gravitational waves and turns out to be the origin of heavy elements, including gold. This has lead to a Science magazine article that has received extensive media coverage on ACM TechNews, Slashdot, FiveThirtyEight, and Quanta Magazine, among others.


Presentation 2: Stochastic Gradient Descent on Highly-Parallel Architectures: Multi-core CPU or GPU? Synchronous or Asynchronous?


Abstract

There is an increased interest in building data analytics frameworks with advanced algebraic capabilities both in industry and academia. Many of these frameworks, e.g., TensorFlow, implement their compute-intensive primitives in two flavors---as multi-thread routines for multi-core CPUs and as highly-parallel kernels executed on GPU. Stochastic gradient descent (SGD) is the most popular optimization method for model training implemented extensively on modern data analytics platforms. While the data-intensive properties of SGD are well-known, there is an intense debate on which of the many SGD variants is better in practice. In this work, we perform a comprehensive experimental study of parallel SGD for training machine learning models. We consider the impact of three factors -- computing architecture (multi-core CPU or GPU), synchronous or asynchronous model updates, and data sparsity -- on three measures---hardware efficiency, statistical efficiency, and time to convergence. We draw several interesting findings from our experiments with logistic regression (LR), support vector machines (SVM), and deep neural nets (MLP) on five real datasets. As expected, GPU always outperforms parallel CPU for synchronous SGD. The gap is, however, only 2-5X for simple models, and below 7X even for fully-connected deep nets. For asynchronous SGD, CPU is undoubtedly the optimal solution, outperforming GPU in time to convergence even when the GPU has a speedup of 10X or more. The choice between synchronous GPU and asynchronous CPU is not straightforward and depends on the task and the characteristics of the data. Thus, CPU should not be easily discarded for machine learning workloads. We hope that our insights provide a useful guide for applying parallel SGD in practice and -- more importantly -- choosing the appropriate computing architecture.

Gallery