Google Summer of Code Final Report

1. Overview
2. Working Period
3. Future Work

1. Overview

This is the final report for a proposal in the Haskell organization to enhance the Goal library with GPU operations. All commits were made to the experimental branch.

2. Working Period

2.1. Developer setup

I originally planned to create a nix shell for development. During the proposal period I experimented with Hasktorch and Goal using a nix template provided by Hastorch. Ultimately, however, we settled on Docker for the dev environment, as it may be more straightforward to use with Apptainer for HPC tasks. The final docker image clocked in at over 14 GB in size. I tried to reduce this but only managed to trim off about half a gigabyte.

zarak@nixos ~ ❯ docker history f71d81c4404d
IMAGE          CREATED        CREATED BY                                      SIZE      COMMENT
f71d81c4404d   2 months ago   /bin/bash -c #(nop)  CMD ["/bin/bash"]          0B
153417880ada   2 months ago   /bin/bash -c #(nop)  ENV PATH=/usr/local/bin…   0B
116decd9c5a0   2 months ago   /bin/bash -c #(nop)  ENV NVIDIA_DRIVER_CAPAB…   0B
ea3546bf7926   2 months ago   /bin/bash -c #(nop)  ENV NVIDIA_VISIBLE_DEVI…   0B
c55774b975f8   2 months ago   |4 CABAL=latest DEBIAN_FRONTEND=noninteracti…   4.03GB
3e713b88518a   2 months ago   /bin/bash -c #(nop) COPY file:1642f81b2fc032…   1.22kB
b354f7d1e1be   2 months ago   |4 CABAL=latest DEBIAN_FRONTEND=noninteracti…   66.9kB
bee356d14e95   2 months ago   /bin/bash -c #(nop)  SHELL [/bin/bash -c]       0B
8bb54779d991   2 months ago   |4 CABAL=latest DEBIAN_FRONTEND=noninteracti…   20MB
1a9db6f64989   2 months ago   |4 CABAL=latest DEBIAN_FRONTEND=noninteracti…   3.8GB
7009ce892ba2   2 months ago   |4 CABAL=latest DEBIAN_FRONTEND=noninteracti…   4.78GB
1c62275ed751   2 months ago   /bin/sh -c #(nop)  ARG CABAL=latest             0B
231a648e7653   2 months ago   /bin/sh -c #(nop)  ARG GHC=9.0.2                0B
a8e5be427c4f   2 months ago   |2 DEBIAN_FRONTEND=noninteractive GPG_KEY=77…   15.4MB
6be00d62872f   2 months ago   |2 DEBIAN_FRONTEND=noninteractive GPG_KEY=77…   3.29kB
01c26ac2d371   2 months ago   /bin/sh -c #(nop)  ARG GPG_KEY=7784930957807…   0B
9db92454b75e   2 months ago   |1 DEBIAN_FRONTEND=noninteractive /bin/sh -c…   1.54GB
6f33782b4774   2 months ago   /bin/sh -c #(nop)  ENV BLAS=OpenBlas            0B
56468c103876   2 months ago   /bin/sh -c #(nop)  ENV TZ=Europe/Berlin         0B
1f2add153ab6   2 months ago   /bin/sh -c #(nop)  ARG DEBIAN_FRONTEND=nonin…   0B
20fffa419e3a   3 months ago   /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      3 months ago   /bin/sh -c #(nop) ADD file:00dae10e79b05c4e1…   72.8MB

This docker image does not cache the Hasktorch library build artifacts, which means that I have to rebuild all of Hasktorch each time I run a fresh instance of the container. Another slight annoyance occurred if I left the container running and my system was suspended; libtorch would occasionally stop detecting the GPU, which necessitated a container restart and hence rebuild. Rather than spend time tackling all these issues, however, I proceeded with the primary objective of the project.

2.2. Goal Core

Core primarily contains functions that are thin wrappers around their Hasktorch counterparts in Vector.hs. If a required function was not provided by Hasktorch, I wrote it using Hasktorch primitives.

For example, scale is a thin wrapper around Hasktorch’s mulScale function:

scale :: Scalar x => x -> Vector n a d -> Vector n a d
{-# INLINE scale  #-}
scale x = UnsafeVector . mulScalar x . toTorch

On the other hand, I used einsum to implement a function to compute the outer product, as this was not available natively in Hasktorch:

outerProduct
    :: (KnownNat m, KnownNat n, Num x)
    => Vector m x d
    -> Vector n x d
    -> Matrix m n x d
{-# INLINE outerProduct #-}
outerProduct v1 v2 =
    Matrix $ UnsafeVector $ flattenAll $ einsum "i,j->ij" [toTorch v1, toTorch v2]

Another example is matrixRoot, which I implemented by referring to the hmatrix source code:

-- | Computes the square root of a 'Matrix' using Denman-Beavers iteration.
-- A slightly modified version of the [implementation](https://hackage.haskell.org/package/hmatrix-0.20.2/docs/src/Internal.Algorithms.html#sqrtmInv) in hmatrix-0.20.2.
matrixRoot
    :: forall n x d. (KnownNat n, Num x, Scalar x, HasTensorOptions x d)
    => Matrix n n x d
    -> Matrix n n x d
{-# INLINE matrixRoot #-}
matrixRoot mtx = Matrix $ UnsafeVector $ flattenAll $ fst $ fixedPoint $ iterate f (t, eyeSquare n opts)
    where n = natValInt (Proxy :: Proxy n)
          opts = tensorOpts (Proxy :: Proxy x) (Proxy :: Proxy d)
          f (y, z) = (0.5 * (y + Torch.inverse z),
                      0.5 * (Torch.inverse y + z))
          t = reshape [n, n] $ toTorch $ toVector mtx
          fixedPoint (a:b:rest) | asValue $ le (l1Loss ReduceMean (fst a) (fst b)) peps = a
                                | otherwise = fixedPoint (b:rest)
          fixedPoint _ = error "fixed point with impossible inputs"

Unlike in the hmatrix version, however, when checking for convergence, I took the mean when computing the l1 norm rather than the sum of the absolute values because I sometimes failed to achieve convergence (even though I theoretically should have) using the latter reduction. This function wasn’t critical within the scope of this project so rather than investigating further I moved on.

2.2.1. Test suite

I wrote a suite of unit tests for the functions in Core. The suite caught errors that were missed by the type system (I used Hasktorch’s untyped API), such as vector shape errors, and conflating CPU and CUDA tensors. A few functions in Core internally use Hasktorch factory functions which create tensors on the CPU by default, so the test suite was particularly helpful in identifying the functions with these errors.

2.3. Goal Geometry

2.3.1. Runtime exceptions

The runtime exceptions due to the mixing or CPU and CUDA tensors was resolved, but we decided to add another type parameter for the device on which a tensor is initialized. As a result, the syntax was not as clean, but the type safety is probably worth it. The unit test suite picked up some new errors after this change - certain data types are not supported for all operations on the GPU. For example, matrix multiplication only works with floating types. Goal only uses floating types with Points on a Manifold, so this wouldn’t present any problems, but one possible way of conditionally constraining data types depending on the device follows:

type CUDANotSupportedError (b :: Type) =
    'Text "You attempted to use " ':<>: 'ShowType b ':$$:
    'Text "which is not supported on CUDA"

type CUDASupportedTypes = [Float, Double]

type family ValidType (a :: DeviceType) (b :: Type) :: Constraint where
    ValidType 'CPU b = ()
    ValidType 'CUDA b =
        TypeError (CUDANotSupportedError b) ('True ~ Elem b CUDASupportedTypes)

type family Elem (e :: Type) (es :: [Type]) where
    Elem e '[] = 'False
    Elem e (e : es) = 'True
    Elem e (_ : es) = Elem e es

The above includes a helpful error message if someone were to use an unsupported type.

2.3.2. Benchmarks

After propagating the hasktorch primitives through the Geometry package, I ran some benchmarks comparing the original goal backend to both the CPU and CUDA based hasktorch backends. There isn’t a large difference between the CPU performance, but the GPU shows a substantial performance boost across the board.

2.3.3. Profiling

After implementing most of the code in goal-graphical, I was able to profile the gradient-descent script. I discovered that there was a flaw in my implementation of l2Norm:

l2Norm
    :: forall k d. (KnownNat k, HasTensorOptions Double d)
    => Vector k Double d
    -> Double
l2Norm = asValue . Torch.sqrt . (* fromIntegral k) . mseLoss (zeros [k] opts) . toTorch
    where k = natValInt (Proxy :: Proxy k)
          opts = tensorOpts (Proxy :: Proxy Double) (Proxy :: Proxy d)

In the above code, I create a (non-sparse) tensor of zeros to use with Hasktorch’s mseLoss function. Allocating this memory is not necessary, however, as I can compute the l2Norm more directly:

l2Norm
    :: forall k d. (KnownNat k)
    => Vector k Double d
    -> Double
l2Norm = asValue . Torch.sqrt . sumAll . pow (2 :: Int) . toTorch

2.4. Goal Probability

This was the most challenging goal package to implement, and several outstanding issues remain. One problem is that a neural network using the hasktorch backend does not learn at all if there are multiple neurons in a layer. It does, however, learn if there is only a single neuron in the hidden layer. This suggests a bug in the implementation of Replicated Manifolds. There may be an incorrect reshaping operation where a Vector is not constructed in row major order. I was not able to diagnose the problem after nearly a week of troubleshooting, so I proceeded to benchmarking and profiling.

2.4.1. Random tensors

I structured the code to randomly generate tensors using Hasktorch instead of RandomMWC. For example, the following generates a Vector of Bernoulli random variables:

bernoulli
  :: forall n d . (KnownNat n, HasTensorOptions Double d)
  => Double
  -> State Generator (Vector n Bool d)
bernoulli p = do
  v <- state T.randomUniformVector :: State Generator (Vector n Double d)
  pure $ T.lessThan v p

The implementation of Random could be improved as it currently has an unused parameter.

2.4.2. Runtime exceptions

I debugged the runtime exceptions here with ghc-debug by running :set -fbreak-on-error in a repl, followed by :trace. This was an enormous help, although it did take some getting used to.

The first exception arose due to a faulty implementation of fromInteger in the Num instance of Vector. I originally used a stock deriving strategy to define the Num instance for Vector. This would produce a size 0 tensor when casting an integer to a Vector, whereas the correct behavior is to generate a Vector with n replicated integers, where n is the size of the Vector.

The second exception was due to a faulty implementation of lowerTriangular in Core. The fix involved correctly masking the triangular elements.

2.4.3. Benchmarks

The plan was to benchmark backpropagation, but instead I only benchmarked the forward pass.

The performance is substantially worse using the hasktorch backend, which is suprising given that a forward pass is essentially a series of simple matrix operations. The problem may lie with the >$> operator, which is basically a wrapper for matrixMap and similar functions. This function takes a list of Vectors and performs the map operation as a matrix-matrix multiplication. The resulting matrix is split into a list of Vectors which may not be very efficient, so once more we turn to profiling.

2.4.4. Profiling

Unlike in Goal Geometry where it was easy to identity a single function (l2Norm) that had poor performance, here an obvious bottleneck (that is, limited to one or a few functions) was harder for me to identity.

Hasktorch CPU backend

There are calls to retryWithGC' each time a Hasktorch tensor is used, which seems to indicate that memory allocation is failing for some reason.

Hasktorch CUDA backend

The CUDA backend shows the same problem. I don’t have a good enough grasp on the low level details to analyze this further at this time, but it’s something I shall to return to soon.

What’s clear is that performing the matrix operations involved in the forward pass directly have good performance, but my implementation suffers from severe performance loss. One possible solution may be to avoid using lists and pass around tensors directly, similarly to how Replicated Manifolds are handled.

3. Future Work

There are a few functions that have yet to be implemented in Goal Core, such as pseudo inverse
The logic error in the implementation of neural networks should be fixed
Diagnose the performance issues in the forward pass with more thorough profiling
Rewrite the code to resolve the performance issues
Check if the problem persists with backpropagation, and if so, address it
Developer setup
1. Use nix to provide a developer environment for Goal, similar to the template Hasktorch provides.
2. Cache the Hasktorch build artifacts in docker to dramatically reduce build times.