Google Summer of Code Final Report
Table of Contents
1. Overview
This is the final report for a proposal in the Haskell organization to enhance the Goal
library with GPU operations. All commits were made to the experimental branch.
2. Working Period
2.1. Developer setup
I originally planned to create a nix shell for development. During the proposal period I experimented with Hasktorch and Goal using a nix template provided by Hastorch. Ultimately, however, we settled on Docker for the dev environment, as it may be more straightforward to use with Apptainer for HPC tasks. The final docker image clocked in at over 14 GB in size. I tried to reduce this but only managed to trim off about half a gigabyte.
zarak@nixos ~ ❯ docker history f71d81c4404d IMAGE CREATED CREATED BY SIZE COMMENT f71d81c4404d 2 months ago /bin/bash -c #(nop) CMD ["/bin/bash"] 0B 153417880ada 2 months ago /bin/bash -c #(nop) ENV PATH=/usr/local/bin… 0B 116decd9c5a0 2 months ago /bin/bash -c #(nop) ENV NVIDIA_DRIVER_CAPAB… 0B ea3546bf7926 2 months ago /bin/bash -c #(nop) ENV NVIDIA_VISIBLE_DEVI… 0B c55774b975f8 2 months ago |4 CABAL=latest DEBIAN_FRONTEND=noninteracti… 4.03GB 3e713b88518a 2 months ago /bin/bash -c #(nop) COPY file:1642f81b2fc032… 1.22kB b354f7d1e1be 2 months ago |4 CABAL=latest DEBIAN_FRONTEND=noninteracti… 66.9kB bee356d14e95 2 months ago /bin/bash -c #(nop) SHELL [/bin/bash -c] 0B 8bb54779d991 2 months ago |4 CABAL=latest DEBIAN_FRONTEND=noninteracti… 20MB 1a9db6f64989 2 months ago |4 CABAL=latest DEBIAN_FRONTEND=noninteracti… 3.8GB 7009ce892ba2 2 months ago |4 CABAL=latest DEBIAN_FRONTEND=noninteracti… 4.78GB 1c62275ed751 2 months ago /bin/sh -c #(nop) ARG CABAL=latest 0B 231a648e7653 2 months ago /bin/sh -c #(nop) ARG GHC=9.0.2 0B a8e5be427c4f 2 months ago |2 DEBIAN_FRONTEND=noninteractive GPG_KEY=77… 15.4MB 6be00d62872f 2 months ago |2 DEBIAN_FRONTEND=noninteractive GPG_KEY=77… 3.29kB 01c26ac2d371 2 months ago /bin/sh -c #(nop) ARG GPG_KEY=7784930957807… 0B 9db92454b75e 2 months ago |1 DEBIAN_FRONTEND=noninteractive /bin/sh -c… 1.54GB 6f33782b4774 2 months ago /bin/sh -c #(nop) ENV BLAS=OpenBlas 0B 56468c103876 2 months ago /bin/sh -c #(nop) ENV TZ=Europe/Berlin 0B 1f2add153ab6 2 months ago /bin/sh -c #(nop) ARG DEBIAN_FRONTEND=nonin… 0B 20fffa419e3a 3 months ago /bin/sh -c #(nop) CMD ["bash"] 0B <missing> 3 months ago /bin/sh -c #(nop) ADD file:00dae10e79b05c4e1… 72.8MB
This docker image does not cache the Hasktorch library build artifacts, which means that I have to rebuild all of Hasktorch each time I run a fresh instance of the container. Another slight annoyance occurred if I left the container running and my system was suspended; libtorch would occasionally stop detecting the GPU, which necessitated a container restart and hence rebuild. Rather than spend time tackling all these issues, however, I proceeded with the primary objective of the project.
2.2. Goal Core
Core primarily contains functions that are thin wrappers around their Hasktorch counterparts in Vector.hs. If a required function was not provided by Hasktorch, I wrote it using Hasktorch primitives.
For example, scale
is a thin wrapper around Hasktorch’s mulScale
function:
scale :: Scalar x => x -> Vector n a d -> Vector n a d {-# INLINE scale #-} scale x = UnsafeVector . mulScalar x . toTorch
On the other hand, I used einsum
to implement a function to compute the outer product, as this was not available natively in Hasktorch:
outerProduct :: (KnownNat m, KnownNat n, Num x) => Vector m x d -> Vector n x d -> Matrix m n x d {-# INLINE outerProduct #-} outerProduct v1 v2 = Matrix $ UnsafeVector $ flattenAll $ einsum "i,j->ij" [toTorch v1, toTorch v2]
Another example is matrixRoot
, which I implemented by referring to the hmatrix
source code:
-- | Computes the square root of a 'Matrix' using Denman-Beavers iteration. -- A slightly modified version of the [implementation](https://hackage.haskell.org/package/hmatrix-0.20.2/docs/src/Internal.Algorithms.html#sqrtmInv) in hmatrix-0.20.2. matrixRoot :: forall n x d. (KnownNat n, Num x, Scalar x, HasTensorOptions x d) => Matrix n n x d -> Matrix n n x d {-# INLINE matrixRoot #-} matrixRoot mtx = Matrix $ UnsafeVector $ flattenAll $ fst $ fixedPoint $ iterate f (t, eyeSquare n opts) where n = natValInt (Proxy :: Proxy n) opts = tensorOpts (Proxy :: Proxy x) (Proxy :: Proxy d) f (y, z) = (0.5 * (y + Torch.inverse z), 0.5 * (Torch.inverse y + z)) t = reshape [n, n] $ toTorch $ toVector mtx fixedPoint (a:b:rest) | asValue $ le (l1Loss ReduceMean (fst a) (fst b)) peps = a | otherwise = fixedPoint (b:rest) fixedPoint _ = error "fixed point with impossible inputs"
Unlike in the hmatrix
version, however, when checking for convergence, I took the mean when computing the l1 norm rather than the sum of the absolute values because I sometimes failed to achieve convergence (even though I theoretically should have) using the latter reduction. This function wasn’t critical within the scope of this project so rather than investigating further I moved on.
2.2.1. Test suite
I wrote a suite of unit tests for the functions in Core
. The suite caught errors that were missed by the type system (I used Hasktorch’s untyped API), such as vector shape errors, and conflating CPU and CUDA tensors. A few functions in Core
internally use Hasktorch factory functions which create tensors on the CPU by default, so the test suite was particularly helpful in identifying the functions with these errors.
2.3. Goal Geometry
2.3.1. Runtime exceptions
The runtime exceptions due to the mixing or CPU and CUDA tensors was resolved, but we decided to add another type parameter for the device on which a tensor is initialized. As a result, the syntax was not as clean, but the type safety is probably worth it. The unit test suite picked up some new errors after this change - certain data types are not supported for all operations on the GPU. For example, matrix multiplication only works with floating types. Goal
only uses floating types with Points
on a Manifold
, so this wouldn’t present any problems, but one possible way of conditionally constraining data types depending on the device follows:
type CUDANotSupportedError (b :: Type) = 'Text "You attempted to use " ':<>: 'ShowType b ':$$: 'Text "which is not supported on CUDA" type CUDASupportedTypes = [Float, Double] type family ValidType (a :: DeviceType) (b :: Type) :: Constraint where ValidType 'CPU b = () ValidType 'CUDA b = TypeError (CUDANotSupportedError b) ('True ~ Elem b CUDASupportedTypes) type family Elem (e :: Type) (es :: [Type]) where Elem e '[] = 'False Elem e (e : es) = 'True Elem e (_ : es) = Elem e es
The above includes a helpful error message if someone were to use an unsupported type.
2.3.2. Benchmarks
After propagating the hasktorch primitives through the Geometry package, I ran some benchmarks comparing the original goal backend to both the CPU and CUDA based hasktorch backends. There isn’t a large difference between the CPU performance, but the GPU shows a substantial performance boost across the board.
2.3.3. Profiling
After implementing most of the code in goal-graphical
, I was able to profile the gradient-descent script. I discovered that there was a flaw in my implementation of l2Norm
:
l2Norm :: forall k d. (KnownNat k, HasTensorOptions Double d) => Vector k Double d -> Double l2Norm = asValue . Torch.sqrt . (* fromIntegral k) . mseLoss (zeros [k] opts) . toTorch where k = natValInt (Proxy :: Proxy k) opts = tensorOpts (Proxy :: Proxy Double) (Proxy :: Proxy d)
In the above code, I create a (non-sparse) tensor of zeros to use with Hasktorch’s mseLoss
function. Allocating this memory is not necessary, however, as I can compute the l2Norm
more directly:
l2Norm :: forall k d. (KnownNat k) => Vector k Double d -> Double l2Norm = asValue . Torch.sqrt . sumAll . pow (2 :: Int) . toTorch
2.4. Goal Probability
This was the most challenging goal package to implement, and several outstanding issues remain. One problem is that a neural network using the hasktorch backend does not learn at all if there are multiple neurons in a layer. It does, however, learn if there is only a single neuron in the hidden layer. This suggests a bug in the implementation of Replicated Manifolds
. There may be an incorrect reshaping operation where a Vector
is not constructed in row major order. I was not able to diagnose the problem after nearly a week of troubleshooting, so I proceeded to benchmarking and profiling.
2.4.1. Random tensors
I structured the code to randomly generate tensors using Hasktorch
instead of RandomMWC
. For example, the following generates a Vector
of Bernoulli random variables:
bernoulli :: forall n d . (KnownNat n, HasTensorOptions Double d) => Double -> State Generator (Vector n Bool d) bernoulli p = do v <- state T.randomUniformVector :: State Generator (Vector n Double d) pure $ T.lessThan v p
The implementation of Random could be improved as it currently has an unused parameter.
2.4.2. Runtime exceptions
I debugged the runtime exceptions here with ghc-debug
by running :set -fbreak-on-error
in a repl, followed by :trace
. This was an enormous help, although it did take some getting used to.
The first exception arose due to a faulty implementation of fromInteger
in the Num
instance of Vector
. I originally used a stock deriving strategy to define the Num
instance for Vector
. This would produce a size 0 tensor when casting an integer to a Vector
, whereas the correct behavior is to generate a Vector
with n
replicated integers, where n
is the size of the Vector
.
The second exception was due to a faulty implementation of lowerTriangular
in Core
. The fix involved correctly masking the triangular elements.
2.4.3. Benchmarks
The plan was to benchmark backpropagation, but instead I only benchmarked the forward pass.
The performance is substantially worse using the hasktorch backend, which is suprising given that a forward pass is essentially a series of simple matrix operations. The problem may lie with the >$> operator, which is basically a wrapper for matrixMap and similar functions. This function takes a list of Vectors
and performs the map
operation as a matrix-matrix multiplication. The resulting matrix is split into a list of Vectors
which may not be very efficient, so once more we turn to profiling.
2.4.4. Profiling
Unlike in Goal Geometry
where it was easy to identity a single function (l2Norm
) that had poor performance, here an obvious bottleneck (that is, limited to one or a few functions) was harder for me to identity.
Hasktorch CPU backend
There are calls to retryWithGC'
each time a Hasktorch tensor is used, which seems to indicate that memory allocation is failing for some reason.
Hasktorch CUDA backend
The CUDA backend shows the same problem. I don’t have a good enough grasp on the low level details to analyze this further at this time, but it’s something I shall to return to soon.
What’s clear is that performing the matrix operations involved in the forward pass directly have good performance, but my implementation suffers from severe performance loss. One possible solution may be to avoid using lists and pass around tensors directly, similarly to how Replicated Manifolds
are handled.
3. Future Work
- There are a few functions that have yet to be implemented in
Goal Core
, such aspseudo inverse
- The logic error in the implementation of neural networks should be fixed
- Diagnose the performance issues in the forward pass with more thorough profiling
- Rewrite the code to resolve the performance issues
- Check if the problem persists with
backpropagation
, and if so, address it - Developer setup
- Use nix to provide a developer environment for Goal, similar to the template Hasktorch provides.
- Cache the Hasktorch build artifacts in docker to dramatically reduce build times.