Orthogonalization performance

In GMRES it is necessary to build an orthonormal basis for the Krylov subspace, because the basis vectors $v, Av, A^2v, \cdots$ become quickly linearly dependent in finite precision. In Jacobi-Davidson we require orthonormality of the search subspace and the converged Schur vectors as well, so that converged Schur vector do not re-enter the search subspace. Also, throughout both algorithms we simplify expressions via equalities like $V^*V = I$ . Due to rounding errors this identity might however not hold exactly. Therefore it is important to have some guarantee of orthogonality.

The benchmarks in this post are performed on an Intel i5-4460 @ 3.20GHz using 16GB of DDR3 memory at 1600MHz.

Classical and Modified Gram-Schmidt

Given a matrix $V \in \mathbb{C}^{n \times m}$ where $n \gg m$ and a vector $w \in \mathbb{C}^n$ , our goal is to remove the components of $w$ spanned by the columns of $V$ .

This comes down to updating $w := (I - VV^*)w$ and subsequently normalizing $w := w / \|w\|$ . In GMRES we want to store the projection $V^*w$ as well.

Classical Gram-Schmidt in Julia would look like this:

import Base.LinAlg.BLAS: gemv!, gemv

function classical_gram_schmidt!{T}(V::StridedMatrix{T}, w::StridedVector{T})
    
    # Orthogonalize
    h = gemv('T', one(T), V, w)
    gemv!('N', -one(T), V, h, one(T), w)

    # Normalize
    nrm = norm(w)
    scale!(w, one(T) / nrm)

    h, nrm
end

It is well-known that due to rounding errors the above procedure is not stable. Modified Gram-Schmidt tries to “solve” this problem by performing the projection sequentially:

import Base.LinAlg.BLAS: axpy!

function modified_gram_schmidt!{T}(V::StridedMatrix{T}, w::StridedVector{T})
    k = size(V, 2)
    h = zeros(T, k)

    # Orthogonalize
    for i = 1 : k
        column = @view(V[:, i])
        h[i] = dot(column, w)
        axpy!(-h[i], column, w)
    end

    # Normalize
    nrm = norm(w)
    scale!(w, one(T) / nrm)

    h, nrm
end

Performance-wise this for-loop is a major bottleneck. The problem is that we have gone from BLAS-2 (matrix-vector product) to BLAS-1 (inner products).

If we bench the above with m = 20 and n = 10_000_000 we’ll quickly see the problem:

using BenchmarkTools

function bench(; m = 20)
    srand(1)
    
    suite = BenchmarkGroup()
    
    for threads = [1, 2, 4]
        BLAS.set_num_threads(threads)
        thread_key = string(threads)
        suite[thread_key] = BenchmarkGroup()
        for n = [1_000_000, 10_000_000]
            V, _ = qr(rand(n, m))
            new_vec = rand(n)
            key = string(n)
            suite[thread_key][key] = BenchmarkGroup()
            suite[thread_key][key]["cgs"] = @benchmark classical_gram_schmidt!($V, a) setup=(a = copy($new_vec))
            suite[thread_key][key]["mgs"] = @benchmark modified_gram_schmidt!($V, a) setup=(a = copy($new_vec))
        end
    end

    suite
end

This outputs:

  "1" => 2-element BenchmarkTools.BenchmarkGroup:
          "1000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "mgs" => Trial(39.492 ms)
                  "cgs" => Trial(19.027 ms)
          "10000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "mgs" => Trial(414.534 ms)
                  "cgs" => Trial(199.866 ms)
  "2" => 2-element BenchmarkTools.BenchmarkGroup:
          "1000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "mgs" => Trial(35.880 ms)
                  "cgs" => Trial(16.462 ms)
          "10000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "mgs" => Trial(393.617 ms)
                  "cgs" => Trial(168.942 ms)
  "4" => 2-element BenchmarkTools.BenchmarkGroup:
          "1000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "mgs" => Trial(35.084 ms)
                  "cgs" => Trial(15.886 ms)
          "10000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "mgs" => Trial(392.145 ms)
                  "cgs" => Trial(159.393 ms)

which shows that MGS is more than twice as slow. The reason for this is that BLAS-1 is memory bound: all data is touched only once, and there is a lot of data in memory. For this reason it is almost useless to have threading enabled.

Iterative / refined orthogonalization

Even though MGS is an easy trick to ensure better orthogonality, it does not necessarily generate vectors that are orthogonal up to machine precision. For instance, if we do

srand(1)
V, _ = qr(rand(1_000_000, 10))
w = rand(1_000_000)
modified_gram_schmidt!(V, w)
@show norm(V' * w)
modified_gram_schmidt!(V, w)
@show norm(V' * w)

it outputs

norm(V' * w) = 1.0289082309190917e-15
norm(V' * w) = 4.513770868406318e-17

which basically means that the projection is not idempotent. After the first pass of MGS there are still components in the direction of $V$ and only in the second pass these are removed almost up to machine precision.

If we really need to maintain orthogonality, one idea is to repeat MGS twice or even more. In Krylov subspace there is the rule of thumb that “twice is enough” due to Kahan and Parlett, but we can just as well decide whether to repeat orthogonalization by using the Daniel-Gragg-Kaufman-Stewart criterion. This criterion states that we should repeat orthogonalization if the tangent between $w$ and $span(V)$ is smaller than $\eta$ . In ARPACK it seems that this parameter is chosen $\eta = 1 / \sqrt{2}$ .

I’m going to assume reorthogonalization is only necessary once, but the if can be replaced by a while.

function repeated_classical_gram_schmidt!{T}(V::StridedMatrix{T}, w::StridedVector{T}; η = one(T) / √2)
    
    # Orthogonalize
    h = gemv('T', one(T), V, w)
    gemv!('N', -one(T), V, h, one(T), w)

    # Repeat
    nrm = norm(w)

    if nrm < η * norm(h)
        increment = gemv('T', one(T), V, w)
        gemv!('N', -one(T), V, increment, one(T), w)
        axpy!(one(T), increment, h)
        nrm = norm(w)
    end

    # Normalize
    scale!(w, one(T) / nrm)

    h, nrm
end

Similarly for MGS:

function repeated_modified_gram_schmidt!{T}(V::StridedMatrix{T}, w::StridedVector{T}; η = one(T) / √2)
    k = size(V, 2)
    h = zeros(T, k)

    nrm = norm(w)

    # Orthogonalize
    for i = 1 : k
        column = @view(V[:, i])
        h[i] = dot(column, w)
        axpy!(-h[i], column, w)
    end

    if norm(w) ≤ η * nrm
        for i = 1 : k
            column = @view(V[:, i])
            correction = dot(column, w)
            h[i] += correction
            axpy!(-correction, column, w)
        end

        nrm = norm(w)
    end

    # Normalize
    scale!(w, one(T) / nrm)

    h, nrm
end

We benchmark these methods as well:

function bench(; m = 20)
    srand(1)
    
    suite = BenchmarkGroup()
    
    for threads = [1, 2, 4]
        BLAS.set_num_threads(threads)
        thread_key = string(threads)
        suite[thread_key] = BenchmarkGroup()
        for n = [1_000_000, 10_000_000]
            V, _ = qr(rand(n, m))
            new_vec = rand(n)
            key = string(n)
            suite[thread_key][key] = BenchmarkGroup()
            suite[thread_key][key]["rcgs"] = @benchmark repeated_classical_gram_schmidt!($V, a) setup=(a = copy($new_vec))
            suite[thread_key][key]["rmgs"] = @benchmark repeated_modified_gram_schmidt!($V, a) setup=(a = copy($new_vec))
        end
    end

    suite
end

and their results are:

  "1" => 2-element BenchmarkTools.BenchmarkGroup:
          "1000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "rcgs" => Trial(37.474 ms)
                  "rmgs" => Trial(79.268 ms)
          "10000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "rcgs" => Trial(393.802 ms)
                  "rmgs" => Trial(829.219 ms)
  "2" => 2-element BenchmarkTools.BenchmarkGroup:
          "1000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "rcgs" => Trial(32.237 ms)
                  "rmgs" => Trial(72.151 ms)
          "10000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "rcgs" => Trial(331.603 ms)
                  "rmgs" => Trial(787.374 ms)
  "4" => 2-element BenchmarkTools.BenchmarkGroup:
          "1000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "rcgs" => Trial(31.030 ms)
                  "rmgs" => Trial(69.855 ms)
          "10000000" => 2-element BenchmarkTools.BenchmarkGroup:
                  "rcgs" => Trial(323.154 ms)
                  "rmgs" => Trial(785.251 ms)

The good news is that repeated classical Gram-Schmidt is still faster than modified Gram-Schmidt without repetition.

Conclusion

If orthogonality is so important that repeated orthogonalization is necessary, then repeated classical Gram-Schmidt seems a good candidate. Even if repeated orthogonalization is not a necessity, repeated classical Gram-Schmidt can even be faster than modified Gram-Schmidt without repetition.

Note that this implies that the orthonormal basis $V$ is indeed best represented as a Matrix rather than a Vector of Vectors.

Remark

In the benchmarks above I normalize a vector by first computing t = 1 / norm(w) and only then computing w *= t. The good thing is that multiplication is a lot faster than division, but the downside is that we round numbers twice. I don’t know if the latter has significant consequences for orthogonality.