Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mir based implementation #11

Open
9il opened this issue Sep 27, 2017 · 3 comments
Open

Mir based implementation #11

9il opened this issue Sep 27, 2017 · 3 comments

Comments

@9il
Copy link

9il commented Sep 27, 2017

Hi,

You may not want to add additional dependencies. In other hand Mir Algorithm is stable enough and is de facto the same thing for Dlang as numpy for Python. Would you accept PRs with optimizations based on Mir Algorithm?

Mir Algorithm benefits are:

  • Awesome matrix / tensor types (dense for now, sparse are in other repo and will be added too). It is cache friendly because all rows located consequently in memory.
  • Multidimensional each, reduce, zip (and others) implementations created with @fastmath in mind.
  • Mir Algorithms code does not require LDC specialisation to be fast. All LDC specific magic is done internally.
  • Clever automatic compile time optimisations. For example, each for the following code are generated to 1D loop because contiguous matrixes can be flattened.

The following mir example generates few times faster code (if matrixes fit to cache).

BTW, current vectorflow code is not properly vectorized for LDC anyway because of 1.0 - beta forces floats to convert to doubles. So, it should be 1f - beta instead.

The current high level API can be preserved for backward compatibility if necessary. Let me know what do you think.

With Mir Algorithm (v0.6.16, reviewed for AVX2)

// ldmd2 ttt.d -I../mir-algorithm/source -output-s -O -inline -release -mcpu=native -linkonce-templates
/// class ADAM : SGDOptimizer

    import mir.ndslice.slice;

    // compiles both for LDC and DMD
    import mir.internal.utility: fastmath;

    ContiguousMatrix!float W;
    ContiguousMatrix!float grad;
    ContiguousMatrix!float M;
    ContiguousMatrix!float S;
    float beta1_0;
    float beta2_0;
    float beta1;
    float beta2;
    float eps;
    float lr;

    static struct Kernel
    {
        float beta1_0;
        float beta2_0;
        float c1;
        float c2;
        float eps;
        float lr;
        @fastmath void opCall()(float g, ref float w, ref float m, ref float s)
        {
            import mir.math.common: sqrt; // vectorized for LDC
            // mc path
            m = beta1_0 * m + (1 - beta1_0) * g;
            g *= g;
            auto gt = m * c1;
            auto mc = lr * gt;
            // sc path
            s = beta2_0 * s + (1 - beta2_0) * g;
            auto st = s * c2;
            auto sc = sqrt(st) + eps;
            // w path
            w -= mc / sc;
        }
    }

    @fastmath final void update_matrix()
    {
        pragma(inline, false);
        import mir.ndslice.algorithm: each; // vectorized for LDC

        Kernel kms;
        kms.beta1_0 = beta1_0;
        kms.beta2_0 = beta2_0;
        kms.c1 = 1 / (1 - beta1);
        kms.c2 = 1 / (1 - beta2);
        kms.eps = eps;
        kms.lr = lr;
        each!kms(grad, W, M, S);
    }

Current code

/// class ADAM : SGDOptimizer
    // references
    float[][] W;
    float[][] grad;

    // local variables
    float eps;
    float lr;

    float beta1_0;
    float beta2_0;
    float beta1;
    float beta2;

    float[][] M;
    float[][] S;


    version(LDC)
    {
        import ldc.attributes;
        pragma(inline, true)
        @fastmath static void adam_op(float[] row, float beta, float[] g) pure
        {
            for(int i = 0; i < row.length; ++i)
                row[i] = beta * row[i] + (1.0 - beta) * g[i];
        }

        pragma(inline, true)
        @fastmath static void adam_op2(float[] row, float beta, float[] g) pure
        {
            for(int i = 0; i < row.length; ++i)
                row[i] = beta * row[i] + (1.0 - beta) * g[i] * g[i];
        }

        pragma(inline, true)
        @fastmath static void adam_op3(
                float[] row_W, float[] row_S, float[] row_M,
                float beta1_, float beta2_, float eps_, float lr_) pure
        {
            float k1 = 1.0/(1.0 - beta1_);
            float k2 = 1.0/(1.0 - beta2_);
            for(int i = 0; i < row_W.length; ++i)
                row_W[i] -= lr_ * k1 * row_M[i] / (sqrt(k2 * row_S[i]) + eps_);
        }
    }

    final void update_matrix()
    {
        foreach(k; 0..W.length)
        {
            auto row_grad = grad[k];
            auto row_W = W[k];
            auto row_M = M[k];
            auto row_S = S[k];

            version(LDC)
            {
                adam_op(row_M, beta1_0, row_grad);
                adam_op2(row_S, beta2_0, row_grad);
                adam_op3(row_W, row_S, row_M, beta1, beta2, eps, lr);
            }
            else
            {
                foreach(i; 0..row_W.length)
                {
                    auto g = row_grad[i];

                    row_M[i] = beta1_0 * row_M[i] + (1.0 - beta1_0) * g;
                    row_S[i] = beta2_0 * row_S[i] + (1.0 - beta2_0) * g * g;

                    auto gt = row_M[i] / (1.0 - beta1);
                    auto st = row_S[i] / (1.0 - beta2);

                    row_W[i] -= lr * gt / (sqrt(st) + eps);
                }
            }
        }
    }
@9il
Copy link
Author

9il commented Sep 27, 2017

The main loops of mir kernels are auto vectorised and unrolled by LDC.
AVX2, float8x2 per cycle.

; all *ps operations are operations on SIMD vectors
LBB2_14:
	vmovups	(%rsi,%rdi,4), %ymm8
	vmulps	%ymm15, %ymm8, %ymm9
	vfmadd231ps	(%rcx,%rdi,4), %ymm2, %ymm9     ; vectrorized FMA
	vmovups	%ymm9, (%rcx,%rdi,4)
	vmulps	%ymm8, %ymm8, %ymm8
	vmulps	%ymm13, %ymm8, %ymm8
	vfmadd231ps	(%rax,%rdi,4), %ymm7, %ymm8
	vmulps	%ymm1, %ymm8, %ymm10
	vrsqrtps	%ymm10, %ymm11                     ; vectrorized SQRT
	vmulps	%ymm9, %ymm12, %ymm9
	vmovaps	%ymm12, %ymm14
	vmulps	%ymm11, %ymm10, %ymm12
	vfmadd213ps	%ymm0, %ymm12, %ymm11
	vmulps	%ymm6, %ymm12, %ymm12
	vmulps	%ymm11, %ymm12, %ymm11
	vcmpneqps	%ymm5, %ymm10, %ymm10
	vandps	%ymm11, %ymm10, %ymm10
	vaddps	%ymm4, %ymm10, %ymm10
	vrcpps	%ymm10, %ymm11
	vmovups	%ymm8, (%rax,%rdi,4)
	vfmadd213ps	%ymm3, %ymm11, %ymm10
	vfmsub132ps	%ymm11, %ymm11, %ymm10
	vfmadd213ps	(%rdx,%rdi,4), %ymm9, %ymm10
	vmovups	%ymm10, (%rdx,%rdi,4)
	vmovups	32(%rsi,%rdi,4), %ymm8
	vmulps	%ymm15, %ymm8, %ymm9
	vfmadd231ps	32(%rcx,%rdi,4), %ymm2, %ymm9
	vmovups	%ymm9, 32(%rcx,%rdi,4)
	vmulps	%ymm8, %ymm8, %ymm8
	vmulps	%ymm13, %ymm8, %ymm8
	vfmadd231ps	32(%rax,%rdi,4), %ymm7, %ymm8
	vmulps	%ymm1, %ymm8, %ymm10
	vrsqrtps	%ymm10, %ymm11
	vmulps	%ymm11, %ymm10, %ymm12
	vfmadd213ps	%ymm0, %ymm12, %ymm11
	vmulps	%ymm6, %ymm12, %ymm12
	vmulps	%ymm11, %ymm12, %ymm11
	vmovaps	%ymm14, %ymm12
	vcmpneqps	%ymm5, %ymm10, %ymm10
	vandps	%ymm11, %ymm10, %ymm10
	vaddps	%ymm4, %ymm10, %ymm10
	vrcpps	%ymm10, %ymm11
	vmulps	%ymm9, %ymm12, %ymm9
	vmovups	%ymm8, 32(%rax,%rdi,4)
	vfmadd213ps	%ymm3, %ymm11, %ymm10
	vfmsub132ps	%ymm11, %ymm11, %ymm10
	vfmadd213ps	32(%rdx,%rdi,4), %ymm9, %ymm10
	vmovups	%ymm10, 32(%rdx,%rdi,4)
	addq	$16, %rdi                  ; float8x2 per cycle
	cmpq	%rdi, %rbp
	jne	LBB2_14

@yvdriess
Copy link

yvdriess commented Oct 9, 2017

Seconded, I am very interested in having a Mir implementation. Do you already have a git repo that I can clone?

@9il
Copy link
Author

9il commented Oct 9, 2017

Just created https://github.com/libmir/vectorflow
It is identical to the origin for now. PRs are welcome! Feel free to ask me at github and gitter if you need any help with Mir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants