Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use an aligned allocator for NumPy? #5312

Closed
sturlamolden opened this issue Nov 26, 2014 · 70 comments
Closed

Use an aligned allocator for NumPy? #5312

sturlamolden opened this issue Nov 26, 2014 · 70 comments

Comments

@sturlamolden
Copy link
Contributor

Regarding the f2py regression in NumPy 1.9 with failures on 32-bit Windows, the question is whether NumPy should start to use an allocator which gives guaranteed alignment.

scipy/scipy#4168

@sturlamolden
Copy link
Contributor Author

Here is one example of an allocator that should work on all platforms. It is shamelessly based on this:

https://sites.google.com/site/ruslancray/lab/bookshelf/interview/ci/low-level/write-an-aligned-malloc-free-function

There are not many ways to do this and similar code is floating around on the net, so extending it in this way is probable ok. (And besides it does not implement realloc.)

Dropping this code into numpy/core/include/numpy/ndarraytypes.hshould ensure that freshly allocated ndarrays are properly aligned on all platforms.

This platform-independent code could possibly be replaced with posix_memalign() on POSIX and _aligned_malloc() on Windows. However, combining posix_memalign() with realloc() is not possible, so implementing it ourselves is probably better.

#define NPY_MEMALIGN 32   /* 16 for SSE2, 32 for AVX, 64 for Xeon Phi */ 

static NPY_INLINE
void *PyArray_realloc(void *p, size_t n)
{
    void *p1, **p2, *base;
    size_t old_offs, offs = NPY_MEMALIGN - 1 + sizeof(void*);    
    if (NPY_UNLIKELY(p != NULL)) {
        base = *(((void**)p)-1);
        if (NPY_UNLIKELY((p1 = PyMem_Realloc(base,n+offs)) == NULL)) return NULL;
        if (NPY_LIKELY(p1 == base)) return p;
        p2 = (void**)(((Py_uintptr_t)(p1)+offs) & ~(NPY_MEMALIGN-1));
        old_offs = (size_t)((Py_uintptr_t)p - (Py_uintptr_t)base);
        memmove(p2,(char*)p1+old_offs,n);    
    } else {
        if (NPY_UNLIKELY((p1 = PyMem_Malloc(n + offs)) == NULL)) return NULL;
        p2 = (void**)(((Py_uintptr_t)(p1)+offs) & ~(NPY_MEMALIGN-1));   
    }
    *(p2-1) = p1;
    return (void*)p2;
}    

static NPY_INLINE
void *PyArray_malloc(size_t n)
{
    return PyArray_realloc(NULL, n);
}

static NPY_INLINE
void *PyArray_calloc(size_t n, size_t s)
{
    void *p;
    if (NPY_UNLIKELY((p = PyArray_realloc(NULL,n*s)) == NULL)) return NULL;
    memset(p, 0, n*s);
    return p;
}

static NPY_INLINE        
void PyArray_free(void *p)
{
    void *base = *(((void**)p)-1);
    PyMem_Free(base);
} 

@sturlamolden sturlamolden changed the title Use an aligned allocator for NumPy Use an aligned allocator for NumPy? Nov 26, 2014
@juliantaylor
Copy link
Contributor

I have already a branch which adds an aligned allocator, i'll dig it out.

By using something like this we throw away the option to uses pythons tracemalloc framework and sparse memory (there is no aligned_calloc).
@njsmith would you be willing to engage with python devs again to add yet another allocator to their slot before 3.5 is released? They already added calloc only for us, would be a schame if we now couldn't use it.

@sturlamolden
Copy link
Contributor Author

Presumably one could pass in alignment in the context data of PyMemAllocatorEx? But NumPy has to support Python versions from 2.6 and up, so doing this in Python 3.5 might not solve the problem.

@njsmith
Copy link
Member

njsmith commented Nov 26, 2014

I do think engaging with the python devs on this before 3.5 is a good idea,
but I still am not convinced we have a good reason to use an aligned
allocator in the near term. It cannot possibly be the case that struct {
double, double } actually requires better-than-malloc alignment on win32 or
SPARC, because if that were true then nothing would work.
On 26 Nov 2014 09:10, "Julian Taylor" notifications@github.com wrote:

I have already a branch which adds an aligned allocator, i'll dig it out.

By using something like this we throw away the option to uses pythons
tracemalloc framework and sparse memory (there is no aligned_calloc).
@njsmith https://github.com/njsmith would you be willing to engage with
python devs again to add yet another allocator to their slot before 3.5 is
released? They already added calloc only for us, would be a schame if we
now couldn't use it.


Reply to this email directly or view it on GitHub
#5312 (comment).

@sturlamolden
Copy link
Contributor Author

The question with regard to f2py was what alignment Fortran would need, not the minimum requirement of C. Speed is also an issue. Both indexing and SIMD works better if the data is properly aligned.

@pv
Copy link
Member

pv commented Nov 26, 2014

A reason for using aligned allocator could indeed be speed, and ensuring SSE/AVX
compatibility would remove the numerical jitter that comes from taking different
code paths for differently aligned data.
.
f2py is older than the ISO C binding standard in Fortran, and the way it
works is essentially the de facto standard way on interfacing Fortran
with C, used extensively by everyone. In light of this experience, it's
clear that alignment provided by system malloc is sufficient for the
Fortran compilers that matter in practice for us.

@pitrou
Copy link
Member

pitrou commented Dec 5, 2014

@sturlamolden
Copy link
Contributor Author

@pitrou And 64 byte alignment is recommended for Xeon Phi. Take a look at the comment behind the definition of NPY_MEMALIGN in my code example.

@njsmith
Copy link
Member

njsmith commented Dec 5, 2014

The main complication on providing aligned allocation is that ATM we can
either hook into the tracemalloc infrastructure xor do aligned allocation,
and fixing this will require some coordination with CPython upstream (see
#4663).
On 5 Dec 2014 16:44, "Sturla Molden" notifications@github.com wrote:

@pitrou https://github.com/pitrou And 64 bit is recommended for Xeon
Phi. Take a look at the comment behind the definition of NPY_MEMALIGN in
my code example.


Reply to this email directly or view it on GitHub
#5312 (comment).

@pitrou
Copy link
Member

pitrou commented Dec 5, 2014

So the CPython issue is at http://bugs.python.org/issue18835.

@pitrou
Copy link
Member

pitrou commented Jan 15, 2015

Given the complications with realloc(), it might not be realistic to expect CPython to solve this in the 3.5 timeframe. Numpy should perhaps use its own aligned allocated wrapper instead (which should be able to defer to the PyMem API, and take advantage of tracemalloc, anyway).

@sturlamolden
Copy link
Contributor Author

Code for such an allocator is included above. I don't understand @juliantaylor 's argument, but he probably understands this better than me.

I can understand what he meant about calloc though. A calloc is not simply a malloc and a memset to zero. A memset will require the OS to fetch the pages before they are needed. AFAIK there is no PyMem_Calloc.

@pitrou
Copy link
Member

pitrou commented Jan 15, 2015

Actually CPython 3.5 has PyMem_Calloc and friends.
I think @juliantaylor was considering the implementation case of using OS functions (posix_memalign, etc.). But that doesn't sound necessary.

By the way @sturlamolden, your snippet redefines PyArray_Malloc and friends, but array allocation seems to use PyDataMem_NEW. Am I misunderstanding something?

@pitrou
Copy link
Member

pitrou commented Jan 15, 2015

Another thought is that aligned allocation may be wasteful for small arrays. Perhaps there should be a threshold below which standard allocation is used?
Also, should the alignment be configurable?

@sturlamolden
Copy link
Contributor Author

The allocators are called PyArray_malloc and PyArray_free in NumPy 1.9. A lot is changed in NumPy 1.10.

@pitrou
Copy link
Member

pitrou commented Jan 15, 2015

Are you sure? PyArray_NewFromDescr_int() calls npy_alloc_cache() and npy_alloc_cache() calls PyDataMem_NEW().

@njsmith
Copy link
Member

njsmith commented Jan 15, 2015

Numpy has multiple allocation interfaces, and they don't have very obvious
names. PyArray_malloc/free are used for "regular" allocations (e.g. object
structs). Data buffers (ndarray ->data pointers, temporary buffers inside
ufuncs, etc.), however, are allocated via PyDataMem_NEW.

On Thu, Jan 15, 2015 at 7:48 PM, Sturla Molden notifications@github.com
wrote:

The allocators are called PyArray_malloc and PyArray_free in Numpy 1.9.


Reply to this email directly or view it on GitHub
#5312 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@sturlamolden
Copy link
Contributor Author

@charris
Copy link
Member

charris commented Jan 15, 2015

@njsmith Yeah, we should rationalize the allocation macros some day... I'd start with the one used to allocate dimensions for ndarray (IIRC).

pitrou added a commit to pitrou/numpy that referenced this issue Jan 16, 2015
Introduce two new functions, get_data_alignment() and set_data_alignment() which allow
setting the guaranteed alignment at runtime.
@pitrou
Copy link
Member

pitrou commented Jan 16, 2015

I created PR #5457 with a patch. Feedback on the approach would be nice.

@njsmith
Copy link
Member

njsmith commented Jan 16, 2015

As far as I know there is currently no benefit to using an aligned
allocator in numpy?

On Fri, Jan 16, 2015 at 7:15 PM, Antoine Pitrou notifications@github.com
wrote:

I created PR #5457 #5457 with a
patch. Feedback on the approach would be nice.


Reply to this email directly or view it on GitHub
#5312 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@pitrou
Copy link
Member

pitrou commented Jan 16, 2015

With Numba we determined that AVX vector instructions required a 32-byte alignment for optimal performance. If you compile Numpy with AVX enabled (requires specific compiler options, I guess), alignment should make a difference too.

@njsmith
Copy link
Member

njsmith commented Jan 16, 2015

Out of curiosity, do you have any real-world measurements? I ask b/c there
are so many factors that play into these things (different overhead/speed
trade-offs at different array sizes, details of memory allocators -- which
also act differently at different array sizes -- etc.) that I find it hard
to guess whether one ends up with like a 0.5% end-to-end speedup or a 50%
end-to-end speedup or what.

On Fri, Jan 16, 2015 at 8:16 PM, Antoine Pitrou notifications@github.com
wrote:

With Numba we determined that AVX vector instructions required a 32-byte
alignment for optimal performance. If you compile Numpy with AVX enabled
(requires specific compiler options, I guess), alignment should make a
difference too.


Reply to this email directly or view it on GitHub
#5312 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@juliantaylor
Copy link
Contributor

fwiw on my i5-4210u I see no significant difference between 16 and 32 byte aligned data in a simple load add store test, the minimum cycle count seems lower by 5% but median and 10th percentile is identical to 1%

@sturlamolden
Copy link
Contributor Author

Is that with AVX?

@eamartin
Copy link
Contributor

eamartin commented May 6, 2017

The "Python aligned allocator" solution I suggested is a hack. I think offering alignment in the Python interfaces would be nice, but the right way to do that would be to handle alignment at the C level.

@vellamike
Copy link

vellamike commented Jul 14, 2017

This feature would be very helpful to me. I am using an FPGA device (Altera A10GX) where the DMA controller requires 64-byte aligned data to be used, this speeds up my code by 40x(!!!). I suspect that @nachiket has the same problem as me. I wrote something similar to what @eamartin is using but this is a bit of a hack.

@mborgerding
Copy link

I definitely encourage 64 byte alignment:

  1. that is the cache line size
  2. it is suitable for any SIMD alignment up to AVX512

@aldanor
Copy link

aldanor commented Jun 5, 2019

Here we are almost 5 years later.

Any thoughts on making this (64-byte alignment in particular) a standard feature?..

@bashtage
Copy link
Contributor

bashtage commented Jun 5, 2019

This cython code is now in NumPy. Of course, this doesn't change the default.

@hmaarrfk
Copy link
Contributor

my 2cents: An aligned allocator would help when interface with hardware devices and kernel level calls. These interfaces might benefit from aligning the buffers to pages.

@mattip
Copy link
Member

mattip commented Sep 27, 2019

In merging randomgen, we gained PyArray_realloc_aligned and friends. Should we move these routines into numpy/core/include ?

@jakirkham
Copy link
Contributor

That would certainly be useful, @mattip. Would it be possible to access this functionality from Python as well?

@sturlamolden
Copy link
Contributor Author

In merging randomgen, we gained PyArray_realloc_aligned and friends. Should we move these routines into numpy/core/include ?

Ripped off my code, did ya? 😂

@sturlamolden
Copy link
Contributor Author

@bashtage
Copy link
Contributor

bashtage commented Nov 17, 2019 via email

@mattip
Copy link
Member

mattip commented Nov 17, 2019

what was the original source of the code?

@sturlamolden
Copy link
Contributor Author

The link is dead, but it was adapted from an aligned malloc that looked like this:

https://tianrunhe.wordpress.com/2012/04/23/aligned-malloc-in-c/

@bashtage
Copy link
Contributor

bashtage commented Nov 17, 2019 via email

@sturlamolden
Copy link
Contributor Author

#5312 (comment)

@bashtage
Copy link
Contributor

bashtage commented Nov 17, 2019 via email

@pitrou
Copy link
Member

pitrou commented Nov 17, 2019

Does everyone who contributes 50 lines of code to Numpy get a dedicated copyright header? I might go through my contributions and see if any applies :-)

@rgommers
Copy link
Member

It looks like I am no longer contributor for the code I have written for NumPy 🧐:

You are and will always be:)

Does everyone who contributes 50 lines of code to Numpy get a dedicated copyright header? I might go through my contributions and see if any applies :-)

Nope. We try to avoid encoding such things inside the source code, since that will always be wildly incomplete and hard to maintain. We do ask people to list themselves in THANKS.txt; I'm looking at a better alternative to that, because that file often gives merge conflicts.

@seberg
Copy link
Member

seberg commented Nov 5, 2021

I am going to close the issue. Happy about a new one though! (It seems most of the discussion is simply outdated and should be re-evaluated based on Matti's work in gh-17582)

Note that it is now – with the next release – possible to write a context manager outside NumPy to swap in an aligned allocator. Which alleviates the need to push it directly into NumPy and gives a chance for much clearer testing of the benefits.

@seberg seberg closed this as completed Nov 5, 2021
@jakirkham
Copy link
Contributor

Also there is a tracking issue with follow up items ( #20193 ). One of them is allocators with specific alignment ( #20193 (comment) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests