Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some more flexibility to coadd: output array specification #351

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

keflavich
Copy link
Contributor

@keflavich keflavich commented Mar 13, 2023

This enables memmap'd output, which will be needed for very large (greater-than-memory) output cubes

EDIT: it also extends reproject_and_coadd into three dimensions

@codecov
Copy link

codecov bot commented Mar 13, 2023

Codecov Report

Attention: 33 lines in your changes are missing coverage. Please review.

Comparison is base (1a91216) 93.60% compared to head (0054941) 90.60%.

Files Patch % Lines
reproject/mosaicking/coadd.py 73.33% 20 Missing ⚠️
reproject/mosaicking/subset_array.py 63.33% 11 Missing ⚠️
reproject/utils.py 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #351      +/-   ##
==========================================
- Coverage   93.60%   90.60%   -3.01%     
==========================================
  Files          25       25              
  Lines         892      947      +55     
==========================================
+ Hits          835      858      +23     
- Misses         57       89      +32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@astrofrog astrofrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful, thanks! Just some comments below, and be sure to include some tests - thanks!

reproject/mosaicking/coadd.py Outdated Show resolved Hide resolved
reproject/mosaicking/coadd.py Outdated Show resolved Hide resolved
reproject/mosaicking/coadd.py Outdated Show resolved Hide resolved
reproject/mosaicking/coadd.py Outdated Show resolved Hide resolved
@keflavich
Copy link
Contributor Author

I'm a little stymied on this:
ValueError: Chunks do not add up to shape. Got chunks=((4,), (4, 100, 101), (4, 100, 101)), shape=(4, 174, 173)
It looks like the block_size isn't playing nice with the different-shaped output; I think that relies on your WIP PR @astrofrog ?

@keflavich
Copy link
Contributor Author

@astrofrog I have some questions now:

The loop for reproject_and_coadd is presently not lazy or possible to make lazy, correct? We need your daskified PR to make it lazy?

I think I'm going to move the write-to-big-array into the parent loop, because otherwise everything still has to be held in memory

@keflavich
Copy link
Contributor Author

ahhh, there's good reason for that - background matching! Yikes.

@keflavich
Copy link
Contributor Author

so, as is, this does coadding on a dataset-by-dataset basis - which, for cubes, means a cube-by-cube basis. It may be more efficient to do it on a plane-by-plane basis. That seems like a lot of additional refactoring work though

@keflavich
Copy link
Contributor Author

There is a huge efficiency boost to be gained if we can reproject directly into the output array, but that results in overwriting the individual values instead of adding to them.

@keflavich
Copy link
Contributor Author

@astrofrog I don't see any way to do this, but do you have ideas? Basically, instead of map_coordinates(..., output=array) writing directly into memory, it would do output += map_coordinates(...). That looks like it is problematic at the lowest levels, and I don't know if it would work at all for other reprojection methods.

This sacrifices flexibility even further to improve speed and robustness and reduce memory use. In the case where you don't want to match backgrounds, this approach just loads the unweighted data into an array and the footprint into another array and divides them at the end. The robustness bit is that we can flush to disk granularly.

I'm not sure this is worth pursuing further, though, because of the depth of refactor that would be needed. Maybe this is something just solved well enough by dask.

@astrofrog
Copy link
Member

I'll have a think about all this! Should be able to work on it a bit this week.

@keflavich
Copy link
Contributor Author

@astrofrog I've been using this PR in practice for a while now, and I just had a look into rebasing, but the codebase has diverged a ton, making this a rather challenging rebase. I'd like to go ahead with it, but this time make sure the changes can get merged. What's your feeling - is there anything blocking this if it's rebased?

@keflavich
Copy link
Contributor Author

ok the rebase wasn't as bad as I thought, but there were some confusing items that I am not sure I've resolved yet.

@astrofrog
Copy link
Member

I can try and prioritise this!

@keflavich
Copy link
Contributor Author

with the new modes (first, last, min, max) we again face duplicating code because we need a version that does, and another version that does not, try to background-match before the final combination step. The current approach simply breaks if match_background is not on and one tries first/last/min/max. Is there a more elegant approach?

@astrofrog
Copy link
Member

Will investigate!

@astrofrog
Copy link
Member

To try and keep things a bit simpler I've extracted out the changes related to specifying output arrays/footprints into a separate PR: #387

@astrofrog
Copy link
Member

In terms of supporting 3D arrays (and also the median combine mode), I think things are getting complicated enough that it's worth thinking about a different approach for doing the co-adding. I'm working on an experimental re-write of reproject_and_coadd which uses dask internally and will open an alternative PR soon so that we can compare and see what works best in practice.

@astrofrog
Copy link
Member

astrofrog commented Sep 11, 2023

See #388 for an alternative approach.

@keflavich
Copy link
Contributor Author

@astrofrog I'm still using this branch in production. It's apparently working (verification in progress). I clearly ran out of time to test 388. Any further thoughts on which direction to take? Both PRs are big, I don't immediately remember/know what the reasons are for one or the other.

@astrofrog
Copy link
Member

Sorry for dropping the ball on this, I'll try and see if I can wrap things up in the next week or so.

@keflavich
Copy link
Contributor Author

my last commit adds a hack to solve this issue: radio-astro-tools/spectral-cube#900 (should've posted that here, perhaps, but it isn't obvious to me where it came from)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants