Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Azure Blob Storage API #378

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nkiraly
Copy link
Contributor

@nkiraly nkiraly commented Feb 7, 2018

There have been several breaking changes and improvments to azure-storage APIs across the Azure SDK releases from 1.0.3 to 2.0.0 to 3.0.0.

Notable items affecting WAL-E:

  • Version 2 fixed blob listing paging issues seen in version 1.0
  • Version 2 added upload chunking, so WAL-E WABS chunking implementation was removed
  • Version 3 fixed descriptor leaking

Changes

  • Update WAL-E minimum Azure SDK version 3.0.0 to leverage blob listing paging issues and use SDK-provided get / put segmentation

  • Update extras to explicitly specify azure 3.0.0

  • Update azure-storage-blob minimum version to 1.1.0

  • migrate uri_put_file from put_blob / put_block / put_block_list to create_blob_from_bytes which has built in segmenting and parallelization

  • migrate uri_get_file from get_blob with chunking to get_blob_to_stream which has built in segmenting

    $ tox -e py35
    137 passed, 115 skipped in 22.41 seconds

References:
https://github.com/Azure/azure-storage-python/blob/master/BreakingChanges.md#version-0300
https://azure-sdk-for-python.readthedocs.io/en/v0.11.1/ref/azure.storage.blobservice.html
https://docs.microsoft.com/en-us/python/api/azure.storage.blob.blockblobservice.blockblobservice?view=azure-python#azure_storage_blob_blockblobservice_BlockBlobService_create_blob_from_bytes
https://docs.microsoft.com/en-us/python/api/azure.storage.blob.baseblobservice.baseblobservice?view=azure-python#azure_storage_blob_baseblobservice_BaseBlobService_list_blobs
https://github.com/Azure/azure-storage-python/blob/master/samples/blob/block_blob_usage.py

@nkiraly nkiraly changed the title Update WABS azure-storage blob API to 0.34.2 Update azure-storage blob API to 0.34.2 Feb 8, 2018
@titanous
Copy link

This also has a side effect of fixing a bug where any backups past the first 14 were inaccessible due to the list page size. The new default list_blobs size appears to be 5000, and it might make sense to specify that explicitly.

@titanous
Copy link

This branch appears to leak file descriptors during a backup-fetch, eventually resulting in EMFILE errors while downloading new segments.

This is the list of fds with a pool size of 10 after about 20 segments: https://gist.github.com/titanous/3808052e8c7e0cefa4009adebe4d6a4c

The number of open file descriptors grew with each new segment that started downloading.

@bearrito
Copy link

bearrito commented Apr 2, 2018

Does a specific test show those leaks?

@titanous
Copy link

titanous commented Apr 2, 2018

It happens consistently when downloading a large base backup. You could make a large dataset with pgtune to test.

@nkiraly
Copy link
Contributor Author

nkiraly commented Apr 18, 2018

I am looking at making further changes to make sure this works reliably.

@nkiraly nkiraly changed the title Update azure-storage blob API to 0.34.2 Update Azure Blob Storage API Apr 18, 2018
@nkiraly
Copy link
Contributor Author

nkiraly commented Apr 20, 2018

Using a package built with 5ae3deb I was able to backup-push a PostgreSQL database server with 20GB of databases to Azure blob storage, and restore that backup-fetch that basebackup to a new server.

I advocate for this pull request to be reviewed for acceptance. Any feedback is appreciated.

@bearrito
Copy link

bearrito commented Jun 4, 2018

@titanous Is there anything else required for this?

@deverant
Copy link
Member

deverant commented Jun 6, 2018

Hey! I assume you have tested this in production so the change is good for Azure and works in production? Really like how the never libraries allows for some code deletion. Two things come to mind right now:

  1. Can you rebase the branch and squash some of these commits?
  2. Have you been able to test this with another storage? Patching threads makes me slightly nervous since it can affect the other libraries used for other storages. I can see if I have time to test this branch out against GCS at least but we should also test it against AWS for sure.

Changes
* Update WAL-E minimum Azure SDK version 3.0.0 to leverage blob listing paging issues and use SDK-provided get / put segmentation
* Update extras to explicitly specify azure 3.0.0
* Update azure-storage-blob minimum version to 1.1.0
* migrate uri_put_file from put_blob / put_block / put_block_list to create_blob_from_bytes which has built in segmenting and parallelization
* migrate uri_get_file from get_blob with chunking to get_blob_to_stream which has built in segmenting

    $ tox -e py35
    137 passed, 115 skipped in 22.41 seconds

References:
https://github.com/Azure/azure-storage-python/blob/master/BreakingChanges.md#version-0300
https://azure-sdk-for-python.readthedocs.io/en/v0.11.1/ref/azure.storage.blobservice.html
https://docs.microsoft.com/en-us/python/api/azure.storage.blob.blockblobservice.blockblobservice?view=azure-python#azure_storage_blob_blockblobservice_BlockBlobService_create_blob_from_bytes
https://docs.microsoft.com/en-us/python/api/azure.storage.blob.baseblobservice.baseblobservice?view=azure-python#azure_storage_blob_baseblobservice_BaseBlobService_list_blobs
https://github.com/Azure/azure-storage-python/blob/master/samples/blob/block_blob_usage.py
this required moving monkey patches to if __main__ to allow pytests to run
@nkiraly
Copy link
Contributor Author

nkiraly commented Jun 6, 2018

  1. Yes I have been using this in production for some time. Base backups, WAL archiving, and restoration working as expected.

  2. I have not tested these changes with other storage systems.

  3. I have rebased the branch and squashed the SDK update iterations. Tests still passing locally with 5e0a415, and Travis CI agrees.

@deverant
Copy link
Member

deverant commented Jun 6, 2018

Verified that integration tests for GCS work even after the threads are patched.

@deverant
Copy link
Member

deverant commented Jun 6, 2018

@fdr Merging this probably means we need a version bump. Any thoughts on this? We should probably release a new version anyway now that the python 3.6 errors are fixed.

@vijayrajah
Copy link

vijayrajah commented Jul 11, 2018

I have used @nkiraly branch, with python 3.6 on Centos7. It is working without any issues. (yet!). I still need to test the restore part.

Can you please merge this and create a new version.

@fdr
Copy link
Member

fdr commented Jul 11, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants