Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import pyarrow-hotfix #669

Merged
merged 1 commit into from Mar 21, 2024
Merged

Conversation

jdblischak
Copy link
Collaborator

Until we can install pyarrow >=14.0.2 in the cloud conda environments, we should ensure that pyarrow-hotfix is installed and imported. From its PyPI Usage section:

pyarrow_hotfix must be imported in your application or library code for it to take effect

xref: TileDB-Inc/tiledb-vcf-feedstock#115

Also, I noticed that pyarrow isn't included in setup_requires

setup_requires=[
"setuptools>=18.0",
"setuptools_scm>=1.5.4",
"wheel>=0.30",
"pybind11>=2.3.0",
"setuptools_scm_git_archive",
],
install_requires=[],
tests_require=[],

Which seems unlikely to be true given that pyarrow is imported in setup.py:

import pyarrow

In general, is there a reason that setup_requires, install_requires, and test_requires aren't overly utilized in this repo? I expected to see at least pyarrow in install_requires and dask in test_requires.

@jdblischak
Copy link
Collaborator Author

Frustratingly linux-libtiledb-dev failed with a segmentation fault during the Python tests both here and on the run on my fork. I restarted both

@jdblischak
Copy link
Collaborator Author

linux-libtiledb-dev is consistently failing. Are any of the recent commits to libtiledb potentially affecting this?

Also I manually triggered a nightly build off of the main branch on my fork to see if these failures are at all related to my PR

@jdblischak
Copy link
Collaborator Author

Also I manually triggered a nightly build off of the main branch on my fork to see if these failures are at all related to my PR

Unfortunately the nightly on main passed 😭

@jdblischak
Copy link
Collaborator Author

Maybe it's the version of dask. The nightly on main installed dask 2023.9.3 from its cache. Whereas my PR is installing dask 2024.2.1

@jdblischak
Copy link
Collaborator Author

Restricting the upper bound for dask didn't fix it. Any ideas?

@gspowley
Copy link
Member

Based on the test failures:

Current thread 0x000078a9fd4f9740 (most recent call first):
  File "/home/runner/micromamba/envs/py4vcf/lib/python3.9/site-packages/tiledb/ctx.py", line 345 in __init__
  File "/home/runner/micromamba/envs/py4vcf/lib/python3.9/site-packages/tiledb/ctx.py", line 545 in default_ctx
  File "/home/runner/micromamba/envs/py4vcf/lib/python3.9/site-packages/tiledb/highlevel.py", line 337 in _get_ctx
  File "/home/runner/micromamba/envs/py4vcf/lib/python3.9/site-packages/tiledb/highlevel.py", line 28 in open
  File "/home/runner/work/TileDB-VCF/TileDB-VCF/TileDB-VCF/apis/python/tests/test_tiledbvcf.py", line 975 in test_disable_ingestion_tasks

The failing test is trying to open an array (which doesn't exist) with tiledb-py:

with tiledb.open(ac_uri) as A:
df = A.query(attrs=["alt", "count"], dims=["pos"]).df[contig, region]

I don't know why this causes a segfault, but the use of tiledb-py is complicating the test. We only need to check that the array path does not exist.

This patch simplifies the test:

diff --git a/apis/python/tests/test_tiledbvcf.py b/apis/python/tests/test_tiledbvcf.py
index 2e4fd3bd..87fd41b6 100755
--- a/apis/python/tests/test_tiledbvcf.py
+++ b/apis/python/tests/test_tiledbvcf.py
@@ -966,23 +966,12 @@ def test_disable_ingestion_tasks(tmp_path):
     if platform.system() != "Linux":
         return
 
-    # query allele_count array with TileDB
+    # ensure the allele_count and variant_stats arrays are not created
     ac_uri = os.path.join(tmp_path, "dataset", "allele_count")
+    assert not os.path.exists(ac_uri)
 
-    contig = "1"
-    region = slice(69896)
-    with pytest.raises(Exception):
-        with tiledb.open(ac_uri) as A:
-            df = A.query(attrs=["alt", "count"], dims=["pos"]).df[contig, region]
-
-    # query variant_stats array with TileDB
     vs_uri = os.path.join(tmp_path, "dataset", "variant_stats")
-
-    contig = "1"
-    region = slice(12140)
-    with pytest.raises(Exception):
-        with tiledb.open(vs_uri) as A:
-            df = A.query(attrs=["allele", "ac"], dims=["pos"]).df[contig, region]
+    assert not os.path.exists(vs_uri)
 
 
 def test_ingestion_tasks(tmp_path):

@gspowley
Copy link
Member

The next failure is also related to using tiledb-py. There must be an unexpected interaction somewhere.

@jdblischak
Copy link
Collaborator Author

The next failure is also related to using tiledb-py. There must be an unexpected interaction somewhere.

Doesn't that suggest there is a compatibility problem between the release version of tiledb-py and the dev release of libtiledb?

Also, I've started getting errors about missing the package dask-expr when importing tiledbvcf-py. It's bizarre though because only some the runs failed. Most bizarre is the macOS workflow. Both the python-standalone job and the Python+external libtiledbvcf jobs installed dask 2024.3.0, but the former failed due to missing dask-expr while the latter passed.

https://github.com/TileDB-Inc/TileDB-VCF/actions/runs/8250806005/job/22566300642?pr=669#step:4:13
https://github.com/TileDB-Inc/TileDB-VCF/actions/runs/8250806005/job/22567028956?pr=669#step:5:14

@gspowley
Copy link
Member

Doesn't that suggest there is a compatibility problem between the release version of tiledb-py and the dev release of libtiledb?

Yes, I believe so. I couldn't explain why the regular nightly is passing, but the regular nightly may be configured differently.

@jdblischak
Copy link
Collaborator Author

Quick update. I fixed the dask import errors in #673. However, the linux-libtiledb-dev build is still segfault'ing during the tiledbvcf-py tests, just as before. Also, a new error while running the libtiledbvcf unit tests on macOS.

-------------------------------------------------------------------------------
TileDB-VCF: Test Resume Ingest and Export With Contig Merge
-------------------------------------------------------------------------------
/Users/runner/work/TileDB-VCF/TileDB-VCF/libtiledbvcf/test/src/unit-vcf-export.cc:1744
...............................................................................

/Users/runner/work/TileDB-VCF/TileDB-VCF/libtiledbvcf/test/src/unit-vcf-export.cc:1744: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
  Error loading metadata; 'version' field has invalid value.

===============================================================================
test cases:   78 |   77 passed | 1 failed
assertions: 6790 | 6789 passed | 1 failed

This PR doesn't touch the libtiledbvcf source code, so I'm hoping this is spurious. I restarted it

@jdblischak
Copy link
Collaborator Author

This PR doesn't touch the libtiledbvcf source code, so I'm hoping this is spurious. I restarted it

Confirmed. It was spurious

@jdblischak
Copy link
Collaborator Author

Still failing the tiledbvcf-py tests:

File "/home/runner/work/TileDB-VCF/TileDB-VCF/TileDB-VCF/apis/python/tests/test_tiledbvcf.py", line 40 in check_if_compatible
File "/home/runner/work/TileDB-VCF/TileDB-VCF/TileDB-VCF/apis/python/tests/test_tiledbvcf.py", line 992 in test_ingestion_tasks

def check_if_compatible(uri):
try:
with tiledb.open(uri):

check_if_compatible(ac_uri)

Currently tiledb-py 0.26.0 is getting installed because the conda env is cached. My next idea is to delete the cache to install a more recent tiledb-py

@jdblischak
Copy link
Collaborator Author

Note that the nightly linux-libtiledb-dev failed on my fork because I deleted the GitHub Actions cache

https://github.com/jdblischak/TileDB-VCF/actions/runs/8353245587/job/22864662362

In other words, I think the only reason it is passing on the main repo is because the conda env is cached

@jdblischak
Copy link
Collaborator Author

Rebased onto #679

@jdblischak jdblischak self-assigned this Mar 21, 2024
@jdblischak
Copy link
Collaborator Author

I fixed the segfault (caused by running tiledb.open() with tiledb-py 0.26.0 and libtiledb 2.22 from its dev branch) by building TileDB-Py from source (#679).

To minimize this PR as much as possible (since we can remove pyarrow-hotfix once we can install pyarrow 14+ in TileDB Cloud), I only import pyarrow-hotfix once during the initialization of pyarrow when tiledbvcf-py is imported. I assume this is sufficient to apply the hotfix.

@jdblischak
Copy link
Collaborator Author

No Azure build was triggered for my last commit. Closing and reopening

@jdblischak jdblischak closed this Mar 21, 2024
@jdblischak jdblischak reopened this Mar 21, 2024
@jdblischak
Copy link
Collaborator Author

Reopening successfully triggered a new Azure build https://dev.azure.com/TileDB-Inc/CI/_build/results?buildId=38579&view=results

@jdblischak jdblischak merged commit 8197381 into TileDB-Inc:main Mar 21, 2024
31 checks passed
@jdblischak jdblischak deleted the pyarrow-hotfix branch March 21, 2024 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants