Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Celery SegFaults on Staging #3089

Closed
rajadain opened this issue Feb 21, 2019 · 4 comments
Closed

Investigate Celery SegFaults on Staging #3089

rajadain opened this issue Feb 21, 2019 · 4 comments
Assignees
Labels
NSF Funding Source: National Science Foundation

Comments

@rajadain
Copy link
Member

On staging we're seeing segfaults like this:

[2019-02-20 13:58:35,634: ERROR/MainProcess] Task a8dc7519-21e4-42db-b3aa-33ff69240367 run from job 73134 raised exception: Worker exited prematurely: signal 11 (SIGSEGV).

particularly around GWLF-E execution for MapShed Stage 2. These errors are not happening in local or production. Investigate and resolve.

@rajadain rajadain added NSF Funding Source: National Science Foundation in progress labels Feb 21, 2019
@rajadain rajadain self-assigned this Feb 21, 2019
@rajadain rajadain added the + label Feb 21, 2019
@rajadain
Copy link
Member Author

I downloaded the Lower Schuylkill HUC-10 for testing, and then created this test script:

# test.py

import os
import json

from cStringIO import StringIO

from gwlfe import gwlfe, Parser

filename = 'huc10__1341.json'
filepath = os.path.abspath(filename)

with open(filepath, 'r') as input_json:
    mapshed_data = json.load(input_json)

    # Round Areas
    mapshed_areas = [round(a, 1) for a in mapshed_data['Area']]
    mapshed_data['Area'] = mapshed_areas

    # Prepare Input GMS
    pre_z = Parser.DataModel(mapshed_data)
    output = StringIO()
    writer = Parser.GmsWriter(output)
    writer.write(pre_z)
    output.seek(0)

    # Read Input GMS
    reader = Parser.GmsReader(output)
    z = reader.read()

    # Run the Model
    result, _ = gwlfe.run(z)

    # Write to file
    outpath = os.path.abspath('output.json')

    with open(outpath, 'w') as outfile:
        json.dump(result, outfile)

which runs fine on my local Worker VM:

vagrant@worker:/vagrant/scratch/celery-segfault$ python test.py 
vagrant@worker:/vagrant/scratch/celery-segfault$ ll
total 296
drwxr-xr-x 1 vagrant vagrant    160 Feb 21 17:03 ./
drwxr-xr-x 1 vagrant vagrant   3104 Feb 21 16:54 ../
-rw-r--r-- 1 vagrant vagrant 290782 Feb 21 16:50 huc10__1341.json
-rw-r--r-- 1 vagrant vagrant   5506 Feb 21 17:03 output.json
-rw-r--r-- 1 vagrant vagrant    808 Feb 21 17:03 test.py

But when I run it on Staging I get this:

ubuntu@ip-10-0-5-21:~/celery-segfault$ python test.py
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
Segmentation fault (core dumped)

So the suspicion that it was related to the latest GWLF-E has been confirmed. Proceeding to investigate _multiarray_umath now.

@rajadain
Copy link
Member Author

I've destroyed my Worker VM locally and am re-creating it from scratch to see if I can reproduce this.

@rajadain
Copy link
Member Author

🎉 After I destroyed and recreated my Worker locally, I can now reproduce this:

$ vagrant ssh worker -c 'cd /vagrant/scratch/celery-segfault/ && python test.py'
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
bash: line 1: 14222 Segmentation fault      (core dumped) python test.py
Connection to 127.0.0.1 closed.

@rajadain
Copy link
Member Author

Some threads:

indicated that the problem is in one of the underlying libraries, which is building against a more modern version of numpy, which gets replaced later by the one we specify. I tried changing the order of installation:

diff --git a/src/mmw/requirements/base.txt b/src/mmw/requirements/base.txt
index 115013e3..c69ffcb5 100644
--- a/src/mmw/requirements/base.txt
+++ b/src/mmw/requirements/base.txt
@@ -14,6 +14,7 @@ django-cors-headers==2.1.0
 cryptography==2.1.4
 pyOpenSSL==17.4.0
 markdown==2.6.9
+numpy==1.14.5
 tr55==1.3.0
 gwlf-e==2.0.0
 requests[security]==2.9.1
@@ -24,7 +25,6 @@ https://bitbucket.org/jurko/suds/get/94664ddd46a6.tar.gz#egg=suds-jurko
 django_celery_results==1.0.1
 pandas==0.22.0
 git+git://github.com/emiliom/ulmo@wml_values_md#egg=ulmo
-numpy==1.14.5
 hs_restclient==1.2.10
 six==1.11.0
 fiona==1.7.11

so that whatever is using the more recent version of numpy in its installation (either gwlf-e or pandas or something else) uses the correct version.

But it did not help.

@rajadain rajadain removed the + label Feb 21, 2019
rajadain added a commit that referenced this issue Feb 26, 2019
…ults

Circumvent GWLF-E SegFaults due to NumPy

Connects #3089
@rajadain rajadain closed this as completed Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NSF Funding Source: National Science Foundation
Projects
None yet
Development

No branches or pull requests

1 participant