Investigate Celery SegFaults on Staging #3089

rajadain · 2019-02-21T16:34:51Z

On staging we're seeing segfaults like this:

[2019-02-20 13:58:35,634: ERROR/MainProcess] Task a8dc7519-21e4-42db-b3aa-33ff69240367 run from job 73134 raised exception: Worker exited prematurely: signal 11 (SIGSEGV).

particularly around GWLF-E execution for MapShed Stage 2. These errors are not happening in local or production. Investigate and resolve.

The text was updated successfully, but these errors were encountered:

rajadain · 2019-02-21T17:13:49Z

I downloaded the Lower Schuylkill HUC-10 for testing, and then created this test script:

# test.py

import os
import json

from cStringIO import StringIO

from gwlfe import gwlfe, Parser

filename = 'huc10__1341.json'
filepath = os.path.abspath(filename)

with open(filepath, 'r') as input_json:
    mapshed_data = json.load(input_json)

    # Round Areas
    mapshed_areas = [round(a, 1) for a in mapshed_data['Area']]
    mapshed_data['Area'] = mapshed_areas

    # Prepare Input GMS
    pre_z = Parser.DataModel(mapshed_data)
    output = StringIO()
    writer = Parser.GmsWriter(output)
    writer.write(pre_z)
    output.seek(0)

    # Read Input GMS
    reader = Parser.GmsReader(output)
    z = reader.read()

    # Run the Model
    result, _ = gwlfe.run(z)

    # Write to file
    outpath = os.path.abspath('output.json')

    with open(outpath, 'w') as outfile:
        json.dump(result, outfile)

which runs fine on my local Worker VM:

vagrant@worker:/vagrant/scratch/celery-segfault$ python test.py 
vagrant@worker:/vagrant/scratch/celery-segfault$ ll
total 296
drwxr-xr-x 1 vagrant vagrant    160 Feb 21 17:03 ./
drwxr-xr-x 1 vagrant vagrant   3104 Feb 21 16:54 ../
-rw-r--r-- 1 vagrant vagrant 290782 Feb 21 16:50 huc10__1341.json
-rw-r--r-- 1 vagrant vagrant   5506 Feb 21 17:03 output.json
-rw-r--r-- 1 vagrant vagrant    808 Feb 21 17:03 test.py

But when I run it on Staging I get this:

ubuntu@ip-10-0-5-21:~/celery-segfault$ python test.py
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
Segmentation fault (core dumped)

So the suspicion that it was related to the latest GWLF-E has been confirmed. Proceeding to investigate _multiarray_umath now.

rajadain · 2019-02-21T18:09:04Z

I've destroyed my Worker VM locally and am re-creating it from scratch to see if I can reproduce this.

rajadain · 2019-02-21T18:34:13Z

🎉 After I destroyed and recreated my Worker locally, I can now reproduce this:

$ vagrant ssh worker -c 'cd /vagrant/scratch/celery-segfault/ && python test.py'
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
ImportError: No module named _multiarray_umath
bash: line 1: 14222 Segmentation fault      (core dumped) python test.py
Connection to 127.0.0.1 closed.

rajadain · 2019-02-21T18:58:14Z

Some threads:

indicated that the problem is in one of the underlying libraries, which is building against a more modern version of numpy, which gets replaced later by the one we specify. I tried changing the order of installation:

diff --git a/src/mmw/requirements/base.txt b/src/mmw/requirements/base.txt
index 115013e3..c69ffcb5 100644
--- a/src/mmw/requirements/base.txt
+++ b/src/mmw/requirements/base.txt
@@ -14,6 +14,7 @@ django-cors-headers==2.1.0
 cryptography==2.1.4
 pyOpenSSL==17.4.0
 markdown==2.6.9
+numpy==1.14.5
 tr55==1.3.0
 gwlf-e==2.0.0
 requests[security]==2.9.1
@@ -24,7 +25,6 @@ https://bitbucket.org/jurko/suds/get/94664ddd46a6.tar.gz#egg=suds-jurko
 django_celery_results==1.0.1
 pandas==0.22.0
 git+git://github.com/emiliom/ulmo@wml_values_md#egg=ulmo
-numpy==1.14.5
 hs_restclient==1.2.10
 six==1.11.0
 fiona==1.7.11

so that whatever is using the more recent version of numpy in its installation (either gwlf-e or pandas or something else) uses the correct version.

But it did not help.

…ults Circumvent GWLF-E SegFaults due to NumPy Connects #3089

rajadain added NSF Funding Source: National Science Foundation in progress labels Feb 21, 2019

rajadain self-assigned this Feb 21, 2019

rajadain added the + label Feb 21, 2019

rajadain removed the + label Feb 21, 2019

rajadain mentioned this issue Feb 22, 2019

Circumvent GWLF-E SegFaults due to NumPy #3090

Merged

3 tasks

rajadain added in review and removed in progress labels Feb 22, 2019

rajadain added a commit that referenced this issue Feb 26, 2019

Merge pull request #3090 from WikiWatershed/tt/circumvent-gwlfe-segfa…

48d5ca0

…ults Circumvent GWLF-E SegFaults due to NumPy Connects #3089

rajadain added tested/verified and removed in review labels Mar 1, 2019

rajadain closed this as completed Mar 7, 2019

rajadain removed the tested/verified label Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Celery SegFaults on Staging #3089

Investigate Celery SegFaults on Staging #3089

rajadain commented Feb 21, 2019

rajadain commented Feb 21, 2019

rajadain commented Feb 21, 2019

rajadain commented Feb 21, 2019

rajadain commented Feb 21, 2019

Investigate Celery SegFaults on Staging #3089

Investigate Celery SegFaults on Staging #3089

Comments

rajadain commented Feb 21, 2019

rajadain commented Feb 21, 2019

rajadain commented Feb 21, 2019

rajadain commented Feb 21, 2019

rajadain commented Feb 21, 2019