Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential NGINX static file permission errors #150

Merged
merged 3 commits into from Jun 11, 2019

Conversation

djanderson
Copy link
Contributor

@djanderson djanderson commented Jun 7, 2019

This PR fixes 2 issues:

  • The NGINX container mounts static files as a volume with the user:group permissions of the user on the host system, then tries to read them with user "nginx" inside the container. This PR adds a Dockerfile-nginx file to give more flexibility over the nginx container and then copies static files into the container instead of mounting. This solves the permissions issue.
  • I found that when API is running a very intensive task (like a long database write), the api container could be marked "unhealthy" and restarted even though the healthcheck test test ! -e /sdr_unhealthy && test ! -e /scheduler_dead was passing. The issue was that a timeout of 1s was used, meaning that if the healthcheck didn't complete in 1s, the container was marked unhealthy. Since the test isn't actually running a time-sensitive command, this PR bumps the timeout to 3s. This fixed the cycling of otherwise healthy api container in my tests.

A few static files were also updated.

@jhazentia
Copy link
Member

Thanks for looking at this. I have also been trying to solve these issues, and will continue to look at it next week. I tried your master branch, but still seem to be experiencing the same problems. Whenever I run the survey_700_band_iq action (on my system with MOCK_RADIO=1 and MOCK_RADIO_RANDOM=1), I can see the log messages for 705, 709, 731.5, and 739 MHz acquisitions. Then appears that the worker dies and a new worker is created. I also tried disabling the healthcheck and removing the autoheal container, but the same problem still occurs. Log messages are below:

[2019-06-07 20:56:43,022] [DEBUG] no times remaining in test7_survey_700_band_iq, removing
api_1    | [2019-06-07 20:56:43,043] [DEBUG] running task test7_survey_700_band_iq/1
api_1    | [2019-06-07 20:56:43,046] [WARNING] Using mock USRP.
api_1    | [2019-06-07 20:56:43,047] [DEBUG] set USRP clock rate: 30.72 MHz
api_1    | [2019-06-07 20:56:43,048] [DEBUG] set USRP sample rate: 7.68 MS/s
api_1    | [2019-06-07 20:56:43,048] [DEBUG] MockTuneResult(actual_dsp_freq=0, actual_rf_freq=700500000.0)
api_1    | [2019-06-07 20:56:43,048] [DEBUG] set USRP gain: 40.0 dB
api_1    | [2019-06-07 20:56:43,048] [DEBUG] Using power scale factor: -43.99347697320405
api_1    | [2019-06-07 20:56:43,048] [DEBUG] Using linear scale factor: 0.006314313658723966
api_1    | [2019-06-07 20:56:43,657] [DEBUG] Successfully acquired 7680000 samples.
api_1    | [2019-06-07 20:56:43,708] [DEBUG] Storing acquisition in database
api_1    | [2019-06-07 20:56:48,830] [DEBUG] set USRP clock rate: 15.36 MHz
api_1    | [2019-06-07 20:56:48,830] [DEBUG] set USRP sample rate: 15.36 MS/s
api_1    | [2019-06-07 20:56:48,830] [DEBUG] MockTuneResult(actual_dsp_freq=0, actual_rf_freq=709000000.0)
api_1    | [2019-06-07 20:56:48,831] [DEBUG] set USRP gain: 40.0 dB
api_1    | [2019-06-07 20:56:48,831] [DEBUG] Using power scale factor: -44.01636882259697
api_1    | [2019-06-07 20:56:48,831] [DEBUG] Using linear scale factor: 0.0062976940592277385
api_1    | [2019-06-07 20:56:50,012] [DEBUG] Successfully acquired 15360000 samples.
api_1    | [2019-06-07 20:56:50,119] [DEBUG] Storing acquisition in database
api_1    | [2019-06-07 20:57:04,608] [DEBUG] set USRP clock rate: 29.44 MHz
api_1    | [2019-06-07 20:57:04,649] [DEBUG] set USRP sample rate: 7.36 MS/s
api_1    | [2019-06-07 20:57:04,651] [DEBUG] MockTuneResult(actual_dsp_freq=0, actual_rf_freq=731500000.0)
api_1    | [2019-06-07 20:57:04,653] [DEBUG] set USRP gain: 40.0 dB
api_1    | [2019-06-07 20:57:04,658] [DEBUG] Using power scale factor: -44.01537558834967
api_1    | [2019-06-07 20:57:04,659] [DEBUG] Using linear scale factor: 0.00629841424372563
api_1    | [2019-06-07 20:57:05,344] [DEBUG] Successfully acquired 7360000 samples.
api_1    | [2019-06-07 20:57:05,431] [DEBUG] Storing acquisition in database
api_1    | [2019-06-07 20:57:13,142] [DEBUG] set USRP clock rate: 15.36 MHz
api_1    | [2019-06-07 20:57:13,142] [DEBUG] set USRP sample rate: 15.36 MS/s
api_1    | [2019-06-07 20:57:13,142] [DEBUG] MockTuneResult(actual_dsp_freq=0, actual_rf_freq=739000000.0)
api_1    | [2019-06-07 20:57:13,143] [DEBUG] set USRP gain: 40.0 dB
api_1    | [2019-06-07 20:57:13,143] [DEBUG] Using power scale factor: -43.985365605482556
api_1    | [2019-06-07 20:57:13,143] [DEBUG] Using linear scale factor: 0.00632021307082295
api_1    | [2019-06-07 20:57:14,404] [DEBUG] Successfully acquired 15360000 samples.
api_1    | [2019-06-07 20:57:14,518] [DEBUG] Storing acquisition in database
api_1    | [2019-06-07 20:57:20 +0000] [35] [INFO] Booting worker with pid: 35

@jhazentia
Copy link
Member

I also continue to get the below error with the current scos-sensor master branch and your master branch.
[Errno 2] No such file or directory: '/src/static/js'
However, if I comment out the STATICFILES_DIRS, it seems to work:

STATICFILES_DIRS = [
    # ("js", path.join(STATIC_ROOT, "js")),
    # ("css", path.join(STATIC_ROOT, "css")),
    # ("images", path.join(STATIC_ROOT, "images")),
    # ("fonts", path.join(STATIC_ROOT, "fonts")),
]

@djanderson
Copy link
Contributor Author

Regarding static files, I think it's actually preferable that the api container throws an error when trying to serve static files, because we really don't want it doing that job. The NGINX container should definitely be handling that, and what I've found is that if NGINX is unable to serve the static files it passes the request through to the API container. We should definitely fix the static file serve issue at NGINX, not at the API container.

First, after you switched branches, did you remove and rebuild the nginx container? docker-compose down -> docker-compose up --build.

Then, if that still doesn't seem to work, can you let me know what the issue is according to the nginx container logs? docker-compose logs -f nginx.

@jhazentia
Copy link
Member

Removing and rebuilding the container didn't work for me. I don't get that error for the first http request, it is occurs on the second http request. For example https://localhost/api/v1/ will work, then I click to go to https://localhost/api/v1/capabilities/, then the error occurs. Here is the nginx log.

nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /api/v1/ HTTP/1.1" 200 10471 "https://localhost/api/v1/schedule/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /static/rest_framework/css/bootstrap.min.css HTTP/1.1" 200 121457 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /static/rest_framework/js/ajax-form.js HTTP/1.1" 200 3597 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /static/rest_framework/css/bootstrap-tweaks.css HTTP/1.1" 200 3385 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /static/rest_framework/css/default.css HTTP/1.1" 200 1131 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /static/rest_framework/css/prettify.css HTTP/1.1" 200 817 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /static/rest_framework/js/jquery-3.3.1.min.js HTTP/1.1" 200 86927 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /static/rest_framework/js/csrf.js HTTP/1.1" 200 1719 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /static/rest_framework/js/bootstrap.min.js HTTP/1.1" 200 37608 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /static/rest_framework/js/prettify-min.js HTTP/1.1" 200 13632 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /static/rest_framework/js/default.js HTTP/1.1" 200 1268 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:31 +0000] "GET /static/rest_framework/img/grid.png HTTP/1.1" 200 1458 "https://localhost/static/rest_framework/css/bootstrap-tweaks.css" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1  | 192.168.32.1 - - [07/Jun/2019:21:45:39 +0000] "GET /api/v1/capabilities/ HTTP/1.1" 500 93110 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"

@dboulware
Copy link
Contributor

@jhazentia When I ran the iq survey it was dying because postgres was running out of memory. It seems I am able to avoid the issue if I reduce the number of frequencies in the survey action.

@djanderson
Copy link
Contributor Author

https://docs.gunicorn.org/en/stable/settings.html#timeout

Workers silent for more than this many seconds are killed and restarted.

We're using a gthread worker in a single process. Some research needs to be done to see what gunicorn considers "silent". It would be interesting to see if the worker is being killed when the iq survey has been running for about 30 seconds.

@djanderson
Copy link
Contributor Author

djanderson commented Jun 7, 2019

From the gunicorn repo:

Most probably one of your worker is doing a blocking operation preventing the gunicorn worker to notify the arbiter. If you have a long operation, make sure to trigger the geven scheduler from time to time by sleeping or such.

Looks like during any task that runs more than 30 seconds (the default timeout) we'll need to explicitly cede (time.sleep(0)?) to make sure the gthread arbiter runs.

First step is to optimize database operations, because iq_survey should nominally not take 30 seconds to complete. See #149.

@MichaelCotton
Copy link
Contributor

Would we avoid error if individual acquire_xxx tasks were scheduled separately, e.g., acquire_700MHz_ATT_DL_iq , acquire_700MHz_ATT_UL_iq, ... instead of survey_700MHz_band_iq?

@djanderson
Copy link
Contributor Author

@MichaelCotton I wouldn't recommend that approach because it defeats the idea that an action can basically do anything. Instead I'd recommend (in order)

  1. Increase timeout - In the short term, so that you can actually get the data for testing while someone works on the fixing the issue, simply increase the gunicorn timeout settings (add it to https://github.com/NTIA/scos-sensor/blob/master/gunicorn/config.py)
  2. Optimize database transactions - as I've outlined above, there's definitely an issue with either the way the database is laid out, or the queries, or both. This should be done regardless of whether it ultimately fixes the gunicorn worker issue or not, but it's also likely to help that issue a lot.
  3. Add time.sleep(0) at the end of each loop in the survey task - sleeping for "0" seconds seems useless but what this essentially does is signals the python interpreter that now is a good time to let any other waiting threads run. This was suggested by the author or gunicorn.

@jhazentia
Copy link
Member

@djanderson I am still getting error for missing src/static/js with the latest change. It seems the debug toolbar is always looking for the STATICFILES_DIRS.
Here is the api log:

api_1       | [2019-06-10 19:37:30 +0000] [21] [INFO] Listening at: http://0.0.0.0:8000 (21)
api_1       | [2019-06-10 19:37:30 +0000] [21] [INFO] Using worker: gthread
api_1       | [2019-06-10 19:37:30 +0000] [24] [INFO] Booting worker with pid: 24
api_1       | [2019-06-10 19:37:30 +0000] [21] [DEBUG] 1 workers
api_1       | [2019-06-10 19:39:22 +0000] [24] [DEBUG] GET /api/v1
api_1       | [2019-06-10 19:39:22 +0000] [24] [DEBUG] Closing connection.
api_1       | [2019-06-10 19:39:22 +0000] [24] [DEBUG] GET /api/v1/
api_1       | [2019-06-10 19:39:22 +0000] [24] [DEBUG] Closing connection.
api_1       | [2019-06-10 19:39:28 +0000] [24] [DEBUG] GET /api/v1/capabilities/
api_1       | Internal Server Error: /api/v1/capabilities/
api_1       | Traceback (most recent call last):
api_1       |   File "/usr/local/lib/python3.6/dist-packages/django/core/handlers/exception.py", line 34, in inner
api_1       |     response = get_response(request)
api_1       |   File "/usr/local/lib/python3.6/dist-packages/django/utils/deprecation.py", line 96, in __call__
api_1       |     response = self.process_response(request, response)
api_1       |   File "/usr/local/lib/python3.6/dist-packages/debug_toolbar/middleware.py", line 133, in process_response
api_1       |     panel.generate_stats(request, response)
api_1       |   File "/usr/local/lib/python3.6/dist-packages/debug_toolbar/panels/staticfiles.py", line 136, in generate_stats
api_1       |     "staticfiles_finders": self.get_staticfiles_finders(),
api_1       |   File "/usr/local/lib/python3.6/dist-packages/debug_toolbar/panels/staticfiles.py", line 148, in get_staticfiles_finders
api_1       |     for path, finder_storage in finder.list([]):
api_1       |   File "/usr/local/lib/python3.6/dist-packages/django/contrib/staticfiles/finders.py", line 131, in list
api_1       |     for path in utils.get_files(storage, ignore_patterns):
api_1       |   File "/usr/local/lib/python3.6/dist-packages/django/contrib/staticfiles/utils.py", line 23, in get_files
api_1       |     directories, files = storage.listdir(location)
api_1       |   File "/usr/local/lib/python3.6/dist-packages/django/core/files/storage.py", line 315, in listdir
api_1       |     for entry in os.scandir(path):
api_1       | FileNotFoundError: [Errno 2] No such file or directory: '/src/static/js'
api_1       | [2019-06-10 19:39:28 +0000] [24] [DEBUG] Closing connection.

Here is the nginx log:

Attaching to scos-sensor_nginx_1
nginx_1     | 2019/06/10 19:37:12 [emerg] 1#1: host not found in upstream "api:8000" in /etc/nginx/conf.d/default.conf:7
nginx_1     | nginx: [emerg] host not found in upstream "api:8000" in /etc/nginx/conf.d/default.conf:7
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /api/v1 HTTP/1.1" 301 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /api/v1/ HTTP/1.1" 200 10471 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/css/bootstrap.min.css HTTP/1.1" 200 121457 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/js/csrf.js HTTP/1.1" 200 1719 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/js/default.js HTTP/1.1" 200 1268 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/css/bootstrap-tweaks.css HTTP/1.1" 200 3385 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/css/prettify.css HTTP/1.1" 200 817 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/css/default.css HTTP/1.1" 200 1131 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/js/jquery-3.3.1.min.js HTTP/1.1" 200 86927 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/js/ajax-form.js HTTP/1.1" 200 3597 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/js/bootstrap.min.js HTTP/1.1" 200 37608 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/js/prettify-min.js HTTP/1.1" 200 13632 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/img/grid.png HTTP/1.1" 200 1458 "https://localhost/static/rest_framework/css/bootstrap-tweaks.css" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:22 +0000] "GET /static/rest_framework/css/bootstrap.min.css.map HTTP/1.1" 200 540433 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"
nginx_1     | 172.30.0.1 - - [10/Jun/2019:19:39:28 +0000] "GET /api/v1/capabilities/ HTTP/1.1" 500 93109 "https://localhost/api/v1/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36" "-"

Note that the debug toolbar does work if I comment out the entries in STATICFILES_DIRS and run python manage.py collectstatic.

@jhazentia
Copy link
Member

Merging these changes into master

@jhazentia jhazentia merged commit 1686a29 into NTIA:master Jun 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants