Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gdal_config option doesn't work for threaded readers #186

Open
vincentsarago opened this issue Jan 5, 2021 · 4 comments
Open

gdal_config option doesn't work for threaded readers #186

vincentsarago opened this issue Jan 5, 2021 · 4 comments
Labels
bug Something isn't working GDAL/Rasterio help wanted Extra attention is needed

Comments

@vincentsarago
Copy link
Member

vincentsarago commented Jan 5, 2021

We already tried to solve this using apiroute_factory
https://developmentseed.org/titiler/concepts/APIRoute_and_environment_variables/

from fastapi import FastAPI, APIRouter
from rasterio._env import get_gdal_config
from titiler.custom.routing import apiroute_factory
from titiler.endpoints.factory import TilerFactory

app = FastAPI()
route_class = apiroute_factory({"GDAL_DISABLE_READDIR_ON_OPEN": "FALSE"})
router = APIRouter(route_class=route_class)

tiler = TilerFactory(router=router)

While this approach worked for a while, it didn't seems to work in python 3.8 and felt quite of a hack.

We then switched and added gdal_config options in the factory (ref #170). I though this worked ... but I was wrong.

In FastAPI, when you define a function with a simple def myfunction(..., fastapi will use starlette's run_in_threadpool to run the function (so the API isn't blocked). This makes the function to be run in a thread ... which isn't the MainThread, and this is important. In Rasterio, when using with rasterio.Env( block, rasterio will check if we are running it in the MainThread or not (https://github.com/mapbox/rasterio/blob/master/rasterio/_env.pyx#L158-L161) and then use different function to set the GDAL config.

The problem here seems that because we use simple def we run the endpoint function in a sub Thread (ThreadPoolExecutor-X_X) and then the env cannot be forwarded to threads within the threads... (yes threads within threads seems bad anyway).

It's kinda hard for me to explain so here is an example

from concurrent import futures
import rasterio
from rasterio._env import get_gdal_config
import threading

from starlette.concurrency import run_in_threadpool

def f(r=None):
    return get_gdal_config("GDAL_DISABLE_READDIR_ON_OPEN"), str(threading.current_thread())


print()
print("1 - simple - OK")
with rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN="FALSE"):
    print(f())

print()
print("2 - async simple - OK")
with rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN="FALSE"):
    print(await run_in_threadpool(f))

def g():
    print("Where am I: " + str(threading.current_thread())) # print what thread is used when calling rasterio.Env
    with rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN="FALSE"):
        with futures.ThreadPoolExecutor() as executor:
            return list(executor.map(f, range(1)))[0]

print()
print("3 - simple multi threads - OK")
print(g())

print()
print("4 - async multithread - NOK") 
print(await run_in_threadpool(g))
1 - simple - OK
('FALSE', '<_MainThread(MainThread, started 4487112128)>')

2 - async simple - OK
('FALSE', '<Thread(ThreadPoolExecutor-0_13, started daemon 123145455685632)>')

3 - simple multi threads - OK
Where am I: <_MainThread(MainThread, started 4487112128)>
('FALSE', '<Thread(ThreadPoolExecutor-23_0, started daemon 123145492471808)>')

4 - async multithread - NOK
Where am I: <Thread(ThreadPoolExecutor-0_4, started daemon 123145408389120)>
('EMPTY_DIR', '<Thread(ThreadPoolExecutor-24_0, started daemon 123145492471808)>')

☝️ where we use run_in_threadpool we simulate the actual setting in titiler (def), fastAPI doesn't use run_in_threadpool when using async def.

Fix ?

  1. use async def for functions definition

If you are using a third party library that communicates with something (a database, an API, the file system, etc) and doesn't have support for using await, (this is currently the case for most database libraries), then declare your path operation functions as normally, with just def, like:

FastAPI docs seems to say we shouldn't (and I think @geospatial-jeff told the same in the past)

  1. see if we can change rasterio (🙅)

Not an expert here, but I guess there is a good reason to use CPLSetConfigOption only in MainThread (cc @sgillies, sorry for the ping but it's only a FYI)

ref: rasterio/rasterio#1012 & rasterio/rasterio#997

  1. change rio-tiler reader to forward gdal config 🤷‍♂️

Seems the safest choice 😭

@vincentsarago vincentsarago added the bug Something isn't working label Jan 5, 2021
@vincentsarago
Copy link
Member Author

possible solution:

do not use rasterio.Env in titiler but use something like

from contextlib import contextmanager

@contextmanager
def Env(**env):
    """Temporarily set environment variables inside the context manager and
    fully restore previous environment afterwards

    from: https://gist.github.com/igniteflow/7267431#gistcomment-2553451
    """
    env = env or {}

    original_env = {key: os.getenv(key) for key in env}
    os.environ.update(env)
    try:
        yield
    finally:
        for key, value in original_env.items():
            if value is None:
                del os.environ[key]
            else:
                os.environ[key] = value

taken from https://gist.github.com/igniteflow/7267431#gistcomment-2553451#185

when entering with Env(self.gdal_config), this will update the environment variable and restore on exist. Is this safe? 🤷‍♂️

@vincentsarago
Copy link
Member Author

Just to be clear the problem is because we are setting env withing threads (unintentionally).

def f():
    return get_gdal_config("ENV")

def g():
    with rasterio.Env(ENV="FALSE"):
        with futures.ThreadPoolExecutor() as executor:
            r = executor.submit(f)
            return r.result()

print(g())
>> "FALSE"

with futures.ThreadPoolExecutor() as executor:
    r = executor.submit(g)
    print(r.result())

>> None

@geospatial-jeff
Copy link
Contributor

geospatial-jeff commented Jan 6, 2021

(yes threads within threads seems bad anyway)

I don't think you can do this. Really what's happening is there are two thread pool executors. The first is used by FastAPI to execute the def route handlers (this is the event loop's default executor). The second is instantiated by rio-tiler in the tasks module. It's not quite thread within a thread but I think you are still right that the problem is because the thread is not the main thread which is true in this case regardless.

I think another good solution is to stop using rasterio.Env in favor of standard environment variables (I'm pretty sure this works). I know this is less koshur but I think @vincentsarago you are correct, if the problem is because rasterio.Env doesn't span threads with how the code is executed by FastAPI, we are left with either a hacky and potentially unsafe solution, hamstringing the performance of the app by switching to async def, or removing rasterio.Env entirely (unless we can change rasterio). Injecting rasterio.Env into rio-tiler seems like the best of the worst options but its quite contrived.

@vincentsarago
Copy link
Member Author

I think another good solution is to stop using rasterio.Env in favor of standard environment variables (I'm pretty sure this works).

Yes this will work but won't allow us to deploy application with different supports, e.g. :

  • NAIP: we need to set {"AWS_REQUEST_PAYER": "requester"},
  • Landsat 8 (aws) we need to set {"GDAL_DISABLE_READDIR_ON_OPEN": "FALSE"} so we can use external overviews.

Ideally, yes global env variable would be enought but won't give flexibility 🤷‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GDAL/Rasterio help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants