Spark Esri

Project to demonstrate the usage of Apache Spark within a Jupyter notebook within ArcGIS Pro.

Notes

Oct 25, 2022 - Updated to support upcoming Pro 3.1. See SparkGeo2 notebook for integration with Apache Arrow :-)

Apr 12, 2022 - Running PySpark in Pro 2.9 requires the PYSPARK_PYTHON environment variable to be set. It should point to the python.exe executable of your active conda environment, e.g., C:\Users\%USERNAME%\AppData\Local\ESRI\conda\envs\spark_esri\python.exe. Defining CONDA_DEFAULT_ENV is neither sufficient and nor necesary.

Dec 16, 2021 - Added check for env var SPARK_HOME to override built-in spark. See instructions below.

Oct 30, 2021 - Pro 2.8 relies on the Windows registry to find the active conda environment. The registry key is HKEY_CURRENT_USER/SOFTWARE/ESRI/ArcGISPro/PythonCondaEnv. The value of this key is used to set the required os environment variable PYSPARK_PYTHON for PySpark to work correctly in a Pro notebook.

As of this writing, the order to detect the active conda environment is as follows:

look for env var CONDA_DEFAULT_ENV.
look for %LOCALAPPDATA%/ESRI/conda/envs/proenv.txt, in case of an older Pro version.
look for HKEY_CURRENT_USER/SOFTWARE/ESRI/ArcGISPro/PythonCondaEnv.

Oct 27, 2021 - Pro 2.8.3 removed the reliance and existence of the file %LOCALAPPDATA%/ESRI/conda/envs/proenv.txt. It now depend on env var CONDA_DEFAULT_ENV to determine the activate conda env.

~~Sep 16, 2021 - Perform the following as a patch for Pro 2.8.3~~

cd c:\
git clone https://github.com/kontext-tech/winutils

~~Define a system environment variable HADOOP_HOME with value C:\winutils\hadoop-3.3.0 and add to system variable PATH the %HADOOP_HOME%/bin value.~~

~~NOTE: This works in Pro 2.6 ONLY. There is a small "issue" with Pro 2.7 and pyarrow. The folks in Redlands have a fix that will be in 2.8 :-(~~

Installation

Install Spark (Optional).

If you do not wish to use Pro's built-in Spark, you can download and install Spark 3.x separately. For example, download spark-3.2.1-bin-hadoop3.2.tgz and set the environment variable SPARK_HOME to the folder where you extracted the archive. It's best to avoid spaces in the folder path.

Create a new Pro Conda Environment.

Start a Python Command Prompt:

Note: You might need to add proxy settings to .condarc located in C:\Program Files\ArcGIS\Pro\bin\Python.

conda config --set proxy_servers.http http://username:password@host:port
conda config --set proxy_servers.https https://username:password@host:port

The above will produce something like the below:

ssl_verify: true
proxy_servers:
  http: http://domainname\username:password@host:port
  https: http://domainname\username:password@host:port

Create a new conda environment:

proswap arcgispro-py3
conda remove --yes --all --name spark_esri
conda create --yes --name spark_esri --clone arcgispro-py3
proswap spark_esri

Optional:

pip install fsspec==2021.8.1 boto3==1.18.35 s3fs==0.4.2 pyarrow==1.0.1

conda install --yes -c esri -c conda-forge -c default^
    "numba=0.53.*"^
    "pandas=1.2.*"^
    "pyodbc=4.0.*"^
    "gcsfs=0.7.*"

Install the Esri Spark module.

Note: You might need to install Git for Windows.

git clone https://github.com/mraad/spark-esri.git
cd spark-esri
python setup.py install

Spatial Binning Notebook

MicroPathing Notebook

Please note the usage of the range slider on the map to filter the micropaths between a user defined hour of day.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.idea		.idea
media		media
python		python
tools		tools
.gitignore		.gitignore
Bins.lyrx		Bins.lyrx
H3_Pandas_UDF.ipynb		H3_Pandas_UDF.ipynb
LICENSE		LICENSE
README.md		README.md
SparkGeo1.ipynb		SparkGeo1.ipynb
SparkGeo2.ipynb		SparkGeo2.ipynb
dbconnect.yml		dbconnect.yml
egg_wheel.sh		egg_wheel.sh
jn.sh		jn.sh
micro_path.ipynb		micro_path.ipynb
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
spark_dbconnect.ipynb		spark_dbconnect.ipynb
spark_esri.ipynb		spark_esri.ipynb
spark_intel.md		spark_intel.md
spark_version.ipynb		spark_version.ipynb
taxi_trips_benford.ipynb		taxi_trips_benford.ipynb
taxi_trips_duration_error.ipynb		taxi_trips_duration_error.ipynb
taxi_trips_duration_train.ipynb		taxi_trips_duration_train.ipynb
virtual_gates.ipynb		virtual_gates.ipynb

License

mraad/spark-esri

Folders and files

Latest commit

History

Repository files navigation

Spark Esri

Notes

Installation

Install Spark (Optional).

Create a new Pro Conda Environment.

Spatial Binning Notebook

MicroPathing Notebook

Virtual Gate Crossings Notebook

Remote Execution on MS Azure Databricks Notebook

Predict Taxi Trip Durations, Map Taxi Trip Duration Errors Notebooks

TODO

References

About

Resources

License

Stars

Watchers

Forks

Languages