preston-scripts

Import and conversion scripts related to Preston data. The scripts are intended to provide examples on how to use Preston in combination with GUODA's cluster (e.g., HDFS, Apache Spark, Mesos).

Includes scripts to:

HDFS Import - import Preston archive into HDFS
Preston to DwC-A - convert Preston DwC-A archives into sequence files, and parquet files.
Create Taxonomic Checklist - use converted Preston DwC-A archives to generate taxonomic checklists given specified taxon and geospatial constraints.

Please submit any issues you may have using https://github.com/bio-guoda/guoda-services/issues/ .

HDFS Import

Hadoop File System (HDFS) is a well-used distributed filesystem designed for parallel processing. Initially designed for hadoop map-reduce, it is now also used with processing engines like Apache Spark.

preston2hdfs.sh is a script to help migrate a Preston instance into HDFS. This is work in progress, so please be read the script before you use it.

To use:

Start a terminal via https://jupyter.idigbio.org
Clone this repository git clone https://github.com/bio-guoda/preston-scripts
cd preston-scripts
Inspect ./preston2hdfs.sh and change settings when needed.
By default, the preston2hdfs.sh script uses an example Preston instance, https://github.com/bio-guoda/preston-amazon , as a Preston remote and HDFS target /user/[your username]/guoda/data/source=preston-amazon/.
Run ./preston2hdfs.sh to migrate the Preston remote to the specified HDFS target.
Inspect the target HDFS target and the work directory preston2hdfs.tmp for results.

Preston to DwC-A

Now that Preston data has been moved into HDFS, we can use idigbio-spark to convert DwC-A files in the Preston data to formats like Parquet and Sequence file. This can be done using an interactive spark shell (spark-shell or pyspark), or by using by using spark-submit.

Preston to DwC-A using dwca2parquet.sh

Repeat step 0-2 of previous recipe
Type hdfs dfs -ls /user/[your username]/guoda/data/source=preston-amazon/
Confirm that the data and prov folder exists and have sub-directories.
Inspect ./dwca2parquet.sh
Run ./dwca2parquet.sh with appropriate settings. By default it uses /user/[your username]/guoda/data/source=preston-amazon/data as your input and /user/[your username]/guoda/data/source=preston-amazon/dwca as your output
Once the job is done, inspect HDFS output dir at /user/[your username]/guoda/data/source=preston-amazon/dwca for results

Preston to DwC-A using Spark Shell

Similar to previous, only instead of using the spark-job-submit.sh script, do the following:

start a jupyter terminal https://jupyter.idigbio.org
download the https://github.com/bio-guoda/idigbio-spark/releases/download/0.0.1/iDigBio-LD-assembly-1.5.9.jar
start a spark-shell using spark-shell --conf spark.sql.caseSensitive=true --jars iDigBio-LD-assembly-1.5.9.jar
now, run the following in the spark-shell

import bio.guoda.preston.spark.PrestonUtil
implicit val sparky = spark
PrestonUtil.main(Array("hdfs:///guoda/data/source=preston-amazon/data", "hdfs:///guoda/data/source=preston-amazon/dwca"))

after the job is done, confirm that

val data = spark.read.parquet("/guoda/data/source=preston-amazon/dwca/core.parquet") // replace with suitable target directory
data.count

results in a non-zero result after replacing the hdfs paths with your desired input and output paths.

Note that your can run the spark-shell locally on your machine also and point the paths at a local file system using file:/// urls.

Also note that similar approach can be taken using pyspark (python) and a spark-shell that runs the executors in the cluster. See Apache Spark documentation for more information.

Create Taxonomic Checklist

Taxonomic checklists can be generated after converting Preston DwC-A to Parquet files.

To generate a taxonomic checklist:

inspect ./create-checklist.sh
run ./create-checklist.sh in jupyter.idigbio.org terminal using appropriate parameters. By default, a checklist for birds and frogs in an area covering the Amazon rainforest is created.
inspect the results in hdfs:///user/[your user name]/guoda/checklist and hdfs:///user/[your user name]/guoda/checklist-summary or the non-default location that you used to calculate the checklist using:

$ hdfs dfs -ls /user/[your user name]/guoda/checklist
$ hdfs dfs -ls /user/[your user name]/guoda/checklist-summary

to use checklists in spark, start a spark-shell (or pyspark) and run commands like:

$ spark-shell
...
scala> val checklists = spark.read.parquet("/user/[your user]/guoda/checklist")
...
scala> checklists.show(10) // to show first 10 items in checklist
...

Use path /user/[your user]/guoda/checklist-summary to discover summaries of generated checklists.

to export checklists to csv files, use:

$ spark-shell
scala> val checklists = spark.read.parquet("/user/[your user]/guoda/checklist")
...
scala> checklists.write.csv("/user/[your user]/my-checklist.csv")

Funding

This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
query		query
server		server
.gitignore		.gitignore
README.md		README.md
cluster-status.sh		cluster-status.sh
create-checklist.sh		create-checklist.sh
dwca2parquet.sh		dwca2parquet.sh
framework-list.jq		framework-list.jq
framework-list.sh		framework-list.sh
framework-stop-old.sh		framework-stop-old.sh
framework-stop.sh		framework-stop.sh
get-libs.sh		get-libs.sh
jq		jq
match-emails.sh		match-emails.sh
package-preston-flat.sh		package-preston-flat.sh
package-preston.config.template		package-preston.config.template
package-preston.sh		package-preston.sh
partition-preston.sh		partition-preston.sh
patch-generation-activities.py		patch-generation-activities.py
preston-web-service.sh		preston-web-service.sh
preston2hdfs.sh		preston2hdfs.sh
register_software_heritage_with_hash_archive.sh		register_software_heritage_with_hash_archive.sh
register_with_hash_archive.sh		register_with_hash_archive.sh
register_with_internet_archive.sh		register_with_internet_archive.sh
spark-shell-cluster.sh		spark-shell-cluster.sh
stats.sh		stats.sh

bio-guoda/preston-scripts

Folders and files

Latest commit

History

Repository files navigation

preston-scripts

HDFS Import

Preston to DwC-A

Preston to DwC-A using dwca2parquet.sh

Preston to DwC-A using Spark Shell

Create Taxonomic Checklist

Funding

About

Resources

Stars

Watchers

Forks

Languages