cmr-nlp

A service for converting natural language queries into CMR search parameters

About ↟

This project aims to provide basic natural language processing (NLP) support for the NASA Earthdata Common Metadata Repository (CMR) clients wishing for a greater user experience when making queries to the CMR Search endpoints. Initial focus is on NLP support for spatio-temporal queries.

Future focus will be on supporting collection, granule, and variable identification from natural language queries.

Dependencies ↟

Java
lein
curl (used to download English language models)
docker and docker-compose (used to run local Elasticsearch cluster)

Supported versions:

cmr-nlp	Elasticsearch	Status
0.1.0-SNAPSHOT	6.5.2	In development

Usage ↟

There are several ways in which this project may be used:

the NLP portion of the codebase as a library (in-memory NLP models will be required)
the Geolocation functionality as a service (an Elasticsearch cluster, local or otherwise, will be required)
both NLP and Geolocation running as a service (no in-memory models; requires Elasticsearch cluster)

Each approach requires slightly different setup.

Setup ↟

In-Memory Models

If running just the NLP portion of the code as a library, you will need to have the required OpenNLP models available to the JVM on the classpath. You may do this easily in a cloned cmr-nlp directory with the following command:

$ lein download-models

This executes the script resources/scripts/download-models, which may be adapted for use in your own project.

Elasticsearch

Starting up a local Elasticsearch+Kibana cluster is as simple as:

$ lein start-es

Note that this utilizes docker-compose under the hood.

Once started, Elasticsearch's Kibana interface will be available here:

http://localhost:5601/

OpenNLP Elasticsearch Ingest

TBD

Geonames Elasticsearch Ingest

Before ingesting Geonames data, you need to

Start your Elasticsearch cluster (see above), and
Download the Geonames gazzette files locally:

$ lein download-geonames

Note that this will also unzip the two compressed files that get downloaded:

allCountries.zip (340MB) uncompresses to 1.4GB
shapes_all_low.zip (1MB) uncompresses to 3.1MB

With that done, you're ready to ingest the Geonames files into Elasticsearch:

$ lein ingest

NLP Library ↟

Start up a repl, do a require, and define a testing query:

$ lein repl

(require '[cmr.nlp.core :as nlp])
(def query "What was the average surface temperature of Lake Superior last week?")

Tokenize:

[cmr.nlp.repl] λ=> (def tokens (nlp/tokenize query))
[cmr.nlp.repl] λ=> tokens
["What"
 "was"
 "the"
 "average"
 "surface"
 "temperature"
 "of"
 "Lake"
 "Superior"
 "last"
 "week"
 "?"]

Tag the parts of speech:

[cmr.nlp.repl] λ=> (def pos (nlp/tag-pos tokens))
[cmr.nlp.repl] λ=> pos
(["What" "WP"]
 ["was" "VBD"]
 ["the" "DT"]
 ["average" "JJ"]
 ["surface" "NN"]
 ["temperature" "NN"]
 ["of" "IN"]
 ["Lake" "NNP"]
 ["Superior" "NNP"]
 ["last" "JJ"]
 ["week" "NN"]
 ["?" "."])

Get chunked phrases:

[cmr.nlp.repl] λ=> (nlp/chunk pos)
({:phrase ["What"] :tag "NP"}
 {:phrase ["was"] :tag "VP"}
 {:phrase ["the" "average" "surface" "temperature"] :tag "NP"}
 {:phrase ["of"] :tag "PP"}
 {:phrase ["Lake" "Superior"] :tag "NP"}
 {:phrase ["last" "week"] :tag "NP"})

Find locations:

[cmr.nlp.repl] λ=> (nlp/find-locations tokens)
("Lake Superior")

Find dates:

[cmr.nlp.repl] λ=> (nlp/find-dates tokens)
("last week")

Get actual dates from English sentences:

[cmr.nlp.repl] λ=> (nlp/extract-dates query)
(#inst "2018-11-27T21:40:12.946-00:00")

This is returned as a collection due to the fact that a query may have more than one date (i.e., indicate a range):

[cmr.nlp.repl] λ=> (def query2 "What was the average high temp between last year and two years ago?")
[cmr.nlp.repl] λ=> (nlp/extract-dates query2)
(#inst "2017-12-04T21:42:42.874-00:00"
 #inst "2016-12-04T21:42:42.878-00:00")

Create a CMR temporal parameter query string from a natural language sentence:

[cmr.nlp.repl] λ=> (require '[cmr.nlp.query :as query])
[cmr.nlp.repl] λ=> (query/->cmr-temporal {:query query2})
{:query "What was the average high temp between last year and two years ago?"
 :temporal "temporal%5B%5D=2016-12-12T13%3A58%3A05Z%2C2017-12-12T13%3A58%3A05Z"}

Which, when URL-decoded, gives us:

"temporal[]=2016-12-05T12:21:32Z,2017-12-05T12:21:32Z"

NLP via Elasticsearch ↟

TBD

Geolocation via Elasticsearch ↟

TBD

License ↟

Distributed under the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
dev-resources/src		dev-resources/src
resources		resources
src/cmr/nlp		src/cmr/nlp
test/cmr/nlp/tests		test/cmr/nlp/tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dev-resources/src

dev-resources/src

resources

resources

src/cmr/nlp

src/cmr/nlp

test/cmr/nlp/tests

test/cmr/nlp/tests

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

README.md

README.md

project.clj

project.clj

Repository files navigation

cmr-nlp

Contents

About ↟

Dependencies ↟

Usage ↟

Setup ↟

In-Memory Models

Elasticsearch

OpenNLP Elasticsearch Ingest

Geonames Elasticsearch Ingest

NLP Library ↟

NLP via Elasticsearch ↟

Geolocation via Elasticsearch ↟

License ↟

About

Releases

Packages

Languages

License

cmr-exchange/cmr-nlp

Folders and files

Latest commit

History

Repository files navigation

cmr-nlp

Contents

About ↟

Dependencies ↟

Usage ↟

Setup ↟

In-Memory Models

Elasticsearch

OpenNLP Elasticsearch Ingest

Geonames Elasticsearch Ingest

NLP Library ↟

NLP via Elasticsearch ↟

Geolocation via Elasticsearch ↟

License ↟

About

Resources

License

Stars

Watchers

Forks

Languages