Skip to content

cmr-exchange/cmr-nlp

Repository files navigation

cmr-nlp

A service for converting natural language queries into CMR search parameters

Build Status Security Scan Dependencies Status Open Pull Requests

Clojars Project Tag

Clojure version

Contents

About

This project aims to provide basic natural language processing (NLP) support for the NASA Earthdata Common Metadata Repository (CMR) clients wishing for a greater user experience when making queries to the CMR Search endpoints. Initial focus is on NLP support for spatio-temporal queries.

Future focus will be on supporting collection, granule, and variable identification from natural language queries.

Dependencies

  • Java
  • lein
  • curl (used to download English language models)
  • docker and docker-compose (used to run local Elasticsearch cluster)

Supported versions:

cmr-nlp Elasticsearch Status
0.1.0-SNAPSHOT 6.5.2 In development

Usage

There are several ways in which this project may be used:

  • the NLP portion of the codebase as a library (in-memory NLP models will be required)
  • the Geolocation functionality as a service (an Elasticsearch cluster, local or otherwise, will be required)
  • both NLP and Geolocation running as a service (no in-memory models; requires Elasticsearch cluster)

Each approach requires slightly different setup.

Setup

In-Memory Models

If running just the NLP portion of the code as a library, you will need to have the required OpenNLP models available to the JVM on the classpath. You may do this easily in a cloned cmr-nlp directory with the following command:

$ lein download-models

This executes the script resources/scripts/download-models, which may be adapted for use in your own project.

Elasticsearch

Starting up a local Elasticsearch+Kibana cluster is as simple as:

$ lein start-es

Note that this utilizes docker-compose under the hood.

Once started, Elasticsearch's Kibana interface will be available here:

OpenNLP Elasticsearch Ingest

TBD

Geonames Elasticsearch Ingest

Before ingesting Geonames data, you need to

  1. Start your Elasticsearch cluster (see above), and
  2. Download the Geonames gazzette files locally:
$ lein download-geonames

Note that this will also unzip the two compressed files that get downloaded:

  • allCountries.zip (340MB) uncompresses to 1.4GB
  • shapes_all_low.zip (1MB) uncompresses to 3.1MB

With that done, you're ready to ingest the Geonames files into Elasticsearch:

$ lein ingest

NLP Library

Start up a repl, do a require, and define a testing query:

$ lein repl
(require '[cmr.nlp.core :as nlp])
(def query "What was the average surface temperature of Lake Superior last week?")

Tokenize:

[cmr.nlp.repl] λ=> (def tokens (nlp/tokenize query))
[cmr.nlp.repl] λ=> tokens
["What"
 "was"
 "the"
 "average"
 "surface"
 "temperature"
 "of"
 "Lake"
 "Superior"
 "last"
 "week"
 "?"]

Tag the parts of speech:

[cmr.nlp.repl] λ=> (def pos (nlp/tag-pos tokens))
[cmr.nlp.repl] λ=> pos
(["What" "WP"]
 ["was" "VBD"]
 ["the" "DT"]
 ["average" "JJ"]
 ["surface" "NN"]
 ["temperature" "NN"]
 ["of" "IN"]
 ["Lake" "NNP"]
 ["Superior" "NNP"]
 ["last" "JJ"]
 ["week" "NN"]
 ["?" "."])

Get chunked phrases:

[cmr.nlp.repl] λ=> (nlp/chunk pos)
({:phrase ["What"] :tag "NP"}
 {:phrase ["was"] :tag "VP"}
 {:phrase ["the" "average" "surface" "temperature"] :tag "NP"}
 {:phrase ["of"] :tag "PP"}
 {:phrase ["Lake" "Superior"] :tag "NP"}
 {:phrase ["last" "week"] :tag "NP"})

Find locations:

[cmr.nlp.repl] λ=> (nlp/find-locations tokens)
("Lake Superior")

Find dates:

[cmr.nlp.repl] λ=> (nlp/find-dates tokens)
("last week")

Get actual dates from English sentences:

[cmr.nlp.repl] λ=> (nlp/extract-dates query)
(#inst "2018-11-27T21:40:12.946-00:00")

This is returned as a collection due to the fact that a query may have more than one date (i.e., indicate a range):

[cmr.nlp.repl] λ=> (def query2 "What was the average high temp between last year and two years ago?")
[cmr.nlp.repl] λ=> (nlp/extract-dates query2)
(#inst "2017-12-04T21:42:42.874-00:00"
 #inst "2016-12-04T21:42:42.878-00:00")

Create a CMR temporal parameter query string from a natural language sentence:

[cmr.nlp.repl] λ=> (require '[cmr.nlp.query :as query])
[cmr.nlp.repl] λ=> (query/->cmr-temporal {:query query2})
{:query "What was the average high temp between last year and two years ago?"
 :temporal "temporal%5B%5D=2016-12-12T13%3A58%3A05Z%2C2017-12-12T13%3A58%3A05Z"}

Which, when URL-decoded, gives us:

"temporal[]=2016-12-05T12:21:32Z,2017-12-05T12:21:32Z"

NLP via Elasticsearch

TBD

Geolocation via Elasticsearch

TBD

License

Copyright © 2018 NASA

Distributed under the Apache License, Version 2.0.

About

A service for converting natural language queries into CMR search parameters

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published