Skip to content

Commit

Permalink
Merge pull request #16 from openeventdata/joss
Browse files Browse the repository at this point in the history
Improve documentation, usability, and examples for JOSS
  • Loading branch information
ahalterman committed Sep 9, 2016
2 parents 79bacb2 + d1264f2 commit 3ebceba
Show file tree
Hide file tree
Showing 10 changed files with 736 additions and 51 deletions.
136 changes: 103 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,67 +3,123 @@
mordecai
=========

Custom-built full text geocoding.
Custom-built full text geoparsing. Extract all the place names from a piece of
text, resolve them to the correct place, and return their coordinates and
structured geographic information.

This software was donated to the Open Event Data Alliance by Caerus Associates.
See [Releases](https://github.com/openeventdata/mordecai/releases) for the
2015-2016 production version of Mordecai.

Why Mordecai?
------------

Mordecai was developed to address several specific needs that previous text
geoparsing software did not. These specific requirements include:

- Overcoming a strong preference for US locations in existing geoparsing
software. Mordecai makes determining the country focus of the text should
be a separate and accurate step in the geoparsing process.
- Ease of setup and use. The system should be installable and usable by people
with only basic programming skills. Mordecai does this by running as a Docker
+ REST service, hiding the complexity of installation from end users.
- Ease of modification. This software was developed to be used primarily by
social science researchers, who tend to be much more familiar with Python
than Java. Mordecai makes the key steps in the geoparsing process (named entity
extraction, place name resolution, gazetteer lookup) exposed and easily
changed.
- Language-agnostic architecture. The only language-specific components of
Mordecai are the named entity extraction model and the word2vec model. Both
of these can be easily swapped out, giving researchers the ability to
geoparse non-English text, which is a capability that has not existed in open
source software until now.

How does it work?
-----------------

`Mordecai` accepts text and returns structured geographic information extracted
from it. It does this in several ways:

- It uses [MITIE](https://github.com/mit-nlp/MITIE) to extract placenames from
the text. In the default configuration, it uses the out-of-the-box MITIE
models, but these can be changed out for custom models when needed.
- It uses [MITIE](https://github.com/mit-nlp/MITIE) named entity recognition to
extract placenames from the text. In the default configuration, it uses the
out-of-the-box MITIE models, but these can be changed out for custom models
when needed.

- It uses [word2vec](https://code.google.com/p/word2vec/)'s models, with
[gensim](https://radimrehurek.com/gensim/)'s awesome Python wrapper, to infer
the country focus of an article given the word vectors of the article's placenames.
[gensim](https://radimrehurek.com/gensim/)'s Python implementation, to infer
the country focus of an article given the word vectors of the article's
placenames. The word2vec vectors of all the place names extracted from the
text are averaged, and this average vector is compared to the vectors for all
country names. The closest country is used as the focus country of the piece of
text.

- It uses a country-filtered search of the [geonames](http://www.geonames.org/)
gazetteer in [Elasticsearch](https://www.elastic.co/products/elasticsearch)
(with some custom logic) to find the lat/lon for each place mentioned in the
text.
(with some custom logic) to find the latitude and longitude for each place
mentioned in the text.

It runs as a Flask-RESTful service.
It runs as a Flask-RESTful service inside a Docker container.

Installation
Simple Installation
------------

Mordecai is built as a series of [Docker](https://www.docker.com/) containers. You'll need to install Docker and and
[docker-compose](https://docs.docker.com/compose/) to be able to use it. If you're using Ubuntu,
[this gist](https://gist.github.com/wdullaer/f1af16bd7e970389bad3) is a good
place to start.
Mordecai is built as a series of [Docker](https://www.docker.com/) containers,
which means that you won't need to install any software except Docker to use
it. You can find instructions for installing Docker on your operating system
[here](https://docs.docker.com/engine/installation/).

`Mordecai`'s Geonames gazeteer can either be run locally alongside Mordecai or on a remote server.
. Elasticsearch/Geonames requires a large amount of memory, so running it
locally may be okay for small projects (if your machine has enough RAM), but is
not recommended for production. The config file's default settings assume it is
running locally. Uncomment and change those lines if your index is elsewhere on
the network. To download and start the Geonames Elasticsearch container
locally, run
To start Mordecai locally, run these four commands:

```
sudo docker pull openeventdata/es-geonames
sudo docker run -d -p 9200:9200 --name=elastic openeventdata/es-geonames
sudo docker build -t mordecai .
sudo docker run -d -p 5000:5000 --link elastic:elastic mordecai
```

This pulls a pre-built image and starts it running with an open port and defined name.
### Explanation:

To start `Mordecai` itself, from inside this directory, run
The first line downloads (if you're running it for the first time) and starts a
pre-built image of a Geonames Elasticsearch container. This container holds the
geographic gazetteer that Mordecai uses to associate place names with latitudes
and longitudes. It will be accessible on port 9200 with the name `elastic`.

```
sudo docker build -t mordecai .
sudo docker run -d -p 5000:5000 --link elastic:elastic mordecai
```
Line 2 builds the main Mordecai image using the commands in the `Dockerfile`.
This can take up to 20 minutes.

Line 3 starts the Mordecai container and tells it to connect to our already
running `elastic` container with the `--link elastic:elastic` option.. Mordecai
will be accessible on port 5000. By default, Docker runs on 0.0.0.0, so any
machine on your network will be able to access it.

The `--link` flag connects `Mordecai` to the elastic image running locally.
Leave off if it's running on a different server.
**Note on resources**: Many of the required components for `mordecai`,
including the word2vec and MITIE models, are very large so downloading and
starting the service takes a while. After starting the service, it will not be
responsive for several minutes as the models are loaded into memory. You should
also ensure that you have approximately 16 gigs of RAM available.

Please note that many of the required components for `mordecai`, such as the
word2vec and MITIE models, are rather large so downloading and starting the
service takes a while.

Advanced Configuration
-----------------------

`Mordecai`'s Geonames gazeteer can either be run locally alongside Mordecai or
on a remote server. Elasticsearch/Geonames requires a large amount of memory,
so running it locally may be okay for small projects (if your machine has
enough RAM), but is not recommended for production.

If you're running elasticsearch/geonames on a different server, you'll need to
make two change:

First, the config file's default settings assume that `es-geonames` is running
locally. If you're running it on a separate server, uncomment and change the
`Server` section of the config file and update with the IP and port of your
running geonames/elasticsearch index.

Second, leave out the `--link elastic:elastic` portion when you call `docker
run` on Mordecai.

If you make any modifications to the Python files, you'll need to rebuild the
Mordecai container, which should only take a couple seconds, and then relaunch
it.

Endpoints
---------
Expand Down Expand Up @@ -103,6 +159,12 @@ curl -XPOST -H "Content-Type: application/json" --data '{"text":"(Reuters) - Th
Returns:
`[{"lat": 34.61581, "placename": "Tikrit", "seachterm": "Tikrit", "lon": 43.67861, "countrycode": "IRQ"}, {"lat": 34.61581, "placename": "Tikrit", "seachterm": "Tikrit", "lon": 43.67861, "countrycode": "IRQ"}, {"lat": 33.32475, "placename": "Baghdad", "seachterm": "Baghdad", "lon": 44.42129, "countrycode": "IRQ"}]`

### R

See the `examples` directory for an example in R, demonstrating how in read in
text, send it to Mordecai, format the returned JSON, and plot it on an
interactive map.

###Python

```
Expand Down Expand Up @@ -140,3 +202,11 @@ The tests currently require access to a running Elastic/Geonames service to
complete. If this service is running locally in a Docker container, uncomment
the `Server` section in the config file so host = `localhost` and port =
`9200`.

Contributing
------------

Contributions via pull requests are welcome. Please make sure that changes
pass the unit tests. Any bugs and problems can be reported
on the repo's [issues page](https://github.com/openeventdata/mordecai/issues).

30 changes: 30 additions & 0 deletions examples/BOL_2009_Amnesty_International.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
A number of initiatives in the area of economic, social and cultural rights resulted in improvements in education and health services and in the recognition of the land rights of Indigenous Peoples and campesinos (peasant farmers). Further weakening of the judicial system undermined fair trial guarantees.
Background

In December, President Evo Morales won a second term in office, gaining a two-thirds majority for his party in the legislature. A new Constitution was approved by voters in January and promulgated in February following more than two years of political negotiation. The Constitution asserts the centrality of Bolivia’s “plurinational” Indigenous majority and contains provisions to advance economic, social and cultural rights.
Political violence diminished, but political polarization continued to affect public life. In April, an elite police unit killed three men suspected of organizing an armed plot against the central government in the city of Santa Cruz, an opposition stronghold. Concerns were subsequently raised about the way in which the investigations were conducted.
Investigations into some 140 cases of reported rapes in Manitoba Mennonite communities were initiated. Young girls were alleged to be among the victims.
Justice system

There were continuing concerns about the independence of the judiciary. Political tensions undermined the ability of key institutions to discuss proposals for reform of the judiciary in a co-ordinated manner.
The last remaining Constitutional Court judge resigned in June, leaving a backlog of over 4,000 cases and no mechanism for oversight of constitutional guarantees
There were concerns that the continuing instability and politicization in the justice system could weaken the application of international fair trial standards. In 2009, many judges and law officers, including several Supreme Court judges, were disqualified and charged with procedural irregularities. Among them was Supreme Court President Eddy Fernández who was suspended in May on the grounds that he had allegedly intentionally delayed the “Black October” case (see below) with intent.
Legal challenges hindered progress in several high-profile cases, leading to allegations of political interference. For example, challenges over jurisdiction slowed progress in the case relating to the outbreak of violence in September 2008 in Pando department which left 19 people, mostly campesinos, dead. Allegations that judges assigned to some cases failed to act with impartiality resulted in further procedural challenges.
Two special commissions established by the Chamber of Deputies in 2008 presented their findings on both the racist violence that occurred in Sucre in May 2008 and the Pando massacre. At the end of the year, a number of local officials and leaders were on trial charged with torture and public order offences in Sucre. The Deputies recommended that over 70 people, including former Pando Prefect Leopoldo Fernández, be charged for their role in the Pando massacre. A trial was expected to start in early 2010.
Impunity

In May, the trial began of 17 senior officials, including former President Gonzalo Sánchez de Lozada, in connection with the “Black October” events of October 2003 in which at least 67 people were killed and more than 400 injured in clashes between the security forces and demonstrators protesting against government proposals to sell off national gas resources. At the end of the year, Gonzalo Sánchez de Lozada remained in the USA awaiting the outcome of an extradition request. Several former ministers charged in the case left Bolivia during 2009, thus evading prosecution.
In November, a US court ruled that sufficient grounds existed to try Gonzalo Sánchez de Lozada and former Defence Minister Carlos Sánchez Berzaín in the USA in a civil suit for damages in relation to charges of crimes against humanity and carrying out extrajudicial executions.
Former Interior Minister Luis Arce Gómez was extradited from the USA to Bolivia. On arrival he was given a 30-year prison sentence. He had been convicted in 1993 of enforced disappearance, torture, genocide and murder committed in 1980 and 1981.
Forensic work to locate the remains of members of an armed opposition movement who were forcibly disappeared in 1970 began in July in Teoponte, a rural area 300km from La Paz. By the end of the year, nine bodies had been found. The search for the remains of around 50 others believed to have died in the area was continuing at the end of the year.
The Ministry of Defence approved a procedure allowing documentation relating to past human rights violations to be requested from the armed forces. President Morales initially insisted that no files existed relating to people who were forcibly disappeared under previous governments.
Indigenous Peoples’ rights

In May, the UN Permanent Forum on Indigenous Issues published a report which acknowledged the steps taken by the Bolivian authorities to identify servitude, forced labour, bonded labour and enslavement of captive families. The report criticized entrenched interests prevalent in lowland prefectures and civic committees that allowed such abuses to continue.
In July, the Vice-Minister for Land announced a new programme to settle approximately 2,000 families from Cochabamba and La Paz departments to 200,000 hectares of lands identified as federal land in Pando department. In August, the first families were moved to these lands. However, there were concerns about the lack of infrastructure and services available to them and the programme was cancelled.
Women’s rights

A government initiative to reduce maternal mortality began in May, granting mothers a cash incentive to attend free pre- and post-natal check-ups. Take-up was high, but there were reports that women who did not have birth certificates encountered obstacles in accessing this health care. Health professionals reported an increase in the number of clandestine abortions and teenage pregnancies during the year, but there were no comprehensive reliable figures to confirm this.
Amnesty International visit

Amnesty International delegates visited Bolivia in August.
117 changes: 117 additions & 0 deletions examples/R_Mordecai_Example.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
title: "Using Mordecai Geoparsing In R"
author: "Andy Halterman"
output: html_document
references:
- id: hrtexts
title: "Human Rights Texts: Converting Human Rights Primary Source Documents into Data"
author:
- family: Fariss
given: Christopher J.
- family: Linder
given: Fridolin J.
- family: Crabtree
given: Charles D.
- family: Biek
given: Megan A.
- family: Ross
given: Ana-Sophia M.
- family: Kaur
given: Taranamol
- family: Tsai
given: Michael
year: 2015
DOI: 10.7910/DVN/IAH8OY
url : "http://dx.doi.org/10.7910/DVN/IAH8OY"
publisher: Harvard Dataverse
---

One of the advantages of Mordecai's HTTP-based interface is that any language
that can make HTTP POST requests can interact with it without needed special
Mordecai packages or code. This example demonstrates how to read in a text file,
have Mordecai geolocate it to the country level, and then do a full geoparse
with Mordecai. It then shows how to format the returned data and easily plot it
on a map.

For this demonstration, we need `httr` for handing the request to Mordecai,
`dplyr` for formatting the result, and `leaflet` for making a quick interactive
map of the results.

```{r message = FALSE, warning=FALSE}
library(httr)
library(dplyr)
library(leaflet)
```

Set the endpoints for the `country` and `places` endpoints. Here, Mordecai is
running locally.

```{r}
country_url <- "http://localhost:5000/country"
places_url <- "http://localhost:5000/places"
```

We can then make a GET request to Mordecai to make sure it's up and running and
that we can talk to it.

```{r}
t <- GET(url = country_url, as = "parsed")
content(t)
```

This response lets us know that it is and gives us some guidance on what data
format it expects.

First, let's test Mordecai's country coding capability. We can read in one of
the human rights texts prepared by @hrtexts...

```{r}
bol <- paste(readLines("BOL_2009_Amnesty_International.txt"), collapse = " ")
```

...and then POST it to the `country` endpoint.

```{r}
bol_country <- POST(url = country_url,
as = "parsed",
body = list("text" = bol),
encode = "json")
content(bol_country)
```

Thankfully, since this is indeed a text about Bolivia, Mordecai codes it as `BOL`.

Now let's do a full geoparsing, extracting all the place names in the text and
finding their correct entries in the gazetteer. The final line formats the
response as a dataframe.

```{r}
bol_places <- POST(url = places_url,
as = "parsed",
body = list("text" = bol),
encode = "json")
bol_places_df <- bind_rows(content(bol_places))
bol_places_df
```

These locations pass an eyeball test: no placename was located to a completely
different looking place. Now, for fun, we can plot these locations on an
interactive leaflet map, sized according to their mentions in the text.

```{r message = FALSE}
bol_places_df %>%
group_by(placename) %>%
mutate(count = n()) %>%
distinct() %>%
leaflet(.) %>%
addTiles() %>%
addCircleMarkers(popup = ~placename, radius = ~3*(count + 2))
```

A more serious example will use many more texts than this one and would probably
wrap the raw POST requests into a function. But hopefully this example will get
R users started with Mordecai.

# References
218 changes: 218 additions & 0 deletions examples/R_Mordecai_Example.html

Large diffs are not rendered by default.

19 changes: 19 additions & 0 deletions paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
title: 'Mordecai: Full Text Geoparsing and Event Geocoding"
tags:
- Python
- geocoding
- geoparsing
- natural language processing
- word embeddings
authors:
- name: Andrew Halterman
- orcid: 0000-0001-9716-9555
affiliation: MIT
date: 4 September 2016
bibliography: paper.bib
---
# Summary
# References
15 changes: 15 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
@article{mikolov2013efficient,
title={Efficient estimation of word representations in vector space},
author={Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey},
journal={arXiv preprint arXiv:1301.3781},
year={2013}
}


@online{geonames,
author = {Geonames},
title = {Geonames},
year = 2016,
url = {http://geonames.org},
urldate = {2016-09-08}
}
28 changes: 28 additions & 0 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: "Mordecai: Full Text Geoparsing and Event Geocoding"
tags:
- Python
- geocoding
- geoparsing
- natural language processing
- word embeddings
authors:
- name: Andrew Halterman
- orcid: 0000-0001-9716-9555
affiliation: MIT
date: 4 September 2016
bibliography: paper.bib
---

# Summary

Mordecai is a new full-text geoparsing system that extracts place names from
text, resolves them to their correct entries in a gazetteer, and returns
structured geographic information for the resolved place name. Mordecai's key
innovations are in a language-agnostic architecture that uses word2vec
[@mikolov2013efficient] for inferring the correct country for a set of
locations in a piece of text and easily changed named entity recognition
models. As a gazetteer, it uses Geonames [@geonames] in a custom-build
Elasticsearch database.

# References

0 comments on commit 3ebceba

Please sign in to comment.