Saffron 4.0 - Text Analysis and Insight Tool

Saffron is a tool for providing multi-stage analysis of text corpora by means of state-of-the-art natural language processing technology. Saffron consists of a set of self-contained and independent modules that individually provide distinct analysis of text. These modules are as follows

Corpus Indexing: Analyses raw text documents in various formats and indexes them for later components
Term Extraction: Extracts keyphrases that are the terms of each single document in a collection
Concept Consolidation: Detects and removes variations from the list of terms of each document
Author Consolidation: Detects and removes name variations from the list of authors of each document
DBpedia Lookup: Links terms extracted from a document to URLs on the Semantic Web
Author Connection: Associates authors with terms from the documents and identifies the importance of the term to each author
Term Similarity: Measures the relevance of each term to each other term
Author Similarity: Measures the relevance of each author to each other author
Taxonomy Extraction: Organizes the terms into a single hierarchical graph that allows for easy browsing of the corpus and deep insights.
RDF Extraction: Creates a knowledge graph (note that this process can take some time)

More detailed information on the configuration of Saffron can be found here.

Prerequisites

Java JDK 1.7 or above

Make sure you have Java

java -version

Maven

Saffron uses Apache Maven to run, it should therefore be installed (the recommended version is Maven 3.5.4).

Maven can be obtained through package managers such as APT or may be installed as follows:

Download Maven

wget -O- https://archive.apache.org/dist/maven/maven-3/3.5.4/binaries/apache-maven-3.5.4-bin.tar.gz | sudo tar -xzv

Locate and add Maven's bin directory path to the PATH variable in your ~/.bash_profile

export PATH="$HOME/apache-maven-3.5.4/bin:$PATH"

source ~/.bash_profile

Check that Maven is installed

mvn -version

MongoDB (optional)

If using the Web Interface MongoDB can be used to store the data. If so, install MongoDb using the using the default settings.

3GB Memory

Saffron use deep learning models for some of its modules, and these files can be quite big. You will need about 3 GB of free hard disk memory to install Saffron and its models.

Installation

To install Saffron:

clone the github repository

git clone https://github.com/insight-centre/saffron.git ~/saffron-os

Move to the project directory, and install the maven dependencies

cd ~/saffron-os
mvn clean install

Run the whole pipeline of Saffron using the Command Line method below.

Note1: Running the pipeline the first time will download all the models needed by Saffron to work, so the first time it will take longer

Note2: After the last step, you may see the following text in the logs. This can be ignored and is not impacting the analysis.

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

Running

Using the Command Line

All steps of Saffron can be executed by running the saffron.sh script, without using the Web Interface. This script takes three arguments

The corpus, which may be
1. A folder containing files in TXT, DOC or PDF
2. A zip, tar.gz or .tgz file containing files in TXT, DOC or PDF
3. A Json metadata file describing the corpus (see Saffron Formats for more details on the format of the file)
4. A Url (to crawl the corpus from)
The output folder to which the results are written
The configuration file (as described in Saffron Formats).

In addition, some optional arguments can be specified:

-c <RunConfiguration$CorpusMethod>: The type of corpus to be used. One of CRAWL, JSON, ZIP (for the corpus as a zip, tar.gz or .tgz file containing files in TXT, DOC or PDF ). Default to JSON

-i <File> : The inclusion list of terms and relations (in JSON)
-k <RunConfiguration$KGMethod> : The method for knowledge graph construction: ie. whether to generate a taxonomy or a knowledge graph. Choose between TAXO and KG. Default to KG
--domain : Limit the crawl to the domain of the seed URL (if using the CRAWL option for the corpus)

--max-pages <Integer> : The maximum number of pages to extract when crawling (if using the CRAWL option for the corpus)
--name <String> : The name of the run

For example, try this test command:

./saffron.sh ./examples/presidential_speech_dataset/corpus_with_authors.json ./web/data/output_KG ./examples/config.json -k TAXO

and verify that you obtain the output JSON files in the ./web/data/output_KG folder

More detail on Saffron, ie. how to install it, how to configure the different features, and the approaches it is based on can be found in the Wiki (https://github.com/insight-centre/saffron/wiki)

Using the Web Interface

(optional) If you choose to use Mongo, install MongoDb (use the default settings)

And start a session by typing 'mongod' on a terminal. MongoDB has to be running.

The file saffron-web.sh contains some information, such as the name given to the database, the host and port it will run on. If using Mongo, you need to change the database name (default to saffron_test) edit the file saffron-web.sh and change the line: export MONGO_DB_NAME=saffron_test

To change the Mongo HOST and PORT, simply edit the same file on the following:
```
 export MONGO_URL=localhost
 export MONGO_PORT=27017
```
All results (output JSON files) will be generated in ./web/data/. However, you can change it to store in in the Mondo database only by setting the following line to false:
```
export STORE_LOCAL_COPY=true
```
To start the Saffron Web server, simply choose a directory for Saffron to create the models and run the command as follows

./saffron-web.sh
Then open the following url in a browser to access the Web Interface

http://localhost:8080/

See the Wiki for more details on how to use the Web Interface

FORMATS.md gives the description of the input files needed to run Saffron and output files generated by Saffron

Using Docker (one module at a time or as a pipeline)

It is possible to run each module of Saffron using Docker (note that some modules depend on other modules).

A comprehensive documentation on how to do this is available in ./docs/Saffron_Docker_Documentation.pdf

Results

If the Web Interface is used and STORE_LOCAL_COPY set to true, the output files are generated and stored in ./web/data/. Saffron generates the following files (see Saffron Formats for more details on each file)

terms.json: The terms with weights
doc-terms.json: The document term map with weights
author-terms.json: The connection between authors and terms
author-sim.json: The author-author similarity graph
term-sim.json: The term-term similarity graph
taxonomy.json: The final taxonomy over the corpus as JSON (if option chosen)
taxonomy.json: The final taxonomy over the corpus as RDF (if option chosen)
rdf.json: The final knowledge graph over the corpus as JSON (if option chosen)
rdf.json: The final knowledge graph over the corpus as RDF (if option chosen)
config.json: The configuration file for the run

To create a .dot file for the generated taxonomy, you can use the following command:

python taxonomy-to-dot.py taxonomy.json > taxonomy.dot

Developer Guide

Check here to see how you can contribute to Saffron

Important:

If making any change that impact either the format of input files, the format of the output files, the format of the configuration file, or the command to run Saffron, please update the following files accordingly:

README.md
Files within the examples folder (and sub-folders)
FORMAT.md

and inform the development team of Saffron.

Java configuration

The Java classes describing the configuration can be found here JavaDoc

API Documentation

For the API documentation, see Saffron API Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 1,306 Commits
.gitlab/issue_templates		.gitlab/issue_templates
airflow/dags		airflow/dags
authors		authors
benchmarks		benchmarks
chuliu-edmonds		chuliu-edmonds
concept		concept
configs		configs
core		core
crawler		crawler
docs		docs
documentindex		documentindex
examples		examples
jpackage		jpackage
kibana		kibana
run		run
taxonomy		taxonomy
term		term
topic		topic
web		web
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
FORMATS.md		FORMATS.md
INTERNATIONALIZATION.md		INTERNATIONALIZATION.md
LICENSE		LICENSE
README.md		README.md
author-sim		author-sim
connect-authors		connect-authors
consolidate-authors		consolidate-authors
consolidate-concepts		consolidate-concepts
crawl		crawl
dbpedia-lookup		dbpedia-lookup
docker-compose.yml		docker-compose.yml
domain-model		domain-model
enrich-terms		enrich-terms
export-kg.sh		export-kg.sh
extract-terms		extract-terms
kg-extract		kg-extract
logo.png		logo.png
logo.svg		logo.svg
pom.xml		pom.xml
saffron-web.sh		saffron-web.sh
saffron.sh		saffron.sh
taxonomy-extract		taxonomy-extract
taxonomy-stats		taxonomy-stats
taxonomy-to-dot.py		taxonomy-to-dot.py
taxonomy-to-html.py		taxonomy-to-html.py
term-sim		term-sim
upgrade3.3To3.4.sh		upgrade3.3To3.4.sh

License

insight-centre/saffron

Folders and files

Latest commit

History

Repository files navigation

Saffron 4.0 - Text Analysis and Insight Tool

Prerequisites

Java JDK 1.7 or above

Maven

MongoDB (optional)

3GB Memory

Installation

Running

Using the Command Line

Using the Web Interface

Using Docker (one module at a time or as a pipeline)

Results

Developer Guide

Java configuration

API Documentation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages