Skip to content

NERD and wiKIData (NERD KID) is a machine learning application for classifying Wikidata items into 27 classes (as defined by the Grobid-NER project).

Notifications You must be signed in to change notification settings

tantikristanti/NERD_KID

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nerdKid 🤡

This project is inspired by the works of entity-fishing and grobid-ner. Entity-fishing is a tool to automate a recognition and disambiguisation task while grobid-ner is a named-entity recogniser based on the GROBID library, a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications.

nerdKid project focuses on the classification of entities into their types (e.g. Person, Location), grobid-ner Classes with the use of Wikidata as online knowledge base.

nerdKid

Goal

According to Wikidata's statistics, more than one hundred million items can be found in Wikidata. With its rich and open knowledge base, it's interesting to learn how those items can be classified into 27 classes. These classes are based on Grobid-NER 's project results.

The idea of this project is to make computers understand how grouping millions of items in Wikidata into specific classes based on their data characteristics.

Let's take an example of an item Albert Einstein in Wikidata which has an identifier 'Q937'. This item actually has a number of properties (e.g. 'instance of-P31', 'sex or gender-P21', etc.) as well as a number of values for each property (e.g. 'human-Q5' as a value of property 'P31', 'male-Q6581097' as a value of property 'P-21'). Based on a trained given model, computers will understand how making some predictions and classifying the Albert Einstein's item into a certain class, Person class, for instance. This project will also consider disambiguity of items. For instance, computers will not classify Marshall Plan into a Person class, because it's not a name of a person, rather it's an American initiative to aid Western Europe.

Albert Einstein

Tools

Developing Tools

Installation-Build-Run

1. Installation

a. Clone this source

$ git clone https://github.com/tantikristanti/NERD_KID.git

b. Download the zip file

NERD_KID

2. Build the project

$ mvn clean install

Training and Evaluation

For the training purpose, 9922 items of Wikidata were chosen. From these examples, 80% were used for the training purpose and the rest for the evaluation. The accuracy obtained from the current model is 92,091%. Furthermore, the FMeasure result for each class type can be seen as follows:

Developing Tools

Since the examples were taken randomly, a number of class types did not have enough examples. This is the reason a number of classes have 0 for their FMeasure.

Get the prediction results

To predict each Wikidata Id prepared in New Elements, this service can be called:

$ mvn exec:java -Dexec.mainClass="org.nerd.kid.model.WikidataNERPredictor"

  • The result can be seen in Result Predicted Class

Demo version

For testing purposes, Nerd-Kid is available here Nerd-Kid

User can only just change the Wikidata Id started with 'Q' and then the number.

Prediction Result

  • The result will be Wikidata Id, the properties, and the result of predicted class.

Use nerdKid Service in Other Projects

nerdKid prepares ways to be used in other projects:

1.Make sure nerdKid is built $ mvn clean install

2.Add the dependency to nerdKid in the pom file of other projects :

<dependency>
<groupId>org.nerd.kid</groupId>
<artifactId>nerd-kid-project</artifactId>
<version>1.0-SNAPSHOT</version>
    <exclusions>
        <exclusion>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
        </exclusion>
        <exclusion>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
        </exclusion>
    </exclusions>
</dependency>

To prevent some errors due to the overlapping of Maven dependencies like for example slf4j, do some steps explained here slf4j :

  • Declare the exclusion of commons-logging in the provided scope
<dependency>
    <groupId>commons-logging</groupId>
    <artifactId>commons-logging</artifactId>
    <version>1.1.1</version>
    <scope>provided</scope>
</dependency>

<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>jcl-over-slf4j</artifactId>
    <version>2.0.0-alpha1</version>
</dependency>
  • And ignore other slf4j dependency (or simply mark them as comments)

3.Add nerdKid library (nerd-kid-project) under lib/org/nerd/kid. This library is built as the deployment result and is saved under .m2/repository/org/nerd/kid/

4.Call the prediction service :

For predicting NER type, nerdKid needs to collect the statements for each Wikidata element. These statements are collected from entity-fishing service. There are several ways to collect the statements:

In this case, new classes as implementation of an interface called WikidataFetcherWrapper in nerdKid need to be created and the WikidataElement getElement(String wikiId) method needs to be adapted as needed.

a. Example of using nerdKid service by default :

WikidataFetcherWrapper wrapper = new NerdKBFetcherWrapper();
WikidataNERPredictor wikidataNERPredictor = new WikidataNERPredictor(wrapper);
System.out.println(wikidataNERPredictor.predict("Q1077").getPredictedClass());

b. Example of using nerdKid service by running entity-fishing on localhost (default port on 8090) :

To use this way, entity-fishing needs to be run $ mvn clean jetty:run, see entity-fishing-documentation

WikidataFetcherWrapper wrapper = new NerdKBLocalFetcherWrapper();
WikidataNERPredictor wikidataNERPredictor = new WikidataNERPredictor(wrapper);
System.out.println(wikidataNERPredictor.predict("Q1077").getPredictedClass());

Reference

For citing this work, please simply refer to the Github project:

Nerd-Kid (2017-2023) <https://github.com/tantikristanti/NERD_KID>

Contact

Main author and contact: Tanti Kristanti

About

NERD and wiKIData (NERD KID) is a machine learning application for classifying Wikidata items into 27 classes (as defined by the Grobid-NER project).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages