Skip to content

SOM-Research/DataDoc-Analyzer

DataDoc Analyzer

Extract, in a structured manner, the general guidelines from the ML community about dataset documentation practices from its scientific documentation. Study and analyze scientific data published in peer-review journals such as: Nature's Scientific Data and Data-in-Brief.

📼 Take a look to our short video presenting the tool! 📼 and here you have an example of an study using DataDocAnalyzer to extract the data from data papers.

Here you have a complete list of data journals suitable to be analyzed with this tool. Test the web UI of the tool in the following HuggingFace Space, and the API using our Docker image

⚒️ Installation

The tools come with two UIs. A web app built with Gradio intended to test the tool's capabilities and analyze a single document (you can try it in the HuggingFace Space). And a API built with FastAPI, suited to be integrated into any ML pipeline:

To use this tool, you need to have python3.10, git, and pip installed in your system. Then just:

git clone https://github.com/SOM-Research/DataDoc-Analyzer.git datadoc

## Enter to the created folder
cd datadoc

## Install dependencies (Better to do this in a virtual enviroment)
pip install -r requirements.txt
Run the web UI:
python3 app.py
Run the API:
uvicorn api:app 
Run the API using the docker image:

First you need to install docker in your sistem. Then:

docker pull joangi/datadoc_analyzer
docker run --name apidataset -p 80:80 joangi/datadoc_analyzer
docker exec apidataset apt -y install default-jre 

The API will be running in your localhost at port 80. (You can change the port in the command above)

☑️ Usage

Web UI

To use this tool, you need to provide your own API key from OpenAI.

Once set, you can upload your PDF from one of the scientific journals suited for this tool1. Keep in mind that we analyze “data papers.” Other journal publications, such as “meta-analysis” or full papers, may not work adequately.

At last, click on “get insights” of any tab, and you will get the results together with the completeness report.

Api showcase

API

The API imitates the behavior of the tabs of the web UI, but, in addition, you also have an endpoint to retrieve all the dimensions at the same time. The API's swagger documentation, which can be tested in situ, is published together along the API. The server will start at port 8000 by default (if not occupied by another app of your system). And the documentation will be found at http://127.0.0.1:8000/docs

Api showcase

📚 Background research

The tool has been presented at the 32nd ACM International Conference on Information and Knowledge Management in October '23 (tool's publication). Also, you can check this short video presenting the tool

⚖️ License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

The CC BY-SA license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.

Creative Commons License

Footnotes

  1. Some journals that publish data papers: Nature's Scientific Data, Data-in-Brief, Geoscience Data Journal etc... Here you have a complete list of data journals suitable to be analyzed with this tool.

About

A tool for analyzing the documentation of scientific datasets

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published