DUNKS

This is the source code and data with the paper "DUNKS: Chunking and Summarizing Large and Heterogeneous Web Data for Dataset Search".

With the vast influx of open datasets published on the Web, dataset search has been an established and increasingly prominent problem. Existing solutions primarily cater to data in a single format, such as tabular or RDF datasets, despite the diverse formats of Web data. In this paper, to address data heterogeneity, we propose to transform major data formats into unified data chunks, each consisting of triples describing an entity. Furthermore, to make data chunks accommodate to the limited input capacity of dense ranking models based on pre-trained language models, we devise a multi-chunk summarization method that extracts representative triples from representative chunks. We conduct experiments on two test collections for ad hoc dataset retrieval, where the results demonstrate the effectiveness of dense ranking over summarized data chunks.

Requirements

This code is based on Python 3.9+, and the partial list of the required packages is as follow.

beautifulsoup4
camelot_py
contractions
pikepdf
python_docx
python_magic
rdflib
tika
xmltodict
flag-embedding
torch
transformers
ranx

pip install -r requirements.txt

Unified Data Chunking

python ./code/unified-data-chunking/graph_builder.py [-i|-p] <input_file|input_path> -o <output_path>

[-i|--input_file]: path to a single file
[-p|--input_path]: path to the input folder
[-o|--output_path]: path to the output folder

Notice: only one of -i and -p can be used

The structure of the input folder:

    ./input_folder
    |--dataset1
        |--file1.json
        |--file2.csv
    |--dataset2
        |--file1.json
        |--file2.csv

The input dataset can contain multiple heterogeneous data files. Currently supported data formats include:

.txt, .pdf, .html, .doc, .docx
.csv, .xls, .xlsx
.json, .xml
.rdf, .nt, .owl

The generated files in the output folder:

    ./output_folder
    |--term.tsv
    |--text.tsv
    |--triple.tsv

The structrue of the output file is as follows:

term.tsv: dataset_id\tterm_id\tterm_text
text.tsv: dataset_id\tpassage_id\tpassage_text
triple.tsv: dataset_id\tsubject_id\tpredicate_id\tobject_id

Multi-Chunk Summarization

python ./code/multi-chunk-summarization/summary_generator.py -i <input_path> -o <output_path> -n <chunk_num> -k <chunk_size>

[-p|--input_path]: path to the input folder, usually it is the output folder of the previous step
[-o|--output_path]: path to the output folder
[-n|--chunk_num]: the maximum number of chunks retained in the summary
[-k|--chunk_size]: the maximum number of triples in a summarized chunk

The structure of the input folder:

    ./input_folder
    |--term.tsv
    |--triple.tsv

The generated files in the output folder:

    ./output_folder
    |--summary.tsv

The structrue of the output file is as follows:

summary.tsv: dataset_id\tchunk_id\tsubject_id\tpredicate_id\tobject_id

Chunk-based Dataset Reranking

We implement monoBERT, BGE, and BGE-reranker as dense reranking models, see code in ./code/chunk-based-dataset-reranking/ for details. We use ranx for normalizing and fusing metadata-based and data-based relevance scores.

Evaluation

All the results for reranking experiments on test collection NTCIR-E and ACORDAR are at ./data/results in TREC format as follows.

1 Q0 32907 1 1.3371933160530833 mixed
1 Q0 31665 2 1.2344413177975981 mixed
1 Q0 1670 3 0.816091260131519 mixed

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
code		code
data		data
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

DUNKS

Requirements

Unified Data Chunking

Multi-Chunk Summarization

Chunk-based Dataset Reranking

Evaluation

Citation

About

Releases

Packages

Contributors 3

Languages

License

nju-websoft/DUNKS

Folders and files

Latest commit

History

Repository files navigation

DUNKS

Requirements

Unified Data Chunking

Multi-Chunk Summarization

Chunk-based Dataset Reranking

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Languages