LDkp

LDkp is Long Documents Keyphrase dataset developed by

LDkp (Long Document keyphrase) dataset is the first benchmark corpus of 1.3M documents for identifying keyphrases from long documents. The LDkp dataset is released in two versions :

LDkp3k: consists of 0.1M keyphrase tagged long documents, is created using keyphrases from KP20k (Meng et al., 2017) and their corresponding long document text from S2ORC (Lo et al., 2020).
LDkp10k: The second dataset LDkp10k consisting of 1.3M long documents along with target keyphrases is created using keyphrases from OAGKX (Çano, 2019) and their corresponding long document text from S2ORC (Lo et al., 2020).

Since both datasets consist of a large number of examples, we present three versions of training split for both of the datasets with sizes, as shown below:

You can download the dataset from this link

Terms of Use

This corpus can be used freely for research purposes.
The paper listed below provide details of the creation and use of the corpus. If you use the corpus, then please cite the paper.
If interested in commercial use of the corpus, send email to research@midas.center.
If you use the corpus in a product or application, then please credit the authors and Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi appropriately. Also, if you send us an email, we will be thrilled to know about how you have used the corpus.
Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi, India disclaims any responsibility for the use of the corpus and does not provide technical support. However, the contact listed above will be happy to respond to queries and clarifications.
Rather than redistributing the corpus, please direct interested parties to this page

Please feel free to send us an email:

with feedback regarding the corpus.
with information on how you have used the corpus.
if interested in having us analyze your data for emotion, and other affectual information.
if interested in a collaborative research project.

References

Please cite the following paper if you use this dataset:

@misc{mahata2022ldkp,
     title={LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents}, 
     author={Debanjan Mahata and Navneet Agarwal and Dibya Gautam and Amardeep Kumar and Swapnil Parekh and Yaman Kumar Singla and Anish Acharya and Rajiv Ratn Shah},
     year={2022},
     eprint={2203.15349},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
Dataset_distribution.png		Dataset_distribution.png
LICENSE		LICENSE
MIDAS-logo.jpg		MIDAS-logo.jpg
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

Dataset_distribution.png

Dataset_distribution.png

LICENSE

LICENSE

MIDAS-logo.jpg

MIDAS-logo.jpg

README.md

README.md

Repository files navigation

LDkp

You can download the dataset from this link

Terms of Use

References

About

Releases

Packages

Contributors 3

License

midas-research/ldkp

Folders and files

Latest commit

History

Repository files navigation

LDkp

You can download the dataset from this link

Terms of Use

References

About

Resources

License

Stars

Watchers

Forks