This repository contains the code to download the HPAC corpus and a set of simple baselines.
- Python 2.7
- requests 2.21.0
- bs4 4.7.1
- ntlk 3.4
- hashedindex 0.4.4
- numpy 1.16.2
- tensorflow-gpu 1.13.1
- keras 2.2.4
- sklearn 0.20.3
- prettytable 0.7.2
- matplotlib 2.2.4
- tqdm
- stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar
- We include together with our code a version of the crawler https://github.com/smilli/fanfiction
- python-tk
We recommend to create a virtualenv (e.g. virtualenv $HOME/env/hpac) so these packages do not interfere with previous versions that you might have installed in your machine.
After activating the virtualenv, execute the file install.sh to automatically install the mentioned dependencies (tested on Ubuntu 18.04 64 bits).
The file resources/hpac_urls.txt contains the URLs of the fanfiction stories that we used to build HPAC.
NOTE: Unfortunately, some stories might be deleted by users or admins after they have been published and completed, so not being able to rebuild the 100% of the corpus is a possibility. NOTE: Also, some stories might have been modified after the corpus was created. This will result in the scripts in charge to generate HPAC not being able to retrieve some samples. NOTE: This corpus is built in an automatic way and we have not censored the content of the stories. Some of them might contain innapropiate content (e.g. sexual related content).
First, to crawl the fan fiction use the script scraper.py:
python scraper.py --output resources/fanfiction_texts/ --rate_limit 2 --log scraper.log --url resources/hpac_urls.txt
--outputThe directory where each fanfiction story will be written down (the name of each file will be the ID of the story).--rate_limitHow fast to crawl fanfiction (in number of seconds). To respect ToS, this limit should correspond to the approximate speed you could manually crawl the stories. The value used in the example is illustrative.--urlThe text file containing the URLs to crawl (e.g.resources/hpac_urls.txt).--logThe path file to log the URLs that could not be retrieved due to some issue.
Similar to https://github.com/smilli/fanfiction, the rate limit is set in order to comply with the fanfiction.net terms of service:
E. You agree not to use or launch any automated system, including without limitation, "robots," "spiders," or "offline readers," that accesses the Website in a manner that sends more request messages to the FanFiction.Net servers in a given period of time than a human can reasonably produce in the same period by using a conventional on-line web browser.
Second, build an index (and a tokenizer) using the script index.py. This is done to then be able to quickly create different versions of the corpus using different snippet lengths.
python index.py --dir resources/fanfiction_texts/ --spells resources/hpac_spells.txt --tools resources/ --stanford_jar resources/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar --dir_tok resources/fanfiction_texts_tok/
--dirThe directory containing the fanfiction stories crawled by the scriptscraper.py.--dir_tokThe output directory where to store the tokenized stories.--spellsThe file containing the spells to take into account (resources/hp_spells.txt).--toolsThe output directory where to store theindexand thetokenizerneeded to create HPAC.--stanford_jarThe path toresources/stanford-corenlp-full-2017-06-09stanford-corenlp-3.8.0.jar.
Finally, we can create a version of HPAC using a snnipet of size x (e.g. 128) with the script create_hpac.py:
python create_hpac.py --dir_stories_tok resources/fanfiction_texts_tok/ --output hpac_corpus/ --window_size 128 --index resources/ff.index --hpac_train resources/hpac_training_labels.tsv --hpac_dev resources/hpac_dev_labels.tsv --hpac_test resources/hpac_test_labels.tsv
--dir_stories_tokThe ath to the directory containing the tokenized fanfiction--outputThe path to the directory where to store HPAC--window_sizeAn integer with the size of the snippet (number of tokens)--indexThe path toff.index(created in the previous step withindex.py)--hpac_trainThe file that will contain the IDS of the training samplesresources/hpac_training_labels.tsv--hpac_devThe file that will contain the IDS of the dev samplesresources/hpac_dev_labels.tsv--hpac_testThe file that will contain the IDS of the test samplesresources/hpac_test_labels.tsv
The scripts generate three files: hpac_training_X.tsv, hpac_dev_X.tsv, and hpac_test_X.tsv, where X is the size of the snippet. This is the HPAC corpus.
As said before, some stories might be deleted from the fanfiction web site or updated, turning into invalid IDS for that particular story. To compare the generated corpus against the one used in the paper, you can use the script checker.py:
python checker.py --input hpac_corpus/hpac_dev_128.tsv --labels resources/hpac_dev_labels.tsv
--inputThe path to the generated version of a training, dev or test set--labelsThe file containing the IDS of the training/dev/test samples (e.g.resources/hpac_dev_labels.tsv)
If you want to create a larger set, or simply use Harry Potter fanfiction (or other fanfiction) for other purposes, you can collect your own fan fiction URL links (users create new stories daily) and then run the previous scripts accordingly.
python get_fanfiction_links.py --base_url https://www.fanfiction.net/book/Harry-Potter/ --lang en --status complete --rating all --page 1 --output new_fanfiction_urls.txt --rate_limit 2
--base_urlThe URL from where to download fanfiction (we used https://www.fanfiction.net/book/Harry-Potter/ )--langDownload stories written in a given language (we used en)--statusDownload fanfiction with a certain status (we used completed)--ratingDownload fanfiction with a certain rating (we used all)--rate_limitMakes a request every x seconds--pageDownload links from page x--outputThe path where to write the URLS
You can train your model(s) using the run.py script:
python run.py --training hpac_corpus/hpac_training_128.tsv --test hpac_corpus/hpac_dev_128.tsv --conf resources/configuration.conf --model LSTM --S 2 --gpu 1 --timesteps 128 --dir models/
--trainingThe path to the training file--testThe path to the dev set during training--dirThe path to the directory where to store/load the models--confThe path to the configuration file that contains the hyperparameters for the different models (e.g.resources/configuration.conf)--modelThe architecture of the model[MLR, MLP, CNN, LSTM]--gpuThe id of the GPU to be used--timestepsThis value should match the size of the snnipet window of the version of HPAC you are using--SThe number of models to train (we used 5 in our experiments).
Each trained model will be named by HP_[MLR,MLP,CNN,LSTM]_timesteps_X, where X is the value of n trained model (e.g. HP_LSTM_128_2).
You can run your trained model(s) using run.py as well
python run.py --test hpac_corpus/hpac_test_128.tsv --conf resources/configuration.conf --model LSTM --S 5 --predict --model_params models/HP_LSTM_128.params --model_weights models/HP_LSTM_128.hdf5 --gpu 1 --timesteps 128 --dir models/
--predictFlag to indicate the script we are on testing.--testThe path to the test set.--confThe path to the configuration file.--STo evaluate the first n models created during training.--modelThe architecture of the model[MLR, MLP, CNN, LSTM].--model_paramsThe path to the parameters file to be used by the model (ignoring the index that indicates that was the n trained model in the previous step).--model_weightsThe path to the weights file to be used by the model (ignoring the index that indicates that was the n trained model in the previous step).--timestepNumber of timesteps (needed for sequential models).
This work has received funding from the European Research Council (ERC), under the European Union's Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150).
David Vilares and Carlos Gómez-Rodríguez. Harry Potter and the Action Prediction Challenge from Natural Language. 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics. To appear.
If you have any suggestion, inquiry or bug to report, please contact david.vilares@udc.es