Dev/tools repo for a project about scientific papers mining to construct graphs
bibliography has been moved to its own file
If you want to execute the code you can find the corpus here. It's TEI XML files.
+--------------------+ | | | PDF articles | | | +---------+----------+ | +---------v----------+ | | | grobit | | | +---------+----------+ | | +----------------------------------+ +------------------------------------+ | | generate_html_article_pages.py | | html pages with | | +---> +--------->+ the text of the articles | | | +----------------------------------+ | and important sentences in yellow | | | | | | | +------------------------------------+ | | +---------v----+-----+ +--------------------------------------+ | | | utilsperso.edif_idf() | | TEI XML files +--------->+ (check the bottom of utilsperso.py | | | | there's a few lines that allow | +--------+-----------+ | standalone launching of | | | edit_idf() | +---------v-----------+ | | | | +-----------------+--------------------+ | generate_gephi_csv.py | | | | +----------+----------+ +---------------v----------------+ | | idf.pickle | +---------v----------+ | I've added an idf file in the | | nodes.csV | | git for convenience, but | | edges.csv <------------+ a new one should be | | | | generated for each corpus | +---------+----------+ | | | +--------------------------------+ +---------v-----------+ | aman's script | | adds coordinates for| | similary view | | and similar nodes | | | +----------+----------+ | +----------v-------------------------+ | convert_id_to_tile.py | | Aman's script gives similar nodes | | as ID. this converts to | | node label | | | +---------+--------------------------+ | +-----v-----+ | Gephi | +-----+-----+ | +------v------+ |GEXF XML file| +------+------+ | +----------v-----------------------+ | this javascript website | |https://github.com/raphv/gexf-js | | with small changes | +----------------------------------+
generate_aman_features.extract_acm() glove : /home/sam/work/glove
necessary to make anything else run
creates the html pages for each article with the main sentences highlighted in yellow
for the most similar nodes added by aman, replace the ID of each node by its label
for the paper, generates the .dat files that aman uses to run the experiments