Paper Graph

Dev/tools repo for a project about scientific papers mining to construct graphs

bibliography has been moved to its own file

If you want to execute the code you can find the corpus here. It's TEI XML files.

Working pipeline

From canceropole PDF articles to the website showing the graph

 +--------------------+
 |                    |
 |   PDF articles     |
 |                    |
 +---------+----------+
           |
 +---------v----------+
 |                    |
 |     grobit         |
 |                    |
 +---------+----------+
           |
           |        +----------------------------------+          +------------------------------------+
           |        |  generate_html_article_pages.py  |          | html pages with                    |
           |    +--->                                  +--------->+ the text of the articles           |
           |    |   +----------------------------------+          | and important sentences in yellow  |
           |    |                                                 |                                    |
           |    |                                                 +------------------------------------+
           |    |
 +---------v----+-----+          +--------------------------------------+
 |                    |          | utilsperso.edif_idf()                |
 |      TEI XML files +--------->+ (check the bottom of utilsperso.py   |
 |                    |          | there's a few lines that allow       |
 +--------+-----------+          | standalone launching of              |
          |                      | edit_idf()                           |
+---------v-----------+          |                                      |
|                     |          +-----------------+--------------------+
| generate_gephi_csv.py                            |
|                     |                            |
+----------+----------+            +---------------v----------------+
           |                       | idf.pickle                     |
 +---------v----------+            | I've added an idf file in the  |
 |  nodes.csV         |            | git for convenience, but       |
 |  edges.csv         <------------+ a new one should be            |
 |                    |            | generated for each corpus      |
 +---------+----------+            |                                |
           |                       +--------------------------------+
 +---------v-----------+
 | aman's script       |
 | adds coordinates for|
 | similary view       |
 | and similar nodes   |
 |                     |
 +----------+----------+
            |
 +----------v-------------------------+
 |  convert_id_to_tile.py             |
 |  Aman's script gives similar nodes |
 |  as ID. this converts to           |
 |  node label                        |
 |                                    |
 +---------+--------------------------+
           |
     +-----v-----+
     |   Gephi   |
     +-----+-----+
           |
    +------v------+
    |GEXF XML file|
    +------+------+
           |
+----------v-----------------------+
| this javascript website          |
|https://github.com/raphv/gexf-js  |
| with small changes               |
+----------------------------------+

For the paper, from several corpora (GSM, DBLP, ACL anthology) to .dat files

generate_aman_features.extract_acm() glove : /home/sam/work/glove

Main scripts and useful stuff

main script for canceropole. takes a folder of tei xml generated by grobit, outputs nodes.csv and edges.csv ready for gephi

necessary to make anything else run

creates the html pages for each article with the main sentences highlighted in yellow

for the most similar nodes added by aman, replace the ID of each node by its label

for the paper, generates the .dat files that aman uses to run the experiments

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
mesh		mesh
27_07_tei_idf.pickle		27_07_tei_idf.pickle
README.md		README.md
REFERENCES.md		REFERENCES.md
convert_id_to_title.py		convert_id_to_title.py
generate_aman_features.py		generate_aman_features.py
generate_gephi_csv.py		generate_gephi_csv.py
generate_html_article_pages.py		generate_html_article_pages.py
utilsperso.py		utilsperso.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mesh

mesh

27_07_tei_idf.pickle

27_07_tei_idf.pickle

README.md

README.md

REFERENCES.md

REFERENCES.md

convert_id_to_title.py

convert_id_to_title.py

generate_aman_features.py

generate_aman_features.py

generate_gephi_csv.py

generate_gephi_csv.py

generate_html_article_pages.py

generate_html_article_pages.py

utilsperso.py

utilsperso.py

Repository files navigation

Paper Graph

Working pipeline

From canceropole PDF articles to the website showing the graph

For the paper, from several corpora (GSM, DBLP, ACL anthology) to .dat files

Main scripts and useful stuff

About

Releases

Packages

Languages

ESBigeard/paper_graph

Folders and files

Latest commit

History

Repository files navigation

Paper Graph

Working pipeline

From canceropole PDF articles to the website showing the graph

For the paper, from several corpora (GSM, DBLP, ACL anthology) to .dat files

Main scripts and useful stuff

About

Resources

Stars

Watchers

Forks

Languages