Skip to content
Lakshmi Devi Priya edited this page Jul 9, 2020 · 4 revisions

From Andy Jackson:

I'm trying to understand the data flow of ami-search @Peter Murray-Rust -- am I right in thinking it goes:
Scan text and generate snippets XML per item.
Read snippets XML and generate frequencies/counts XML etc. per item.
Read per-item XML data and generate top-level/summary XML or CSV (latter for co-occurrence data).
Generate HTML versions of XML and CSV for use.
In particular, am I right in thinking that all the outputs are generated from the snippets?

ami-search processes a CProject and iterates over each CTree.

  • It creates scholarly.html. Probaly from ` "pseudo-make" where it skips

Running ami search for the "country dictionary"

Tester: Ambreen Hamadani

ami search tool was used to test the country dictionary

  1. getpapers was used to create a directory of 1000 papers (including full texts wherever available) getpapers -q "viral epidemics" -o countr_dict -f v_epid/log.txt -x -p -k 1000

  2. This directory was used to run ami search using country dictionary ami -p countr_dict search --dictionary country

  3. After a successful run, HTML Documents were created that classified the papers on the basis of the _country _while citing the frequency of each country. eg:

ISSUES:

  • ami search doesn't work directly unless the directory (cProject Directory) is specified before the search --dictionary eg The command ami search --dictionary country -p countr_dict1 throws the following error
================================
-v to see generic values

Specific values (AMISearchTool)
================================
created COMMAND: word(frequencies)xpath:@count>20~w.stopwords:pmcstop.txt_stopwords.txt search(country) search(-p) search(countr_dict1)
0    [main] DEBUG org.contentmine.ami.tools.AbstractAMISearchTool  - old style search command); to be changed
0 [main] DEBUG org.contentmine.ami.tools.AbstractAMISearchTool  - old style search command); to be changed
>ERROR: requires cProject

The correct command, in this case, is: ami -p countr_dict1 search --dictionary country

Running ami search in disease dictionary

Tester : Lakshmi Devi Priya

  1. A large corpus of 950 articles with XML files and pdf files was created(for mini-project) using the syntax getpapers -q "viral epidemics AND human NOT COVID NOT corona virus NOT SARS-Cov-2" -o mpc -f mpc/log.txt -x -p -k 950.
  2. The corpus was segmented into 4 subfolders, each consisting of 200-250 Ctree folders.
  3. ami search was run on each subfolder using disease dictionary. The syntax used for 1st subfolder was ami -p 1-subfolder search --dictionary disease.
  4. The output showed warnings and debugs. xml documents and html DataTables were created in the subfolder based on disease dictionary with their counts and the frequencies of the words that take place in the articles.

The html datatable was like: https://drive.google.com/file/d/112nZnbZk-duJGQ88-NvNcIv7ItuA0k_Q/view?usp=sharing

issue

  1. Initially ami search was used in the 950 article corpus completely. ami search was able to create html files for some Ctree folders but errors popped up as below.
Caused by: java.lang.OutOfMemoryError: Java heap space
544001 [main] ERROR org.contentmine.cproject.args.DefaultArgProcessor  - ERR! java.lang.RuntimeException: cannot run [runTransform] in --transform (OutOfMemoryError: Java heap space)
PMC7259790 java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:564)
[...]
  1. To rectify the OutOfMemoryError, set the environment variable MAVEN_OPTS using the command set MAVEN_OPTS =-Xmx512m -XX:MaxPermSize=128m.
  2. The Cproject (mpc) was segmented into 4 subfolders, each consisting of 200-250 Ctree folders.
  3. Then, the above syntax was used on each subfolders.
Clone this wiki locally