Home

Welcome to the CEVOpen wiki! This page outlines the key components of the project. It's intentionally kept short. If you wish to know more, you can browse through the wiki pages of CEVOpen and openVirus.

1. Main components of intern activity:
- 1.1 Interns and Roles
- 1.2. Mini-projects
2. Technology
3. Prerequisites
- 3.1.Install
  - 3.1.1. pygetpapers (https://github.com/petermr/pygetpapers)
  - 3.1.2. ami_gui.py
- 3.2. git clone
4. Overall Goal
5. Meeting Records
- 5.1. Biweekly Meetings
- 5.2. Coding Sessions
6. Outreach
7. Code of Conduct

1. Main components of intern activity:

Currently, our projects are based on building dictionaries. Each intern has a dictionary or dictionaries, which is usually relevant to essential oils.

1.1 Interns and Roles

Interns	Dictionary	Mini-Project	Other Roles	Date of Joining	Date of Leaving
Radhu Kantilal Ladani @Radhu903	eoPlant, activity	Medicinal Activity	Record keeping	04-01-2021	30-06-2021
Kanishka Parashar @Kanishkaparashar03	invasive_plant	Invasive Species		11-01-2021	30-06-2021
Talha Hasan @tlhahsn	plant_compound, eoplant_material_history	Plant Chemical Compounds		01-02-2021	31-07-2021
Vasant Kumar @950vasant	eoplant_part, gene	Plant Genes		01-02-2021	31-07-2021
Dr. Sagar Sudam Jadhav @sasujadhav1			Dictionary Manager	18-05-2021	31-10-2021
Chaitanya Kumar @chaitshar			Record keeping (coding sessions)	01-06-2021	01-08-2021
Bhavani Malhotra @malhotra-bhavini				25-05-2021	30-09-2021
Ayush Garg (Volunteer) @ayush4921			Developing pygetpapers
Shweata N. Hegde @ShweataNHegde	organization, plant_genus, ethics_statement	Ethics Statement	Project Manager, Record Keeping, Outreach	01-06-2021	01-09-2021

1.2. Mini-projects

pose_research question research previous work repeat {

initial search
build dictionary } until useful or not_useful
build minicorpus
analyze search_results in light of hypothesis
run public demo

for technology it's more like:

outline goal
research previous work repeat {
test tool
document
add to toolkit if useful } until useful_collection
run public demo

chemotype
genotype
activities (medicinal)
phenotype - invasive species integration - how these fit together - an atlas

2. Technology

Tools include:

APIs for repositories such as EPMC, biorXiv preprints, and thesis collections.
Scrapers for semi-structured sites such as journals
standardised metadata (e.g. JATS)
PDF and HTML readers => XML or JSON
article sectioning (e.g. into JATS categories)
extraction of floats (tables, maps, images, diagrams, chemistry, maths*)
display and navigation of sections in a paper
aggregated statistics and machine learning
multilingual annotation (using Wikidata)
linking to the Wikidata knowledge graph

2.1.(`(py)getpapers`, `ami`)

pygetpapers is the scraper developed in Python by Ayush Garg. It is based on getpapers(https://github.com/ContentMine/getpapers) which was written in Node.js. pygetpapers downloads scientific papers, primarily from EuropePMC repository. You can read more about it here
pyami (Needs more documentation. Still a prototype) is currently being developed by Peter Murray-Rust. It's a new Python-based open-source universal reader and analyser for scientific literature. Source code can be found here

2.2. How are the dictionaries created?

Most dictionaries are created from Wikidata SPARQL queries. You can take a look at individual dictionary wiki pages to know more.
You can also refer to this slide deck to understand the basics.

2.3. Reporting Errors

When you have problems please try to be as constructive and informative as possible. Mailing developers (PMR or Ayush) with "It doesn't work, please tell me what to do" is not only useless.

RULE. No one can fix your bug unless you describe it. So, describe it:

What are you trying to do? In precise terms: "I am searching corpus X using program Y according to instructions Z"
What is your environment?" "I am on a Windows 10 machine with 16 GB RAM"
Is this a new problem? "No, it has just appeared today". It worked yesterday
Can you reproduce it? Yes. DO OTHER PEOPLE REPRODUCE THE SAME ERROR?
what did you do? I ran myprog on the command line

myprog -boo mydata.dat

what happened? I immediately got this stack trace:

<output>
<stack trace goes here>

ONLY use screenshots when it's a graphical program. Never use photographs.
Were there any other messages?

"cannot find data file:" myfile.dat

3. Prerequisites

Python is essential to run all of our software. Ensure you've installed it before proceeding further.

3.1.Install

3.1.1. `pygetpapers` (https://github.com/petermr/pygetpapers)

Run the following command on your command line to install pygetpapers

pip install git+git://github.com/petermr/pygetpapers

If you have trouble installing using this method, you can find alternatives here.

3.1.2. `ami_gui.py`

git clone https://github.com/petermr/openDiagram.git
Though ami_gui.py runs on the command line, you will have to make some changes to the source code to point the software to where all the projects outlined below lie on your local machine. PyCharm is recommended to edit the source code.

3.2. `git clone`

The project has gradually expanded and branched out to different research areas. Therefore, our work is dispersed across various different repositories. These repositories are where the latest dictionaries, mini-corpora and software are. To run amigui_py, you will have to clone (i.e., download it to your local machine) the following repositories:

openVirus (https://github.com/petermr/openVirus.git)
dictionary(https://github.com/petermr/dictionary.git)
CEVOpen (https://github.com/petermr/CEVOpen.git)
openDiagram (https://github.com/petermr/openDiagram.git)

4. Overall Goal

To build a multilingual semantic Atlas of Volatile Phytochemistry.[1]

4.1. Subgoals

To build Open Source multiplatform tools which can discover, aggregate, clean, and semantify scholarly documents containing significant amounts of phytochemical VOC[2]s. Documents will contain, extraction and assay of oils, optionally with properties and activities.

4.2. About CEVOpen

Phytochemistry is the key component of this project and in the main, we will be analysing:

compounds (mainly VOC). Includes synonyms, structures, images
plants that create VOC/essential oils, again many synonyms, includes images
locations where the plant was harvested
activities reported for the oils
organizations involved

We will be analysing corpora for instances of the above, manually to validate the process and then automatically.

[*] not included in CEVOpen but extensible in future
[1] we need an engaging title. "Atlas" is often extended beyond maps (e.g. Atlas of The Human Body). For example, plantPart is an atlas of the plant. It works for me but may confuse others. Here are some ideas:

"Compendium of ..."
"Semantic Essence of phytochemistry". Essence == central meaning, and also volatiles
But please think creatively.

[2] Volatile Organic Compound

4.4. Required actions:

Coordination of EO-related and general dictionaries - conformance to a common standard.
Validation of gold-standard minicorpora (e.g. for training and validating machine learning)
If you are interested in contributing to the project on the Machine Learning front, you can take a look at the Our-Project-and-Machine-Learning page.

4.5. Update (2021-06-06):

We have a new set of interns joining us. Here we are summarizing goals for the next 6 months:

4.5.1. Goals and Objectives

We have been joined by Chaitanya Sharma and Bhavini Malhotra and Sagar Jadhav and we are hoping to appoint another intern (InternX) shortly. The goal of these 6 months is to consolidate our current dictionaries, corpora, and code, and then to explore how they can be used. We'll think of this as a guide to phytochemistry of essential oils ("Atlas", "Compendium", etc.). Each of you will be creating a specific part of this and/or coordinating and customising it for a wide range of audiences. High-level objectives: - which will need prioritising

create
- tools and
- knowledge for phytochemistry of essential oils
carry out initial scoping research
create outreach materials and events

4.5.2. Roles

We have just about enough tools to start semi-automating the process, but they will need refining. That's the roles of:|

Chaitanya - section and entity recognition in papers (including PDFs). This is critical for scaling up and extracting paragraphs and sentences related to all our dictionaries
Bhavini - presentation and analysis of data. This is downstream, so consumes AMI output and hopefully will reveal potential patterns.This can lead to a "Phytochemical Atlas" linking plants, chemistry and geography. Both B and C have good software experience and will help with code testing, documentation and maintenance
Shweata will be needed in coordinating this, writing linking material, etc. Her work on Ethics Statement will support text analysis and may link up with Chaitanya. She can also work with Bhavini on presentation. All of these are transferable skills. Sagar will build a plant_gene/gene_ dictionary. This is more speculative than previous dictionaries since the literature is more dispersed and variable than the EO->compounds->activity papers we have found so far. Also, the nomenclature is more challenging. He will also soak-test (sic, not smoke) the current dictionaries and (manually at first) validate the results.
We expect the new intern (X) to have a range of skills in both informatics and bioscience. A multidisciplinary approach rather than concentrating on one area of science. We need someone to help pull the data together, clean it and look for patterns. A disciplined worker, good communication skills, flexible, competent in data handling etc. rather than specific bio/chemical knowledge. All will be heavily involved in testing the AMI framework including pygetpapers. Those experienced in the software will help to make the choice of systems we use.

4.6. Tasks in 2022

Create simple documentation and tutorials with examples to access the tools for new interns.
Make dictionaries for:
1. Plants
2. Species
3. Terpene synthases
4. Chemicals
Create a word list of chemicals and phytochemicals.
Build a classifier which differentiates between compounds and enzymes(using ML).
Text extraction from images.

5. Meeting Records

We regularly meet bi-weekly to discuss strategy and work. Apart from that, we also meet informally to review code - once a week.

5.1. Biweekly Meetings

Radhu had maintained records of all meetings since the beginning of her term. https://github.com/petermr/dictionary/wiki/Meeting-Records

Bhavini is maintaining the recordings starting 2021-06-07.

5.2. Coding Sessions

We are now having regular coding sessions every week. Chaitanya is going to maintain records of these meetings. https://github.com/petermr/CEVOpen/wiki/Coding-Sessions:-Meeting-Record

6. Outreach

We've presented our work (mostly of openVirus) at various places including Wikcite, COAR and BarCamp. You can take a look at our Outreach page. If you're a newbie, taking a look at our presentations is, probably, the best way to get started to understand the pipeline.

New Outreach content from 11 Oct 2021 here on CEVOpen Wiki, Outreach

7. Code of Conduct

All the interns, volunteers and contributors should adhere to the code of conduct, outlined here. Basically, it says "be respectable and helpful towards everyone".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

1. Main components of intern activity:

1.1 Interns and Roles

1.2. Mini-projects

2. Technology

2.1.(`(py)getpapers`, `ami`)

2.2. How are the dictionaries created?

2.3. Reporting Errors

3. Prerequisites

3.1.Install

3.1.1. `pygetpapers` (https://github.com/petermr/pygetpapers)

3.1.2. `ami_gui.py`

3.2. `git clone`

4. Overall Goal

4.1. Subgoals

4.2. About CEVOpen

4.4. Required actions:

4.5. Update (2021-06-06):

4.5.1. Goals and Objectives

4.5.2. Roles

4.6. Tasks in 2022

5. Meeting Records

5.1. Biweekly Meetings

5.2. Coding Sessions

6. Outreach

7. Code of Conduct

Clone this wiki locally

Home

1. Main components of intern activity:

1.1 Interns and Roles

1.2. Mini-projects

2. Technology

2.1.((py)getpapers, ami)

2.2. How are the dictionaries created?

2.3. Reporting Errors

3. Prerequisites

3.1.Install

3.1.1. pygetpapers (https://github.com/petermr/pygetpapers)

3.1.2. ami_gui.py

3.2. git clone

4. Overall Goal

4.1. Subgoals

4.2. About CEVOpen

4.4. Required actions:

4.5. Update (2021-06-06):

4.5.1. Goals and Objectives

4.5.2. Roles

4.6. Tasks in 2022

5. Meeting Records

5.1. Biweekly Meetings

5.2. Coding Sessions

6. Outreach

7. Code of Conduct

Clone this wiki locally

2.1.(`(py)getpapers`, `ami`)

3.1.1. `pygetpapers` (https://github.com/petermr/pygetpapers)

3.1.2. `ami_gui.py`

3.2. `git clone`