Skip to content
Ambreen H edited this page Sep 20, 2020 · 12 revisions

Machine learning

These pages will include tools, experiences, tutorials on how to do simple ML on our corpora

## Scope openVirus will almost certainly be limited to working with words (i.e. not images, speech, etc.).

There is no magic.

ML requires good data, good conceptual models, good annotation and constant testing and re-modelling. It may well require new code. The amount that can be done in a month is limited. None the less we can make useful discoveries in technology and hopefully some initial categorization.

Classification

Most of the operations will be classification. Wikipedia explains it well (https://en.wikipedia.org/wiki/Statistical_classification).

In statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition.

In the terminology of machine learning,[1] classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.

Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or features. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure). Other classifiers work by comparing observations to previous observations by means of a similarity or distance function.

An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.

We are likely to start with binary classification ("viral-epidemic" true/false) and move on to categorical (e.g. type of section).

Features

From Wikipedia:

Feature vectors

Most algorithms describe an individual instance whose category is to be predicted using a feature vector of individual, measurable properties of the instance. Each property is termed a feature, also known in statistics as an explanatory variable (or independent variable, although features may or may not be statistically independent). Features may variously be binary (e.g. "on" or "off"); categorical (e.g. "A", "B", "AB" or "O", for blood type); ordinal (e.g. "large", "medium" or "small"); integer-valued (e.g. the number of occurrences of a particular word in an email); or real-valued (e.g. a measurement of blood pressure). If the instance is an image, the feature values might correspond to the pixels of an image; if the instance is a piece of text, the feature values might be occurrence frequencies of different words. Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups (e.g. less than 5, between 5 and 10, or greater than 10).

PMR: My experience is that it pays to have good feature extraction before the ML. That's why dictionaries and sectioning are so important. For example if you do word frequency analysis on our current corpora the commonest words in section titles are likely to be "Introduction" , "methods", etc. This would be useful to distinguish our corpus from sports reports ("Teams", "Fixtures") but for science they occur in most papers so there is a lot of "noise". However it might help to distinguish articles from reviews.

precision and recall

(True/False)(Positive/Negative). https://en.wikipedia.org/wiki/Precision_and_recall . This is the basis of measurement of how well your methods are working. Note that at the start we do not have any negative measure. We cannot tell whether getpapers missed articles. However once you have classified your corpus (or a section of it) you will know whether your methods have missed documents or sections. That is why you need the human-intensive process of annotation.

There is no magic!

from Wikipedia

Precision and recall

In pattern recognition, information retrieval and classification (machine learning), precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances, while recall (also known as sensitivity) is the fraction of the total amount of relevant instances that were actually retrieved. Both precision and recall are therefore based on an understanding and measure of relevance. Suppose a computer program for recognizing dogs in photographs identifies 8 dogs in a picture containing 12 dogs and some cats. Of the 8 identified as dogs, 5 actually are dogs (true positives), while the rest are cats (false positives). The program's precision is 5/8 while i…

Feature creation tools

AMI itself extracts the features and deliberately leaves it to other tools to do the analysis. The results subtree (e.g. results.xml) can be mined for words and you will probably want to investigate how to extract these.

NOTE. do not start writing code without posting your ideas. It is likely that someone has already written tools to do similar jobs and that KNIME, Jupyter, R etc. can do this with little effort. So please post your requirements first.

Some of this overlaps with Natural Language Processing (NLP). Again KNIME, Jupyter and other tools can do everything you are likely to need in the next month.

Among the techniques you may need are:

  • regular expressions. There are many good interactive tutorials and you can learn the basics in 30 minutes. (at random: https://regexone.com/ , https://regexr.com/) Note that wildcards in filenames are not regular expressions. However they are not a golden hammer [ Geek humour if you like that (https://xkcd.com/208/ ) . (https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/ ). And NEVER use regexes for HTML or XML. (https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1733489#1733489 - first answer)
  • xpath. https://en.wikipedia.org/wiki/XPath#Examples - the rest is a bit heavy . Or https://www.baeldung.com/java-xpath . We use xpath in navigating the CProject trees . You'll find it in some ami commands. (You only have to create the expression, not the code).

Smoke_test

Tester: Ambreen H

Data preparation

In order to run the machine learning model, proper data preparation is necessary

  • The following libraries were used: xml.etree.ElementTree as ET, string, os and re
  • A function was written to locate XML files and extracting abstract from that
  • This was done on a small number of papers (11 positives and 11 negatives)
  • The abstract was cleaned by removing unnecessary characters, turning into lowercase and removing subheadings like 'abstract' etc
  • Finally a single data file was created in CSV format having 3 columns, one for the name of the file, other for the entire cleaned text in the abstract, and whether the result is a false positive or true positive.

Code File

ML Classification

  • Jupyter Notebook was used to run a smoke test for binary classification
  • Further improvements in the code are being attempted

Code File

Clone this wiki locally