Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random sentences with percentage values could be improperly classified as having salinity data #122

Open
carrineblank opened this issue Feb 16, 2018 · 2 comments

Comments

@carrineblank
Copy link

For the description: "Motile rods, width 0.5–1.4 µm, length 2–8 µm (type strain:0.5–0.8 µm, 3–5 µm). Unstained cells are not granulated. Gram reaction in 12 h cultures is uneven (dappled); after 38 h, cells are Gram-negative. Spores are ellipsoidal and most (>50 %) of the sporangia are not swollen. Colonies on agar media sink into the agar within a few days (see Fig. 3); no liquefaction of agar occurs. Colonies on peptone/ urea agar are whitish and round with entire margins; no pigmentation occurs on mineral/glucose/yeast extract medium. Chemo-organotrophic. Growth is inhibited by peptones; inhibition may be neutralized by urea. Catalase- and oxidase-positive. Mesophilic; maximum temperature for growth is 40 ˚C (type strain: 35 ˚C). Positive for hydrolysis of agar, starch, hippurate and aesculin. Acid is produced from agar and glucose. Shows weak aminopeptidase activity. Negative for anaerobic growth, growth at pH 5.7 and in 5% NaCl, Voges–Proskauer test, urease, nitrate reduction, activities of egg-yolk lecithinase, dextranase, DNase and lysine decarboxylase, hydrolysis of poly-β-hydroxybutyric acid, casein, pectin, Tween 80 and chitin, production of indole, dihydroxyacetone and dextrin crystals, anaerobic gas production from nitrate, alkali or acid production in litmus milk, liquefaction of gelatin and resistance to lysozyme and sodium lauryl sulfate. Variable reactions are observed for deamination of phenylalanine, tyrosine degradation (type strain is positive) and methylene blue reduction (type strain is negative). The G+C content of the DNA is 47–49 mol% (type strain, 47 mol%), as determined by the thermal denaturation method. Type strain is 10T=DSM 1327T=CIP 107437T. Isolated from meadow soil in Gottingen, Germany, in 1972. "

MicroPIE is picking up 50 % for the NaCl min. It should be returning nothing.

(If sentences with digits followed by a percentage sign are used to classify sentences with salinity data this could be the source of the problem - in this case the sentence has nothing to do with salinity).

@carrineblank
Copy link
Author

A related example.

For the sentences: "In addition to the characteristics described for the genus, the species has the following features. Cells are 3.0–5.0 μm long and 0.8–1.5 μm in diameter. Colonies are circular with entire margins, flat/umbonate elevation, opaque and butyrous in texture and 2–3 mm in diameter after 2 days on NA (pH 7.0) plates at 37 °C. Temperature for growth is 16–45 °C, with optimum growth at 37 °C; there is no growth at 50 °C and little growth at 16 °C after several days. Growth is observed at pH 5.5–9.5, with optimum growth at pH 7.0–8.0 (most rapid initial growth at pH 7.5) but no growth at pH 5.0. Tolerates 0–100 mM boron in agar media, with optimum growth in the absence of boron and some growth at 150 mM boron after 2 days. NaCl is tolerated up to 5 % (w/v), indicating that it is moderately halotolerant. "

MicroPIE is picking up 2 mm, 0 mm for the NaCl min. It should be picking up 0 for the NaCl min.

(In this case, it appears that the phrase containing a concentration value "Tolerates 0–100 mM boron in agar media" is being interpreted incorrectly as NaCl concentrations. Thus, it appears that some sentences containing mM concentration values are being improperly classified as sentences containing salinity information.).

@carrineblank
Copy link
Author

Some suggestions:
In order to classify a sentence as containing salinity information it should not only have a concentration value (\d %, \d mM, \d µM, \d M, \d per mil, \d ‰, \d %%), but it should also have the words salinity, NaCl, sea salt, sea salts, seasalt, seasalts, sodium chloride, salt, Na+, seawater or sea water.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant