Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,... #455

Aculo0815 · 2022-12-02T13:51:09Z

Hi,
I installed the latest Opemsemanticsearch Version as deb-Package in my Ubuntu 22 LTS Hyper-V machine. I'd like to use OSS for our about 1700 docx documentations of non standard Feature of our Software. The indexing of the docs worked without any problems.

My Problem is:
By Default all docx are tagged with multiple default tags, I think the came from the 'Apache solr"!?!
Here some examples:

Is it possible to deactived this tags in the apache solr? If tried the following, which didn't worked:

Delete / Clear the index of OSS
Edit the {{{/var/solr/data/opensemanticsearch/conf/_schema_analysis_synonyms_skos.json}}} and delete all entries instead of one
Reload the 2 OMS Cores with
{{{curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=opensemanticsearch}}}
{{{curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=opensemanticsearch-entities}}}
After reindex my docx, the default tags are still there
I have looked inside the '/var/opensemanticsearch/db' sqllite DB, too, but didn't find something useful

Did anyone has a hint, to get rid of the default tags?

josefkarlkraus · 2022-12-21T23:18:56Z

Probably editing the file /etc/opensemanticsearch/etl solves your problem, especially by changing the lines regarding regex and by uncommenting lines with:
enhance_extract_email
enhance_extract_phone
enhance_extract_law
enhance_extract_money

feathered-arch · 2022-12-22T02:37:41Z

Y'know, I always wondered if anyone got those to work for them. I also disabled these because while it's a great concept, particularly when indexing things like the Panama Papers, it doesn't seem intelligent enough to properly parse things out without resorting to a lot of regex testing.

Aculo0815 · 2023-01-06T08:16:53Z

Great, it works. Thank's a lot. Now I'm ready to install it on a production VMware for my Dev-Team

Aculo0815 · 2023-01-11T12:13:23Z

I worked one once, but i tried it on the new server and the tagging of 'Currency' is stil there.
The 'phone numbers', 'Money' and 'Law clause' (and i added 'iban') tags are gone, that worked
I've done the following steps:
- change the /etc/opensemanticsearch/etl
- Maybe restart of the 'opensemanticetl ' service is enought, but I reboot the whole ubuntu machine
here are my changes of the etl file, did i miss a config?

# -*- coding: utf-8 -*-

#
# ETL config for connector(s)
#

# print debug messages
#config['verbose'] = True


#
# Languages for language specific index
#
# Each document is analyzed without grammar rules in the index fields like content, additionally it can be added/copied to language specific index fields/analyzers
# Document language is autodetected by default plugin enhance_detect_language_tika_server

# If index support enhanced analytics for specific languages, we can add/copy data to language specific fields/analyzers
# Set which languages are configured and shall be used in index for language specific analysis/stemming/synonyms
# Default / if not set all languages that are supported will be analyzed additionally language specific
#config['languages'] = ['en','de','fr','hu','it','pt','nl','cz','ro','ru','ar','fa']

# force to language specific analysis additional in this language(s) grammar & synonyms, even if language autodetection detects other language
#config['languages_force'] = ['en','de']


# only use language for language specific analysis which are added / uncommented later
#config['languages'] = []

# add English
#config['languages'].append('en')

# add German / Deutsch
#config['languages'].append('de')

# add French / Francais
#config['languages'].append('fr')

# add Hungarian
#config['languages'].append('hu')

# add Spanish
#config['languages'].append('es')

# add Portuguese
#config['languages'].append('pt')

# add Italian
#config['languages'].append('it')

# add Czech
#config['languages'].append('cz')

# add Dutch
#config['languages'].append('nl')

# add Romanian
#config['languages'].append('ro')

# add Russian
#config['languages'].append('ru')



#
# Index/storage
#

#
# Solr URL and port
#

config['export'] = 'export_solr'

# Solr server
config['solr'] = 'http://localhost:8983/solr/'

# Solr core
config['index'] = 'opensemanticsearch'


#
# Elastic Search
#

#config['export'] = 'export_elasticsearch'

# Index
#config['index'] = 'opensemanticsearch'


#
# Tika for text and metadata extraction
#

# Tika server (with tesseract-ocr-cache)
# default: http://localhost:9998

#config['tika_server'] = 'http://localhost:9998'

# Tika server with fake OCR cache of tesseract-ocr-cache used if OCR in later ETL tasks
# default: http://localhost:9999

#config['tika_server_fake_ocr'] = 'http://localhost:9999'


#
# Annotations
#

# add plugin for annotation/tagging/enrichment of documents
config['plugins'].append('enhance_annotations')

# set alternate URL of annotation server
#config['metadata_server'] = 'http://localhost/search-apps/annotate/json'


#
# RDF Knowledge Graph
#

# add RDF Metadata Plugin for granular import of RDF file statements to entities of knowledge graphs
config['plugins'].append('enhance_rdf')


#
# Config for OCR (automatic text recognition of text in images)
#

# Disable OCR for image files (i.e for more performance and/or because you don't need the text within images or have only photos without photographed text)
#config['ocr'] = False

# Option to disable OCR of embedded images in PDF by Tika
# so (if alternate plugin is enabled) OCR will be done only by alternate
# plugin enhance_pdf_ocr (which else works only as fallback, if Tika exceptions)
#config['ocr_pdf_tika'] = False

# Use OCR cache
config['ocr_cache'] = '/var/cache/tesseract'

# Option to disable OCR cache
#config['ocr_cache'] = None

# Do OCR for images embedded in PDF documents (i.e. designed images or scanned or photographed documents)
config['plugins'].append('enhance_pdf_ocr')

#OCR language

#If other than english you have to install package tesseract-XXX (tesseract language support) for your language
#and set ocr_lang to this value (be careful, the tesseract package for english is "eng" (not "en") german is named "deu", not "de"!)

# set OCR language to English/default
#config['ocr_lang'] = 'eng'

# set OCR language to German/Deutsch
#config['ocr_lang'] = 'deu'

# set multiple OCR languages
config['ocr_lang'] = 'eng+deu'


#
# Regex pattern for extraction
#

# Enable Regex plugin
config['plugins'].append('enhance_regex')

# Regex config for IBAN extraction
#config['regex_lists'].append('/etc/opensemanticsearch/regex/iban.tsv')


#
# Email address and email domain extraction
#
#config['plugins'].append('enhance_extract_email')


#
# Phone number extraction
#
#config['plugins'].append('enhance_extract_phone')


#
# Config for Named Entities Recognition (NER) and Named Entity Linking (NEL)
#

# Enable Entity Linking / Normalization and dictionary based Named Entities Extraction from thesaurus and ontologies
config['plugins'].append('enhance_entity_linking')

# Enable SpaCy NER plugin
config['plugins'].append('enhance_ner_spacy')

# Spacy NER Machine learning classifier (for which language and with which/how many classes)

# Default classifier if no classifier for specific language

# disable NER for languages where no classifier defined in config['spacy_ner_classifiers']
config['spacy_ner_classifier_default'] = None

# Set default classifier to English (only if you are sure, that all documents you index are english)
# config['spacy_ner_classifier_default'] = 'en_core_web_sm'

# Set default classifier to German (only if you are sure, that all documents you index are german)
# config['spacy_ner_classifier_default'] = 'de_core_news_sm'

# Language specific classifiers (mapping to autodetected document language to Spacy classifier / language)
#
# You have to download additional language classifiers for example english (en) or german (de) by
# python3 -m spacy download en
# python3 -m spacy download de
# ...

config['spacy_ner_classifiers'] = {
    'da': 'da_core_news_sm',
    'de': 'de_core_news_sm',
    'en': 'en_core_web_sm',
    'es': 'es_core_news_sm',
    'fr': 'fr_core_news_sm',
    'it': 'it_core_news_sm',
    'lt': 'lt_core_news_sm',
    'nb': 'nb_core_news_sm',
    'nl': 'nl_core_news_sm',
    'pl': 'pl_core_news_sm',
    'pt': 'pt_core_news_sm',
    'ro': 'ro_core_news_sm',
}


# Enable Stanford NER plugin
#config['plugins'].append('enhance_ner_stanford')

# Stanford NER Machine learning classifier (for which language and with how many classes, which need more computing time)

# Default classifier if no classifier for specific language

# disable NER for languages where no classifier defined in config['stanford_ner_classifiers']
config['stanford_ner_classifier_default'] = None

# Set default classifier to English (only if you are sure, that all documents you index are english)
#config['stanford_ner_classifier_default'] = '/usr/share/java/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz'

# Set default classifier to German (only if you are sure, that all documents you index are german)
#config['stanford_ner_classifier_default'] = '/usr/share/java/stanford-ner/classifiers/german.conll.germeval2014.hgc_175m_600.crf.ser.gz'

# Language specific classifiers (mapping to autodetected document language)
# Before you have to download additional language classifiers to the configured path
config['stanford_ner_classifiers'] = {
    'en': '/usr/share/java/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',
    'es': '/usr/share/java/stanford-ner/classifiers/spanish.ancora.distsim.s512.crf.ser.gz',
    'de': '/usr/share/java/stanford-ner/classifiers/german.conll.germeval2014.hgc_175m_600.crf.ser.gz',
}

# If Stanford NER JAR not in standard path
config['stanford_ner_path_to_jar'] = "/usr/share/java/stanford-ner/stanford-ner.jar"

# Stanford NER Java options like RAM settings
config['stanford_ner_java_options'] = '-mx1000m'


#
# Law clauses extraction
#

#config['plugins'].append('enhance_extract_law')


#
# Money extraction
#

#config['plugins'].append('enhance_extract_money')


#
# Neo4j graph database
#

# exports named entities and relations to Neo4j graph database

# Enable plugin to export entities and connections to Neo4j graph database
#config['plugins'].append('export_neo4j')

# Neo4j server
#config['neo4j_host'] = 'localhost'

# Username & password
#config['neo4j_user'] = 'xxx'
#config['neo4j_password'] = 'xxx'

josefkarlkraus · 2023-01-19T14:18:03Z

I just realize, that i had to go a bit further to deactivate those as well: You can simply deactivate the Django facets for e.g. "phone", "currency" and so on.
(maybe the procedure is a bit hacky but it works):

Create Django account
cd /var/lib/opensemanticsearch
python3 manage.py createsuperuser
Access Django web interface
http://xxx.xxx.xxx.xxx/search-apps/admin/ >> Thesaurus >> Facets
Deactivate facets in web interface

deactivate all facets that you don't need by clicking on them and set:
Enabled: "No",
Snippets enabled: "No"
Graph enabled: "No"
SAVE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,... #455

Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,... #455

Aculo0815 commented Dec 2, 2022

josefkarlkraus commented Dec 21, 2022

feathered-arch commented Dec 22, 2022

Aculo0815 commented Jan 6, 2023

Aculo0815 commented Jan 11, 2023 •

edited

josefkarlkraus commented Jan 19, 2023

Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,... #455

Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,... #455

Comments

Aculo0815 commented Dec 2, 2022

josefkarlkraus commented Dec 21, 2022

feathered-arch commented Dec 22, 2022

Aculo0815 commented Jan 6, 2023

Aculo0815 commented Jan 11, 2023 • edited

josefkarlkraus commented Jan 19, 2023

Aculo0815 commented Jan 11, 2023 •

edited