Automatic construction of a semantic knowledge-graph based on content of technical documentation
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Technical Documentation Analyzer is an application that provides a standalone tool used for creating knowledge-graph based on content of technical documentation.
Large IT projects are based on the process of translating the system description into specific modeling languages, starting from the description of the requirements model in a natural language, through the description of the architecture model, to the description of the system in the programming language. The aim of the project is to automatically create a description of the system using the OWL/RDF language based on the available technical documentation, using NLP technologies offering Named Entities Recognition and Relations Detection mechanisms. The result of the project is an ontology for describing software solutions from a functional and non-functional perspective, and an automatic mechanism for creating a Semantic Knowledge Graph based on it, describing the system based on the knowledge contained in the technical documentation.
System was designed to handle technical documentation content related with autonomous car's industry therefore own Named Entities Recognition DNN was developed. Model required compatibility with spacy library that is why we used spaCy as training interface. Training procedure was made on the dataset related to Formula Student series, and documentations publicly available on the Internet, annotated using NER Annotator for SpaCy tool.
NER model are available for download here: ner_latest.zip and is provided in version release files.
Below manual is Linux dependent, running on Windows OS is possible but not recommended.
If you are installing from source, you will need:
- Python 3.8 or later, discarding 3.11, 3.12
Optional:
- Java 17 for more information extraction, and 4.5.4 release of CoreNLPServer available here: CoreNLP Server
If you want to compile with CUDA support, install the following (note that CUDA is not supported on macOS)
- NVIDIA CUDA 11.0 or above according to Tensorflow and PyTorch support
- NVIDIA cuDNN v7 or above
- Compiler compatible with CUDA
Note: You could refer to the cuDNN Support Matrix for cuDNN versions with the various supported CUDA, CUDA driver and NVIDIA hardware
If you want to disable CUDA support, export the environment variable USE_CUDA=0
.
Other potentially useful environment variables may be found in main.py
.
Latest tag for TDA is 0.1.0-alpha
released for validation phase. Go to releases section, download archive and follow
below instructions, alternatively get system from source. For more information about releases check RELEASE.md file
git clone https://github.com/lukaszmichalskii/technical-documentation-analyzer.git
cd technical-documentation-analyzer
# **** OPTIONAL: virtual environment for Python setup ****
python3 -m virtualenv venv
source venv/bin/activate
# **** END OPTIONAL ****
python3 -m pip install -r build_requirements.txt
# Default model for better performance use 'en_core_web_sm' (worse accuracy).
python3 -m spacy download en_core_web_lg
Navigate to downloaded CoreNLPServer archive and extract packages.
unzip <archive_name.zip> # extract archive
Navigate to directory with extracted packages, and run below command:
# Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
Aside: Server should listen on port 9000:
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called --- [main] INFO CoreNLP - Server default properties: (Note: unspecified annotator properties are English defaults) inputFormat = text outputFormat = json prettyPrint = false [main] INFO CoreNLP - Threads: 24 [main] INFO CoreNLP - Starting server... [main] INFO CoreNLP - StanfordCoreNLPServer listening at /[0:0:0:0:0:0:0:0]:9000
Named Entities Recognition use pre-trained model available here: ner_latest.zip NER is topic related and above CNN detect information for autonomous car's industry. To load your own model connected with documentation follow steps below and replace mentioned model.
Navigate to src/nlp/models
directory and create ner
directory, then place all extracted files into that dir:
# navigate to models dir (from project root) and create ner
cd src/nlp/models && mkdir ner
# move all files from extracted model directory to ner directory
mv <extracted_model_absolute_path> ner
If all steps was done correctly and paths are not messed up, system on start will compile and add NER model to information extraction pipeline.
Aside: Some packages and models might be missing, system will automatically resolve missing dependencies and configure environment on first run.
Technical Documentation Analyzer (TDA) by default is run using standard pipeline configuration.
python3 src/skg_app.py --techdoc_path <path_to_documentation>
Execution could be configured using --only
argument, this allows to specify which jobs to run. The below command will run only decompress,
decode and information_extraction steps.
python3 src/skg_app.py --techdoc_path <path_to_documentation> --only decompress decode information_extraction
By default, results are stored in results
directory created in current working directory. To specify where
to store intermediate results from each module execution use --output
argument. If destination does not
exist system will automatically create provided directory tree.
python3 src/skg_app.py --techdoc_path <path_to_documentation> --output <path_to_output>
To serialize results to StarDog database provide database name with --db_name
argument.
python3 src/skg_app.py --techdoc_path <path_to_documentation> --db_name <database_name>
By default, application uses src/plugins/default_plugin.py
as text processing plugin. Custom plugin can be used with --path argument.
python3 src/skg_app.py --techdoc_path <path_to_documentation> --plugin <path_to_plugin>
Arguments for adjusting running options:
Argument | Description | Default | Required |
---|---|---|---|
--techdoc_path |
Path to the compressed documentation file/s (.zip and .tar.xz compressed only), directory with already decompressed files or single file (supported document formats: .pdf, .docx) | None | YES |
--plugin |
Path to the text parsing plugin | src/plugins/default_plugin.py | NO |
--only |
Specifies actions which should be performed on input package | decompress decode information_extraction make_graph upload_graph | NO |
--pipeline |
Specifies actions which should be performed on preprocessed text in NLP step | clean cross_coref tfidf tokenize content_filtering batch svo spo ner | NO |
--output |
Specifies directory, where results should be saved. Has to be empty | results | NO |
--tfidf |
Specifies how many words to pick from TF-IDF results for topic modeling | 5 | NO |
--db_name |
Name of the database to upload graph to | None | NO |
Other options can be set via environment variables:
Variable | Description | Default |
---|---|---|
MODEL | Language model used for Natural Langauge Processing tasks | en_core_web_lg |
USE_CUDA | If set to 1 system utilize CUDA platform during execution, otherwise CPU cores will handle calculations. Requires CUDA configuration, gives much better performance even on large language models | 0 |
IN_MEMORY_FILE_SIZE | Maximum file size that can be loaded into program memory in bytes. If file size is greater than resource limit then content is broken down into smaller pieces | 1MB |
STARDOG_ENDPOINT | Stardog database endpoint URL | None |
STARDOG_USERNAME | Stardog database username | None |
STARDOG_PASSWORD | Stardog database password | None |
See the Jira for a list of proposed features (and known issues).
Project: https://github.com/lukaszmichalskii/technical-documentation-analyzer
Author | |
---|---|
Katarzyna Hajduk | 259189@student.pwr.edu.pl |
Hubert Kustosz | 259119@student.pwr.edu.pl |
Damian Łukasiewicz | 259186@student.pwr.edu.pl |
Łukasz Michalski | 261118@student.pwr.edu.pl |
Technical Documentation Analyzer (TDA) has a GNU license, as found in the LICENSE file.