Skip to content

BC Data Catalogue search engine improvement proof-of-concept project. This work was completed by a student team as a UBC MDS Capstone project (June 2023).

License

Notifications You must be signed in to change notification settings

bcgov/bcdc-search-improvement-capstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LicenseLifecycle:Experimental

bcdc-search-improvement-capstone

A UBC MDS Capstone project focused on a proof-of-concept of the application of natural language processing modeling to improve the user search experience of the B.C. Data Catalogue.

Project Status

Work In Progress

Problem Statement

The BC Data Catalogue contains over 4,000 datasets with lots of useful information, available for anyone to use. Despite its wealth of information, users often faced difficulties in locating the datasets they needed, resulting in lower engagement with the platform.
To address this issue, we used SBERT to implement Semantic Search on top of Solr search engine. Semantic search is an advanced NLP technique that focuses on understanding user intent rather than relying solely on specific keywords.
The new search engine shows a significant improvement in search performance and has the ability to comprehend synonyms and phrases. It can also handle typing errors, making it more intuitive and user-friendly.

How to run?

  1. install Java 8
  2. install dependencies using conda
conda env create -f environment.yml
  1. start Solr
bin/solr.cmd start
  1. Run the streamlit app
streamlit run search_engine.py

Data Sources

B.C. Data Catalogue text data: sourced directly from the B.C. Data Catalogue available under the Open Government Licence - British Columbia.

Software

Getting Help or Reporting an Issue

To report bugs/issues/feature requests, please file an issue.

How to Contribute

If you would like to contribute, please see our CONTRIBUTING guidelines.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Dependencies

Solr is an open source search platform built on Apache Lucene. We used Solr 6.6.6 in this project. The license for Solr can be found at solr/LICENSE.txt.
We also used Vector Scoring Plugin for Solr to calculate the distance between the query and the documents.

License

Copyright 2023 Province of British Columbia

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

About

BC Data Catalogue search engine improvement proof-of-concept project. This work was completed by a student team as a UBC MDS Capstone project (June 2023).

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published