Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional data to OpenSearch index. #766

Merged
merged 36 commits into from May 1, 2024

Conversation

ItIsJordan
Copy link
Collaborator

@ItIsJordan ItIsJordan commented Feb 27, 2024

Closes #557. Adds the additional_resources of a submission object, and the data_abstract (comment from submission.yaml) to the OpenSearch index as mentioned in #557. Also updates a number of tests which broke upon adding new test data.

Adds the data_abstract (comment in yaml) to the opensearch indexer function.
Fixes test_submissions_csv in dashboard_test.py to use regex when checking against URLs in text as the URL on the webpage does not always match what is contained in the Flask config.
Adds testing for data abstract searching
Adds table data to the OpenSearch index for submission searching in the OpenSearch document enhancers.
Updates test_search in search_test.py to include testing of table description text
Updates index to add file_description from additional resource objects within HEPSubmission/DataSubmission objects to the OpenSearch index.
Updates search_test OpenSearch testing to properly include some testing for searching of additional resources.
Updates the hard-coded/assumed variables used as expected data within the tests to accommodate the existence of a new test submission in the identifiers dictionary within conftest.py
Updates a broken test in test_dashboard.py, this test has issues running on my local system.
Now triggers rendering of LaTeX in record table descriptions as this was bugged and only occurred when the full table is loaded. Mentioned in issue #756
Fixes a bug where the table_options_region element ("Showing X of X values" and "Show All") on the records page would sometimes have the data of another table if loaded later.
Refactors the route to get data tables at /data/ to separate the data table (yaml file) and the headers query code. Implements a second route to select only the table data.
Updates the table renderer JS to properly use new route to select only data table.
Updates tests in response to previous commit.
Fixes an issue as a part of the large file loading bugs in #756 where the filesize threshold loading button would execute multiple queries. This occurred by multiple events being passed to the button. Now clears the buton events before.
Reverts a number of bugfixes/changes for the large file loading functionality that was intended for a separate branch.

Reverts commits:

12214c7
4afcb2f
a9bd63b
4dc662a
6445bd9
882d6a9
@coveralls
Copy link

coveralls commented Feb 27, 2024

Coverage Status

coverage: 83.41% (+0.09%) from 83.321%
when pulling 2846603 on add-additional-desc-to-osindex
into 31fa3ca on main.

Copy link
Member

@GraemeWatt GraemeWatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is great! I tested and it works well, but I had a few suggestions for improvements going a bit beyond the scope of the original issue:

  1. In addition to the free-text search, specific fields can now be searched with queries like resources:hepdata_lib or data_abstract:"CERN-LHC". These examples could be given in the "Advanced Search Tips" (search_help.html).
  2. The original issue search: index additional_resources and data_abstract #557 mentions that the goal is "these additional resources become searchable and findable in automatic queries". In practice, this means that enough information should be given in the JSON format of the search results and record pages to enable (e.g.) a Python script to download the resource files. Currently, the resources in the JSON format contains only a list of description fields of the resources. It should be expanded to give also a url and type so that resources is now a list of dictionaries (description, url, type). The type can be set to file_type of the DataResource object, while url is the file_location if it starts with http, otherwise (for a resource file not an external link) construct the url from the id similar to this line.
  3. The resources of an individual data table should be added to the data_tables when viewing the JSON for a whole record and also to the JSON for an individual data table. This is not currently done.
  4. You didn't modify the mapping in record_mapping.py, so the new fields data_abstract and resources take default values (check http://localhost:9200/hepdata-main/_mapping ). This seems to work with the current code, but it might be necessary to give an explicit mapping after addressing point 2.

Fixes and issue where the get_table_details route was not properly routing when load_all flag is not set. This resulted in failure to find the correct route.

Also removes method, as it likely does not matter.
Adds a function to generate/return the correct web link for a DataResource object. Adds associated test.
Update record_mapping to include resources/data_abstract at the record-level.
Adds function to generate a DataResource dictionary data from a DataResource object.
Also adds testing.
Adds the full resource data to the OpenSeach index for both records and table resource objects.
Adds full resource data to the output of the process_data_tables function so the full resource data shows on an individual record's JSON output
Add search testing for new search functionality
Updates the "Advanced" search help tooltip in search_help.html to include new data_abstract searching, and searching by resource (type, description, url)
Changes use of SITE_URL in the tests to retrieve from app.config.get as this is safer.
Adds a shorthand for "resource.description" searching as "resources" in the OpenSearch query builder
Updates search help text in search_help.html to standardise the text for data abstract searching. Also updated to mention "resources" shorthand for description searching.
Add "app" fixture to broken tests that now seem to require it. (where app is used to get site url)
Adds resource informaton to the table_contents in get_table_details, which will be seen in the json output of an individual data table.
Replaces more use of SITE_URL importing directly from the config, which can be unsafe, to use the current_app value provided by flask.

(adding missing imports)
Updates the search help widget in search_help.html to remove reference to non-shorthand "resources.description". Now just references "resources" as the shorthand term for this.
Copy link
Member

@GraemeWatt GraemeWatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks a lot. I'll merge now.

@GraemeWatt GraemeWatt merged commit 1dcf84f into main May 1, 2024
9 checks passed
@GraemeWatt GraemeWatt deleted the add-additional-desc-to-osindex branch May 1, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

search: index additional_resources and data_abstract
3 participants