Goodbooks10K dataset search tool

A simple React Elasticsearch tool to explore the Goodbooks10k dataset allowing filtering by book title and user tags, built using light-bootstrap-dashboard-react. It is meant to be a simple showcase for how to use ElasticSearch to search for tagged documents.

Installation

Run npm install

Data loading

Currently the way to load the data to an elasticsearch instance is through the nested_create_es_index.py script, after decompressing the data/data.7z file. You will need to install the elasticsearch-py package for it to work. If you want to build the new_nested_tags.csv file yourself from the raw goodbooks-10k files you'll also need to install pandas into a python environment.

After running the index creation script, it is recommended to explore it interactively in Kibana. The basic query used by this project is an aggregation of the form:

GET /goodbooks10k/books/_search
{ 
  "aggs": {
        "byTag": {
          "terms": {
            "field": "tag_nested.name.keyword",
            "size": 40000
          }
        }
      },
  "size": 0
}

Where the size set to 40000 tells the aggregation to load every tag in the dataset, and the size set to 0 in the outer query tells ElasticSearch to omit search results, as we are only interested in the aggregated data.

Gotchas

react-select package is fixed at version 1.2.1 in package.json due to this issue, from reading this other issue.
Elasticsearch is prone to backwards-incompatible changes, so keep in mind that the aggregation currently used in this project might not work as expected in future versions of ElasticSearch (higher than 6). When trying the main aggregation in Kibana with size 40000, the response area will also show the following warning message.

#! Deprecation: This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.

If you are getting this message, try changing search.max_buckets as indicated. If it still doesn't work or other error comes up, please open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Documentation		Documentation
config		config
data		data
public		public
scripts		scripts
src		src
.env		.env
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
package.json		package.json
requirements.txt		requirements.txt
runtime.txt		runtime.txt
server.js		server.js

License

jobliz/goodbooks10k-dataset-explorer

Folders and files

Latest commit

History

Repository files navigation

Goodbooks10K dataset search tool

Installation

Data loading

Gotchas

About

Resources

License

Stars

Watchers

Forks

Languages