Use compiler directly:
$ g++ -Wall -std=c++17 -Isrc -O3 -c *.cpp src/*.cpp
$ g++ -Wall -std=c++17 -O3 -o data-explorer *.o
OR
use CMake + GCC/Clang to compile project and tests (from some IDE or cmd).
$ data-explorer sample.txt
<operation> <aggregation> <grouping>
Usage Example:
$ avg score movie_name
Example output:
avg score GROUPED BY movie_name
ender's_game 8
pulp_fiction 6
inception 8
Operation time = 0.000009s
main.cpp - main file :)
Column.[h|cpp] - Abstract and base class for column inheritance hierarchy.
IntegerColumn.[h|cpp] - Class for storing data and performing operations on integer type columns.
StringColumn.[h|cpp] - Class for storing data and performing operations on string type columns.
DataLoader.h - Abstract and base class for data loaders.
FileDataLoader.[h|cpp] - File data loader. Loads from file headers, types and finally data.
Dataset.[h|cpp] - Representation of data. Contains info about headers, column types and stores Column class objects.
Operation.[h|cpp] - Stores enum OperationType. Also "math" is done here using some templates.
Query.h - Trivial structure for storing which query user requested.
UserInterface.[h|cpp] - Functionalities related to interaction with user.
As speed is most important expectations from task there was some optimization performed. Ones with biggest impact:
- Used std::unordered_map instead of std::map.
- Used std::vectors to store data and passed by const reference.
- Storing strings as mapped values (std::string <-> unsigned int) and usage of indexes for operations (performance and significant memory optimisation).
- Minimized copying.
Potential further optimizations:
- Usage of dynamic C-style arrays for storage. To introduce those input file need to be passed 2 times (first to check number of rows).
- Usage of C-style array + index instead of maps (if applicable and worth doing).
Potential options for scalability:
- Usage for multithreading by introducing threads.
- Usage of MPI (make sense with more sophisticated calculations).
- GPU calculations (in case of more complex calculations).
I'm not fully happy about:
- Template usage.
- Allowing accessing private field data of Column subclasses from outside (performance reasons).