GitHub - przemek83/data-explorer: Small tool for aggregating and grouping data. Focused on simplicity, speed and memory efficiency. Written in C++. Created as offline interview task.

Building:

Use compiler directly:

$ g++ -Wall -std=c++17 -Isrc -O3 -c *.cpp src/*.cpp  
$ g++ -Wall -std=c++17 -O3 -o data-explorer *.o

OR
use CMake + GCC/Clang to compile project and tests (from some IDE or cmd).

Execution:

$ data-explorer sample.txt

Usage:

Usage Example:

$ avg score movie_name

Example output:

avg score GROUPED BY movie_name  
ender's_game 8  
pulp_fiction 6  
inception 8  
Operation time = 0.000009s

Files:

main.cpp - main file :)
Column.[h|cpp] - Abstract and base class for column inheritance hierarchy.
IntegerColumn.[h|cpp] - Class for storing data and performing operations on integer type columns.
StringColumn.[h|cpp] - Class for storing data and performing operations on string type columns.
DataLoader.h - Abstract and base class for data loaders.
FileDataLoader.[h|cpp] - File data loader. Loads from file headers, types and finally data.
Dataset.[h|cpp] - Representation of data. Contains info about headers, column types and stores Column class objects.
Operation.[h|cpp] - Stores enum OperationType. Also "math" is done here using some templates.
Query.h - Trivial structure for storing which query user requested.
UserInterface.[h|cpp] - Functionalities related to interaction with user.

Additional info

As speed is most important expectations from task there was some optimization performed. Ones with biggest impact:

Used std::unordered_map instead of std::map.
Used std::vectors to store data and passed by const reference.
Storing strings as mapped values (std::string <-> unsigned int) and usage of indexes for operations (performance and significant memory optimisation).
Minimized copying.

Potential further optimizations:

Usage of dynamic C-style arrays for storage. To introduce those input file need to be passed 2 times (first to check number of rows).
Usage of C-style array + index instead of maps (if applicable and worth doing).

Potential options for scalability:

Usage for multithreading by introducing threads.
Usage of MPI (make sense with more sophisticated calculations).
GPU calculations (in case of more complex calculations).

I'm not fully happy about:

Template usage.
Allowing accessing private field data of Column subclasses from outside (performance reasons).

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
src		src
test		test
.clang-format		.clang-format
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakeLists.txt.in		CMakeLists.txt.in
LICENSE		LICENSE
README.md		README.md
main.cpp		main.cpp
sample.txt		sample.txt
style.astylerc		style.astylerc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

test

test

.clang-format

.clang-format

.gitattributes

.gitattributes

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

CMakeLists.txt.in

CMakeLists.txt.in

LICENSE

LICENSE

README.md

README.md

main.cpp

main.cpp

sample.txt

sample.txt

style.astylerc

style.astylerc

Repository files navigation

Building:

Execution:

Usage:

Files:

Additional info

About

Releases

Packages

Languages

License

przemek83/data-explorer

Folders and files

Latest commit

History

Repository files navigation

Building:

Execution:

Usage:

Files:

Additional info

About

Topics

Resources

License

Stars

Watchers

Forks

Languages