Skip to content

Processing data using different R libraries and comparing their performance.

License

Notifications You must be signed in to change notification settings

IgorKolodziej/R_packages_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Execute SQL queries using different R libraries, benchmark them and plot the results

The aim of this project is to create and compare the performance of code that returns the exact same dataframes as SQL queries given below, using the following R libraries:

  • sqldf
  • base R
  • dplyr
  • data.table

It was made as part of structured data processing course at my university and all credits for inventing these excercises go to my professor, Anna Cena, Ph. D.

Additionaly, I challanged myself to write dplyr and data.table functions with only one command.

Furthermore, I decided to play around with ggplot2 plotting library to visualize the benchmark results, definitely check it out!

How to use

If you want to run the script, you need to unzip the data from and into the data directory

Folder structure

  • main.R loads and executes all the tasks, checks if the results are equal, benchmarks the functions and saves results
  • data: the data we are working with
  • tasks: each R file contains functions written with mentioned libraries for one SQL query
  • queries_results: tables resulting from given queries in csv format
  • benchmark_results: tables comparing performance of the libraries
  • plots: plots showing benchmark results

Dataset

We are working with a travel forum data from anonymized dump of all user-contributed content on the Stack Exchange network.

Queries

Query 1

SELECT Location, SUM(UpVotes) as TotalUpVotes
FROM Users
WHERE Location != ''
GROUP BY Location
ORDER BY TotalUpVotes DESC
LIMIT 10 

Query 2

SELECT STRFTIME('%Y', CreationDate) AS Year, STRFTIME('%m', CreationDate) AS Month,
COUNT(*) AS PostsNumber, MAX(Score) AS MaxScore
FROM Posts
WHERE PostTypeId IN (1, 2)
GROUP BY Year, Month
HAVING PostsNumber > 1000 

Query 3

SELECT Id, DisplayName, TotalViews
FROM (
    SELECT OwnerUserId, SUM(ViewCount) as TotalViews
    FROM Posts
    WHERE PostTypeId = 1
    GROUP BY OwnerUserId
) AS Questions
JOIN Users
ON Users.Id = Questions.OwnerUserId
ORDER BY TotalViews DESC
LIMIT 10

Query 4

SELECT DisplayName, QuestionsNumber, AnswersNumber, Location, Reputation, UpVotes, DownVotes
FROM (
    SELECT *
    FROM (
        SELECT COUNT(*) as AnswersNumber, OwnerUserId
        FROM Posts
        WHERE PostTypeId = 2
        GROUP BY OwnerUserId
    ) AS Answers
    JOIN
    (
        SELECT COUNT(*) as QuestionsNumber, OwnerUserId
        FROM Posts
        WHERE PostTypeId = 1
        GROUP BY OwnerUserId
    ) AS Questions
    ON Answers.OwnerUserId = Questions.OwnerUserId
    WHERE AnswersNumber > QuestionsNumber
    ORDER BY AnswersNumber DESC
    LIMIT 5
) AS PostsCounts
JOIN Users
ON PostsCounts.OwnerUserId = Users.Id

Query 5

SELECT Title, CommentCount, ViewCount, CommentsTotalScore, DisplayName, Reputation, Location
FROM (
    SELECT Posts.OwnerUserId, Posts.Title, Posts.CommentCount, Posts.ViewCount,
    CmtTotScr.CommentsTotalScore
        FROM (
            SELECT PostId, SUM(Score) AS CommentsTotalScore
            FROM Comments
            GROUP BY PostId
        ) AS CmtTotScr
    JOIN Posts ON Posts.Id = CmtTotScr.PostId
    WHERE Posts.PostTypeId=1
) AS PostsBestComments
JOIN Users ON PostsBestComments.OwnerUserId = Users.Id
ORDER BY CommentsTotalScore DESC
LIMIT 10

About

Processing data using different R libraries and comparing their performance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages