GitHub - sunnysideprodcorp/Cpp_R_PlottingNonRectangularData: A set of C++/R code to extract non-rectangular data from tsv files and plot it with a variety of alignment options

The code in this repository is designed to deal with non-rectangular data, such as time series collected as part of a web-scraping exercise. Below you see a sample of web-scraping that recorded the ranks of the top-25 Reddit posts every hour. For each post, the rank and time the rank was recorded were collected until the post fell out of the top 25. This resulted in a .tsv file, with example lines below, where each row represents a single post and its trajectory through the front page of Reddit.

comp	id	subreddit	created	ranks	recorded_at	rank_length
0	3ckf7b	pics	1436373189.0	[1, 1, 3, 5]	[1436392502, 1436396101, 1436399701, 1436403301]	4.0
0	3csv79	UpliftingNews	1436538581.0	[16, 24]	[1436558101, 1436594101]	2.0
0	3ccx4y	gifs	1436223351.0	[6, 7, 7, 10]	[1436259301, 1436262901, 1436266501, 1436273701]	4.0
0	3bldf2	todayilearned	1435636909.0	[4, 3]	[1435665301, 1435668901]	2.0
0	3acrl2	pics	1434677487.0	[0, 0, 0, 4]	[1434686101, 1434689701, 1434693301, 1434714901]	4.0
0	3cosrl	space	1436457300.0	[22, 16, 15, 14, 15, 17, 15, 18, 18]	[1436489702, 1436493301, 1436496901, 1436500501, 1436504101, 1436507701, 1436511301, 1436518501, 1436522101]	9.0
0	3d2m5l	pics	1436748860.0	[6]	[1436781302]	1.0
0	3b5ll4	nottheonion	1435291130.0	[14, 14, 17]	[1435326901, 1435330501, 1435334101]	3.0
0	3a7l67	Showerthoughts	1434575878.0	[16, 13]	[1434617702, 1434628502]	2.0

The goal of the code is to convert information like the above into plots that look like what we see below.

These two-dimensional histogram plots are an instructive way to examine non-rectangular data because these plots do not impose any assumptions about the shape or length of a time series such as the ones we collected when scraping Reddit's front page. We can get a sense of how different subreddits perform by watching how individual posts perform when they first enter the top 25 posts, and how long they stay there. For example, in the figure below we see that videos tend to shoot to the top of the rankings but not last a long time. In contrast TwoXChromosomes tends not to rank as highly as videos but does tend to stay around a bit longer in the top 25.

The code includes have different alignment options to learn different things about the data. In the above plot, where all time series are aligned together at their first point, we learn about how a threads in a certain subreddit tend to over time. But we might also ask how different subreddits behave at different times of the day. We can see that information below, in an example plot. In particular, we see that personalfinance is largely popular in the mid to late evening hours, whereas nottheonion is popular almost every hour of the day. nottheonion is overall also more popular, achieving higher ranks and a higher overall count compared to personalfinance.

Anyone using or modifying the C++ code should be aware of a few particulars:

There is no error checking in the Hist2D class. You must be sure that the max and min values you use to initialize a Hist2D will cover all your data. If not, you will have a run-time error that could waste a lot of time. The decision was made to avoid error checking in favor of speed.
In NumericVector, summary statistics are all computed as doubles. If you are using especially troublesome arithmetic types, such as long ints, these summary statistics may not cast correctly. You must check these values or insert error checking and appropriate casting. Again the decision was made to keep the code light weight and speedy.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
C++_processing_files		C++_processing_files
R_plotting_files		R_plotting_files
data		data
.gitignore		.gitignore
README.md		README.md
compare1.jpg		compare1.jpg
compare2.jpg		compare2.jpg
lines		lines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++_processing_files

C++_processing_files

R_plotting_files

R_plotting_files

data

data

.gitignore

.gitignore

README.md

README.md

compare1.jpg

compare1.jpg

compare2.jpg

compare2.jpg

lines

lines

Repository files navigation

About

Releases

Packages

Languages

sunnysideprodcorp/Cpp_R_PlottingNonRectangularData

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages