Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualizing data with R and ggplot2 #606

Open
hawc2 opened this issue Mar 19, 2024 · 10 comments
Open

Visualizing data with R and ggplot2 #606

hawc2 opened this issue Mar 19, 2024 · 10 comments

Comments

@hawc2
Copy link
Collaborator

hawc2 commented Mar 19, 2024

Programming Historian in English has received a proposal for a lesson, 'Visualizing data with R and ggplot2,' by @rogorido and @nabsiddiqui.

I have circulated this proposal for feedback within the English team. We have considered this proposal for:

  • Openness: we advocate for use of open source software, open programming languages and open datasets
  • Global access: we serve a readership working with different operating systems and varying computational resources
  • Multilingualism: we celebrate methodologies and tools that can be applied or adapted for use in multilingual research-contexts
  • Sustainability: we're committed to publishing learning resources that can remain useful beyond present-day graphical user interfaces and current software versions

We are pleased to have invited @rogorido and @nabsiddiqui to develop this Proposal into a Submission under the guidance of @semanticnoodles as editor.

The Submission package should include:

  • Lesson text (written in Markdown)
  • Figures: images / plots / graphs (if using)
  • Data assets: codebooks, sample dataset (if using)

We ask @rogorido and @nabsiddiqui to share their Submission package with our Publishing team by email, copying in @semanticnoodles.

We've agreed a submission date of April. We ask @rogorido and @nabsiddiqui to contact us if they need to revise this deadline.

When the Submission package is received, our Publishing team will process the new lesson materials, and prepare a Preview of the initial draft. They will post a comment in this Issue to provide the locations of all key files, as well as a link to the Preview where contributors can read the lesson as the draft progresses.

If we have not received the Submission package by April, @semanticnoodles will attempt to contact @rogorido and @nabsiddiqui. If we do not receive any update, this Issue will be closed.

Our dedicated Ombudspersons are Ian Milligan (English), Silvia Gutiérrez De la Torre (español), Hélène Huet (français), and Luis Ferla (português) Please feel free to contact them at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudspersons will have no impact on the outcome of any peer review.

@semanticnoodles
Copy link

I confirm @rogorido and @nabsiddiqui shared with me access to their repository containing all the required files, and that I handed them over to @anisa-hawes to allow the publishing team to generate the preview, thanks.

@anisa-hawes
Copy link
Contributor

Hello Giulia @semanticnoodles, Igor @rogorido and Nabeel @nabsiddiqui,

Many thanks for sharing the lesson submission materials with me. I've now checked the Markdown file, and add some key elements of metadata. I've also checked the accompanying images and assets, ensuring each element meets our requirements.

You can find the key files here:

You can review a Preview of the lesson here:

--

A few initial notes:

  • I've made a slight adjustment to the Header sizes used in the lesson. Our typesetting convention is that ## Header 2 is the largest.
  • I've added placeholder alt_text + captions for each of your images. We have committed to providing alt-text for all figure images, plots and graphs included in our lessons, so you'll need to add this as part of your revisions. These notes on Descriptive Alt text may be useful to you.
  • I've checked to ensure that you both have the Write access you'll need to edit your draft directly. We ask authors to work on their own files with direct commits: (we prefer you don't fork our repo, or use the Pull Request system in ph-submissions).
  • I imagine Giulia @semanticnoodles may have noted this too, but I noticed that you include both a .tsv and a .csv version of the dataset, although only the .csv appears to be used in the lesson. Is the .tsv alternative required too?

@anisa-hawes
Copy link
Contributor

anisa-hawes commented Mar 20, 2024

Hello again Igor @rogorido and Nabeel @nabsiddiqui.

What's happening now?

Your lesson has been moved to the next phase of our workflow which is Phase 2: Initial Edit.

In this Phase, your editor Giulia @semanticnoodles will read your lesson, and provide some initial feedback. Giulia will post feedback and suggestions as a comment in this Issue, so that you can revise your draft in the following Phase 3: Revision 1.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 1 <br> Submission
Who worked on this? : Publishing Manager (@anisa-hawes) 
All  Phase 1 tasks completed? : Yes
Section Phase 2 <br> Initial Edit
Who's working on this? : Editor (@semanticnoodles)  
Expected completion date? : April 20
Section Phase 3 <br> Revision 1
Who's responsible? : Authors (@rogorido + @nabsiddiqui) 
Expected timeframe? : ~30 days after feedback is received

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

@rogorido
Copy link
Collaborator

@anisa-hawes Thanks for your comments. As for the tsv file: no, it is not required. It can be deleted.

I'll add the alternative captions. Thanks.

@rogorido
Copy link
Collaborator

rogorido commented Apr 8, 2024

I added captions and alt texts (10a6a9e), but Nabeel should take a look whether it looks 'Englishly' enough...

@semanticnoodles
Copy link

Hello @rogorido and @nabsiddiqui,

here follows my preliminary feedback; I am aware it is quite extensive, but I believe these indications could help you strengthen your tutorial. If you need any clarification, please do not hesitate to ask!

Overall feedback

In general, your tutorial provides valuable guidance on navigating and producing a wide range of visualisations, effectively walking through the various features of ggplot2. The piece meets the accessibility and inclusivity goals of the Programming Historian fairly well, and in most cases the language is easy to understand and straightforward. However, some elements need further work, mostly falling under two intertwined aspects discussed in the following paragraphs.

Usability: Enhancing the logical structure of the lesson

In my opinion, this is the most critical point to consider. The tutorial lacks a cohesive element to tie its components together and the organisation of the content could benefit from a more linear and less convoluted approach. The case study you propose (sister cities) seems to be just a tool to obtain a series of visualisations. This is fair enough, but it could benefit from further methodological contextualisation and unpacking: the people following your tutorial may not be historians not have a clear understanding of the methods you are using -- although they can be familiar with R.

In terms of improving the overall content, I think there are two possible directions for you to consider: either revising the content to follow a visualisation task-based narrative or placing more emphasis on the structure of the case study. The first option would privilege the visualisation tasks (but still require some methodological support for the case study), while the second would require you to generate stronger and sharper research questions from the case study, to be answered (at least in part) by the visualisation tasks. I think @nabsiddiqui did a very good job of structuring the content in the lesson Data Wrangling and Management in R, so I would recommend keeping that in mind as a reference.

The title of the proposal could benefit from being more specific - or at least mentioning the context of application. The table of contents looks unbalanced: the headings and their actual wording could be better aligned with the content they cover, and the nesting could be more linear.

You give very clear information about the concept of the grammar of graphics - this is really the cornerstone of understanding how ggplot2 is designed. I really appreciate you explaining this and including many useful resources, although I think they could be arranged more organically, instead of including relatively short hints throughout the tutorial, as they tend to overshadow the walkthrough steps on several occasions.

Sustainability: Critically reviewing the data analysis narrative

The dataset looks more than adequate for the visualisation tasks you have set as objectives, but the data narrative and its wording could benefit from further tuning. What you offer in this lesson is mostly visualisation of data distributions and there is little statistical testing involved. As your topic is sister cities, it makes perfect sense to talk about relationships, although what you observe are mostly trends or tendencies that you could try to explain through further research; sometimes you clearly point that out and sometimes it looks rather implicit. I think this is just a matter of fine-tuning the language, nothing more.

Section-specific feedback

Para stands for paragraph number; please refer to the preview generated by @anisa-hawes

Introduction, Lesson Goals and Data

  • Para 1, line 2: there is an extra )
  • Lesson’s goals could be more specific (you could pick outcomes that have major resonance that adding meaningful labels to plots)
  • No reference to the dataset is presented here (it comes from Wikidata, right?). Make sure you at least have a couple of words about it here represented.
  • Review the heading accordingly with the edits.

ggplot2: General Overview

  • This acts more like an introductory section, although it is nested under the previous one. Bring it to the same level as the previous or put it before it to give a more comprehensive introduction (or re-arrange it for better consistency, please).
  • A couple of words about the Tidyverse here would better contextualise the workflow.
  • Para 7 could be added to the Additional Resources section.
  • Para 8 could mention more strategically the arguments – review it for a better alignment with the walkthrough. You could even thinking of following the official layers featured in the introduction to ggplot2 vignette, adapting that to match with the elements you thoroughly explain.
  • Review the heading accordingly with the edits.

Sister cities in Europe

  • Please clarify your understanding of sister cities by giving a working definition. This would clarify the starting point of your research.
  • The rationale of your case needs some more unpacking; please add some context here, also about the provenance of your dataset.
  • The research questions here listed are somewhat aligned with the steps you propose. I would recommend you to review them for enhanced consistency.
  • Review the heading accordingly with the edits. Most importantly, from here on you start with the walkthrough. Make sure you clarify this by tuning the headings.

Loading Data with readr

  • If you referenced the tidyverse above you won’t need to explain tibbles extensively here. Please review this part for conciseness.
  • Including head(eudata) could support your explanation about the observations occurring in the dataset – this is also considered good practice in data science.
  • Para 16 could benefit the previous section.
  • Consider raising the level of this heading and review it accordingly.

Creating a bar graph

  • IMPORTANT: There is no typecountry column included in your dataset. I tested the walkthrough using the data contained in the eu column, just remember to send us the correct version of the dataset.
  • Paras 20-23 could be more focused on the walkthrough; anticipating para 23 once obtained the barplot could enhance the clarity.
  • Para 30 could use a bit more details about the interpretation of the results. If you plan
  • Review the heading accordingly with the edits.

Other Geoms: Histograms, Distribution Plots and Boxplots

  • Para 31, penultimate line: comma missing space afterwards.

  • Para 33, please review this for clarity (here you should mention why you used log10 once for all or put it into another spot. Consider explaining why none of the methods is ideal)

    This leads to an uninformative histogram. We can take log10(dist) as our variable or filter to exclude values above 5000kms. None of these methods is ideal, but as far as we know, we are operating with manipulated data making it less problematic

  • Para 36, please review it for clarity (it reads implicitly why you employed ECDF).

  • Para 41, same issue: you refer to ANOVA without explaining why you foresee that as a viable statistic test, cutting the paragraph short.

  • Review the heading accordingly with the edits.

Manipulating the Look of Graphs

  • This section would be more logically following the Other Geoms section. Evaluate how to make this and the following sessions more cohesive.
  • Para 42 could be revised for clarity – especially the research question. Mind that you first performed the random subsampling and then explained it.
  • Para 45 does not add much information to the following steps. Instead of pointing out which elements you want to manipulate, consider laying out clearly the goal for your tasks.
  • Para 55, review for conciseness (sometimes less is more).
  • Review the heading accordingly with the edits.

Scales: Colors, Legends, and Axes

  • Para 65, please review for straightforwardness - advantage of using a continuous scale? Also a repetition in the last line (“represent the distance”).
  • Para 68, review for accuracy: the way it is phrased seems like ggplot2 does not use discrete colour scales at all.
  • Para 70, would better fit in the Additional Resources section.
  • Para 74, review for accuracy.

Faceting a Graph

  • This section would be more logically part of the Other Geoms section and use a title anticipating also the theme changes.
  • Para 75, review for clarity and conciseness (“split by categories [space time and so]” is not very straightforward. Consider explaining straightforwardly what facetting is.)

Themes: Changing Static Elements

  • As the previous, this section would be more logically following the Other Geoms section.

Extending ggplot2 with Other Packages

  • Para 84, extra comma not rendering the link for Ridgeline plots
  • As the previous, this section would be more logically following the Other Geoms section.

Additional Resources

  • Consider reviewing and incorporating other elements into this section, following more closely the tools used in the tutorial instead of pointing towards general-purpose resources. A critical list of resources would be more useful to your readers.

Format & style

Two quick comments on the form and style.

  • Please homogenise the use of capitalisation in the headings (exclusion made for ggplot2 that always comes lowercased, but you know it 😄)
  • Please homogenise the way you refer to R functions and arguments – using the code format or not, you choose. Consistency is the only requirement.

Thank you for the great work done so far!

@rogorido
Copy link
Collaborator

@semanticnoodles thanks for your extensive comments. I will have a look at the enhancements you're proposing in the next days.

@anisa-hawes
Copy link
Contributor

What's happening now?

Hello Igor @rogorido and Nabeel @nabsiddiqui. Your lesson has been moved to the next phase of our workflow which is Phase 3: Revision 1.

This Phase is an opportunity for you to revise your draft in response to @semanticnoodles's initial feedback. You can make direct commits to your file here: /en/drafts/originals/visualizing-data-with-r-and-ggplot2.md. @charlottejmc or I are here to help if you encounter any practical problems!

When both of you + Giulia are happy with the revised draft, we will move forward to Phase 4: Open Peer Review.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 2 <br> Initial Edit
Who worked on this? : Editor (@semanticnoodles) 
All  Phase 1 tasks completed? : Yes
Section Phase 3 <br> Revision 1
Who's working on this? : Authors (@rogorido + @nabsiddiqui)  
Expected completion date? : May 17
Section Phase 4 <br> Open Peer Review
Who's responsible? : Reviewers (TBC) 
Expected timeframe? : ~60 days after request is accepted

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

@semanticnoodles
Copy link

Hello Igor @rogorido and Nabeel @nabsiddiqui, I hope you are doing well!

Just checking in with you about the draft revision (Phase 3 / Revision 1) as the deadline of the 17th of May has passed. If you need some extra time let me know approximately how much, so we can set up a new deadline -- and @anisa-hawes or @charlottejmc can update the Mermaid timeframe.

If you have doubts or need any clarification, please do not hesitate to keep in touch.

@nabsiddiqui
Copy link
Collaborator

Hello @semanticnoodles,

I have tried to rework a lot of the tutorial. I feel that changing some of the headings will make the flow more obvious. Let me see if it makes sense the way I have done it or if there should be additional changes. Here are some of what I reviewed based on your timeline. The rest I will leave to @rogorido unless he has an objection:

Introduction, Lesson Goals and Data

  • Para 1, line 2: there is an extra )
  • Lesson’s goals could be more specific (you could pick outcomes that have major resonance that adding meaningful labels to plots)
  • No reference to the dataset is presented here (it comes from Wikidata, right?). Make sure you at least have a couple of words about it here represented.
  • Review the heading accordingly with the edits.

ggplot2: General Overview

  • This acts more like an introductory section, although it is nested under the previous one. Bring it to the same level as the previous or put it before it to give a more comprehensive introduction (or re-arrange it for better consistency, please).
  • A couple of words about the Tidyverse here would better contextualise the workflow.
  • Para 7 could be added to the Additional Resources section.
  • Para 8 could mention more strategically the arguments – review it for a better alignment with the walkthrough. You could even thinking of following the official layers featured in the introduction to ggplot2 vignette, adapting that to match with the elements you thoroughly explain.
  • Review the heading accordingly with the edits.

Sister cities in Europe

  • Please clarify your understanding of sister cities by giving a working definition. This would clarify the starting point of your research.
  • The rationale of your case needs some more unpacking; please add some context here, also about the provenance of your dataset.
  • The research questions here listed are somewhat aligned with the steps you propose. I would recommend you to review them for enhanced consistency.
  • Review the heading accordingly with the edits. Most importantly, from here on you start with the walkthrough. Make sure you clarify this by tuning the headings.

Loading Data with readr

  • If you referenced the tidyverse above you won’t need to explain tibbles extensively here. Please review this part for conciseness.
  • Including head(eudata) could support your explanation about the observations occurring in the dataset – this is also considered good practice in data science.
  • Para 16 could benefit the previous section.
  • Consider raising the level of this heading and review it accordingly. (Felt it was better at this level)

Creating a bar graph

  • IMPORTANT: There is no typecountry column included in your dataset. I tested the walkthrough using the data contained in the eu column, just remember to send us the correct version of the dataset.
  • Paras 20-23 could be more focused on the walkthrough; anticipating para 23 once obtained the barplot could enhance the clarity.
  • Para 30 could use a bit more details about the interpretation of the results. If you plan
  • Review the heading accordingly with the edits.

Other Geoms: Histograms, Distribution Plots and Boxplots

  • Para 31, penultimate line: comma missing space afterwards.

  • Para 33, please review this for clarity (here you should mention why you used log10 once for all or put it into another spot. Consider explaining why none of the methods is ideal)

    This leads to an uninformative histogram. We can take log10(dist) as our variable or filter to exclude values above 5000kms. None of these methods is ideal, but as far as we know, we are operating with manipulated data making it less problematic

  • Para 36, please review it for clarity (it reads implicitly why you employed ECDF).

  • Para 41, same issue: you refer to ANOVA without explaining why you foresee that as a viable statistic test, cutting the paragraph short.

  • Review the heading accordingly with the edits.

Manipulating the Look of Graphs

  • This section would be more logically following the Other Geoms section. Evaluate how to make this and the following sessions more cohesive.
  • Para 42 could be revised for clarity – especially the research question. Mind that you first performed the random subsampling and then explained it.
  • Para 45 does not add much information to the following steps. Instead of pointing out which elements you want to manipulate, consider laying out clearly the goal for your tasks.
  • Para 55, review for conciseness (sometimes less is more).
  • Review the heading accordingly with the edits.

Scales: Colors, Legends, and Axes

  • Para 65, please review for straightforwardness - advantage of using a continuous scale? Also a repetition in the last line (“represent the distance”).
  • Para 68, review for accuracy: the way it is phrased seems like ggplot2 does not use discrete colour scales at all.
  • Para 70, would better fit in the Additional Resources section.
  • Para 74, review for accuracy.

Faceting a Graph

  • This section would be more logically part of the Other Geoms section and use a title anticipating also the theme changes.
  • Para 75, review for clarity and conciseness (“split by categories [space time and so]” is not very straightforward. Consider explaining straightforwardly what facetting is.)

Themes: Changing Static Elements

  • As the previous, this section would be more logically following the Other Geoms section.

Extending ggplot2 with Other Packages

  • Para 84, extra comma not rendering the link for Ridgeline plots
  • As the previous, this section would be more logically following the Other Geoms section.

Additional Resources

  • Consider reviewing and incorporating other elements into this section, following more closely the tools used in the tutorial instead of pointing towards general-purpose resources. A critical list of resources would be more useful to your readers.

Format & style

Two quick comments on the form and style.

  • Please homogenise the use of capitalisation in the headings (exclusion made for ggplot2 that always comes lowercased, but you know it 😄)
  • Please homogenise the way you refer to R functions and arguments – using the code format or not, you choose. Consistency is the only requirement.

Other

  • Change Title to be More Descriptive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 3 Revision 1
Development

No branches or pull requests

5 participants