cobey_lab_handbook.Rmd

--- 
title: "Cobey Lab Handbook"
#author: "Sarah Cobey"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output: bookdown::gitbook
documentclass: book
bibliography: [book.bib, packages.bib]
biblio-style: apalike
link-citations: yes
github-repo: cobeylab/lab_handbook
description: "Handbook of the Cobey Lab at the University of Chicago"
---

# Preface {-}

This handbook is partly for prospective or new lab members who want to know how we do things. The first chapters, where we lay out our principles and general style of work, are most relevant here. 

The main purpose of this handbook, however, is to help current lab members share knowledge, develop skills, and get things done. Contributions welcome.

***Contributors:** 
By Sarah Cobey. Midway section is originally by Ed Baskerville with updates made by Rachel Oidtman. Coding section is by Alex Byrnes.*

<!--chapter:end:index.Rmd-->

# Why we work {#why}

As a lab, we have two equally important goals: 

1. to perform high-quality research and
2. to create an environment that accelerates the growth of scientists and improves the practice of science. 

Performing high-quality research means that we undertake meaningful problems, investigate them in a rigorous way, and promptly publish our results so that others can easily pick up where we leave off.

Creating an environment that promotes the growth of scientists means that we support the expansion of our own and one another's research capacities, especially through persistent effort, curiosity, and constructive criticism.

These practices in themselves should improve the way science is done, but we also have broader responsibilities to promote equity, limit unnecessary struggle and wasted resources, and perform research with the public good in mind.

This handbook outlines practices to help us achieve these goals. We'll always be improving our methods! If you work here, you can help.

<!--chapter:end:01-why.Rmd-->

# How we work

## Principles

**Do ambitious research.**
Research always seems to take more time than it should, so spend your time on important questions.
Think hard about what *should* be done, not only what can be done.
Try not to let others define subfields and questions for you.
Be deeply practical in evaluating your progress and choosing your next steps, but work toward lofty aims.

**Learn fast and change direction when necessary.**
Research involves making mistakes, or at least doing things that seem really dumb in retrospect.
Learn as much as possible from your failures. 
(Could you have found that bug earlier? Learned about that other technique earlier?)
Do not shame yourself for them.
Instead, admit them, and document this learning for yourself and others by talking about it or potentially adding some advice to the handbook.
Note that failing is not the same thing as not having the result you wanted---it's a good day when your hypothesis is not supported and you get to learn something about how the world works.
Frequently reevaluate your approach and the direction of each project, and take initiative in doing this.
Take initiative as a collaborator (middle author) too.

**Know your corner of the literature.**
It makes you much smarter and can save enormous time in the long run.
It also makes it easier to spot good opportunities and unanswered questions.
Knowing the history of work on your problem inside and out is a requirement for first authors.
Develop a scientific reading habit if you don't have one yet.
As a general guideline, on average, graduate researchers and postdocs should be reading five papers a week, and skimming more.

**Be open to collaboration, and respect your collaborators.**
Getting anything worthwhile done in research requires learning from others, including through papers, talks, and whiteboard time.
Be proactive in thinking about who might have relevant expertise.
Ask for help, give help, and carefully acknowledge the contributions of others.
Clarify expectations when you start on a project: make agreements explicit (and for important things, in writing), with expectations and timelines, and be reliable.
For instance, let your collaborators know when they can expect to hear from you with new results, drafts, etc.
These principles hold for interactions inside and outside the lab.
By default, you should think of yourself as a collaborator on every project in the lab, and remain engaged.

**Communicate assertively.**
It's nice to hear from other people that they've benefited from your analysis, timeliness, criticism, etc.
Tell people whenever you can that you like what they're doing.
Consider emailing strangers when you like their paper or talk, and explain why.
On the flip side, it's frustrating to learn from a third party that someone is unhappy with something we said or did.
Assertive communication means we give each other direct, constructive feedback if we think something isn't right. 
You can trust me not to negatively describe your behavior to others without speaking to you first.
More broadly, if you think something can be improved, speak up to the right people before complaining to others.
Be constructive by criticizing the idea, analysis, or behavior, rather than the person.
Communicating assertively and kindly usually takes practice.

**Don't be too narrow.**
Take time to play intellectually.
Participate in departmental seminars, go to talks in other departments, meet with people who seem to be doing interesting things, and read exciting papers that might not be obviously related to your current projects.
(Graduate researchers should aim to attend a departmental seminar and journal club each week.)
Start journal clubs and groups that you wish existed and would make time for.
Try to balance exploration with work on existing projects.
I don't know the right balance here, but it's worth trying to figure it out.

**Work hard but sustainably.**
Figure out sustainable habits for effective research. 
What is sustainable is personal: avoid blindly adopting others' criteria.
Focus on habits, such as working set hours each day, before benchmarks (e.g., "Publish", "Get speaking invitation", "Be famous!!!11!!" etc.).
Resist the temptation to run from one deadline to the next, and think instead about how to make regular progress.
(Do this especially if you're (i) fresh out of undergrad, or (ii) have never done it before because you think you work best under pressure.)
Aim for 40 hours of focused work per week.
If you're not happy with your progress or productivity, *there's no shame in asking for help or ideas from others*, including your advisor.
For what it's worth, I do think it's possible to do great research while having a life. (I've seen others do it... lolz)
A great resource for sustainable habits is the [National Center for Faculty Development and Diversity](https://www.facultydiversity.org/). 
They have online classes, weekly newsletters, and writing support groups. 
The University of Chicago has a subscription, so you should be able to get free access.

**Be accountable**.
All work in the lab is collaborative, involving the blood, sweat, and tears (and hopes and dreams!) of multiple people, and often indirectly taxpayers.
Respect all these contributions by being a reliable, involved scientist.
Let collaborators know if your contributions will be delayed, if you think something should be done differently, or if you have concerns about quality.
Always look for better hypotheses and approaches.
Speak up if you see something that's not right.
Remember we'll all be dead soon enough, and this is our opportunity to help others.

```{r danse-macabre, out.width='50%', echo=FALSE, fig.show='hold', fig.align="center", fig.cap='A contrasting time scale of research'}
knitr::include_graphics(rep("images/danse-macabre.png"))
```

## Work hours

I will not judge your progress based on the hours you keep in the lab: what matters most is that you make substantive progress on the lab's goals.
I encourage you to figure out how you work best.
You thus have broad freedom to choose how you work, provided you communicate your plans to others and see the projects through.
But because we're human, and it's nice to see other, non-digitized human faces regularly, aim to spend at least three days a week (for most of the workday) in lab. 
You don't have to communicate your schedule with me in advance, as long as you're around roughly this much and show up for meetings.
Undergrads and out-of-state lab members will have different arrangements, of course.

Take vacations! 
Take good ones, show us the photos, and bring back tea, chocolate, and/or fine liquor.
Try to use all your vacation time. 
You don't need approval from anyone before you select dates, but please try to give collaborators advance notice and consider their schedules.
Mark the days you plan to be away on the lab calendar.

Please stay home if you're sick. You will be judged for this one.

Non-hourly, benefits-eligible employees (including graduate students) are entitled to various forms of parental, personal, and family leave, and I encourage you to take them if you need them.
If you need other accommodations, please let me know.

## Workspace

The lab space is supposed to help you work efficiently and happily: I want it to be a place where people can reliably go to get stuff done.

* Keep the main room quiet. 
If others are present, have meetings (in person or via Skype) in an adjacent conference room.
* Please feel free to customize your space. Adjust the location and height of your desk and file cabinet as you see fit. Let me know if you'd like a privacy screen, a fan or space heater, etc.
* Check with others before bringing pets to work.
* Consult others before making dramatic changes to the lighting or temperature.
* Keep things clean. Wipe your desk. Wipe the kitchen counter. Do not wipe crumbs on the floor. Clean up spills. Gently encourage others to do the same. Bad habits kill mice.
* To make the room easier to clean, avoid letting your stuff overflow past your desk. Don't leave piles of stuff on the floor for more than a few days.
* Please let me know if there are ways the space can be more comfortable, or if there are particular things (e.g., new computer or software) that would improve your work.
* Lock the door if you leave and no one is in the lab.

## Communication
Please limit the use of email for research questions and discussions.
Use Asana instead---it makes things much easier to find in the long run, and it has no overeager spam filter.
Asana is also the best tool for lab announcements and discussions.

I do not expect you to check email or Asana on weekends, vacations, sick days, or holidays. 
I'll always try to respond to your communications within 24 h, excluding weekends, vacations, sick days, and holidays.
In general, you should check email and Asana a few times a day and try to respond to urgent requests within an hour or two (M-F), but I expect urgent requests to be few and far between. 
They will probably have some warning (e.g., an impending paper or grant deadline).

If I'm in my office and the door is open, you're welcome to come in to talk.
If the door is closed, it means I'm working, and it's best to communicate via Asana.

## Weekly check-ins

Most weeks, you'll meet individually with me to discuss research and occasionally other topics.
The agendas of these meetings is largely up to you.
However, I ask you to come prepared with slides, an updated summary, or notebook (Rmd, Jupyter, etc.) concisely describing progress since the last meeting. 
You can choose whichever format works best for you, but somehow, your notes should clearly state (i) what goals you had set last week for this week, (ii) your progress on each goal, and (iii) what you think comes next.
(This structure is really helpful for me.)
You should also be prepared to show system and unit tests, or some kind of validation, to convince me your research results are correct.
The meeting is time to both dig into the weeds but also think about the big picture.

I'll try to remind you of this, but one of my main goals for each meeting is to learn how I can best support you, during the meeting and after. 
If our relationship needs adjusting, please let me know.
If you want to talk about career goals, that's fine too.
This is your time, but please be organized about it.

When I'm traveling, these weekly meetings may need to be rescheduled or occur over the phone.
Please always feel free to request a meeting when I am traveling, unless I'm on vacation.

## Lab meetings

We'll meet weekly as a lab.
Meetings start with various announcements of abstract deadlines, cool upcoming talks, new papers, etc., and then we briefly update one another on our research (roughly 1-5 min per person).
The point of these updates is to practice describing our research and especially to keep each other involved in our work, which includes providing helpful suggestions.
The rest of the meeting is usually dedicated to an in-depth presentation and discussion of one of our research projects or a discussion of a paper.
Plan to present your research once per quarter and to lead at least one paper discussion per quarter.
Pick papers at least a week in advance so people have time to read them.
Everyone is expected to show up having read and critically thought about the paper.
If someone is presenting, everyone else is expected to make helpful suggestions.

## Daily "Standup"

We use a Slack app to do quick check-ins on a daily basis. In the morning, everyone receives a prompt to list the work they're planning for the day. Everyone's activities then appear at 10 AM Central for others to read. Checking in is optional and skipping the prompt is acceptable with or without explanation. We use this to start conversations, and to build the habit of formulating our intentions in advance.

## Reproducible research

You have broad freedom in most aspects of how you work, but there are certain protocols we follow to keep our work reproducible, accessible, and organized. 

**Reproducible** means that other researchers could use your notes and code to reconstruct your results precisely without guesswork or manual labor. All of the figures and results in any manuscript must be fully reproducible by executing one or a few scripts in a public github repository. It should also be easy to reproduce intermediate results during development. Basically, this means we use version control, [git](https://git-scm.com/book/en/v1/Getting-Started-About-Version-Control).

**Accessible** means that (i) all of your code, including small scripts, is maintained in a git repository that is regularly (e.g., daily) synced to the lab's github account; (ii) all raw data and major results are stored on Midway projects/cobey (unless other arrangements are required by IRB); (iii) project management is visible to all lab members on Asana; and (iv) you regularly back up your laptop using an external hard drive *and* CrashPlan Pro.

**Organized** means that you keep your project files organized, use version control, document your code, include unit and system tests, use Asana and/or notebooks to record all decisions in your analysis from day to day and week to week, and you refactor code when it stinks. It also means you communicate progress promptly to collaborators in meetings and (for external collaborators) emails.

Specific suggestions are in Section \@ref(so-you-wanna).


<!--chapter:end:02-how.Rmd-->

# Performance

## Reviews
There should be informal reviews at every weekly meeting.
The point is mutual feedback, i.e., we can discuss your progress and develop achievable goals for the next few days to years, and you can tell me if there are areas in which I can provide better help.
If you would like more formal progress reviews, let me know.
Please also let me know if you are ever worried about your progress.
Postdocs and salaried researchers will get formal annual reviews as required by the Biological Sciences Division, but they are really secondary to the regular meetings (i.e., there should be no surprises).

## My commitments

* Meet with you at least biweekly (usually weekly) to discuss research and professional opportunities.
* Help give you an accelerated introduction to the field.
* Provide rapid (within days) critical feedback on research ideas and drafts---*with advance notice!*
* Help you establish relationships with other scientists in field.
* Promote your work in conferences and meetings.
* Help you attend conferences and meetings.
* Help identify areas of professional growth.
* Provide teaching and mentoring opportunities, if desired.
* Fund you for at least $n$ years (as agreed), assuming steady progress.
* Be a trustworthy, reliable, honest, hard-working, constructive, respectful, and communicative colleague.
* (For postdocs) Help you identify a line of research to continue when you leave the lab.

## Basic expectations

* Follow the lab's principles (Section \@ref(principles)) and all our described work practices, including the project management and programming techniques described in Section \@ref(so-you-wanna).
* Take full intellectual ownership of your research, i.e., think hard about whether you and your collaborators are doing the right thing, search for relevant papers, and push your projects forward at a good pace.
* Develop annual and long-term professional goals as soon as you join the lab, and discuss them with me then and regularly thereafter. Let me know whenever yours goals change. (It's okay if your long-term goals are amorphous, just let me know.)
* Work steadily, understand how you work, and let me know how I can help you work better. (See Philip Guo's [list of performance bounds](http://www.pgbovine.net/human-bounds.htm) for examples.) 
Please let me know especially if my availability, the environment, software, or hardware are slowing you down.
* Learn from your mistakes. Programming bugs, bad writing, awkward slides, undiscovered papers, are all an unfortunate part of research. Forgive yourself *and* take corrective action to reduce the error rate in the future. Of course, the optimal error rate is usually not zero... The only *real* mistakes are blowing off or ignoring what people (reviewers, coauthors, committee members, me, etc.) are telling you, ignoring data related to your research or performance, and being a jerk.  
* Perform lab service, as agreed upon (e.g., maintain the lab calendar, order office supplies, water the plants).

## Graduate researchers
I've listed below the skills I think graduate researchers should have by the time they defend their PhD. 
Your adviser and committee will help steer you, but you are in control.
(I kind of dislike the "student" convention, tbh. You're scientists, just less experienced ones.)

* *Intellectual independence and mastery*
    * Be able to define a coherent field of study, including the progress that has been made in it and the problems that remain. 
    This requires following the literature by regular, self-directed reading (at least five papers a week, conservatively, on average).
    * Have enough statistical and general knowledge to assess the strength of evidence of (almost) any study or general claim in this field
    * Propose and carry out tractable, meaningful studies
    * Identify new questions you want to answer and have an idea of how to address them
    * Have a demonstrated history of acquiring skills through self-driven instruction and self-initiated collaborations
* *Intellectual contributions*
    * Publish at least two papers on which you're first author.
	These papers should be submitted by the time you defend. Note, this is not the requirement of the UChicago MSTP or E&E programs, but I think it is an important minimal target. 
    * Give talks outside the department and handle questions about your work.
    * Collaborate on projects on which you're not the first author.
    * Ask public questions during conference talks and seminars.
* *Toughness*
    * Practice feeling clueless regularly and getting over it, especially through learning.
    * Adapt projects to deal with unexpected outcomes.
    * Learn how to handle diverse forms of criticism and professional conflict.
* *Service*
    * Be able to criticize constructively in any situation.
    * After publishing, start reviewing manuscripts for journals.
    * Understand social and political context for scientific research.
    * Practice sharing your work with broader audiences, e.g., via blog posts, talks to the public, and interviews.
* Seek funding opportunities and apply for grants.
* Establish your reliability in communicating with committees, collaborators, and administrators in a timely, respectful way, and follow through on your commitments.

Please note that you are ultimately responsible for ensuring you are meeting the requirements for your degree. The Student Advisory Committee (SAC) and later your thesis committee will help you with the planning, but you should take initiative in scheduling and planning ahead.

## Postdoctoral researchers

Generally, postdocs should have facility with all of the skills listed above and 

* Develop new research projects and manage external collaborations.
* Drive projects forward in consideration of existing and ongoing research in the field.
* As negotiated, potentially take a major role in managing research performed under federal contracts, including the completion of monthly progress reports.
* Mentor junior researchers. This can be formal, but postdocs should also be providing more constructive comments across the board to other researchers, including junior ones.
* Teach selectively, if interested in a teaching position.

<!--chapter:end:03-performance.Rmd-->

# So you wanna...

## Join the lab

We welcome applications from skilled, ambitious, and independent researchers at all levels, as long as they are burning to do good research promptly.

**Undergraduate students** interested in joining the lab generally need to be proficient (not brilliant) in at least one programming language, such as Python, R, Matlab, C(++), or Java, and have some biological background or curiosity in at least one research area. You should also be proficient in basic statistics.

**Prospective graduate students** are encouraged to review the details of the graduate program and the research described here. They may wish to consider working as research assistants in the lab to ensure a good fit before applying to the graduate program.

**Rotation students** must start with strong quantitative and some programming skills.

**Senior researchers, research programmers, and postdoctoral fellows** are also welcome to contact Sarah about opportunities for support and collaboration. We are especially looking for more postdocs to study the evolutionary dynamics of adaptive immunity.

All who are interested in joining the lab should explain in their initial communication what skills they could bring to the lab and what they hope to obtain from collaborating. It is essential to have read recent papers in the relevant research area, including some from our group, and to have an idea of the kind of questions or problems that excite you.

## Do some research

```{r owl-plot, out.width='50%', echo=FALSE, fig.show='hold', fig.align="center", fig.cap='Basically, except less sequential'}
knitr::include_graphics(rep("images/owl.png"))
```

**Identify a good question:** This can take some time. Talk with other people, read, keep talking, study patterns, reason from first principles, and keep talking. What phenomena are you trying to explain? Generate many questions. Consider the next step, if necessary, in picking one. 

**Develop a game plan:** Posit some answers to your question. What do those answers imply? What patterns or processes are (in)consistent with them? How can you test them? (And how can you make sure you're testing them correctly, i.e., that your analysis is correct?) Draft some approaches. Prioritize a few. Add the actionable tasks to Asana. (You can keep backup ideas there too in a separate section.) Read up on [project management](http://thenewpi.blogspot.com/2018/03/why-you-should-care-about-project.html).

**Set up a notebook:** Create a version-controlled "lab notebook" in which to record your progress, which includes your thinking, notes from papers, and your analyses. There are many fine ways to do this: what's most important is that the notebook is organized and that you use it. You could use Asana and Overleaf (Latex), an R markdown file, or a Jupyter notebook. The latter two are probably most seamless, but it doesn't matter too much. Be sure to keep files (notes, data, scripts) in a repository, synced to the lab's github account. Everything you do should be traceable and reproducible in some way---no quick "one off" figures that exist only on your laptop.

**Understand context and constraints.** If you're working with data, there might be IRB restrictions on how it can be used, stored, and shared. Find out and comply. Also ask how the work is funded, if you do not yet know, and what kinds of reporting requirements and deadlines we may have. Contracts often require monthly progress reports; those for grants are less frequent. Identify any collaborators and make a plan for working with them 

**Have the right attitude:** As long as you're reasoning based on evidence, you're making progress. See [Schwartz (2015)](http://jcs.biologists.org/content/128/15/2745). Not all projects should move ahead. This is why it's useful to step back, reassess, and discuss your work with others. Revisit and revise your previous questions.


## Code well and efficiently

See the [coding handbook](coding-handbook.html#coding-handbook).

## Write good

### General advice

* One of the best ways to write good papers is to read lots of good papers. 
This is more comfortable than learning incrementally from rejections. Also, there are useful books and essays on the subject (see below). 
* For grants and papers, the central challenge is to articulate an interesting question and show how you have helped or will help answer it. Practice doing this from the beginning of your research project, as you sketch ideas and results in your notebook.
* **Be clear.** Use topic sentences. Assume your reader is an intelligent first-year graduate student, but with less time on her hands. Try to state things vividly and directly. Your writing will almost always improve if you try to explain your reasoning as transparently as possible.
* Focus on ideas, not people or studies. Avoid "Many studies have shown X." Just state what has been shown and give references.
* Be consistent. If you define a parameter, refer to it the same way throughout your paper. This holds for all sorts of annoying punctuation and formatting conventions. Channel the reader's attention into one clear, fascinating story, and let nothing distract from it.
* I like Claus Wilke's advice about knowing, when you sit down to write, if you are drafting or revising/editing. If you are drafting, don't worry so much about the flow. Just get the ideas down. Under no circumstances should you show me that draft, however.
* Recommendations: ["Why Academics Stink at Writing"](https://stevenpinker.com/files/pinker/files/why_academics_stink_at_writing.pdf), [*The Elements of Style*](https://www.bartleby.com/141/index.html) (which I really like, contrary to [Claus](https://serialmentor.com/blog/2017/11/12/move-over-Strunk-White)), ["Ten Simple Rules for Writing Research Papers"](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003453), [Sarnecka Lab's "Writing Workshop"](https://sarneckalab.blogspot.com/2018/07/writing-workshop-table-of-contents.html)

### Initial submissions

The following workflow seems to work well, but if you prefer another one, let me know.

1. Start yesterday by writing summaries of the research in your lab notebook as you go. You can probably write most of the methods and many results this way. You'll also accumulate the major points for the intro and discussion in your notes too.
1. Identify a target journal in consultation with me and coauthors. Check the journal's instructions for authors so you know how to structure your manuscript.
2. When it is time to write the manuscript, set up a repository synced to Overleaf. Draft the figures and abstract first. Discuss the arc with me and potentially other coauthors. Sign up for a future lab meeting or theory meeting, in which we'll discuss the complete draft (which you'll need to distribute a week in advance). Propose an ordered author list and discuss with me. 
3. Next write an outline---with the main steps in the argument as  topic sentences, so it's really possible to trace the whole story---with each result in its own subsection and figures inline.
4. Draft the results section. Ask another person in the lab for feedback and then revise, and then show me. *N.B. I have a lamentably hard time looking past poor construction and focusing on the ideas, so I appreciate it when the writing is clear, organized, and not too laden with typos, even at these early stages.*
5. Next write the methods, introduction, and discussion, and revise the abstract. Do not forget to acknowledge Midway (assuming you used it) and funders. Funders often require specific language. Investigate.
6. Ask for more feedback from a labmate. Revise. Share with me.
7. We'll then discuss the manuscript in a lab meeting, and revise. Now is a good time to propose to the coauthors a schedule for your sending and their reviewing a draft.
9. We only send drafts to coauthors and "friendly reviewers" when the writing is coherent and flows well. We do not want to waste their time. 
10. Ensure your repository is well documented and up to date, and that all the analysis---including the figures---can be run from the included code with minimal effort. We make the repository public when we submit the manuscript to a preprint server. Often it's useful to start a new repository than the one in which you developed the project. Double check that you are not sharing protected data.
11. Ask a colleague to try to run your code, given the manuscript and access to the repository, and no other help from you.
12. When the final manuscript is ready for coauthors' approval, confirm with them (if you've not yet) their funding and preferred and non-preferred reviewers. Make sure they don't have a COI with any of the preferred reviewers.
13. Draft the cover letter, or whatever introductory text the journal requires.
14. When all coauthors have approved the manuscript, we submit it to a preprint server (without journal-specific typesetting), publish the repository, and submit to a journal.
15. Celebrate.

### "Mature" papers

1. Ideally at least 20 min have passed before we receive a decision from the journal. (That's the fastest rejection I've heard someone receiving from *Nature*.) If we receive a rejection without review (a "desk rejection"), we probably need to improve our abstract, introduction, and/or cover letter. If we're rejected after review, we'll take 2-3 days to consider the reviews and make a plan for revisions. Be sure to communicate decisions to coauthors, if the journal does not email them automatically, and to let the coauthors know the plan for responding to the decision.
2. If the journal requests a revision, we track changes (using colored text) and write a polite and succinct reply. I like replies that are as self-contained as possible, so that months after having done the initial review, the reviewer can read over the reply and be sufficiently reminded of the context for each criticism that she or he doesn't need to reread the whole paper. In the reply, we describe exactly what has been changed in the paper and quote the changed text (with as much context as necessary, including corresponding line numbers). Ask other lab members for sample replies to get a sense of the tone and format.
2. It is to our advantage to revise quickly. With greater distance from the paper, reviewers can easily take issue with new parts. Novelty tends to decrease with time, even if nothing has been published in the interim.
3. Ask coauthors in advance if they could inspect revisions and the reply during some time window---they should know when to expect our draft.
4. When we resubmit to the journal, we *immediately* upload the revised manuscript (without journal-specific typesetting) to the preprint server. Once the manuscript is accepted, it's usually too late to submit a revision. We would then have to decide between paying exorbitant OA fees to the journal or leaving the article behind the journal's paywall, thereby encouraging people to access the outdated version. It's best if the accepted version is already on the server.
5. Remember that rejections and revisions are [part of the game](https://twitter.com/dsquintana/status/1053898526667739136).

## Email like a pro

* **Be concise.** 
Be clear if you are asking for something, or if you are simply giving information.
Try to minimize the number of back-and-forths required: instead of asking if someone is free to meet next week, list blocks of time you are available and propose a location. 
Make it easy to reply quickly.
* Rather than send large attachments, **send a link to a Box or Dropbox file**.
* **Be polite**. Being concise is part of being polite, but being polite also means using professional titles and spell-checking your email.
Striking the right tone can be hard sometimes. 
One error that very junior scientists sometimes make is being excessively deferential ("I was wondering maybe if you might consider...").


## Make nice figures

Mostly, see Claus Wilke's excellent online guide, [Data Visualization](https://serialmentor.com/dataviz/). 
Some immediate suggestions:

* Label your axes. All parameters should be spelled out and accompanied by their symbol (e.g., "Transmission rate, $\beta$").
* Save figures in vector formats.
* If many points are plotted over one another, consider semi-transparency or plotting densities.
* Minimize wasted space in figures while ensuring your axis limits are appropriate (e.g., that fractions ranging from 0 to 1 have y-axis ranges of [0,1]).
* Titles are frequently a waste of space, but be sure to include key information somewhere nearby, e.g., what the shaded area represents, how you assessed significance, etc.
* To increase accessibility, avoid relying on red/green contrasts.

## Keep up with the literature

Keeping up with the literature involves two challenges: finding papers and reading them. I've described what I do, but there could be better solutions.

**Finding papers**: 

1. Use a RSS reader. Subscribe to major journals and bioRxiv and arXiv topics. Skim the titles and abstracts when you have a few minutes here and there. This will probably identify 90% of the *new* papers you might care about.

2. Set up Google alerts so you can get emails when particular papers and people are cited.

3. Do a good search in any area of interest so you can identify relevant papers published decades ago. It is amazing how rapidly phenomenal work can be forgotten by a field.

**Reading papers**:

1. Just block off the time on your calendar and do it.

2. Consider setting up a small reading group to discuss more challenging papers. (For easier papers, this can double the time it takes to read them.) Also take advantage of theory group and lab meetings to force these discussions.

## Get funded

1. Aggressively search for opportunities. Ask the grad student and postdoc coordinators, ask your peers, do searches, etc. Periodically recheck.

2. Work backwards from deadlines, giving yourself much more time than seems necessary, and establish a schedule. Considerations: (i) Ask Linda if the grant will need to be reviewed by the **URA** and find out what their deadline is. This is the effective deadline (it's usually about two weeks before the submission deadline). (ii) **Letter writers** usually need at least three weeks before then, and it is best if you can give them a good copy of your research proposal by then too. (iii) Depending on the complexity of the application, we may need four or more weeks to **bounce drafts back and forth**. (iv) For applications with mentoring plans, it's especially useful to get started several months ahead, so we can identify if **another sponsor** should be brought on board. (v) It's also important to start early if we're unsure the research will be a **good fit** for the funder: we need the time to revise our pitch in coordination with the program officer. (vi) Many grants require **preliminary data**, and it's good to figure out what that should be early.

3. Identify several people who will read a draft of your proposal. It's best if some of them have been successfully funded and if they are not in your subfield. (Ideally, they'd be just like the review panel.) Ask how long they'll need with the proposal to give you comments, and work backwards to figure out when your draft needs to be ready.

4. Obtain copies of as many successfully funded applications of this type as you can, ideally with their summary sheets or reviewer comments. 
Promise confidentiality. 
You can search for funded federal grants on the [NIH RePORTER website](https://federalreporter.nih.gov/). 
It's also worthwhile just asking around.

4. Read *[4 Steps to Funding](https://www.amazon.com/Funding-Rejection-Funded-Simple-Formula/dp/0615505589)* in its entirety before drafting anything. 
It should only take a few hours. 
We have a copy in the lab somewhere.

5. Study the call/grant description carefully and study the funded applications. 
What consistencies appear? 
Potentially consult with program officers and other applicants to make sure you understand what reviewers are looking for. 

6. Write the proposal, and get funded! No seriously, we'll discuss proposal-specific details in person. But as a mentor once told me, people generally like things in proportion to how well they understand them, so you want to make sure the proposal is really exciting---see *4 Steps to Funding*---and really, really clear. This is why we ask people outside our subfield to give us comments.

## See our funding

We keep copies of funded and unfunded grants in the "Grant proposals" project on Asana.
*Assume these grants are confidential; do not share them outside the lab.*
Feel free to ask me about them if you have questions, and if you write a fellowship proposal, please add your submitted proposal (excluding the budget) to the project.

## Review your peers

I'm assuming you've already been invited to review a paper. 
If you've not, there's not too much you can do, aside from publishing.
If you make positive comments about unpublished work at a meeting, there's a chance the authors will suggest you as a reviewer. 
If you make smart comments, there's a chance editors will notice.
Rest assured I'm always on the lookout for papers I can invite lab members to review with me or in my place.

If you've been asked to review a paper,

* Ensure you do not have a conflict of interest. Check the journal's policy, but regardless, look in your heart of hearts, and do not overestimate your impartiality. You should decline to review papers by friends and recent collaborators. I also decline to read manuscripts that seem to be directly "competing" with my own, i.e., they are tackling the same question using similar methods. They're probably not really competing, but if I feel they are, that's enough to disqualify me.
* Confirm you can make the deadline. If you need more time, it's better to ask the editor for extra time now.
* It might be a good idea to ask if they want your review as a backup. It seems rude to ask and probably usually is, but I've twice agreed to review manuscripts for journals only to be told *after I'd agreed but before the deadline* that they had received a sufficient number of reviews and no longer needed mine. This doesn't seem polite except under extraordinary circumstances and wasted hours of my time.
* Read over the journal's criteria for judging manuscripts. For some journals, novelty is unimportant, or there's no requirement to work with empirical data. It's really annoying to be held to standards that the journal itself does not endorse; the editors often don't appear to recognize when this is happening. (Nope, no baggage here!) 
* Start your reviews with a succinct summary of the manuscript, placing it in the context of other work in the field. This helps the editor, who might not understand the paper so well, and also shows that you understand the paper. Directly discuss what the paper contributes or could contribute and the extent to which the paper satisfies the criteria important to the journal.
* Next review the major strengths (sic) and weaknesses of the paper. Be very clear about what makes and/or doesn't make the paper convincing.
* Give evidence for your views. Especially regarding claims of novelty, cite! One of the most maddening things is to get a review saying, "Yawn, this has basically been done before," with no references. Citations also help the authors improve their work quickly, especially if you're suggesting relevant papers they've missed.
* In general, do not punish the authors for not doing the study you would've done or think should be done. Focus on what the paper *does* contribute or could contribute with minor or moderate changes. 
* Do not recommend acceptance, major revisions, minor revisions, etc., directly in the body of the review. That recommendation is for the editor to make. Your job is to help the editor make a decision and the authors to understand your impression of their paper---both what's good about it and what can be improved.
* Be constructive. *Never* be snarky or sarcastic. Imagine this is the first review the first author is receiving, or that the authors have feelings.
* Let the editor know in your review or confidential comments if you do not feel qualified to judge certain parts of the manuscript. It is okay to state this in the main body of your review too. Just remember you have a positive duty to disclose.
* Especially for papers that need a lot of work, it's not a good use of time to note every small mistake. You do not need to be the copy editor. If there are small technical mistakes, e.g., the lines on a graph are switched or the notation is messed up, put them in a section for minor comments.
* For papers involving code, try in 10 min to run the code and check its documentation, but do not reimplement the analysis unless you want to. You also do not need to check complex mathematical derivations. However, the methods should be clear and completely reproducible from the content of the manuscript. (I am not a fan of "See previous papers $X_{t_1}$, $X_{t_2}$,...", though a bit is okay.)
* It's fine, even good, to comment on other reviewers' comments in later rounds of review, especially if you disagree with them. If you think a reviewer has made a major error, email the editor.
* Try to limit your likeliy biases in peer review. People often favor manuscripts by authors of the same gender and nationality ([Murray et al. 2018](https://www.biorxiv.org/content/biorxiv/early/2018/08/29/400515.full.pdf)). (There are also biases in the selection of reviewers; [Helmer et al. 2017](https://elifesciences.org/articles/21718).)

## Have productive meetings

### Research meetings
**Before the meeting:**

1. Make sure every meeting has a purpose that everyone understands. It is good if you can send an agenda beforehand. Some people also like to review materials, such as summaries, beforehand. Ask they want this. 

2. If proposing an ad hoc meeting, give an estimate of how much time it should take in your invitation. This will help people focus during the meeting.

3. If the meeting is routine, let the other participants know in advance if you expect it to be especially short or long.

4. I dislike meeting reminders, but some people need them. Find out.

**At the meeting:**

1. Quickly review the agenda and the meeting's purpose.

2. If discussing research, ensure you give appropriate background information and context, and ensure your figures and numbers are clear (even if they're not "pretty").

3. Take notes rather than forcing yourself to recall things later. 

4. End by summarizing the next steps, responsibilities, and timeline.

5. Keep the meeting on track: If a less relevant discussion takes off, flag this as a topic to address later.

**After the meeting:**

1. For committee meetings, meetings with collaborators, etc., send a short follow-up email summarizing what was decided and what will happen next. 
For meetings with me, tasks should be updated in Asana, and additional notes can be there or your lab notebook.

2. Follow up on those commitments.

### Seminar speakers

Try to meet with seminar speakers who do relevant work.
This is just fun, and it also helps people get to know you and your research.
Think of their visits like an intermittent conference without the annoying travel.
Prepare for the meetings by reading at least one of their papers, skimming their other works, and writing a list of questions that would be fun to discuss.
If you're having trouble getting on the schedule, let me know.

## Book travel

The general idea is to reduce costs as much as possible while remaining comfortable and productive.
(These savings will go toward more travel, research, and fun lab things.)
The grants that fund travel have different allowable expenses and documentation requirements, so please check flights and your total budget with me before booking.

Guidelines:

* Imagine the money as your own. Please plan your travel far enough in advance that we are not paying through the nose for registration or flights. 
Please book flights at least six weeks in advance, unless you're really confident the price is dropping.
For conferences, book flights earlier.
* UChicago has discounts with various airlines, hotels, etc. [Check them](http://finserv.uchicago.edu/purchasing/travel/index.shtml). 
You may need to use the University's travel agency or use a special website (e.g., swabiz.com for Southwest). 
(Some of these "deals" should probably be checked against Hotwire or Priceline.)
* Bonnie can book the flight for you so you do not have to pay and then be reimbursed. If you pay upfront yourself, you will have to wait until after the travel is complete to be reimbursed.
* If you book an atypical flight, such as something arriving a few days early or leaving a few days late, or that includes personal travel, funding groups generally require that you also include a quote, *obtained at the time of booking*, of the cost of the flight for typical (business-only) travel.
They'll only reimburse up to this amount. But it's otherwise totally fine to include personal travel with business, as long as you document carefully.
* If your travel is funded by the federal government, you generally have to fly with a U.S. carrier or book your ticket through that carrier.
* You are not required to share hotel rooms or use Airbnb, but if you do, it's appreciated.
* It's also great if you can share rides/taxis and take public transit, but you're not expected to go to great inconvenience to save money. 
* The University will reimburse only original itemized receipts; it will not give you a per diem. Food costs are reimbursable up to federal rates if covered by a federal grant, or $100 if from a non-grant account, per University policy. I think the principled thing to do is only submit receipts for food costs above what you'd normally spend (and not to go crazy with spending). Note the receipts must be itemized, and alcohol cannot appear on the receipts of NIH-funded travel.
* Internet charges cannot be expensed to federal grants.
* Submit receipts to Bonnie within one month of travel. 
* I suggest you sign up for airline loyalty programs if you haven't yet. 
Southwest is pretty good: Any flight can be paid for in points (miles), in that there are no annoying blackout dates or hoops to jump through, and you can change flights without a fee. 
American Airlines is basically the worst. 

## Be happy doing research

> But I am very poorly today & very stupid & hate everybody & everything. One lives only to make blunders.— I am going to write a little Book for Murray on orchids & today I hate them worse than everything so farewell & in a sweet frame of mind, I am  
> 
> Ever yours
>
> C. Darwin

If you're excited to solve the problems you're working on and to communicate them to the world, research is great.
Sometimes things can get in the way. 
Major obstacles and tips to avoid them are below.
If you think something is missing from this section, please let me know or add it.

### Time management 
A critical skill is to identify your priorities, understand how you work, and learn how to allocate your effort to get the most important things done and avoid overburdening yourself.
The [National Center for Faculty Development and Diversity](https://www.facultydiversity.org/) has excellent materials, designed for faculty but relevant for everyone, on helping you use your time well. 
You should be able to get free access to the videos and tools through the University.
If you're regularly feeling overwhelmed by tasks or unhappy with your progress, this is also something we can discuss at weekly meetings.
(Full disclosure: I'm far from perfect and perennially trying to improve in this area.)
As mentioned, I think 40 h of carefully chosen, focused work per week is enough to get things done.

Practical suggestions:

* I use the [Freedom](https://freedom.to/) app and [Pomodoro Technique](https://en.wikipedia.org/wiki/Pomodoro_Technique) when I'm having trouble focusing or really resisting some task. Often when I'm resisting a task there's some emotion behind it (e.g., not wanting to be bored), and recognizing that emotion and setting a timer (I can handle 20 min of boredom) helps me avoid procrastination.
* Every important task or task >5 min goes into Asana and immediately placed on my calendar. This helps things get real. It's harder to be deluded about how much time I have.
* I find it useful to compare my scheduled day to how I actually spent my time. It has made me realize the necessity of adding a bit of extra time for spontaneous meetings, scheduling brainless tasks after teaching, etc.
* I also give myself weekly and quarterly goals, and do the same comparisons.
* Working or accountability groups/buddies can be great. If you know others who are struggling to read, write, etc., regularly, consider setting up formal work sessions or accountability reports. The buddies don't have to be local.

### Imposter syndrome
It's really common and can't completely be cured. 
My best advice is to practice acknowledging doubt and then moving on to whatever you want to do.
(I like [this post](https://psycgirl.wordpress.com/2016/07/22/the-tale-of-the-unwritten-manuscript/) from psycgirl.)
Because research involves working on unsolved problems with an ever-expanding set of tools, and the world is complicated, we have to be comfortable pushing through the discomfort of ignorance and mediocrity ([Schwartz 2008](http://jcs.biologists.org/content/121/11/1771)).
Once you accept this, it can mostly be fun to work on interesting problems with great people.

In general, you should not conflate your *or anyone else's* confidence and competence (see the [Dunning-Kruger effect](https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect)).
This is important in making science more equitable.


### Mental health and medical problems
Please take a "mental health day" if you need one, and if you feel stuck, I encourage you to consider talking to a therapist. 
If you are new to therapy, keep in mind that there's enormous variation in quality and style between therapists.
If you don't feel like you're making progress with a therapist, find a new one. 
Remember that many University health plans cover therapists who are not on campus.
Of course, there are a bazillion ways to maintain our mental health, and I encourage you to develop [multiple strategies](http://drtregoning.blogspot.com/2015/05/using-pop-songs-to-maintain-good-mental.html) if you're feeling strained.

If you're not getting good medical care or you have a condition that interferes with your work, please let me know so we can find better care and/or accommodate your needs. 

### Unwelcome environment
If your environment is creating difficulty, e.g., the University or lab does not feel like a welcoming place, or you are under great financial strain, please let me know.
Your workplace should obviously be supporting you.

A note to students: Under [Title IX](https://voices.uchicago.edu/equity/title-ix/), if you speak with me about sexual misconduct, I am required to talk to the Title IX Coordinator about it. The University may proceed with an investigation (potentially despite your wishes). If you wish to maintain confidentiality, there are a [variety of people](https://voices.uchicago.edu/equity/title-ix/confidential-resources/) you can talk to.

## Give a good talk

I like this [advice from Jonathan Shewchuk](https://people.eecs.berkeley.edu/~jrs/speaking.html).
If you're scared to get started, read [Tim Urban's](https://waitbutwhy.com/2016/03/doing-a-ted-talk-the-full-story.html) essay about his TED talk for inspiration.
If it's the first time you've given the talk, sign up on the lab calendar to give a practice presentation at least a week before the talk itself.

## Interview someone

We have frequent opportunities to meet with potential hires, especially prospective graduate students, postdocs, and faculty.
These are usually great opportunities to talk research with someone new, but they're also critically important from an institutional perspective.
You play an essential role in helping identify top performers.
We want colleagues who will not only do great work but who will make the lab, department, etc., a more invigorating place to be.
Please take advantage of opportunities to meet with people and learn about what they do.
You can take it as read that whoever is doing the hiring wants feedback promptly: try to provide feedback (in person, over the phone, or in writing) quickly, e.g., <24 h.
Please also let me know if you think particularly great people are on the market!

It's vitally important to decide *before* we meet a potential hire what criteria we should use to judge them: studies show that we often subconsciously rationalize biases by identifying criteria post hoc.
I think academics in particular are prone to discrimination because their self-identity is so predicated on objectivity.
It is useful to have a core, fixed set of questions or topics to discuss with different candidates.
This doesn't mean the conversation can't wander, but it promotes fair comparisons.
I'm happy to talk about the criteria I use for different positions.

Please keep in mind that many laws exist to protect people from discrimination, and they affect what potential employers can and cannot ask interviewees.
Even though you might not have hiring authority, as a representative of the University, you should avoid asking these questions too, even indirectly. 
(You might not have any intent to discriminate, but the questions could rattle the candidate, and others who hear might be inappropriately influenced.)
Do not ask questions about race, color, national identity, or citizenship; religion; sex, gender identity, or sexual orientation; pregnancy status, marital status, or parenthood status; disability; and age.
For instance, do not ask about what languages someone speaks (unless it is somehow relevant to the position, which it usually isn't), their accent, where their parents were born, or their partner's job, or the existence of a partner.
It is especially inappropriate to bring up two-body issues when discussing candidates until an offer has been made.
Questions related to economic status (e.g., car or home ownership, debt, etc.) are also unwise.

The basic principle is equity.
Equity is a moral imperative. 
It also has the handy feature of broadening the talent pool for any position and probably accelerating the pace of science.
From this principle, it follows we should not discriminate or draw conclusions about scientific and professional merit based on a huge class of dumb things, like whether someone wears makeup, seems really excited about sports, programmed in Fortran at age 3, knows your friends, drinks socially, etc.
We should make an effort to work well with people who are different from us.

I have heard almost every type of inappropriate interview question in academia.
It's pretty sad.
If you hear someone asking one of these questions, do your part by telling the candidate they don't need to answer and/or immediately changing the topic of conversation.
Candidates will often answer these questions anyway or even volunteer protected information on their own.
Do not follow up, and attempt not to be influenced by the information.

## Contribute to the handbook

I'd love to make the handbook as useful as possible.
Please contribute if you see ways to improve it (especially if you have css skills!).
The handbook repository is in the lab's github account. 
Submit your changes as a pull request.
If you'd like to make many contributions, let me know, and I'll add you as an owner.

## Win at conferences

There are a bazillion resources on this. I think they boil down to

* Try to develop a list of people you'd like to talk to before you go, and have an idea of what you'd like to discuss with them. It can sometimes help to send an email in advance if there's someone you really want to connect with. You can set up a time and place to meet.
* Pace yourself. Get sleep. Go to talks, but not necessarily all of them. Make time for dinner and socializing. Ask people to join you for dinner.
* Practice asking speakers questions.
* Get in the habit of introducing yourself, asking people about their research, etc. 
* Avoid spending much time with lab members. Really, the opportunity cost of hanging out with lab members is big. Meeting new people might not always feel like it amounts to much, but it will pay big dividends, I promise. You'll see many of them again. You'll probably collaborate with a few. 

## Negotiate authorship

We try to follow the [APA guidelines](https://www.apa.org/research/responsible/publication/) for determining authorship:

> Authorship credit should reflect the individual's contribution to the study. An author is considered anyone involved with initial research design, data collection and analysis, manuscript drafting, and final approval. However, the following do not necessarily qualify for authorship: providing funding or resources, mentorship, or contributing research but not helping with the publication itself. **The primary author assumes responsibility for the publication, making sure that the data are accurate, that all deserving authors have been credited, that all authors have given their approval to the final draft; and handles responses to inquiries after the manuscript is published.**  

(Emphasis mine.) 
Not everyone we work with follows these guidelines, and they can differ from journals' policies.
We'll talk about it.
Authorship frequently needs to be [renegotiated](https://www.apa.org/science/about/psa/2015/06/determining-authorship.aspx).
It's better not to postpone this.
Please talk to me if you are unclear about authorship.
In general, I expect first authors to be corresponding authors, unless they want to pass responsibility for future communication to me (or whoever's the senior author).

## Engage with the public

**Locally:** The University has established relationships with local schools through the [Neighborhood Schools Program](https://nsp.uchicago.edu/) and with the community through the [Office of Civic Engagement](https://civicengagement.uchicago.edu/education/tutoring-enrichment/).
We also sometimes talk with local journalists and radio hosts (e.g., on WBEZ). 
Let me know if you think there's something we should share.

**And beyond:** If you're interesting in educational outreach, consider [Skype a Scientist](https://www.skypeascientist.com/). If writing is more your thing, check out the [OpEd Project](https://www.theopedproject.org/).

<!--chapter:end:04-so_you_wanna.Rmd-->

# Coding Handbook

## Justification

The practice of science requires special care to ensure integrity. Not only do we want to know our results are correct, we need to show outside collaborators, institutions, publishers, and funders. The standards for these groups are also rising, particularly in the areas of data and code. Excerpts from the Fostering Integrity in Research (2018) checklists for researchers, journals, and research sponsors:

Researchers:

* Develop data management and sharing plan at the outset of a project.
* Incorporate appropriate data management expertise in the project team.
* Understand and follow data collection, management, and sharing standards, policies, and regulations of the discipline, institution, funder, journal, and relevant government agencies.

Journals:

* Provide a link to data and code that support articles, and facilitate long-term access.
* Require full descriptions of methods in method sections or electronic supplements.

Research sponsors:

* Develop data and code access policies for extramural grants appropriate to
the research being funded, and make fulfillment of these policies a condi-
tion of future funding.
* Cover the costs borne by researchers and institutions to make data and
code available.
* Practice transparency of data and code for intramural programs.
* Promote responsible sharing of data in areas such as clinical trials.

One of the main reasons research has changed with respect to ensuring integrity in the last few decades is the increasing role of data and computer software. New norms, standards, and training are required, and new opportunities for communication and reuse are available.


## Coding Culture

As with any high-stakes endeavor, it's important to think about how our treatment of each other contributes to success. Software development depends on accuracy, interdependence, and requires human judgment. Culture can therefore greatly impact productivity and resiliency.

### Cultural practices

1. Be open with your code, and your understanding. Coding productivity depends on information. Passing the information as quickly and openly as possible will help overcome this and facilitate progress. The communication phenomenon has been known since the [70s](https://en.wikipedia.org/wiki/The_Mythical_Man-Month) yet it is easy to forget.

2. Be charitable with your feedback. No error is so obvious that even the most experienced programmer won't make it from time to time. Break your PR reviews into demonstrable chunks that you can prove with a code snippet. Don't make sweeping or vague judgements.

3. Be thick-skinned. Code has a way of seeming absolute and damning when you get it wrong. On the other hand, that is its nature. Any error will feel that way, and everyone makes [mistakes](https://en.wikipedia.org/wiki/List_of_software_bugs).


## Complexity


> The art of programming is the art of organizing complexity.
  --- Edsger W. Dijkstra

Programs tend to be difficult to understand. They are written by someone with roughly the same capacity for complexity as the reader, but with the advantage of having written it. This person will usually write to the limit of *their own* understanding because we write competitive programs that are as sophisticated and full-featured as possible.

There are strong incentives in all of computing to write complex code. There are also a limitless number of ways to write code that are functionally identical to each other.

It's also important to remember that complexity is a force of nature. Once enough possible states of a program have been achieved that are difficult to characterize – which is easy to do – it becomes impossible for any human brain to understand completely. This doesn't happen for all programs, but due to combinatorics, the point at which an application becomes complex can easily sneak up on the author.


## Why writing code is easier than reading it


> The process of understanding a code practically involves redoing it.
  --- John von Neumann

Take a function in a large codebase. The person who wrote it understands:

1. The expected -- or possible -- range of inputs for this particular application. (Number of possible arguments -- values -- can easily be in the trillions for a simple function.) Possible range will often depend on the entire rest of the codebase and will usually be implicit.
2. How often the function is called at runtime (Note: This is different from the number of times it is referenced in the codebase.)
3. The intentions of the code. This can be different from what it *does* and is simpler to understand. (Usually intention vs reality is cleared up with comments.)
4. The narrative of the code. The history of a codebase is a powerful mnemonic device. "We wrote this because there was an issue in January. There are three other places this functionality is handled."

All of this asymmetry between the author and the reader is in addition to the raw size of the source code. In other words, these depend on the combinatorial explosion of interconnected components.


### Science and Software

Some coding principles are *less important* in science because:

* Scientific applications are more mathematical. You can reason about the range of values more easily.
* They can be very short.
* They are often meant to be run one way. For example, a Jupyter notebook that is intended to run in the order the cells are in on the page.

Some coding principles are *more important* in science because:

* Scientific applications are meant to be read. They are intended to teach and be verified.
* They are meant to be open-source, and reused.
* They are at the forefront of human understanding. Extra complexity is detrimental.
* They are sensitive to error. The stakes are high.


## Indirection, Abstraction, and Generalization

Indirection, abstraction, and generalization are three closely-related concepts in programming.

Indirection is the most general in that it refers to any symbolic representation of a process in the place of the process itself. A function calling another function, for instance.

Indirection can reduce complexity, and multiply the number of cases a piece of code can handle, *and be a source of complexity itself*.

The so-called [Fundamental Theorem of Software Engineering](https://en.wikipedia.org/wiki/Fundamental_theorem_of_software_engineering), attributed to David J. Wheeler, is: "We can solve any problem by introducing an extra level of indirection. (Except for the problem of too many levels of indirection.)"

Abstraction is also very general but refers to the process of removing details that are not relevant to some concept one is trying to model.

Finally, generalization is very similar to abstraction with the connotation of combining the functionality of several similar pieces of code into one, usually parameterized, copy.

Thinking about how accurately, simply, and powerfully your code represents what is being modeled is important because it makes your code more useful, understandable, and because it becomes more mathematical: You've distilled a model to its essence.


### Example: [The Weasel Algorithm](https://en.wikipedia.org/wiki/Weasel_program#Weasel_algorithm)

``` python
from random import choice, random

charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
target  = list("METHINKS IT IS LIKE A WEASEL")

# create a random parent
parent  = [choice(charset) for _ in range(28)]
while parent != target:

    # calculate how fast to mutate depending on how close we are to the target.
    rate = 1-((28 - (sum(t == h for t, h in zip(parent, target))
)) / 28 * 0.9)

    # initialize ten copies of the parent, randomly mutated.
    mut1 = [(ch if random() <= rate else choice(charset)) for ch in parent]
    mut2 = [(ch if random() <= rate else choice(charset)) for ch in parent]
    mut3 = [(ch if random() <= rate else choice(charset)) for ch in parent]
    mut4 = [(ch if random() <= rate else choice(charset)) for ch in parent]
    mut5 = [(ch if random() <= rate else choice(charset)) for ch in parent]
    mut6 = [(ch if random() <= rate else choice(charset)) for ch in parent]
    mut7 = [(ch if random() <= rate else choice(charset)) for ch in parent]
    mut8 = [(ch if random() <= rate else choice(charset)) for ch in parent]
    mut9 = [(ch if random() <= rate else choice(charset)) for ch in parent]
    mut10 = [(ch if random() <= rate else choice(charset)) for ch in parent]

    # put mutants in an array
    copies = [mut1, mut2, mut3, mut4, mut5, mut6, mut7, mut8, mut9, parent]

    # pick most fit parent from the beginning of the list
    parent1 = max(copies[:4], key=lambda trial: sum(t == h for t, h in zip(trial, target)))

    # pick most fit parent from the end of the list
    parent2 = max(copies[4:], key=lambda trial: sum(t == h for t, h in zip(trial, target)))

    # choose a place in the "genome" to split between the two parents.
    place = choice(range(28))
    mated = [parent1, parent2, parent1[:place] + parent2[place:], parent2[:place] + parent1[place:]]

    # choose most fit amongst the parents and two progeny
    parent = max(mated, key=lambda trial: sum(t == h for t, h in zip(trial, target)))

    print(''.join(parent))
print('Success! \n', ''.join(parent))

```

This code has several problems.

1. There's a lot of duplication. Initializing the array requires as many lines as there are elements in the array. `sum(t == h for t, h in zip(trial, target))` is copied anywhere it is needed, etc.
2. It only works for specific lengths of `target` and `copies` because a user would need to modify code instead of changing a parameter. Literal values like 4 and 28 are used instead of variables (hardcoding).
3. It's not conceptual. If it weren't for the comments, it would be difficult to tell what the author is getting at. What does 28 mean in this context? Is it the same as other 28s?
4. It would be difficult to maintain (particularly if this style was used in a large program). Will the program still work if we change some of it? Is there an error in some of the duplicated code? How do we add functionality without adding more duplication?

All of these are problems that can be solved by generalization.

This code was adapted from a much more general version at [Rosetta Code](https://www.rosettacode.org/wiki/Evolutionary_algorithm#Python). In some ways the less general version is easier to understand. Your eye doesn't need to jump around as much to see what is going on. Usually, though, the more general code is preferred.

How a particular program should be written is a judgment call based on its size, predicted longevity, and what is most clear to readers. If you're starting to lose productivity or bugs are difficult to fix, it might be time to generalize. As with almost any engineering topic, generalization is a tradeoff and can be taken too far.


## Debugging

Debugging and experimentation are fundamentally the same. Debugging is done by isolating variables to identify the cause of a problem. It is comparing the results of two runs of a program with one variable changed. If one run reproduces a bug and the other doesn't, you can usually conclude the value of the variable is the cause. ("Variable" may not be a literal variable. It may be a short section of code, or input.)

Modular, functional code is easier to debug than the opposite: arbitrarily interconnected code. The reason for this is it's easier to understand, and it's easier to “change one thing” and understand the outcome. Code that is easy to debug is also generally [modular](#modules-and-modularity), [functional](#functional-programming), and [testable](#testability).


## Functional Programming

The term "functional" comes from Functional Programming, which is a discipline in which:

1. Functions always return the same output for an input. For instance:

``` python
def f(x):
    return x + 2

```

Always returns the corresponding x + 2 for every x no matter the context and across time.

``` python
from random import randint

def f(x):
    return x + randint(0, 10)

```

Does not.

2. Functions have no side-effects. The function can't modify any variables outside itself.

``` python
g = 0

def f(x):
    global g
    g = g + 1
    return x + 2

```

Modifies `y`, which is outside the scope of the function `f`. State applies to external systems as well. Modifying a database, for instance, counts as out-of-scope, and can affect future runs of a function with the same input.

These properties guarantee that a program is [referentially transparent](https://en.wikipedia.org/wiki/Referential_transparency). You can easily modify it because you can replace any instance of a function with a value, and you can move functions without concern that their behavior will change. Additionally, variables that can be changed by many different functions, in the worst case global mutable variables, add complexity to programs. This is analogous to running an experiment where variables can't be controlled because the state of the program generally involves variables that could be at any state at any time and may radically change the behavior of the program. A function in functional programming (also known as a pure function) can be tested completely by changing its arguments.

The benefits of functional programming are closely related to [modularity](#modules-and-modularity). Functional programs are modular in that every function encapsulates some functionality and has a well-defined interface, the function signature.


## Unit Testing

Unit testing is a technique for verifying a codebase by writing sample input and expected output for a number of its functions. These tests -- usually structured as functions themselves -- are binary. They either run to completion, or they throw an exception. Failure to run represents failure of the test.

Unit tests are generally run without input. The input to _the function being tested_ is written as literals (`3`, `"blue"`, `[6, 7, 8]`) within the test function, or globally for several tests to use. Input to a test, or variables and environment necessary for the function to run is called a *fixture*.

A _test suite_ is the set of all tests of a codebase. Usually test suites are run in their entirety, giving a simple output with the percentage of tests that passed. The codebase (usually called a "build" in this context) is said to be *passing* or *failing*.

Unit testing serves two major purposes.

1. A Unit test is a declaration by the author that an input/output pair is "correct." They represent what the function *should* do. A programmer could write a unit test `test_add` of the function `add`:

``` python

def add(a, b):
    """One of the four basic arithmetic operations. Takes two numbers -- a and b -- and returns a + b."""
    return a + b + 1

def test_add():
    assert add(1, 1) == 3

```

The test verifies that the input/output pair (a = 1, b = 1)/3 is consistent with the function `add`, but does not validate that it is functioning properly (at least according to the way `add` has been described).

A valid test:

``` python

def test_add():
    assert add(1, 1) == 2

```

would uncover the fact that `add` was either written incorrectly, or was recently broken.

When writing tests, it's good to check yourself by avoiding the output of the function as it is written. Take input/output pairs from another, reliable, source or think about the problem and write what you believe is the correct output. A programmer who runs `add` and then writes a test with (a = 1, b = 1)/3 would be perpetuating the error instead of correcting it.


2. Unit tests verify that changes to a codebase don't have unexpected effects. In other words, they help compare two versions of the code to show the functionality (set of input/output pairs) is the same.

This is helpful for refactoring or rewriting where many changes are being made and the application needs to be verified repeatedly.


### Limitations to Unit Tests

Unit tests are not proofs. They test one input/output pair that stands for many pairs in the input/output space. It is always possible that one pair is not tested that will be critical to the functioning of the program, and the program does not handle it as intended.

It is important, therefore, that unit tests are written for representative pairs. For instance:


``` python

def add(a, b):
    if a == 3:
        return a + b + 1

    return a + b

def test_add_ones():
    assert add(1, 1) == 2

```

The test `test_add_ones` is representative of the space of three integers, but it ignores the conditional that modifies the behavior in the case of a = 3. Real-world examples will be much less obvious so this is common. The path may be one of thousands, buried in many tens of thousands of lines of code.

Because all of the paths that characterize the behavior of the function are not tested, this codebase could be said to have insufficient "[coverage](https://en.wikipedia.org/wiki/Code_coverage)." Unit testing libraries will usually be able to measure coverage, which is useful for finding these cases.


### Testability

A function is easier to test when there's a simple way to characterize its behavior with some inputs and expected outputs. This generally means small, easy-to-write input/output pairs. For example:

``` python
def f(i, j, s):
    if i > j:
        return s + " is above the line"
    else:
        return s + " is below the line"

def test_low_f():
    assert f(2, 3, "test") == "test is below the line"

def test_high_f():
    assert f(3, -1, "test") == "test is above the line"

```

``` python
def g(i, j, s):
  if i > j and is_thursday and urllib.request.urlopen("http://line.status.org") and random.randrange(1,10) > 2:
    return "It's thursday and the line status is good and you're lucky."

```

Function f:
  1. Is purely functional. It doesn't modify or depend on values outside its scope, and always returns the same values for the corresponding input.

`f` is easy to test. `test_low_f` and `test_high_f` cover both branches (the if and else) and characterize the behavior of `f` well. Note: It's a judgment call whether or not enough of the input space has been tested. It's easy to see in this case that all integers will behave predictably. This also doesn't cover exceptions, which should generally be tested.

Function g:
  1. Works differently depending on the day, the status of an external web site, and a random number. The behavior of g depends on a lot of values, and values that are outside the scope of the function.
  2. Does not handle all values of its parameters.

`g` is not purely functional and very difficult to test. The state that would be needed to get a predictable output is difficult to prepare. (Functions should also almost never silently fail.)

This most commonly occurs in practice with large global, mutable variables and large objects in general, and the results of connections to external services. Sometimes this can't be avoided, but structuring your program strategically can minimize the effects of state.


## Modules and Modularity

The "messiness" of code is hard to quantify. Messy, difficult-to-understand code is sometimes called "spaghetti code" because connections between components are made from anywhere, to anywhere without much planning or structure. However:

1. A complex codebase *should* have many connections such as function calls, imports, variable mutation etc.
2. Where and when to make connections can be a matter of what “lens” through which you're viewing the code. There isn't an absolute objective standard.

Modularity is one path to clean, simple, maintainable code that is considered distinct from "spaghetti" and can be put in objective terms.

The word "module" has a specific meaning in certain programming languages. As a general term, it means a section of code that acts as a unit, usually on a particular topic or domain, that has a well-defined interface.

An [interface](https://en.wikipedia.org/wiki/Interface_(computing)) is a boundary between components over which information is exchanged. The simplest well-defined interface is a function signature.

``` python
def f(a, b, c):
  product = a * b
  return product + c

```

The "module" `f` has an interface that is the parameters `a`, `b`, and `c`. Any code that calls `f` needs to provide these parameters. They are defined explicitly in the codebase and enforced by the compiler. Variables within `f` can't be accessed from outside `f`. In other words, a connection, or metaphorical spaghetti strand can't be made to anything inside `f`.

Once a codebase is organized into modules, it becomes much simpler and easier to maintain. Modularity also contributes to "[separation of concerns](https://en.wikipedia.org/wiki/Separation_of_concerns)" one of the most important, if not *the* most important software principle. Software organized into concerns with interfaces in between is easier to reason about because the modules can be reasoned about, and modified, independently.

Returning to the evolutionary algorithm example, a more modular version might look like this:

``` python
"""Module for simulating evolution of strings."""

from random import choice, random
from functools import partial

def fitness(trial, target):
    """Compare a string with a target. The more matching characters, the higher the fitness."""
    return sum(t == h for t, h in zip(trial, target))

def mutaterate(parent, target):
    """Calculate a mutation rate that shrinks as the target approaches."""
    perfectfitness = float(len(target))
    return (perfectfitness - fitness(parent, target)) / perfectfitness

def mutate(parent, rate, alphabet):
    """Randomly change characters in parent to random characters from alphabet."""
    return [(ch if random() <= 1 - rate else choice(alphabet)) for ch in parent]

def mate(a, b):
    """
    Split two strings in the center and return 4 combinations of the two:
        1. a
        2. b
        3. beginning of a, then the end of b
        4. beginning of b, then the end of a
    """
    place = int(len(a)/2)
    return a, b, a[:place] + b[place:], b[:place] + a[place:]


def evolve(seed, target_string, alphabet, population_size=100):
    """Randomly mutate populations of seed until it "evolves" into target_string."""

    assert all([l in alphabet for l in target_string]), \
            "Error: Target must only contain characters from alphabet."
    assert len(seed) == len(target_string), \
            "Error: Target and Seed must be the same length."

    # For performance
    target = list(target_string)
    parent = seed
    generations = 0

    while parent != target:
        rate = mutaterate(parent, target)
        mutations = [mutate(parent, rate, alphabet) for _ in range(population_size)] + [parent]

        center = int(population_size/2)
        parent1 = max(mutations[:center], key=partial(fitness, target))
        parent2 = max(mutations[center:], key=partial(fitness, target))
        parent = max(mate(parent1, parent2), key=partial(fitness, target))
        generations += 1
    return generations

# Tests
import unittest

class TestEvolveMethods(unittest.TestCase):

    def test_fitness(self):
        self.assertEqual(fitness("abcd", "axxd"), 2)
        self.assertEqual(fitness("abcd", "abcd"), 4)
        self.assertEqual(fitness("abcd", "wxyz"), 0)

    def test_mutaterate(self):
        self.assertEqual(mutaterate("abcd", "wxyz"), 1)
        self.assertEqual(mutaterate("abcd", "abce"), 0.25)
        self.assertEqual(mutaterate("abcd", "abcd"), 0)

    def test_mutate(self):
        """mutate with no mutation rate returns the parent."""
        alphabet = "abcdefgh"
        self.assertEqual(mutate("abcd", 0, alphabet), ['a', 'b', 'c', 'd'])

        """mutate returns only characters from the alphabet."""
        self.assertTrue(all([l in alphabet for l in mutate("abcd", 0.25, "abcdefgh")]))

    def test_mate(self):
        self.assertEqual(mate("abcd", "efgh"), ('abcd', 'efgh', 'abgh', 'efcd'))

    def test_evolve(self):
        """evolve raises error if target is not composed of the alphabet."""
        with self.assertRaises(AssertionError):
            evolve("abcd", "abcx", "abcd")

        """evolve raises error if seed and target are different lengths."""
        with self.assertRaises(AssertionError):
            evolve("abc", "abcd", "abcd")


print("Evolve some strings.")

"""

Classic example. From random seed to english sentence.

"""
seed = "RHBpoxYLCGjNpUgLYnMfiKskRHmk"
target = "METHINKS IT IS LIKE A WEASEL"
alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ "

generations = evolve(seed, target, alphabet)
print(f"Success in {generations} generations!")


"""

DNA example.

"""
seed = "CGATGATGTATACTGTACGTATCTACTAC"
dna_target = "AATCCGCTAGGTATCAGACTAGTAGCAGT"
dna_alphabet = "ATCG"

dna_generations = evolve(seed, dna_target, dna_alphabet)
print(f"Success in {dna_generations} generations!")


print("Run tests")
unittest.main()

```


1. Code is organized into functions with well-defined domains (concerns), and well-defined interfaces.
2. Functions are purely functional unless there's a good reason. Good reasons include needing to throw exceptions, write to logs, generate random numbers, and limited use of global variables.
3. The "module" has a public interface (`evolve`). If this were a true module in the Python sense, or a class, all other methods could be private.
4. `evolve` returns a useful value to the caller. Functions should, in general, return values instead of modifying variables in the body, printing to the screen, writing to files etc. There's nothing wrong with these things but they are easier to manage at the top level of a program.
5. All parameters are exposed to the caller. The module doesn't need to be modified to use the whole space of possible seeds, targets, and population sizes.


## Statistics and other Modeling

Statistics and other forms of modeling and simulation are the most common form of software produced in the practice of science. It is also an area that affects reproducibility and interpretation of research. In a Nature survey assessing the effect of implementing a publication checklist, respondents strongly associated statistical reporting with research quality: "Of those survey respondents who thought the checklist had improved the quality of research at Nature journals, 83% put this down to better reporting of statistics as a result of the checklist." [Nature 556, 273-274 (2018)](https://www.nature.com/articles/d41586-018-04590-7)

[TBD]


## Guidelines for Writing Code

1. Write comments. Cover the intention, and how this code fits in with the rest of the codebase, and any meta-information such as [deprecation](https://en.wikipedia.org/wiki/Deprecation).
2. Lint. Use a linter before committing. Git hooks can be set up to run a linter every time you commit. This helps avoid committing a large number of linter corrections at once.
3. Write unit tests. Unit tests written early will give you confidence to refactor code, and they help to verify a user is getting the same results.
4. Use version control regularly.
5. Write code that's meant to be read and understood.


## Standards

See also [Reproducible research](how-we-work.html#reproducible-research)

The key qualities that make a codebase reproducible and reusable are:

* Clarity: How simply and understandably the concepts are presented.
* Determinism: How well the starting state of the code, data, and environment are controlled so the user produces the same result as the researcher.
* Interoperability: How well the code interfaces with other tools, and has a standard method for interacting with it.

1. Document your code, preferably as you go. Even if these are rough notes at first, they are very valuable for reuse.
   a. Documentation should be tied to a version. The version of the README in a repo should be about the same version of the code.
   b. Document for superficial users who will only run the runbook and leave, and users who might want to contribute to, or reuse your code.

2. Do versioned releases. Many times there will only be one release when a paper is published for instance, but it pins your code and documentation to a point in time that is canonical.

3. For analysis, include a runbook. It should start with data in the same format the user will receive and include any transformations. It should run tests, or give some indication that the run was successful.

4. Include your data if possible. If your data is not publicly available, direct the users who will have access clearly to the data in the same format and file name they will receive. If the data comes from a database or API, connect to it as simply as possible to avoid confusion.

5. Include any output data (charts, files) in the repository if possible. This gives the reader a quick reference without running the program, and will alert them if their results are slightly different.

6. Include dependencies such as libraries, languages, and applications with versions.

7. Treat the repository as a published artifact. Respect publishing norms such as attribution and long-term access.


## Lab Systems

### Version Control

Version control systems such as git are nearly indispensable for maintaining the accuracy and integrity of a codebase. Services like Github also facilitate publishing code in a verifiable, archivable, and contributor-friendly format.

There are eight git commands that are the backbone of git usage. These commands can be memorized in one sitting and provide most of what is needed for the vast majority of routine use.


``` bash
git clone

# Clone (copy) a repo to the current directory.

git add

# Begin tracking a file or directory. "Add" it to the local repo.

git checkout <branch name>

# Make <branch name> the current branch.

git checkout -b <new branch name>

# Create and check out <new branch name>.

git pull

# Copy the latest versions of the files in the current branch.

git status -s

# Get a list of files that have been modified locally.

git diff

# Get the differences between the local files and the current branch.

git commit -am "<new commit message>"

# Commit (create a set of changed files to be sent to the remote repository).

git push

# Push new commit(s) from the local to the remote repository.


```

The key to understanding git is 1. Understanding why it is used and 2. A good mental model of repositories, branches, and commits.

Repository: A copy of all the code and history for a particular project. The remote repository is usually on a remote server or service such as github.com.

Branch: A copy of the files in a repository that can be modified independently of other branches. Reconciling the differences between two branches and consolidating them into one branch is *merging* them.

Commit: A bundle of changes to the files in a repository. The history of a repository consists of a list of commits to various branches, and merges of the branches into another. (Usually there's one *main* branch where all the changes eventually end up. The canonical current state of the repo is the state of the main branch, and versioned releases are usually of the main branch.)


### Midway

Note: This section was updated in August 2021. The information includes references to Midway3, but at this time the majority of computing takes place on Midway2.

Midway2 and Midway3 are the main computing clusters used by the lab, run by the Research Computing Center. Midway3 was launched in March 2021. The official documentation for Midway2 and Midway3 are maintained here by RCC:

**Midway2**: [https://rcc.uchicago.edu/docs/](https://rcc.uchicago.edu/docs/)

**Midway3**: [https://mdw3.rcc.uchicago.edu/](https://mdw3.rcc.uchicago.edu/)

This page contains some tasks that are likely to be useful to you.

#### Storage on Midway

There are three places you can store things on Midway: (1) your home directory; (2) your scratch space; and (3) a shared project directory for the lab.

Your home directory is in `/home/<CNetID>` and has a quota of 30 GB. This is a good place to check out your code, install software you want to use for runs, etc. You should NOT, in general, store simulation output in your home directory.

Your scratch space is `/scratch/midway3/<CNet ID>/` and has a quota of 100 GB. This is where you should dump simulation output. I recommend organizing your simulation output using a predictable system: I like to put each experiment in a directory tagged by the date and a name, e.g., `2014-12-08-antigen-vaccine`; you may want prefer a hierarchy.

The lab project directory is `/project2/cobey` (for Midway2() (`/project/cobey` for Midway3 but as of 8/2021 there is no storage quota) and has a quota of 4.49 TB, with a hard limit of 4.94 TB, for everyone. If you want to share simulation output with other people in the lab, put them here. Keep things organized into separate project directories; I would recommend tagging directories with dates as above, e.g. `/project2/cobey/bcellproject-storage/2014-12-08-test`. The lab directory has a file quota of 1,885,693 files, with a hard limit of 2,074,262 files.

To check your disk usage, use the `quota` command. These quotas are current as of August 2021.

#### Connecting to Midway

Details here:

[https://rcc.uchicago.edu/docs/connecting/](https://rcc.uchicago.edu/docs/connecting/)

##### SSH (terminal)

To connect via ssh, use your CNetID:

```
$ ssh <CNetID>@midway2.rcc.uchicago.edu
```

Passwordless login is no longer available, and two factor authentication is required.

##### SCP

You can copy individual files or directly back and forth using the `scp` command, e.g.,

```
$ scp <local-path> <CNetID>@midway2.rcc.uchicago.edu:<remote-path>
$ scp <CNetID>@midway2.rcc.uchicago.edu:<remote-path> <local-path>
```

There are also graphical SSH/SCP browsers for Mac OS X, Linux, and Windows:

* [WinSCP](http://winscp.net/eng/index.php) (Windows)
* [FileZilla](https://filezilla-project.org) (all platforms)
* **Mac OS X specific to be updated**

##### SAMBA (SMB) (connect as a disk)

You can make Midway to look like a local disk on your computer using [SAMBA](https://en.wikipedia.org/wiki/Samba_(software)). This is convenient for things like editing job submission scripts using your favorite editor directly on the server, without having to copy things back and forth.

Midway's SAMBA (SMB) hostname is `midwaysmb.rcc.uchicago.edu`.

If you're off campus, you'll need to be connected to the U of C VPN to access SMB. It will also be pretty slow, especially if you're on a crappy cafe connection like the one I'm on now--you may find a GUI SCP client to be a better choice offsite.

On Mac OS X, go to the Finder, choose Go > Connect to Server... (Command-K), and then type in the URL for the directory you want to access. The URLs are currently confusing (you need to specify your CNetID in the scratch URL but not in the home URL):

```
   Home:  smb://midwaysmb.rcc.uchicago.edu/homes
Scratch:  smb://midwaysmb.rcc.uchicago.edu/midway-scratch/<CNetID>
Project:  smb://midwaysmb.rcc.uchicago.edu/project/cobey
```

##### ssh -X (terminal with graphical forwarding)

If you connect to Midway via `ssh -X`, graphical windows will get forwarded to your local machine. On Linux, this should work out of the box; on Mac OS X you'll need to install [XQuartz](https://xquartz.macosforge.org/) first. (There's probably a way to make this work on Windows too; if you figure this out, please add instructions here.)

It's then as simple as doing this when logging in:

```
$ ssh -X <CNetID>@midway2.rcc.uchicago.edu
```

This is especially useful for the `sview` command (see below); it also will forward the graphics of a job running on a cluster node if you use `ssh -X and then `sinteractive`.

##### VNC (graphical interface via ThinLinc)

You can also get a full Linux desktop GUI on a Midway connection using a program called ThinLinc:

[https://rcc.uchicago.edu/docs/connecting/#connecting-with-thinlinc](https://rcc.uchicago.edu/docs/connecting/#connecting-with-thinlinc)

WARNING: the first time you use ThinLinc, before you click Connect go to Options... > Screen and turn off full-screen mode. Otherwise ThinLinc will take over all your screens, making it rather hard to use your computer.

##### Connecting to Eduroam internationally

See [UChicago ServiceNow](https://uchicago.service-now.com/it?id=kb_article&kb=KB00015370).

#### Configuring Software on Midway

Before you run anything on Midway, you'll need to load the necessary software modules using the `module` command:

[https://rcc.uchicago.edu/docs/tutorials/intro-to-software-modules/](https://rcc.uchicago.edu/docs/tutorials/intro-to-software-modules/)

If you're not sure what the name of your module is, use, e.g.,

```
$ module avail intel
```

and you'll be presented with a list of options. You can then load/unload modules using, e.g.,

```
$ module load intel/15.0
$ module unload intel/15.0
```

There are multiple versions of many modules, so you'll generally want to check `module avail` before trying `module load`.

If you want to automatically load modules every time you log in, you can add `module load` commands to the end of your `~/.bash_profile` file (before `# User specific environment and startup programs`).

#### Git

The easiest way to get your code onto Midway is to check it out

```
$ module avail git
```

#### Java

```
$ module avail java
```

As of writing, Midway only has one version of Java (1.7), so be sure not to use JDK 1.8 features in your Java code.

#### C/C++

```
$ module avail intel
$ module avail gcc
```

There are three C/C++ compiler modules available on Midway: `gcc`, `intel`, and `pgi`. The Intel and PGI compilers are high-performance compilers that should produce faster machine code than GCC, but only `intel` seems to be kept up to date on Midway, so I'd recommend using that one.

The Intel compilers should work essentially the same as GCC, except due to ambiguities in the C++ language specification you may sometimes find that code that worked on GCC needs adjustment for Intel. To use the Intel compiler, just load the module and compile C code using `icc` and C++ code using `icpc`.

Unless you have a good reason not to, you should use `module avail` to make sure you're using the latest GCC and Intel compilers, especially since adoption of the [C++11 language standard](http://wikipedia.org/wiki/C++11) by compilers has been relatively recent at the time of writing.

Also, it's worth knowing that Mac OS X, by default, uses a different compiler entirely: [LLVM/Clang](http://llvm.org). (Currently, only an old version of LLVM/Clang is available on Midway). So you might find yourself making sure your code can compile using three different implementations of C++, each with their own quirks.

Here's how you compile C++11 code using the three compilers:

```
$ g++ -std=c++11 -O3 my_program.cpp -o my_program
$ icpc -std=c++11 -O3 my_program.cpp -o my_program
$ clang++ -std=c++11 -stdlib=libc++ -O3 my_program.cpp -o my_program
```

NOTE: the `-O3` flag means "optimization level 3", which means, "make really fast code." If you're using a debugger, you'll want to leave this flag off so the debugger can figure out where it is in your code. If you're trying to generate results quickly, you'll want to include this flag. (If you're using Apple Xcode, having the flag off and on roughly correspond to "Debug" and "Release" configurations that you can choose in "Edit Scheme".)

If you want to make it easy to compile your code with different compilers on different systems, you can use the `make.py` script in the [bcellmodel](https://bitbucket.org/cobeylab/bcellmodel) project as a starting point. It tries Intel, then Clang, then GCC until one is available. (This kind of thing is possible, but a bit annoying to get working, using traditional Makefiles, so I've switched to using simple Python build scripts for simple code.)

#### R

```
$ module avail R
```

shows several versions of R. The default version (as of August 2021) is `R/3.6.1`. The best version will probably be the latest one alongside the Intel compiler, e.g., `R/3.3+intel-16.0` at the time of writing. You can choose which version of `R` to use:

```
$ module load R/4.0.1
```
If you have already loaded the default version of `R` (`$ module load R`), you will need to first unload `R` (`$ module unload R`) before you can specify a version.

#### Python

In order to keep a consistent Python environment between your personal machine and Midway, we are maintaining our own Python installations in `/project/cobey`. Skip the Midway Python modules entirely, and instead include this in your `~/.bash_profile`:

```
export PATH=/project/cobey/anaconda/bin:/project/cobey/pypy/bin:$PATH
```

To make sure things are set up properly make sure that `which python` finds `/project/cobey/anaconda/bin/python` and `which pypy` finds `/project/cobey/pypy/bin/pypy`. See the Python page on this wiki for more information.

#### Matlab and Mathematica

You shouldn't use Matlab or Mathematica if possible, because if you publish your code your results will only be reproducible to people that want to pay for Matlab or Mathematica.

But if you must...

```
module avail matlab
module avail mathematica
```

shows that they are available on Midway2. (Getting a graphical Mathematica notebook to run on the cluster is a pain in the ass, though, so you're probably better off just running it locally if you can get away with it.)

#### Installing R and Python libraries

##### R

We usually need specific libraries to run R scripts. It is important to check what version of `R` are necessaryfor libraries to be installed and adjust accordingly. I find it helpful to create an `installation.R` script to install libraries into the correct directory, for example:

```
#!/usr/bin/Rscript

## Create the personal library if it doesn't exist. Ignore a warning if the directory already exists.
dir.create(Sys.getenv("R_LIBS_USER"), showWarnings = FALSE, recursive = TRUE)

## Install multiple packages
install.packages(c('dplyr',
                   'optparse',
                   'doParallel'),
                 dependencies = T,
                 Sys.getenv("R_LIBS_USER"),
                 repos = 'http://cran.us.r-project.org'
)
```

Other libraries (such as [`panelPomp`](https://rdrr.io/github/cbreto/panelPomp/)), are not yet in the CRAN respository. These libraries often need to be installed from GitHub, which is possible using `devtools`.

```
## panelPomp from GitHub
install.packages(c("devtools"),
                 dependencies = T,
                 Sys.getenv("R_LIBS_USER"),
                 repos = 'http://cran.us.r-project.org'
)

devtools::install_github('cbreto/panelPomp')
```

##### Python


### Running Jobs on Midway

I'll leave most of the details to the official documentation:

[https://rcc.uchicago.edu/docs/using-midway/#using-midway](https://rcc.uchicago.edu/docs/using-midway/#using-midway)

but a summary of important stuff follows.

#### Overview

Midway consists of a large number of multi-core nodes, and uses a system called [SLURM](http://slurm.schedmd.com) to allocate jobs to cores on nodes.

Some details on the terminology: a "node" is what you normally think of as an individual computer: a box with a motherboard, a hard drive, etc., running a copy of Linux. Each node's motherboard contains several [processors](http://en.wikipedia.org/wiki/Microprocessor) (a physical chip that plugs into the motherboard), each of which may contain several cores. A "core" is a collection of circuits inside the processor that can, conceptually speaking, perform one series of instructions at a time. (Until a few years ago, processors only consisted of one core and people talked about "processors" the way they now talk about "cores," so you might hear people confusing these from time to time.)

If you are writing code that does only one thing at a time (serial code), then all you really need to know is that a single run of your code requires a single core.

#### Job structure

Note that Midway counts service units, or core-hours in increments of 0.01. To minimize waste, we're best off designing jobs to be at least several minutes each. (It makes sense that we wouldn't want to bog down the scheduler anyway.)

#### Priority

As described below, if your jobs aren't running right away, you can use

```
$ squeue -u <CNetID>
```
to see what's going on.

If your jobs are queued with `(Priority)` status, it means other jobs are taking priority. Job priorities are determined by the size of the job, its time in the queue, the requested wall time (so it pays to be precise and know your jobs well), and group-level prior usage. Groups that have consumed fewer resources get higher priority than those using more. This usage has a half-life of approximately 14 days, which means it's less awkward to spread jobs out over time. You can view the priority of queued jobs using

```
$ sprio
```

#### Useful commands

##### sinteractive

To get a dedicated job that you can interact with just like any login session--e.g., if you want to manually type commands at the command line to try some code out, make changes, do some analysis, etc.--you can use the `sinteractive` command:

[https://rcc.uchicago.edu/docs/using-midway/#sinteractive](https://rcc.uchicago.edu/docs/using-midway/#sinteractive)

If you connected via `ssh -X`, then graphical windows will also get forwarded from the cluster to your local machine.

##### sbatch

To submit a single non-interactive job to the cluster, use the `sbatch` command:

[https://rcc.uchicago.edu/docs/using-midway/#sbatch](https://rcc.uchicago.edu/docs/using-midway/#sbatch)

This involves preparing a script with special indications to SLURM regarding how much memory you need, how many cores, how long you want the job to run, etc.

##### squeue

View jobs you are currently running:

```
$ squeue -u <CNetID>
```

##### scancel

Cancel jobs by job ID:

```
$ scancel <job-ID>
```

Cancel all of your jobs:

```
$ scancel -u <CNetID>
```

#### scontrol update

Useful for changing the resources requested for PENDING jobs.

Move a PENDING job to another partition:

```
$ scontrol update partition=<partition_name> qos=<partition_name> jobid=<jobid>
```

Change the time limit for a PENDING job:

```
$ scontrol update TimeLimit=<HH:MM:SS> jobid=<jobid>
```

##### accounts

Display number of used/available CPU-hours for the lab:

```
$ accounts balance
```

Display number of CPU hours used by you:

```
$ accounts usage
```

Displace account usage (SUs) for a previous period (update --cycle to indicate the academic cycle of interest):

```
$ /software/systool/bin/accounts balance --account=pi-cobey --cycle=2019-2020
```
##### sview

If you're connected graphically to Midway (either via `ssh -X` or using ThinLinc), you can get a graphical view of the SLURM cluster, which makes it easy to do things like selectively cancel a bunch of jobs at once:

```
$ sview
```

The most useful command: Actions > Search > Specific User's Job(s)

##### quota

Display storage usage:

```
$ quota
```
The standard quota on an individual account is 30GB. When you exceed this, Midway will notify you the next time you log in. There is a grace period of 7 days to adjust your usage. There is also a hard limit of 35GB.

The check the file and storage breakdown on `\project2\cobey` (or `\project\cobey`), add the `-F` flag:

```
$ quota -F /project2/cobey
```

### SLURM tricks

`squeue -o` lets you specify additional information for squeue using a format string. These are annoying to type every time you want to query things. You can create an alias in your `.bash_profile` script:

```{sh}
alias sq='squeue -o "%.18i %a %.9P %.8j %.8u %.8T %.10M %.9l %R %B %C %D %m"'
```

which includes standard info plus some extra stuff (including time limit, # nodes, # CPUs, and memory). Then you can just type `sq`, `sq -p cobey`, `sq -u edbaskerville`, etc. to perform queries with your customized format string.

See `man squeue` for a list of format string options.

`sinfo -o` has similar options, including the ability to see how many processors are available/in use:

```{sh}
alias si='sinfo -o "%P %.5a %.10l %.6D %.6t %C"'
```

A full list of aliases for user edbaskerville might look like:

```{sh}
alias sq='squeue -o "%.18i %a %.9P %.8j %.8u %.8T %.10M %.9l %R %B %C %D %m"'
alias qcobey='sq -p cobey'
alias qsandyb='sq -p sandyb -u edbaskerville'
alias si='sinfo -o "%P %.5a %.10l %.6D %.6t %C"'
alias icobey='si -p cobey'
alias isandyb='si -p sandyb'
```

### Other Midway items to keep track of

#### Protected health information (PHI) and MidwayR
There are instances where we work with data that qualifies as [PHI](https://en.wikipedia.org/wiki/Protected_health_information). Most often, this is individual-level data that includes a specific date related to an individual (e.g., birth date, admission date, vaccination date). These types of data cannot be stored or used in jobs running on Midway2/Midway3. For PHI and other protected data types, MidwayR may be useful. MidwayR is similar to the Midway computing environment but is also equipped with tools and software needed to meet the highest levels of secure data protection. Even with using MidwayR, it may be necessary to obscure dates that qualify as PHI.

#### Allocation requests
Each September we have to submit an allocation request to the RCC to request Service Units (SUs). SUs are the basic unit of computational resources and they represents the use of one core for one hour on the Midway Cluster. Allocation requests from 2019-2020 and 2020-2021 are housed on `/project2/cobey/allocation_requests_midway_resources`.

#### Node life
For each node purchased by the lab, the warranty is 5 years. Once the warranty runs out, the nodes can still be active on Midway2, but if they break or go down, the RCC will not fix them. For any long-duration jobs, please ensure you are using a node that is under warranty to reduce the likelihood of data loss. A spreadsheet with nodes, purchase date, warranty duration, and node label is housed in `/project2/cobey/allocation_requests_midway_resources`.

For the current status of private nodes, use:


```{sh}
sinfo -p cobey
```

Or for nodes with "cobey" in the name:

```{sh}
sinfo | grep cobey
```


```{r private-nodes, out.width='100%', echo=FALSE, fig.show='hold', fig.align="center", fig.cap='Node list with status September 2022'}
knitr::include_graphics(rep("images/private-nodes.png"))
```

#### Transitioning from Midway2 to Midway3
Any nodes purchased after March 2021 are housed on Midway3. The RCC will mount storage from Midway2 onto Midway3, meaning it will be accessible when running jobs from Midway3. As of August 2021, this has not yet occurred (due to delays related to the pandemic), but they expect this will occurr by October 2021. After this point, it will make sense for all new projects to be housed on Midway3 instead of Midway2.


## PHI and PII

Protected Health Information (PHI) and (PII) must be handled carefully according to the IRB protocol of the project, legal restrictions, and UChicago policy. Policies are covered in the Human Subjects, and HIPAA trainings.


### Links to Policies

(Available only within UCM network or VPN)

[Access Control Policy](https://services.uchospitals.edu/sites/PoliciesAndProcedures/HIPAA%20Security/1%20-%20IT%20Security%20Policies/03_POL-AC%20Access%20Control%20Policy.pdf)

[Responsibilities and Oversight Policy](https://services.uchospitals.edu/sites/PoliciesAndProcedures/HIPAA%20Security/1%20-%20IT%20Security%20Policies/01_POL-RO%20Responsibility%20and%20Oversight%20Policy.pdf)

[Personal Computing Device Policy](https://services.uchospitals.edu/sites/PoliciesAndProcedures/HIPAA%20Security/1%20-%20IT%20Security%20Policies/06_POL-BD%20Personal%20Computing%20Device%20Policy.pdf)

[Data Classification Policy and Handling Procedures](https://services.uchospitals.edu/sites/PoliciesAndProcedures/HIPAA%20Security/1%20-%20IT%20Security%20Policies/02_POL-DC%20Data%20Classification.pdf)

[HIPAA Security](https://services.uchospitals.edu/sites/PoliciesAndProcedures/Pages/HIPAA-Security.aspx)

[UCMC Remote Work Policy](https://services.uchospitals.edu/sites/PoliciesAndProcedures/UCH%20Administrative/A09-23%20_Remote%20Work.pdf)

[Biological Sciences Division Security Office](https://security.bsd.uchicago.edu/)

[HIPAA Privacy](https://services.uchospitals.edu/sites/PoliciesAndProcedures/Pages/HIPAA-Privacy.aspx)

### Handling PHI

Data is vulnerable "at rest" (on a local machine, or cloud storage) and "in transit" between machines over the internet or other network. It can also be recovered after it has been deleted from certain types of drive, and copies are often retained on systems where data was stored. Online services (Slack, email systems, Github, messaging apps) -- even when attachments appear to be limited to two parties -- also expose data to the company that owns the service, and their internal logs, backups, third-party services et cetera.

It is important, therefore, to use secure systems even for temporary storage, or when transferring files to MidwayR. PHI generally needs to be deleted after the conclusion of a study according to a Data Use Agreement, or IRB protocol.

CrashPlan Pro should be set up to exclude any PHI data directories, and backups should be included when deleting data. Copies can also be in recycle bins, logs, and swap files.

There's no official secure deletion application for individual files. However, [CCleaner](https://www.ccleaner.com/) and similar programs can help with the process on local machines. Whole-disk deletion should be coordinated with the [BSD Information Security Office](https://security.bsd.uchicago.edu/).

If data is exposed, lost or stolen, it should be reported according to the corresponding protocol or agreement.

Methods for transferring files are available in the [MidwayR User Guide](https://midwayr-docs.rcc.uchicago.edu/data-transfer/).


<!--chapter:end:05-coding.Rmd-->

# Onboarding

Welcome to the lab! This is an incomplete checklist of items to cover in your first week. We'll review all of this in a meeting the first day.

The most important things are to read the whole handbook and not to hesitate to ask questions.

**Before you arrive**, if possible, let me know if you'll need a laptop, so we can have one waiting for you.
Connie can work with you if you need a visa.
I can also help link you up with people who can help with housing.

**Workspace**

* Identify hardware to purchase: monitor, laptop, external hard drive, keyboard, mouse, etc. 
* Any other accommodations needed for comfortable work station?
* Key to lab (from E&E office, $20 deposit required)
* Get UChicago ID card from library. This gives you access to the building.
* Consider asking for access now to other spaces you may visit, e.g., immunology (4th floor of KCBD).

**Computing and admin**

* [Request an account](https://rcc.uchicago.edu/accounts-allocations/general-user-account-request) to use the Cobey partition on Midway.
* Request access to the lab's Asana workspace.
* Request access to the [cobeylab github account](https://github.com/cobeylab). (Create a github account if you've not yet.)
* Get access to our [Instagram account](https://www.instagram.com/cobeylab/). You're free to contribute whenever you want, but please check with subjects that they don't mind their picture being posted to the internet.
* Request access to the theory group calendar and the lab calendar
* Download and install the [University VPN](https://uchicago.service-now.com/it?id=kb_article&kb=kb00015292) so you can access resources off campus.
* Consider signing up for emails on additional seminar series and groups. (Note that a list of almost all BSD seminars will automatically be emailed to you every day.)
* Ask Mike Guerra to help you set up CrashPlan Pro and the external hard drive for backups.
* Mike can also help you with ongoing IT support requests such as adding a printer. *Please CC bsdis@bsd.uchicago.edu on all requests so a ticket is generated.*
* Figure out a lab service task. These tasks are basically chores, e.g., managing office supplies and the lab environment, maintaining the lab calendar and getting people signed up for meetings, etc. On Asana, they go by the euphemism "Areas of Responsibility". They're nonetheless critical for allowing us to work efficiently and happily. You'll also be assigned to a week of kitchen duty (basically, making sure other people clean up their messes, and picking up the slack if they don't).
* Write your profile for the lab website, and upload your CV (we could review it together first, if you like).

**Research**

* Scour the active research projects on Asana. (If you can see them, you're welcome to investigate the contents.) Ask people what's going on with each.
* Read more recent papers from the lab so you are familiar with lab member's areas of expertise.
* Discuss one-year goals and your long-term plans. 
* Identify short-term research goals (i.e., your initial project(s)), deadlines for initial research outputs, and deadlines for funding and conferences. 
Although we should focus on outputs, let's discuss if there are any immediate skills you need to develop to complete the work (Figure).
* Let me know how you like to interact, how you like to work, and if there are practices or principles described in the handbook that you would like to see changed, or at least don't work well for you. We can negotiate policies.
* Learn who/what is funding your research. Read at least the corresponding grant proposal, if we have one, and potentially other recent proposals (all on Asana).
* Identify a time for weekly meetings.
* Figure out with me if you need training in human subjects research.

```{r data-plot, out.width='40%', echo=FALSE, fig.show='hold', fig.align="center", fig.cap='No'}
knitr::include_graphics(rep("images/data_science.png"))
```

**Introductions**

* Lab members! We'll have a big lab lunch or dinner soon after you arrive, but I encourage you to meet with everyone individually in the first week or two.
* E&E administrators and IT
* Building administrators and custodians
* Neighbors in the Erman Building
* Other faculty
* Postdocs should consider joining the postdoc happy hour on Fridays

**Life and around town**

* Check out University [perks & discounts](https://humanresources.uchicago.edu/benefits/retirefinancial/perks/index.shtml). Note deals for phones, housing, etc.
* If you're commuting from outside Hyde Park, consider signing up for commuter benefits (if you are eligible). Get a [Ventra](https://www.ventrachicago.com/) card, which works on all buses and the El/CTA.
* You can download the [Metra](https://metrarail.com/) app if you'll be taking the Metra Electric line to/from downtown.
* Ask around about which health plans are en vogue.


<!--chapter:end:06-onboarding.Rmd-->

`r if (knitr:::is_html_output()) '
# References {-}
'`

<!--chapter:end:08-references.Rmd-->