Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish manuscript and submit to journal #174

Closed
AlexanderTitus opened this issue Apr 2, 2019 · 21 comments
Closed

Finish manuscript and submit to journal #174

AlexanderTitus opened this issue Apr 2, 2019 · 21 comments

Comments

@AlexanderTitus
Copy link
Collaborator

Hi All,

Can we wrap this up and get the manuscript submitted. I think we are ready for the final read and edits, then submission.

@Benjamin-Lee Can we aim to get this done by April 15th?

@rasbt
Copy link
Collaborator

rasbt commented Apr 2, 2019

I think it would be good to wrap this up at some point. I started with making edits to tip1-4 & 6, but I also noticed that there's still a lot of work in front of us as everything is currently a bit rough. To avoid that this project becomes stale (which it seems it already does), we may want to think how to delegate the editing tasks and also attach some deadline, otherwise, it probably will never get done.

@agitter
Copy link
Collaborator

agitter commented Apr 5, 2019

there's still a lot of work in front of us as everything is currently a bit rough

I agree with that assessment. It is great to have a complete draft, but @rasbt's latest pull requests show there is a lot of refinement needed and some further decisions about the main message of each tip. #173 also raised good points about how to make the manuscript more actionable and concise.

Rather than fix a due date, I suggest we determine which 2-5 contributors can devote 2-3 weeks of focused effort to finalize the messaging and polish the text. Unfortunately, I can't volunteer for that due to other pressing manuscript deadlines. I'm happy to watch pull requests and provide assistance or feedback when I can.

@Benjamin-Lee
Copy link
Owner

Hi all,

So sorry for being gone for the last three weeks. I was out of the country for spring break followed immediately by another international trip for a conference. I am still 100% committed to getting this done, although the next ~10 days will be rough for me (manuscript revisions due 4/7 followed by a midterm on 4/10 followed by a class's final paper due 4/15). Let's aim to get #171 merged ASAP followed by #173, which has some good suggestions for tips 1-3.

I can take on making the changes @michaelmhoffman suggested in #173. @michaelmhoffman, can you take a look at the remaining tips since your suggestions have been really good?

@rasbt and @agitter, can you point to some areas that you think are rough? No need to propose any specific changes, just want to know areas that you think need some focus that I can work on them.

@rasbt
Copy link
Collaborator

rasbt commented Apr 5, 2019

@rasbt and @agitter, can you point to some areas that you think are rough? No need to propose any specific changes, just want to know areas that you think need some focus that I can work on them.

I think it's overall a bit heterogenous: very detailed in some aspects, very unspecific in others. I think this is a consequence of the different author styles and patching things here and there. I think it might help if a single person would take a look at the manuscript as a whole and impose some consistent style. I think the main ideas are all there, but nonetheless it feels a bit disjoint. This could all be easily addressed if someone could take the lead and smoothen things out. And after that, we can have another round of feedback and editing that involves all contributors.

@michaelmhoffman
Copy link
Collaborator

I've added more handwritten comments through the beginning of tip 7 to #173.

@rasbt
Copy link
Collaborator

rasbt commented Apr 6, 2019

I've added more handwritten comments through the beginning of tip 7 to #173.

I haven't read those completely, but based on a quick glance regarding your first comment "what is bio-specific about it?" that's also something I found. I think we are currently very focused on discussing DL in general (and even sometimes too much focused on also trying to cover general ML). In other words, it feels that this was written as a general "DL tips" document with some bio-literature citations added later on to make it appear bio-specific. It's not bad, but maybe we need to move the "DL bio" aspects more into the center given the type of journal we are submitting this to.

Like I mentioned in another PR, my personal opinion is that we maybe try too hard to cover everything, i.e., general ML & general DL & Bio-DL. Of course, some of the more general ML concepts are relevant for DL, but on the other hand, we also have to keep in mind that this is not a textbook. I wonder if it may be even helpful to just assume that readers have some basic ML, pattern classification, or predictive modeling knowledge so that we can focus (given the limited space) more on the DL-bio specific topics so that this article would add unique value (considering that there are already many ML and DL intro articles out there)

@Benjamin-Lee
Copy link
Owner

maybe we need to move the "DL bio" aspects more into the center given the type of journal we are submitting this to

100% agree on this. The paper currently reads more as "Ten quick tips for deep learning for biologists" rather than "Ten quick tips for deep learning in biology". Although the target audience is (and should be) biologists, a greater emphasis on the specifics of DL in bio is warranted.

I wonder if it may be even helpful to just assume that readers have some basic ML, pattern classification, or predictive modeling knowledge so that we can focus (given the limited space) more on the DL-bio specific topics so that this article would add unique value (considering that there are already many ML and DL intro articles out there)

I think that assuming some basic knowledge would make writing the paper easier, although I worry that, given the current hype around DL, basic knowledge cannot be assumed.

@agitter
Copy link
Collaborator

agitter commented Apr 16, 2019

One option is to make Tip 1's message "These are a few essential things you must know about ML before diving into DL in biology. If you are not familiar with these concepts, review a general ML guide (e.g. Chicco 2017) first."

That isolates all of the general ML discussion to a single tip and the rest would pertain to DL in biology.

@evancofer
Copy link
Collaborator

I agree with @Benjamin-Lee that is a bad idea to assume that readers are extremely familiar with ML principles. Correct me if I am wrong, but I don't think we are close to the content limit? As such, I don't get why there is such an urgency to remove any introductory material. It's probably a better idea to identify more actionable advice that we can add to existing tips. With a large set of authors, it's easier to add more content now and take away things later (doing so in a uniform manner that brings about a consistent style etc). I am concerned we will lose steam (e.g. contributors and interest) during a lengthy revision of what is essentially background/introductory material, when we really need to be adding more advice (i.e. "tips") to what we already have.

@Benjamin-Lee
Copy link
Owner

@evancofer That's pretty reasonable. I do agree that increasing the actionability of the paper is a worthwhile goal. With that in mind, what if we make sure that each section ends with a sentence summarizing the an easily actionable version of the tip.

E.g. for tip 1 (rephrasing @agitter's comment): "If you are not familiar with machine learning, review a general ML guide before diving into DL." and for tip 2: "Create and fully tune several traditional models such as logistic regressors or random forests before implementing a DL model"

@evancofer
Copy link
Collaborator

evancofer commented Apr 18, 2019 via email

@Benjamin-Lee Benjamin-Lee mentioned this issue Apr 22, 2019
1 task
@AlexanderTitus
Copy link
Collaborator Author

@Benjamin-Lee is this project still active? Whats our status.

@Benjamin-Lee
Copy link
Owner

So sorry for taking a while to get on this. I chatted with @evancofer and @agitter at RECOMB a few weeks ago and the project is definitely still active. I personally just finished finals and moved to NYC and am starting a summer position at MSKCC so I have been a bit distracted.

That being said, I think the single most important thing we need to do is to make the paper more concrete in terms of examples (of positive and negative uses of DL) as well as actionable recommendations. Additionally, I am thinking about merging the current tips 8 ("Your DL models can be more transparent") and 9 ("Don’t over-interpret predictions") since they both cover model interpretation, but I'm not entirely sure.

My immediate priority is addressing @agitter's comments in #181 and #182 so we can get that merged. I'd love to hear everyone else's thoughts on what needs to get done before we submit.

@agitter
Copy link
Collaborator

agitter commented Jun 3, 2019

My main priority after our RECOMB conversation was to finalize who the audience is, what we assume they already know, and what we want them to know after reading this. We have discussed this before, but it's important to get right so I'll raise it again.

I am assuming that our audience does not have much background in ML but has heard about deep learning applications in biology. They are curious whether deep learning is appropriate for their problem and where to turn to get started. If that is that case, I think we need to place even more emphasis on ML fundamentals than we have now. Tip 1 and parts of other tips may not be enough. We may even want to add a "Decide whether deep learning is right for your problem" tip.

If the audience already has a background in ML and biology applications, there are many existing technical resources that they can use to get started with deep learning.

The audience could also be readers who have ML and deep learning backgrounds but not much in biology who want to learn what is special about biology for deep learning applications. Our current rules do not seem aimed at that audience. They focus on teaching deep learning concepts, not biological concepts.

I also agree with adding more examples and making the tips less abstract.

@rasbt
Copy link
Collaborator

rasbt commented Jun 3, 2019

We may even want to add a "Decide whether deep learning is right for your problem" tip.

That's a good point, and we should definitely include sth like that!

Last time I read the first section(s), it came across as a bit too negative/preaching. There is nothing wrong with it from the technical side, but I could imagine that the order and style may be discouraging for newcomers. So, I wonder if it might be a good idea also to restructure the flow of the article a little bit. I.e., we could

  1. Briefly describe what deep learning is and select a handful of fascinating problems in biology where DL made a substantial difference (this would be the introduction). I think we want to start with something exciting and positive here, not "DL is just ML" and "using DL is not a good idea in the majority of research problems."

  2. Go over the general tips for applying DL, but for each section, have a biology-related example/case study. **This would be, to a large extent, what we already have, but smoothened out such that it reads like a homogenous text (right now, due to the many edits we made individually, it is a bit like a patchwork)

  3. End the article with a "Decide whether deep learning is right for your problem" tip. I think this should come last because readers will have a better idea of how DL can be applied to biological problems and are also better aware of the challenges after reading the article, not before :).


Timeline

Also, to really get back to making good progress on this article, I think we should really nail down the target audience now (and also put it prominently into the README.md of this repo as well as the introduction of the article -- so that everyone is clear about that when we continue contributing content and making edits.

Right now is really a good time, because for most, it's "summer", which doesn't mean it's not busy, but it's probably less busy than the typical semester schedules. It would actually be great if we could get this submitted before September.


Target Audience

My preference regarding the target audience would be

If the audience already has a background in ML and biology applications...

I think that there are already many ML resources / articles for ML in bio out there. Also, I think it is unlikely that someone who is not familiar with biology/biological applications would be very interested in applying DL to it. There are maybe some exceptions in collaborative settings, but it will probably only be a small fraction of people for whom this will be useful.

Also, I think for people who don't have any experience with ML/DL, our article would be the wrong medium. Those probably need to learn about scientific computing and ML in general, which cannot be taught in a 8-10 page article but would require a more fundamental tutorial or course.

Personally, who I picture as the target audience is a person who is a computationally-oriented biologist who ideally even has some basic bioinformatics background (maybe taken a traditional intro class). The person used some basic regression or classification algorithms (linear regression, maybe RF or SVMs) on some problem sets. Now, the person would like to see/try whether more recent (or "sophisticated") predictive modeling approaches could solve the problem better. Having used some basic ML, the person knows that a dataset should be partitioned into training and test folds and is aware of the issues of overfitting. However, when looking at DL frameworks and their tutorials, it seems to the person that using DL is far more complicated than using SVM in e.g., scikit-learn. Things the person may wonder about are

  • Where to start?
    What architecture should I choose?
    How should I prepare my data? In scikit-learn, I used a design matrix representation, but when I look at DL examples, they all work on text or pixels, so is there sth fundamentally different about DL?
  • Also, is my dataset big enough?
  • My model is not training, is this a problem with my dataset, or is there an issue with the model I chose?
  • My dataset has so many features, and neural networks are so complicated, how do I prevent overfitting and/or exploit some noise in the data?
  • How can I find out what the neural network actually learned? How can I interpret that model?

@agitter
Copy link
Collaborator

agitter commented Jun 4, 2019

@rasbt has persuaded me that a target audience that knows ML and biological applications is most appropriate, mainly because of the below comment:

Also, I think for people who don't have any experience with ML/DL, our article would be the wrong medium. Those probably need to learn about scientific computing and ML in general, which cannot be taught in a 8-10 page article but would require a more fundamental tutorial or course.

I strongly agree with this and would like to see this point clearly stated in the introduction (after the enticing points about DL in bio, as @rasbt suggested). We could also briefly refer to some existing resources for learning ML fundamentals. I like the "Ten quick tips for machine learning in computational biology" paper, but it is too short to help someone who doesn't have an ML background to learn ML.

Having used some basic ML, the person knows that a dataset should be partitioned into training and test folds and is aware of the issues of overfitting.

Even if the reader is aware of these best practices, some of them are so important and so frequently violated that we may still want to keep them in the tips here.

@Benjamin-Lee do you agree regarding the target audience? If so, we can update the readme with this more focused direction and make plans to reorient the article.

@Benjamin-Lee
Copy link
Owner

@Benjamin-Lee do you agree regarding the target audience? If so, we can update the readme with this more focused direction and make plans to reorient the article.

I think @rasbt is right on the target audience. By narrowing the audience from biologists in general to computationally oriented biologists we can make the article less general which, to my mind, is the biggest issue.

Even if the reader is aware of these best practices, some of them are so important and so frequently violated that we may still want to keep them in the tips here.

+1 on this. When taken with #174 (comment) ("I think we want to start with something exciting and positive here, not "DL is just ML" and "using DL is not a good idea in the majority of research problems.""), my inclination would be to condense tips 1 and 2 into a paragraph at the end of the introduction.

Unless anyone has an objection to it (if so, please speak up!), let's reorient the article.

@fmaguire
Copy link
Collaborator

fmaguire commented Jun 5, 2019 via email

@agitter
Copy link
Collaborator

agitter commented Jun 5, 2019

my inclination would be to condense tips 1 and 2 into a paragraph at the end of the introduction.

I like them as separate tips because I find test set reuse, evaluation metrics, and baselines to be critical, although these tips may need major rephrasing or restructuring. They also don't need to be tips 1 and 2.

@agitter
Copy link
Collaborator

agitter commented Jun 8, 2020

Following the discussion in #205, I'm bumping this issue from last summer. It contains some crucial points that I suggest need to be addressed before this is ready to submit:

@raghuyennamalli
Copy link

I think the first paragraph or two should make it crystal clear who the audience is and how this resource will help them.

I see that this has been addressed to what I posted elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants