Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp Markdown bindings for overhauled website #3699

Merged
merged 27 commits into from
May 26, 2024

Conversation

rcurtin
Copy link
Member

@rcurtin rcurtin commented May 2, 2024

This is a "paired PR" along with #61.

In recent months, I have been writing Markdown documentation (like this) for each individual C++ class that mlpack provides. This is easy to compile into HTML, and there is even a script to do it, in scripts/build-docs.sh. However, this handwritten Markdown documentation did not integrate with the automatically-built Markdown documentation for our bindings.

I spent a few weeks trying to figure out how to put it all together, and I was able to do it, although it required a significant refactoring of how the Markdown bindings work. Here is a working version of the new mlpack website (also with changes from the PR referenced above):

https://www.ratml.org/misc/mlpack_new/

and, the documentation for Python is relatively unchanged, but now fits in with the themeing and formatting of the rest of the documentation:

https://www.ratml.org/misc/mlpack_new/doc/user/bindings/python.html

There are a number of important changes to the functionality here that I'll list below:

  • Instead of generating one gigantic mlpack.md file that contains documentation for all languages, the generate_markdown program now takes two parameters: the first specifies the language to print documentation for, and the second specifies whether to generate the documentation as a Markdown file, or the sidebar for the documentation as HTML. This means that all the crazy sed/awk crap that the website build script had to do to split mlpack.md into one file for each language is no longer needed (phew!).

  • The generated Markdown documentation is actually committed to the repository, in doc/user/bindings/. This allows someone to browse the documentation without needing to do make markdown---in fact, they can browse it directly on Github (although Github's Markdown reader doesn't render everything perfectly). It also allows website building to be done without actually compiling mlpack, which is significant time savings: building the website from scratch before would take several hours, because we had to build each version of the library, and then build the Markdown bindings, etc., etc.

  • To ensure that the Markdown bindings that are committed in the repository don't go out of date, I added scripts/check-markdown-docs.sh, and I'll set up a CI job on Jenkins that just runs it. The job will fail if the bindings have been updated in a way that the generated Markdown changes, but those changes were not committed to doc/user/bindings/.

  • build-docs.sh now supports custom sidebars for documentation pages. I wanted the bindings to display a different sidebar---if the user is browsing Python documentation, the sidebar should not be full of links to C++-specific documentation. Example Python page, example regular documentation page

  • All of the URLs in the SEE_ALSO()s for each binding are now checked by the build-docs.sh script. In fact, all links in the generated Markdown bindings are now ensured to be valid.

  • HISTORY.md is now also compiled into HISTORY.html by scripts/build-docs.sh, so that it can be referenced from the website. I had to make a handful of changes there so it would render reasonably.

I know the diff is large... any feedback is much appreciated. 😄

@conradsnicta
Copy link
Contributor

conradsnicta commented May 3, 2024

@rcurtin It's really nice to see this come together. A few comments and suggestions based on the current state at https://www.ratml.org/misc/mlpack_new/

Main landing page:

  • In the short description under the mlpack name in top center, suggest to omit "C++"; maybe change the description to "extensive library of flexible machine learning algorithms"; the idea is to avoid being pigeonholed into C++-only land right off the bat, as mlpack has bindings to other languages
  • Suggest to change the two blurbs under the logo (separated by a vertical line) into a simple bullet point list without the vertical divider
  • Suggest to add a third bullet point, explicitly stating that there are bindings to Python etc
  • For the citation section, suggest to use only one paper (the latest JOSS article from 2023), and remove the older ones; they're distracting and redundant
  • Suggest to remove or simplify the Benchmarks section; the table is too dense and has too many details; at the very least reduce the table down to 3 or 4 rows maximum; suggest to also remove the Shogun and WEKA columns -- these projects are pretty much dead and including them is just distracting
  • Suggest to change the "Download mlpack" button to link to the "Quickstart" page, instead of directly pointing to the .tar.gz archive; also remove mlpack version details, in order to reduce the maintenance burden; more on that below

Quickstart page:

  • Suggest to rename this section to "Installation" or "Download"
  • Suggest to have a separate section with "Latest Stable Release", linking to mlpack-4.3.0.tar.gz. This is instead of stating "mlpack is available in most package managers, or the source can be downloaded directly"
  • Suggest to add a note along these lines: "Pre-built mlpack packages are provided for various operating systems; these packages may not be the latest version; if you're encountering problems, use the official stable release provided here"

Documentation page:

  • Several sections appear empty, so assuming these are under construction: "Clustering algorithms", "Geometric algorithms", "Preprocessing utilities"

Copy link
Member

@shrit shrit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went a bit faster on the markdown part, but spent more time on the others
Technically, I agree with @conradsnicta on the following points, as I have felt the same thing:

C++ should not be the centre, as in some cases could be pushing people away, even though I agree that in my opinion, the ultimate objective of mlpack is to focus on places where other library fails to deliver, especially when it comes to embedded system C/C++

    In the short description under the mlpack name in the top centre, suggest omitting "C++"; maybe change the description to "extensive library of flexible machine learning algorithms"; the idea is to avoid being pigeonholed into C++-only land right off the bat, as mlpack has bindings to other languages

I am not sure about this, yes it does make sense and we should add it, but I do not know where, and talking about Python Immediately without C++ can give the impression to people that we are a Python library, which should never be the case because we can not compete with other libraries when it comes to Python.

    Suggest to add a third bullet point, explicitly stating that there are bindings to Python etc

Totally agree on this one, it does not make too much sense to cite the older one unless it was for a specific reason.

    For the citation section, suggest using only one paper (the latest JOSS article from 2023), and remove the older ones; they're distracting and redundant

I think we should move this to a different place, and Yes I agree on removing Weka and Shogun, I never heard of mlpy or used it. We should replace these with TensorFlow or TensorFlow Lite, Pytorch, or any other modern maintained framework.

Also, we should mention some benchmarks regarding the binary footprint.

    Suggest to remove or simplify the Benchmarks section; the table is too dense and has too many details; at the very least reduce the table down to 3 or 4 rows maximum; suggest to also remove the Shogun and WEKA columns -- these projects are pretty much dead and including them is just distracting

This is exactly what I thought about the quick start page as well, it should be named installation.

    Suggest to change the "Download mlpack" button to link to the "Quickstart" page, instead of directly pointing to the .tar.gz archive; also remove mlpack version details, in order to reduce the maintenance burden; more on that below

Quickstart page:

    Suggest to rename this section to "Installation" or "Download"
    Suggest to have a separate section with "Latest Stable Release", linking to mlpack-4.3.0.tar.gz. This is instead of stating "mlpack is available in most package managers, or the source can be downloaded directly"
    Suggest to add a note along these lines: "Pre-built mlpack packages are provided for various operating systems; these packages may not be the latest version; if you're encountering problems, use the official stable release provided here"

A couple of things that I had in mind:

  1. I would remove the FAQ page because everything on that page has been already said before, and it does not add any value to the library, only maintenance burden.

  2. If It is up to me I would remove the entire contributor section from the community, it is not well-updated, and there are so many names, that I doubt anyone would come to see what this is about.

  3. I would simplify the page community, there is so much information, that it is hard to find the needed one

3.1) I would remove the word Developer from in front of the names, it feels too much academic, even all of us have left academia long time ago 😄

  1. We should remove the tutorial button from the front page because it is leading to some very old thing. We can bring it back with modern tutorials.

  2. We should add a page that talks about the Ecosystem of mlpack, (for example adding link to ensmallen, bandicoot, armadillo, and other libraries, that are using mlpack. Not only this, we can add old and current users, and what are they doing with the library, something like use cases, which is more relevant to people and attractive as well.

  3. I really like Pytorch website, I think we should get some inspiration from there, regarding Edge dedicated page, and adding a Blog page, since we are going to add these in the future.

  4. I really like the About page in Pytorch as well, including the installation

  5. Finally, we do not need a link to the Home page with the home button, everyone is clicking on the logo, it is the easiest and the fastest way

@@ -92,7 +92,7 @@ BINDING_SEE_ALSO("@random_forest", "#random_forest");
BINDING_SEE_ALSO("Naive Bayes classifier on Wikipedia",
"https://en.wikipedia.org/wiki/Naive_Bayes_classifier");
BINDING_SEE_ALSO("NaiveBayesClassifier C++ class documentation",
"@src/mlpack/methods/naive_bayes/naive_bayes_classifier.cpp");
"@src/mlpack/methods/naive_bayes/naive_bayes_classifier.hpp");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was really well hidden

Comment on lines +102 to +104
// I wonder how long this full text PDF will remain available...
"https://www.microsoft.com/en-us/research/uploads/prod/2006/01/"
"Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it was up to me I would remove it from the binding for two reasons.

  1. This is an unnecessary piece of code that does not do anything and we have to maintain it in the longer term, and in some places host the paper.
  2. We have really good documentation right now, this is where users and developers are intended to land and not on the main.cpp of the binding. This means that this will never be read or used.

If it was up to me I would remove all the papers from here and mention them directly in the docs, the smartest way actually is to put a link in the bindings to our documentation instead of one to the paper. There, in the doc, we can take care of the paper and the details of the method.

Copy link
Member Author

@rcurtin rcurtin May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this does get read and used, these SEE_ALSO() links get compiled directly into the "see also" section for the binding documentation. For instance, these links right here end up in the Bayesian linear regression documentation for each binding language: http://ratml.org/misc/mlpack_new/doc/user/bindings/python.html#see-also-1

As far as maintenance goes, the CI job that builds and tests documentation should tell us if the URLs ever go out of date.

@rcurtin
Copy link
Member Author

rcurtin commented May 3, 2024

Actually I will mark this as a draft for now so that I can handle all the changes, and then mark it as ready.

@rcurtin rcurtin marked this pull request as draft May 3, 2024 12:41
@rcurtin
Copy link
Member Author

rcurtin commented May 13, 2024

I was traveling this past week, so it took me a while to get back to, but I much appreciate the comments @shrit and @conradsnicta. The example website has been updated with all the changes below.

I also made some significant changes to the sidebar---instead of having different sidebars for binding documentation and for C++ documentation, now the same sidebar is used for both. I think that it's a lot more consistent and easy to navigate now.

Some responses to each point and changes that I made. (Sorry that it got really long...)

In the short description under the mlpack name in top center, suggest to omit "C++"; maybe change the description to "extensive library of flexible machine learning algorithms"; the idea is to avoid being pigeonholed into C++-only land right off the bat, as mlpack has bindings to other languages

C++ should not be the centre, as in some cases could be pushing people away, even though I agree that in my opinion, the ultimate objective of mlpack is to focus on places where other library fails to deliver, especially when it comes to embedded system C/C++

Yes, I agree with this. I thought about it for a little while---I want to keep it clear that we are header-only, but if you remove the C++ then the term doesn't make sense... so I instead added a second line to the subtitle: with bindings to Python, Julia, R, Go, and the command-line. Hopefully that should cover the bases here.

Suggest to change the two blurbs under the logo (separated by a vertical line) into a simple bullet point list without the vertical divider

Yeah, maybe less visually interesting but certainly easier to read. Done.

For the citation section, suggest to use only one paper (the latest JOSS article from 2023), and remove the older ones; they're distracting and redundant

Yeah, fair point, removed.

Suggest to remove or simplify the Benchmarks section; the table is too dense and has too many details; at the very least reduce the table down to 3 or 4 rows maximum; suggest to also remove the Shogun and WEKA columns -- these projects are pretty much dead and including them is just distracting

I think we should move this to a different place, and Yes I agree on removing Weka and Shogun, I never heard of mlpy or used it. We should replace these with TensorFlow or TensorFlow Lite, Pytorch, or any other modern maintained framework.
Also, we should mention some benchmarks regarding the binary footprint.

Well, I agree with both of those things, but I don't have any other benchmark numbers handy and want to avoid diving into getting benchmarking set up again... I also want to have some kind of proof point for "fast" on the homepage, even with the clear problems that have been pointed out and even though the numbers are almost old enough that they could get a driver's license in the US. Anyway, I want to update these for sure, but I'd prefer to leave the cleanup for a follow-up and come back to it.

Suggest to change the "Download mlpack" button to link to the "Quickstart" page, instead of directly pointing to the .tar.gz archive; also remove mlpack version details, in order to reduce the maintenance burden; more on that below

Ah, thanks! I had meant to do that but apparently forgot to actually make the change. Done now.

Quickstart page:
Suggest to rename this section to "Installation" or "Download"

Yes, good point, that will help prevent confusion between the "quickstarts" which are documentation and this page. I went with "Download".

Suggest to have a separate section with "Latest Stable Release", linking to mlpack-4.3.0.tar.gz. This is instead of stating "mlpack is available in most package managers, or the source can be downloaded directly"
Suggest to add a note along these lines: "Pre-built mlpack packages are provided for various operating systems; these packages may not be the latest version; if you're encountering problems, use the official stable release provided here"

Thanks, I did my best to refactor the download page to mention these.

I would remove the FAQ page because everything on that page has been already said before, and it does not add any value to the library, only maintenance burden.

All of the questions are obvious to us who know the project already, yeah, but a lot of people may be unfamiliar with these questions. I do get a decent number of emails from people who do not appear to have much expertise and would benefit from these pages. It doesn't hurt to keep the page there; I think none of the answers have been updated in a very long time anyway. ensmallen, Armadillo, and Bandicoot have similar pages.

If It is up to me I would remove the entire contributor section from the community, it is not well-updated, and there are so many names, that I doubt anyone would come to see what this is about.

The only thing that kept me from doing this is that some contributors from very long ago never had their contributions linked with a Github account. But I guess that maybe they are called out in COPYRIGHT.txt or something like this. Anyway, I removed it.

I would simplify the page community, there is so much information, that it is hard to find the needed one

With the contributors list gone, it's a lot shorter now. Really the thing that most people will be interested in is the chat channels and maybe video call; the rest of the stuff is not particularly important but does need to go somewhere. That contribution guide is really helpful too---I point lots of people towards it and it saves me time writing the same thing over and over again. :)

I would remove the word Developer from in front of the names, it feels too much academic, even all of us have left academia long time ago 😄

Yeah, I think the NSF wanted it written like that many years ago. I dropped the "Developer" part, even though I do still dip my toes in academia from time to time. :)

We should remove the tutorial button from the front page because it is leading to some very old thing. We can bring it back with modern tutorials.

Where do you see the link to the tutorials? Maybe I am looking at the wrong page? I don't see one on the new homepage. I thought I had removed all links to the old tutorials.

We should add a page that talks about the Ecosystem of mlpack, (for example adding link to ensmallen, bandicoot, armadillo, and other libraries, that are using mlpack. Not only this, we can add old and current users, and what are they doing with the library, something like use cases, which is more relevant to people and attractive as well.

Not a bad idea, but I would prefer to save it for later, or if you want to do it, feel free.

I really like Pytorch website, I think we should get some inspiration from there, regarding Edge dedicated page, and adding a Blog page, since we are going to add these in the future.

Definitely we will need more direct links to the application notes and tutorials we will write, but I'm hesitant about adding a blog. Unless you have a team of people dedicated to writing blog content, it will always go out of date and then users who come look at the blog say "huh, it hasn't been updated in 3 years, I guess the project must have no momentum". I finally removed the old GSoC blog we used to use for this reason.

I really like the About page in Pytorch as well, including the installation

Actually the current version of the mlpack homepage has an installation selector kind of like PyTorch does, but I have found it hard to maintain and a little bit complicated. So I just went with something simple. If someone wanted to replace it with something better or more elaborate in the future, I'm not opposed, but I would prefer not to (I enjoy writing C++ not HTML/JS :)).

Finally, we do not need a link to the Home page with the home button, everyone is clicking on the logo, it is the easiest and the fastest way

I see what you mean, but it doesn't hurt anything to leave it there. I personally prefer the explicit callout to the home page.

@conradsnicta
Copy link
Contributor

further comments in mlpack/mlpack.org#61 (comment)

@rcurtin rcurtin marked this pull request as ready for review May 17, 2024 02:14
@conradsnicta
Copy link
Contributor

@shrit - may need to approve this PR as well, since it's part of mlpack/mlpack.org#61

Copy link
Member

@shrit shrit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@rcurtin
Copy link
Member Author

rcurtin commented May 26, 2024

The documentation build fails, but only because the new website is not yet deployed. Otherwise all the links seem ok.

@rcurtin rcurtin merged commit 9107c55 into mlpack:master May 26, 2024
19 of 20 checks passed
@rcurtin rcurtin deleted the website-revamp branch May 26, 2024 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants