Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a marshalled output for index generation #1084

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

panglesd
Copy link
Collaborator

This PR adds a new marshalled output for the generation of index files.

This is desirable for several reasons:

  • It allows to combine several pre-built indexes, making the index generation more incremental
  • It can be read by other OCaml tools using odoc as library, for instance when building a search engine index (cc @EmileTrotignon). Provide a Complete Module Index #1022 is also relevant.

The first commit fixes a bug in the computation of the ID of the page containing a standalone comment. The second commit is the actual change.

There are some design to discuss:

  • This PR adds a --marshall flag to the compile-index: by default the output is json, but it is marshalled with this flag.
  • The files containing marshelled indexes have to be prefixed by index- and have to have the .odoc extension.
  • Currently, the format is a hashtable. I think it would be better to have a better datastructure, similar to the one to contain occurrences.
  • The complexity when merging indexes many times is not ideal. For instance the following commands would be in O(n^2)
$ odoc compile-index --marshall -o index-all1.odoc a1.odocl
$ odoc compile-index --marshall -o index-all2.odoc a2.odocl index-a1.odoc
$ odoc compile-index --marshall -o index-all3.odoc a3.odocl index-a2.odoc
$ odoc compile-index --marshall -o index-all4.odoc a4.odocl index-a3.odoc
$ odoc compile-index --marshall -o index-all5.odoc a5.odocl index-a4.odoc
...
$ odoc compile-index --marshall -o index-all<n>.odoc a<n>.odocl index-a<n-1>.odoc
$

But I think that is ok.

@EmileTrotignon
Copy link
Collaborator

I do not like the index-<name>.odoc syntax. It is unusual to express file type with a prefix, and it might break scripts/makefiles that use *.odoc. .odoci or .odoc_index would be better.

@Drup
Copy link
Contributor

Drup commented Mar 18, 2024

I must admit I'm quite confused at the difference between *.odoc and *.odocl (are there others) ?

@EmileTrotignon
Copy link
Collaborator

@Drup
.odoc represent the data of a single page. .odocl also, but the page has been linked with others so that it is possible to resolve references.

In term of command line, you need a single module (cmti ?) to create a .odoc, but you need every .odoc to create a .odocl.
There are no others except the one in this proposal.

@panglesd
Copy link
Collaborator Author

I think the idea is that .odoc files are the result of the compile phase (the odoc compile command) while .odocl files are the result of the link phase (odoc link). Those phases and the dependencies inside them is not completely defined in odoc, but we are working on this.

In addition to the compile command (which create files corresponding to modules and standalone pages), several other commands generate .odoc files:

  • odoc compile-src produces a .odoc file (containing the required implementation info to render source code)
  • odoc count-occurrences produces a .odoc file (containing a marshalled table mapping ids to number of occurrences)
  • odoc source-tree produces a .odoc file (containing a source tree used to generate html navigation of directories when rendering source code)

Implementation and source trees are produced during the compile phase, and are meant to be linked. Contrary to this, occurrences and indexes are generated after the link phase, so I definitely agree that their extension should not be odoc nor odocl.

About the index-<name>.odoc syntax: it is not that unusal, at least in odoc:

  • pages must be named page-<name>.odoc
  • implementations must be named src-<name>.odoc
  • occurrence tables must be named occurrences-<name>.odoc
  • source trees must be named srctree-<name>.odoc

So, I think we have to options:

  1. Define a single extension for "post-link" files such as occurrences and indexes, and use prefix to distinguish them. For instance occurrences-<name>.odoca, index-<name>.odoca and toc-<name>.odoca (table of contents that will be introduced in the future might need a post-link file)
  2. Define one extension per "post-link" files: <name>.odoc-occurrences, <name>.odoc-index, <name>.odoc-toc.

I would prefer the first option.

@EmileTrotignon
Copy link
Collaborator

About the index-.odoc syntax: it is not that unusal, at least in odoc:

I know it is quite common in odoc, but at least for files names I really do not like it because it is a weird convention that fails to communicate its intent, so this requires knowledge of odoc to understand. It might also restrict names available for users, in a way that will be unexpected : no one will try to name a page .index (or at least they will understand that it might have meaning), but index- might happen.

I also do not like that file with the same extension are in fact of a different "type". I think you should be able to trust that the extension tells you what kind of file a file is, and in (1), all the .odoca have completely different (binary) content and they cannot be used interchangeably at all.

Regarding implementation and sources, it seems to me that they are used in the same context (linking) and as such can share an extension. Their content could even be of the same ocaml type if we wanted to (maybe they are ?).

I deeply favor (2), even though I understand that it requires changes out of the scope of this PR.

The only benefits for (1) I found in your message is the fact it is done like that in other places in odoc, but as it seems insufficient with regard to the drawbacks expressed above. Maybe there are other benefits I did not consider ?

@EmileTrotignon
Copy link
Collaborator

The interface changes from the first commit are fine, sherlodoc only uses Fold.unit and Fold.page

@panglesd
Copy link
Collaborator Author

I'm not decided on which is better, but let me still answer to some of you comments

this requires knowledge of odoc

Using odoc requires knowledge of odoc. Only odoc drivers are expected to call the command line. So I don't think this is such a big deal!

no one will try to name a page .index (or at least they will understand that it might have meaning), but index- might happen

This is not a problem at all. I don't think there can be any issue:

  • a compilation unit named <name> cannot have - in it
  • Other odoc produces artifact are prefixed: for instance, the compilation of index-bli.mld will generate page-index-bli.mld and that would not cause problems

I also do not like that file with the same extension are in fact of a different "type". I think you should be able to trust that the extension tells you what kind of file a file is, and in (1), all the .odoca have completely different (binary) content and they cannot be used interchangeably at all.

odoc files have completely different binary content, and cannot be used interchangeably: for instance, only pages can be passed as parents.
Similarly, json files can have completely different content and not be used interchangeably, but it still makes sense to give them the same extension.


So:

  1. index-<name>.odoca, occurrences-<name>.odoca:
    • Makes it explicit which phase those files belong to: .odoca files belong to the "post-link" phase, which takes .odocl files as input.
    • Consistent with what exists in the compile (.odoc) and link (.odocl) phase
  2. <name>.odoc-index, <name>.odoc-occurrences
    • @EmileTrotignon could you tell what what are the main benefits, considering what I wrote above?

@panglesd
Copy link
Collaborator Author

The interface changes from the first commit are fine, sherlodoc only uses Fold.unit and Fold.page

(sherlodoc uses entries_of_item, which is also changed by the bugfix.)

@EmileTrotignon
Copy link
Collaborator

(sherlodoc uses entries_of_item, which is also changed by the bugfix.)
I had missed that, the change still looks fine.

@EmileTrotignon
Copy link
Collaborator

EmileTrotignon commented Mar 18, 2024

Using odoc requires knowledge of odoc. Only odoc drivers are expected to call the command line. So I don't think this is such a big deal!

Yes, but it that case, we make up our own convention when a perfectly fine one already exists, so even though it's not unreasonable to ask people to learn things, my impression is that here, we are making it harder when its not necessary.

Other odoc produces artifact are prefixed: for instance, the compilation of index-bli.mld will generate page-index-bli.mld and that would not cause problems

You are right that it will not cause conflicts, but it could be confusing for people who might be trying to understand what is happening looking at _build, for instance if you also had bli.mld. Having a bli.odoc-index would be immediately understood for what it is.

odoc files have completely different binary content, and cannot be used interchangeably: for instance, only pages can be passed as parents.
Similarly, json files can have completely different content and not be used interchangeably, but it still makes sense to give them the same extension.

Yes, I do not mean that they should be used completely interchangeably, this is not possible if two file have different content, but that there exists a context where they are all accepted. This is the case for json (you can always pass a json file to jq), and I hope it is for .odoc (a .odoc is always valid as something to be linked against ? It can always be used to generate and odocl ?).
There is no such context for occurrences and index files, the only thing they have in common is being produced at link time, which is not the usual meaning of a file extension.

Regarding consistency, I think that this convention is more appropriate for src- and page- because they have shared uses, which is not the case for index- and occurrences-. (This could reformulated into : the meaning .odoc is not "the output of the compile phase" but "the input of the link phase", and there would be no such meaning for ".odoca")

To me, the main benefit of option (2) is that it makes it clear what you can do with the file, in a way that is understandable with minimal friction. This feel like a way more important information about a file than when it was produced. This is also more important because these files will used by tools external to odoc.

I think I have expressed myself fully, maybe a third person could take the final decision if you still disagree after reading this. .odoca is already a lot better than just .odoc.

in
Arg.(
value & opt_all convert_fpath []
& info ~doc ~docv:"FILE" [ "file-list" ])
in
let marshall =
let doc = "whether to output a json file, or an .odoc file" in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick : I think it should be "a .odoc file" (pronounced "a dot odoc file")

@EmileTrotignon
Copy link
Collaborator

I have read the diff of the second commit and it looks good to me aside from a very small nitpick.

panglesd and others added 4 commits June 5, 2024 12:12
Standalone documentation comments currently do not have an id. This id was
carried as the accumulator of the field, which yielded wrong results!

Signed-off-by: Paul-Elliot <peada@free.fr>
Signed-off-by: Paul-Elliot <peada@free.fr>
@EmileTrotignon EmileTrotignon force-pushed the search-index-buid-system-friendly branch from 205afd9 to 4d28c05 Compare June 5, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants