Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs #328

Open
wants to merge 41 commits into
base: master
Choose a base branch
from
Open

Conversation

lahdjirayhan
Copy link

@lahdjirayhan lahdjirayhan commented Dec 12, 2021

I think I can try writing some docs (docstrings, examples). I may not be able to document all scrapers and all the machinery in it at the moment, just the ones I've used or can fairly understand. That being said, I think having some sort of documentation on this library is still a good thing to do.

Should I continue working on this, @JustAnotherArchivist? I apologize in advance if this PR feels like it comes out of nowhere.

sidenote: I'm not sure the difference between #6 and #7, and whether this PR is solving which.

The docs currently resides in index. Mostly are
result of sphinx quickstart. Will move later.

The docstrings are still in very initial phase.
I wasn't experienced enough with the library to
warrant a thorough docstring.
also move/restructure the rst
@TheTechRobo
Copy link
Contributor

I think #6 is more say the limitations and workings of each individual scraper, and #7 is documenting the Python API

@JustAnotherArchivist
Copy link
Owner

Thank you for your attempt at this! The lack of documentation is definitely a major issue, so I do appreciate this!

One thing that immediately sticks out for me is something I'm deeply allergic to, though I'm not sure I spelt it out anywhere before, so apologies for that. It's duplication of information. It makes future maintenance harder because one has to remember to update things in multiple places. This is inevitably forgotten eventually, which then leads to inconsistencies and confusion.

There are three primary examples here:

  • The twitter-reference.rst file. I'd like to avoid listing the scrapers and even their modules in the docs at all. Instead, they should be discovered automatically through snscrape.modules. Apparently sphinx.ext.autosummary might be able to do this (with hacks). (Basically, adding or removing scrapers or modules should not require any changes to the docs directory.)
  • Function signatures, i.e. argument types, default values, and return value type. This should all happen through type hints, and it appears that Sphinx supports that with autodoc_typehints = "description" (which is even enabled in your docs/conf.py). Default values appear to require an extension.
    • In an ideal world, the argument and return value descriptions would also live in annotations (PEP 593); this is what I mentioned in Document the individual modules better #6 and doesn't seem to be supported anywhere yet. But at least there's only minimal duplication there (the argument names), so that's acceptable for now.
  • The package metadata and version in docs/conf.py. Note that the version is not recorded anywhere in the code; it comes from setuptools_scm via git tags. This should instead require an installation of snscrape at doc build time and then fetch that data using importlib.metadata.

It probably makes sense to split the documentation writing into three parts: type hints and docstrings for all public API, separate documentation (like a general introduction, CLI vs Python layer, and the Twitter example), and everything directly related to Sphinx (configuration etc.). Most of my points above fall into the last part. Perhaps you'd like to focus on just one for now?


A couple comments on style. For docstrings, it should be this, in line with the few existing ones:

def func(foo: str) -> str:
	'''Basic description

	Args:
		foo: a description of foo's significance
	Returns:
		a description of the return value
	'''

	return foo

That is: single quotes, no empty line at the end of the docstring, and an empty line between the docstring and the code. Further, I think the class description should go into a class-level docstring, and only __init__ argument description into the method docstring.

For the rST files, the same style as for my Python code should be used: tabs for indentation, spaces for alignment. Lines should be broken only where it makes sense (i.e. one paragraph = one line); I'm not a fan of breaking text into lines at random points based on 1980s-era screen widths. :-)

That's all I can think of right now. Looking forward to seeing where this is heading! :-)

@JustAnotherArchivist
Copy link
Owner

And @TheTechRobo's comment is correct. #6 is about high-level documentation of the existing scrapers, what they target, returned item fields, etc., while #7 is about the usage of snscrape from Python. There's some overlap though, naturally.

@lahdjirayhan
Copy link
Author

Thank you for your feedback. This is one of my first attempts to contribute to a project that's not my own, so I understand that I need to adjust some style accordingly.

I also agree with the points you brought up about (avoiding) code duplication. I dislike unnecessary duplication myself. It's just that I wasn't (and still am not) experienced with setting up Sphinx, and I think that showed in my previous attempts. I need to look further and try to configure Sphinx better so that docs/ directory (and especially, the "API reference"-like section) will not need to be separately updated and maintained.


I have some questions, though:

  • Instagram: currently, each scraper class still needs the mode parameter to be specified, despite the already specific class names. What should be done here?
    1. Recommend users to use InstagramCommonScraper and set the mode themselves. Or,
    2. Setup each Instagram scrapers so that the mode is set on each class' __init__. For example, something like this:
class InstagramUserScraper(InstagramCommonScraper):
    ...
    def __init__(self, name, **kwargs):
        super().__init__(mode="User", **kwargs)
   ...
  • Reddit: technically I can import the specific scraper classes, but they are not listed in the source code. Instead, they are available through _make_scraper. I understand this is done to vastly reduce code duplication, because each scraper class will just contain mostly the same code anyway. But, how should the docstrings be written in this case?

@JustAnotherArchivist
Copy link
Owner

Yeah, I have little experience with Sphinx myself as well. Incidentally, that's part of why I put off working on the docs for such a long time. But I'm sure we can find a good solution.

Good points on the Instagram and Reddit scrapers. Yes, both of those are a mess and should be refactored. I'll look into that.

On that note, I should also revisit the public vs private parts of the code. For example, I don't consider InstagramCommonScraper part of the public API; people should only use the subclasses like InstagramUserScraper. Documentation is much less important for the private ones, naturally. I'll do some cleanup there as well. Expect a decent number of renames everywhere. Hopefully it won't be too messy to rebase your branch afterwards.

Add initial edits into conf.py. I'm trying things
out right now.
The previous commit (copypaste template from SO)
has following behavior when tested on my local
device:

1. It generated all necessary stub .rst files
   under /_autosummary. (good)
2. It generated all HTMLs for each module,
    submodule, and class. (good)
3. (the bug) snscrape.modules page lists -- but
   does not link to any of -- its submodules.
	 The individual HTMLs for each submodule exist,
	 it is just not linked in toctree. I can open
	 each resulting HTML manually on my browser.

I've did some trial-and-error, this commit is
reflecting the minimal change that makes the
snscrape.modules page links to its submodules.
At least it works.
@lahdjirayhan
Copy link
Author

lahdjirayhan commented Jan 11, 2022

Copy link
Contributor

@TheTechRobo TheTechRobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just something that confused me for a second

snscrape/modules/twitter.py Outdated Show resolved Hide resolved
@lahdjirayhan lahdjirayhan marked this pull request as ready for review February 17, 2022 15:20
Copy link
Contributor

@TheTechRobo TheTechRobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick skim found this

docs/conf.py Outdated Show resolved Hide resolved
docs/index.rst Outdated Show resolved Hide resolved
@TheTechRobo
Copy link
Contributor

I'm getting this:

themandownstairs:~/sndocs git:add-docs ❯❯❯ sphinx-build -b html docs out                                             2
Running Sphinx v4.4.0

Configuration error:
There is a programmable error in your configuration file:

Traceback (most recent call last):
  File "/home/thetechrobo/.local/lib/python3.9/site-packages/sphinx/config.py", line 340, in eval_config_file
    exec(code, namespace)
  File "/home/thetechrobo/sndocs/docs/conf.py", line 35, in <module>
    _major, _minor, _patch, _yyyymmdd = release.split('.')
ValueError: too many values to unpack (expected 4)

@TheTechRobo
Copy link
Contributor

TheTechRobo commented Feb 20, 2022

Oh, I think it's because I'm on the dev version.

You should add a check for that

@lahdjirayhan
Copy link
Author

lahdjirayhan commented Feb 21, 2022

@TheTechRobo,

It does not happen on my end. The following is the snscrape version:

MINGW64 /e/Git-LINE/justanotherarchivist-snscrape/snscrape (add-docs)
$ snscrape --version
snscrape 0.3.5.dev231+g0832e95
>>> from importlib.metadata import metadata
>>> M = metadata('snscrape')
>>> M['version']
'0.3.5.dev231+g0832e95'

dev231+g0832e95 isn't exactly a YYYYMMDD date format and I probably should check against that, but it shouldn't return the error you mentioned as well. Maybe you have different output for snscrape --version?


@JustAnotherArchivist,

I'm not sure why my version is still at 0.3.5.dev ... surely it should be at 0.4.x by now? I've merged master to my branch every now and then, too. Is this okay? I installed snscrape in a virtual environment using pip install -e ., if it matters.

@JustAnotherArchivist
Copy link
Owner

JustAnotherArchivist commented Feb 21, 2022

@lahdjirayhan The version is only updated when you run pip because it's stored in the egg-info. Try rerunning pip install -e . (no uninstall needed) and then checking again. I currently have a local version number like 0.4.3.20220107.dev29+g77bbb9f.d20220220 for example. (That date part at the end signifies that there are uncommitted changes.)

@JustAnotherArchivist
Copy link
Owner

Also, apologies, I didn't notice that you marked this as ready for review. I'll take a look soon!

Copy link
Contributor

@TheTechRobo TheTechRobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't see anything other than these minor style questions. This is great!

Comment on lines +168 to +169
'''
Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match up with the rest of the docstrings, shouldn't there not be a newline there? (very minor i know but...)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args, Kwargs, Returns, Yields, etc. is usually positioned unindented in the left according to Google style. In addition to that, I feel uncomfortable putting Args: right after ''' because it is usually reserved for docstring summary.

On the other hand, I don't think init docstring needs more explanation. The relevant information for each class is already given in the class docstring.

Comment on lines +165 to +166
'''
Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.

@lahdjirayhan
Copy link
Author

lahdjirayhan commented Feb 21, 2022

@JustAnotherArchivist Thanks for the info on the version. I'll try it out as soon as I can.

Update: Nope, I've tried rerunning pip install -e . again and it still shows the same 0.3.5.dev version. In addition to that, running git tag shows only up to v.0.3.4.

Copy link
Contributor

@TheTechRobo TheTechRobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very quick skim leavs me saying LGTM.

@TheTechRobo
Copy link
Contributor

@lahdjirayhan I'm not sure that merging the branch in also gets the tags. That could be the problem.


#. **Instantiate a scraper object.**
``snscrape`` provides various object classes that implement their own specific ways. For example, :class:`TwitterSearchScraper` gathers tweets via search query, and :class:`TwitterUserScraper` gathers tweets from a specified user.
#. **Call the scraper's** ``get_item()`` **method.**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_item()?

#. **Instantiate a scraper object.**
``snscrape`` provides various object classes that implement their own specific ways. For example, :class:`TwitterSearchScraper` gathers tweets via search query, and :class:`TwitterUserScraper` gathers tweets from a specified user.
#. **Call the scraper's** ``get_item()`` **method.**
``get_item()`` is an iterator and yields one item at a time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants