Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scanner uses a lot of memory #144

Open
anarcat opened this issue Apr 4, 2019 · 8 comments
Open

scanner uses a lot of memory #144

anarcat opened this issue Apr 4, 2019 · 8 comments

Comments

@anarcat
Copy link

anarcat commented Apr 4, 2019

Hi!

I've finally found the time to give supysonic a try, tired of dealing with the other Java monstrosities that populate the Subsonic ecosystem. ;) So, first: thanks for providing such an alternative, it's a breath of fresh air.

I have found that the initial scan of a large-ish music database (30 000+ songs) takes a long time (30 minutes with sqlite, 60 minutes with mysql) and a lot of memory (700MB). If I read the code correctly, entries are all recorded in memory before being flushed out to disk all at once. Is that correct?

I think it would be more efficient to write those entries directly to the database instead. It would be faster and less memory hungry.

As an aside, is there any reason why os.walk is not used to walk the file hierachy? I understand it used to be much slower than listdir in previous versions, but since we're stat'ing all files anyways, the performance impact shouldn't be that bad. And besides, newer versions (py 3.5) use os.scandir so it's actually much faster now... All I can see from the git history is 954c75b which says something about file permissions....

If you're open to the idea, I would like to reimplement parts of the scanner so that it's more efficient and flexible. I would like, for example, to follow symlinks (#48) and ignore certain directories as well.

@spl0k
Copy link
Owner

spl0k commented Apr 6, 2019

Hello!

I'm surprised with your scanning times 😮. I have a library with around 29 000 tracks and it 8 minutes on my computer. But I know the scanner has this issue of opening media files twice (once for metadata and another to check if there's an embedded cover art) that most likely affects its performance.
There's no explicit flush nor commit to the database. I'll have to check but I think Pony flushes automatically before an artist/album lookup. It might keep the data cached to rollback the transaction in case of an error though.

os.walk was used before but replaced by os.listdir because it was causing errors with paths that contained invalid unicode characters (see #85).
IIRC two errors were raised:

  • one in os.walk code, when concatenating path elements
  • another later when inserting in the database (hence the try/except in
    try: # test for badly encoded filenames
    f.encode('utf-8')
    except UnicodeError:
    self.__stats.errors.append(path)
    continue
    )
    If I'm not mistaken the first one only happened with Python 2, Python 3 replacing invalid characters by surrogates which were rejected by Pony or the database (causing the second error)

Go ahead if you have your idea on how to improve the whole thing. I'm not closed to symlinks, but that could lead to potential issues if they create loops on the file system, or point to folders that are already tracked by supysonic.

@anarcat
Copy link
Author

anarcat commented Apr 6, 2019

I'm surprised with your scanning times open_mouth. I have a library with around 29 000 tracks and it 8 minutes on my computer. But I know the scanner has this issue of opening media files twice (once for metadata and another to check if there's an embedded cover art) that most likely affects its performance.

Performance might be different because my music database is on spinning rust, are you using a SSD?

There's no explicit flush nor commit to the database. I'll have to check but I think Pony flushes automatically before an artist/album lookup. It might keep the data cached to rollback the transaction in case of an error though.

Yeah, I think it might be worth flushing the good stuff once in a while. As things stand now, it looks like all the entries are committed to the database at once, at the very end, and it takes about a minute to do so (gross estimate). I don't have the feeling it can recover from an interruption either, which is a bummer when you're running a process that takes half an hour... ;)

os.walk was used before but replaced by os.listdir because it was causing errors with paths that contained invalid unicode characters (see #85). IIRC two errors were raised: [concat, insert]

So I think you're correct in that concatenating entries will cause Python to crash because it tries to decode the bytes into whatever the environment says it should be. And yes, a similar will occur on insert as well.

The thing with pathnames in UNIX is they are not UTF-8. They can be encoded in any shape or form. In my filesystem, I have files encoded as ASCII, UTF-8, UTF-16 and ISO-8859-1. I know, it's bad, and I should clean that up, but it's not actually breaking any standard whatsoever. And the problem is that people are using various weird encodings for their filenames still. And because we exchange those files around, this is bound to cause problems again.

So my position on filenames is they should always be treated as binary, until they are displayed to the user, at which point an educated guess should be made to try and show something meaningful. There are various ways to do that, including the surrogates techniques, but in any case, broken display shouldn't hurt scanning.

I should also mention that I think Python 2 support is now overrated. ;) I think it's now perfectly acceptable for a "leaf" package like Supysonic to switch to Python 3 completely, even even require support for Python 3.5 for example. Py3 is well distributed and established in all major distributions and keeping Python 2 compatibility shims will just make your work harder for no good reason. ;)

I'll see if I can come up with a better scanner. And preferably one that is verifiably better - do you have any advice on how I could hook performance tests to the test suite?

Oh, and regarding symlinks I'd say: garbage in, garbage out. If people create loops in their filesystem, they get to see supysonic loop forever. Maybe they like stuff like that, I don't know why people would create such atrocities. ;) But from a security perspective, I think it's sufficient to have a flag (off by default) that allows users to say "I trust this filesystem, there are no loops" and enable symlink following. It would certainly be sufficient to my use case anyways. :)

Thanks so much for your feedback, it's very valuable!

@anarcat
Copy link
Author

anarcat commented Apr 6, 2019

Oh, and about encoding, there are things like chardet that can be used to make an educated guess about character encoding when utf-8 fails as well.. It's slow, but the idea is it's a fallback when all else fails.

@spl0k
Copy link
Owner

spl0k commented Apr 6, 2019

are you using a SSD?

Yes.

So my position on filenames is they should always be treated as binary, until they are displayed to the user

So you're suggesting we store paths as binary? There are times where filenames are indeed presented to the user:

  • if a media file is missing metadata, the filename is used in place of a track title
  • the Subsonic API allows browsing using the folder hierarchy. I think this a bit stupid nowadays and considered replacing this with the "browse via tags" version, but I guess some users might still find some use to this strategy.

I should also mention that I think Python 2 support is now overrated. ;) I think it's now perfectly acceptable for a "leaf" package like Supysonic to switch to Python 3 completely

I'm patiently waiting for Python 2's EOL to remove all that compatibility layer. Doing it earlier crossed my mind but it seems some people still use Py2.7 for Supysonic (eg. #137).

do you have any advice on how I could hook performance tests to the test suite?

err sorry no.

But from a security perspective, I think it's sufficient to have a flag (off by default) that allows users to say "I trust this filesystem, there are no loops" and enable symlink following.

That's so simple I didn't even think about it :D

@anarcat
Copy link
Author

anarcat commented Apr 6, 2019

So my position on filenames is they should always be treated as binary, until they are displayed to the user

So you're suggesting we store paths as binary?

Yes, actually. Or at least something (e.g. ISO-8859-1) that is 8bit doesn't need to be encoded/decoded.

There are times where filenames are indeed presented to the user:

  • if a media file is missing metadata, the filename is used in place of a track title
  • the Subsonic API allows browsing using the folder hierarchy. I think this a bit stupid nowadays and considered replacing this with the "browse via tags" version, but I guess some users might still find some use to this strategy.

Yeah that's fine. You can do the decode stuff then, but it's important to keep a handle on the binary in the back.

I'm patiently waiting for Python 2's EOL to remove all that compatibility layer. Doing it earlier crossed my mind but it seems some people still use Py2.7 for Supysonic (eg. #137).

I honestly have very little patience for this nowadays. :) For libs I understand, because sometimes there's this one library that hasn't been ported so the entire project still needs to work with python 2. But Supysonic is not a library, really... It's a end-user program, so I don't get why someone would run it as Python 2.

(Well, that's actually a lie: I had to run it as Python 2 on Debian stretch + Apache + WSGI because the WSGI bindings are py2 only. But then I could have set things up with fcgi as well and it should just have worked...)

But from a security perspective, I think it's sufficient to have a flag (off by default) that allows users to say "I trust this filesystem, there are no loops" and enable symlink following.

That's so simple I didn't even think about it :D

Hehe.. Glad to be of service. ;)

@spl0k
Copy link
Owner

spl0k commented Apr 7, 2019

because the WSGI bindings are py2 only

Are you sure?

@anarcat
Copy link
Author

anarcat commented Apr 7, 2019

gah, didn't find that one when i looked i guess. weird. it was pretty late. :p

spl0k added a commit that referenced this issue May 10, 2019
Should prevent 'database is locked' issues
Also helps reducing memory usage (#144)
@spl0k
Copy link
Owner

spl0k commented Jun 19, 2019

Oh btw, the full scan no longer use a single database session (as Pony calls it) for the whole process. Please tell me if that helps with the original issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants