Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CsQuery.HtmlParser.HtmlData.TokenIDs is getting too big! #189

Open
ra00l opened this issue Jun 1, 2015 · 2 comments
Open

CsQuery.HtmlParser.HtmlData.TokenIDs is getting too big! #189

ra00l opened this issue Jun 1, 2015 · 2 comments

Comments

@ra00l
Copy link

ra00l commented Jun 1, 2015

Hey! First of all, GREAT library. From what I noticed so far, it's pretty fast.

I discovered a problem while using CsQuery on multiple html files, from a WARC archive containing 4gb of data. The property CsQuery.HtmlParser.HtmlData.TokenIDs it getting too big for its own good. In my case, after a few minutes, it reaches over 10mb in size, and because of it, parsing is getting slower and slower.

Since you don't know when to clear it, I would suggest just adding a cleanup method, like:

        public static void ClearCachedTokens()
        {
            TokenIDs = new Dictionary<string, ushort>();
        }
@jamietre
Copy link
Owner

jamietre commented Jun 1, 2015

There's actually an open issue related to this (more specifically, the dictionary overflowing). Several solutions have been discussed such as making the token cache per-instance and using two dictionaries, since the dictionary is intialized with "hardcoded" tokens and it would be expensive to repopulate it each time. Or alternatively using a smart cache that GC's things that are infrequently accessed, or something. I just haven't had any time to do it because I don't use CsQuery during my day job any more.

In the meantime your idea is actually not bad, it would at least make the bug manageable by end users. It's not quite that trivial since it would have to be reinitialized with the static data, but still pretty simple compared to the permanent fix.

@ra00l
Copy link
Author

ra00l commented Jun 2, 2015

Jamie, thanks for your quick answer. I realised the code I had wasn's working this morning :) So after a bit of trial and error, this seems to work:

public static void ClearCachedTokens()
        {
            nextID = 2;
            TokenIDs = new Dictionary<string, ushort>(); //not necessary actually, it's in the constructor
            Tokens = new List<string>();
            InitInternal();
        }

InitInternal is actually the contents of the static constructor method, moved outside to be reused.

Maybe this helps anyone interested.

lukasbob added a commit to Siteimprove/CsQuery that referenced this issue May 3, 2016
…mplemented in HtmlData.cs.

This addresses issues where the `nextID` counter overflows ushort.MaxValue and wraps around to return the wrong index value, resulting in wrong strings output from the `Render` method.

Fix jamietre#204, jamietre#205, jamietre#189, jamietre#164
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants