Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace A_TIMESTAMP with A_FETCH_BEGAN_TIME #292

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hennekey
Copy link
Contributor

@hennekey hennekey commented Jan 14, 2020

It seems that A_TIMESTAMP went out of favor quite a long time ago.
A_FETCH_BEGAN_TIME is used within FetchHistoryProcessor and throws an
exception as is because of it.

It seems that A_TIMESTAMP went out of favor quite a long time ago.
A_FETCH_BEGAN_TIME is used within FetchHistoryProcessor and throws an
exception as is because of it.
@ato
Copy link
Collaborator

ato commented Jan 16, 2020

This seems reasonable but I'm not familiar enough with this module to confidently review this. I'm hoping someone more knowledgeable about this comments, otherwise I'll merge it in a few days if we don't hear anything.

@nlevitt nlevitt requested a review from kngenie January 16, 2020 19:42
@nlevitt
Copy link
Contributor

nlevitt commented Jan 16, 2020

This is @kngenie's code. I don't know if it's still used.

@hennekey
Copy link
Contributor Author

hennekey commented Jan 16, 2020

I was trying to use it in conjunction with the HBase recrawl modules but everything besides these class seems to use A_FETCH_BEGAN_TIME and so they're writing to the wrong attribute and FetchHistoryProcessor throws an exception here:

revisit.setRefersToDate((Long)history[1].get(A_FETCH_BEGAN_TIME));

@anjackson
Copy link
Collaborator

We use A_TIMESTAMP but can switch if that makes sense. However, I'm having trouble seeing what exactly you are referring to @hennekey because your links go to your private GitHub Enterprise instance. Any chance you can open it up? Or refer to this repo instead?

@hennekey
Copy link
Contributor Author

@anjackson Sorry about that, I've updated the link to the code.

Any explanation about how you make use of A_TIMESTAMP would be helpful so I can do the right thing.

@anjackson
Copy link
Collaborator

anjackson commented Jan 23, 2020

Thanks, but I think your original link refers to:

// A_TIMESTAMP has been used for sorting history long before A_FETCH_BEGAN_TIME
// field was introduced. Now FetchHistoryProcessor fails if A_FETCH_BEGAN_TIME is
// not set. We could stop storing A_TIMESTAMP and sort by A_FETCH_BEGAN_TIME.

?

We use it in much the same way, I think. We use it for deduplication and for checking how recently we visited to see if we need to re-crawl yet. But it looks like I hit similar issues because I populate both fields.

I'm not sure how important these semantics are -- the timestamp in question may not necessarily be the time the fetch began, but in practice I think it always is, so I think it's fine to reduce it all to that name.

@hennekey
Copy link
Contributor Author

Yeah, I did reference that comment.
It occurs to me now that perhaps semantically A_TIMESTAMP is meant to be the time the history map entry was created for that CrawlURI and A_FETCH_BEGAN_TIME is more specifically the point in time that the fetch action started.
I defer to you to the semantics; you've obviously got more experience with it. I'd be happy to make whatever changes we deem necessary as a result.

@hennekey
Copy link
Contributor Author

Your use of this constant also makes it clear that it's a more dangerous change than I thought. It is public afterall so there may be other users.

Perhaps the proper solution is to clarify the semantics and use the proper value in FetchHistoryProcessor instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants