Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge ait-qa branch to re-syncronised development #415

Draft
wants to merge 150 commits into
base: master
Choose a base branch
from

Conversation

anjackson
Copy link
Collaborator

This is a merge intended to re-sync with the ait-qa branch.

galgeek and others added 30 commits June 24, 2016 13:33
* aitfive-1176:
  hacked up tarball with heritrix, contrib and dependencies
  batch trough writes
  initial version of TroughCrawlLogFeed.java
* aitfive-1176:
  oops, String.join is a java 8 feature
* aitfive-1176:
  missed the maven tarball assembly file
* master:
  ughhh appease java 8 javadoc rules
  fix for race-condition when first using the WARC writers internetarchive#167
* aitfive-1176:
  include http response payload when logging errors from trough
  update for new data model queued_url->uncrawled_url and some new fields
* aitfive-1176:
  missed a comma
* aitfive-1176:
  missed another comma
* aitfive-1176:
  realized this stuff needs to be thread safe
* trough-esc-sql:
  escape strings in sql posted to trough
* trough-logging-tweak:
  reduce batch size to 400 and avoid ridiculously long log lines
adam-miller and others added 24 commits February 25, 2021 22:31
…ugh-crawl-log-feed-synchronization

Fix error log reporting of batch size for trough crawl logs
…discrepency between crawled and uncrawled lists
…7-fixes-trough-crawl-log-feed-synchronization

Revert "Fix error log reporting of batch size for trough crawl logs"
…gh-dedup-performance-rework

Adds trough dedup performance rework
…h dedup batches. Fix posting of trough dedup at end of crawl.
…gh-dedup-performance-rework

Refactor crawl log batch posting. Add configuration options for troug…
…gh-dedup-performance-rework

Ensure we set dedup schema when writing dedup shards
…gh-dedup-performance-rework

Add dedup load from in-memory dedup cache db
…gh-dedup-performance-rework

Refactor Trough client URL cache put. Limit TroughContentDigestHistor…
…odule-max-file-size-option

Adds bdbmodule max file size option
@ato
Copy link
Collaborator

ato commented Jul 19, 2021

OK by me. Although this does include the removal of the hbase module which someone objected to last year (see #313).

@anjackson
Copy link
Collaborator Author

Okay, I think that's synced up, but with HBase modules restored from the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants