Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop anemone and use Spidr for Repo discovery #10947

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sjha4
Copy link
Member

@sjha4 sjha4 commented Mar 22, 2024

To Do:

  • Support basic auth
  • Support proxy

What are the changes introduced in this pull request?

Considerations taken when implementing this change?

What are the testing steps for this pull request?

bundle install
Go to Content > Product > Repo discovery
Run repo discovery.

@sjha4
Copy link
Member Author

sjha4 commented Mar 22, 2024

@evgeni : Thoughts on spidr gem as replacement for anemone?

@evgeni
Copy link
Member

evgeni commented Mar 22, 2024

Doesn't look crazy? Only dep is nokogiri, which we have anyway, tested on modern rubies. Why not.

@sjha4
Copy link
Member Author

sjha4 commented May 21, 2024

I am seeing significant performance difference with what's on the PR vs the existing workflow..Looking at ways to speed this up..Will push updates when I get the performance sorted.

Update: Should be good to go with latest commit.
Able to see chunked output in repo discovery page and the task finishes in about the same time as earlier.

@sjha4 sjha4 changed the title Early Draft - Drop anemone and use Spidr Drop anemone and use Spidr for Repo discovery May 21, 2024
@sjha4 sjha4 marked this pull request as ready for review May 21, 2024 16:59
app/lib/katello/repo_discovery.rb Outdated Show resolved Hide resolved
katello.gemspec Outdated Show resolved Hide resolved
Copy link
Member

@ekohl ekohl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation wise I think you should separate the crawler and Docker search into separate classes. Perhaps even the file crawl as well. Right now it's confusing.


def process_page_urls(urls)
urls.each do |url|
url = url.to_s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bad idea. The URL object is way more valuable and I'd only add it to @to_follow with to_s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't necessarily use the URL object properties once we have the list of URLs in the page. It's just matching strings after that point..Becomes easier to pass that around and work with it IMO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it in latest push to pass URL object.

app/lib/katello/repo_discovery.rb Outdated Show resolved Hide resolved
app/lib/katello/repo_discovery.rb Outdated Show resolved Hide resolved
katello.gemspec Outdated Show resolved Hide resolved
katello.gemspec Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants