Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter output field list #20

Open
adrianshort opened this issue Oct 2, 2018 · 4 comments
Open

Filter output field list #20

adrianshort opened this issue Oct 2, 2018 · 4 comments
Labels
enhancement New feature or request
Milestone

Comments

@adrianshort
Copy link
Owner

adrianshort commented Oct 2, 2018

Allow users to specify which fields they do or don't want included in the output.

Add only and except to the params hash in Authority#scrape.

Both of these should be a comma-separated list of field names.

Using only and except params at the same time throws an error.

We might need to consider how this would interact with potential options for a deep or shallow scrape, eg an option like documents: true which scrapes the contents of documents pages.

One specific use case is including or excluding personal data eg applicants' and agents' names, email addresses and phone numbers. But it'd be nicer to do that with an option like personal_data: false.

@adrianshort adrianshort added the enhancement New feature or request label Oct 2, 2018
@adrianshort adrianshort mentioned this issue Oct 2, 2018
4 tasks
@adrianshort
Copy link
Owner Author

For Authority#scrape:

def scrape(params, options = {})

Note:

  • params is for what to scrape (the search terms sent to the site and the output desired)
  • options is for how to scrape (configuring the scraper's speed, user agent, etc).

@KeithP
Copy link
Contributor

KeithP commented Oct 2, 2018

if the fields a user specifies to exclude amounts to a whole tab then we should omit scraping that tab.

@adrianshort
Copy link
Owner Author

True. And that's going to be a bunch of fun to code because different systems put their fields on different pages, so you'd need a data structure breaking down which systems, pages and fields correspond.

@KeithP
Copy link
Contributor

KeithP commented Oct 2, 2018

mapping from differing data structures can be done thus:

        ret = []
        key_map = { :council_reference=>:application_number,
                    :date_validated=>:date_validated,
                    :scraped_at=>:fetched_at,
                    :info_url=> :detail_page_link,
                    :address=>:site_address,
                    :description=>:description_of_development,
                    :documents_count=>:documents_count,
                    :documents_url=>:documents_page_link }
        app.each do |app_hash|
          ret << app_hash.map {|k, v| [key_map[k], v] }.to_h
        end

@adrianshort adrianshort added this to the 1.0.0 milestone Oct 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants