Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Needed: Scraper that automatically extracts questions and answers from a URL #81

Open
Timoeller opened this issue Mar 26, 2020 · 3 comments
Assignees
Labels
Data enhancement New feature or request

Comments

@Timoeller
Copy link
Contributor

For now we have individual scrapers for each site. Adding more sites is a very manual and slow process and existing scrapers fail when the site changes slightly. See individual scrapers here.

Automatic Scraper
We need a scraper that takes in a URL to an FAQ page and automatically extracts questions and answers in a structured way. The scraper might need some NLP based question detection to identify which parts need to be extracted. For some pseudo code see here.

Datasources
We can curate a sheet of official FAQ pages and crawl relevant information more quickly.
That way the community can check the validity of the source FAQ and if the extraction worked.

@Timoeller Timoeller added enhancement New feature or request Data labels Mar 26, 2020
@DataWorm
Copy link
Contributor

I think discovering and extracting the questions is easy enough. Looking for sentences that end with a question mark. Maybe also searching for some frequently used keywords in those sentences like "corona, protect, safety, cure, home office...". Then comparing xpath structures to build an xpath structure that can identify all those questions even if they does not contain any of those keywords.

Detecting answers might be more tricky. Usually you might expect them to be after the question it belongs to but also before the next question shows up. However sometimes there is also an overview of the questions with anchor links to the actual Q&A entry. So looking for anchor links around a question could be one way to avoid scraping failures. So using the content between two questions might contain the answer text but might also contain a lot more fragments that we may want to avoid. And cleaning those unwanted fragments might be the hardest challenge I guess.

@swapna-intel
Copy link

I will start working on this. Just got all the pieces - with pycharm (been an emacs person till now) working. If someone has experience with web crawling and would like to partner, do reach out!

@DRMALEK
Copy link

DRMALEK commented Apr 12, 2020

İ thought about using some computer vision models like Yolo3 , to segment the FAQ section on the page to question and its answer, but I'm not sure is it worth it or if it can have some drawbacks. Any suggestions are welcomed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants