Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automate and scrap rotating domains of the disposable providers #450

Open
martenson opened this issue Feb 20, 2024 · 10 comments
Open

automate and scrap rotating domains of the disposable providers #450

martenson opened this issue Feb 20, 2024 · 10 comments

Comments

@martenson
Copy link
Member

Some disposable services have fairly stable urls with lists -- hence we could implement a CI check that automatically merges domains in that list with our blocklist or opens PRs. Places that do not have such a resource can still be scrapped in CI with a headless browser or such.

@haumacher
Copy link
Contributor

Is there already a list of disposable e-mail provider web pages that could serve as input for such automation?

@martenson
Copy link
Member Author

martenson commented Feb 21, 2024

@haumacher I'd start with some of the bigger providers like

https://10minutemail.com/
https://m.kuku.lu/
https://temp-mail.org/

But each would be different and unique to scrape/parse, so I wouldn't spend much time on any specific one, just tried to find the low hanging fruit first.

@maciejstromich
Copy link

another one is https://www.disposablemail.com/

@haumacher
Copy link
Contributor

haumacher commented Feb 23, 2024

@icyavocado
Copy link

Should the script open a web headless browser and use a regex to search for the new email list. Adding the new domain, and then commit the new diff as a new PR?

@haumacher
Copy link
Contributor

@icyavocado I think, this would be a huge effort and only possible for dumb fake-mail providers. "Professional" ones have strong protection against automated querying e.g. CAPTCHAs or even headless browser detection.

Here are some more:

https://www.spamgourmet.com
https://www.fakemailgenerator.com
https://www.easytrashmail.com
https://www.txen.de
https://www.throwawaymail.com
https://hi2.in
https://temp-mail.io
http://ese.kr
https://ulm-dsl.de/

@icyavocado
Copy link

icyavocado commented Feb 26, 2024

Agree, from what I'm seeing, this may be a case of high effort, low reward scenario without a way to bypass bot checking. Unless we can show with certainty that requests from Github to these domains won't get blocked, finding an easy way to carry out this change doesn't seem likely.

I wonder if we should write each of these different scrapping like a cypress/selenium test.

@martenson
Copy link
Member Author

martenson commented Feb 26, 2024

I agree that some are well-enough protected against headless browser checking for domains. However, I believe there still are some low hanging fruits to be found. E.g. https://www.fakemailgenerator.com/ just directly gives you the dropdown with domain selection. Writing a selenium case that loads this and then checks the list against blocklist opening PR if there is a diff is a nice step.

edit: the example above is actually likely static, but e.g. https://www.fakemail.net/ does not seem protected against such approach

@icyavocado
Copy link

I can look into this and try to create a basic test that might work for most situations. The aim is to make a simple version to show that the idea can work with minimum effort.

@icyavocado
Copy link

icyavocado commented Mar 4, 2024

Here is my proposal for the script using puppeteer. Here is the propose change: icyavocado@384f7a8

TLDR: this script reads from the disposable_providers.txt file and attempts to identify disposable email addresses. It ensures that any domains already on our blocklist or allowlist are excluded from this process.

This is just the first step of the task. As we continue our discussion about how to implement this correctly, I'll be working on the automation aspect.

P/S: I was able to get the automation to work, here is a run of the automation using the script above to find then create a new branch: https://github.com/icyavocado/disposable-email-domains/actions/runs/8142541958/job/22252322009

Here are some potential challenges we might face, along with possible solutions. Your insights and suggestions are welcome:

  • The current regex also matches gmail.com, which is problematic. We could maintain a list of domains to ignore, but this approach requires ongoing upkeep. If you have alternative solutions, I'd love to hear them.
  • We could allow custom regex for each disposable provider. This could offer some advantages. For instance, we could pull input from disposable_providers.txt like so: https://10minemail.com \b(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9][a-z0-9-]*[a-z0-9]\b.
  • Speeding up the process might be achievable by allowing class/id targeting. Again, we could use disposable_providers.txt for input, for example: https://10minemail.com #email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants