automate and scrap rotating domains of the disposable providers #450

martenson · 2024-02-20T21:53:48Z

Some disposable services have fairly stable urls with lists -- hence we could implement a CI check that automatically merges domains in that list with our blocklist or opens PRs. Places that do not have such a resource can still be scrapped in CI with a headless browser or such.

haumacher · 2024-02-20T22:32:15Z

Is there already a list of disposable e-mail provider web pages that could serve as input for such automation?

martenson · 2024-02-21T16:43:56Z

@haumacher I'd start with some of the bigger providers like

https://10minutemail.com/
https://m.kuku.lu/
https://temp-mail.org/

But each would be different and unique to scrape/parse, so I wouldn't spend much time on any specific one, just tried to find the low hanging fruit first.

maciejstromich · 2024-02-22T13:54:54Z

another one is https://www.disposablemail.com/

haumacher · 2024-02-23T19:56:44Z

I did some research - analyzing the commit log of disposable_email_blocklist.conf and extracting all URLs of fake mail service providers mentioned in the commit messages. This gave the following fake mail services:

https://10minemail.com
https://10minutemail.com
https://10minutemail.net
https://10minutemail.org
https://10minutesmail.net
https://1secmail.com
https://2chmail.net
https://developermail.com
https://disposemymail.com
https://docs.webhook.site
https://dropmail.me
https://email-fake.com
https://emailfake.com
https://emailtemporal.org
https://etempmail.com
https://fakermail.com
https://linshiyou.com
https://linshiyouxiang.net
https://m.kuku.lu
https://mail.gw
https://mail.td, https://mail.cx
https://mail.tm
https://mohmal.com
https://muellmail.com
https://nospam.today
https://receivemail.org
https://ruu.kr
https://snapmail.cc
https://sute.jp
https://temp-mail.org
https://temp-mail.us
https://temp-mail.winsub.kr
https://tempmail.click
https://tempmail.lol
https://tempmail.plus
https://tempmailer.net
https://tempmailo.com
https://temporary-mail.net
https://tempr.email
https://trashmail.com
https://trashmailgenerator.de
https://www.disposablemail.com
https://www.emailnator.com
https://www.emailondeck.com
https://www.fakemail.net
https://www.linshi-email.com
https://www.minuteinbox.com
https://www.onetime-mail.com
https://www.trash-mail.com
https://youxiang.dev
https://ワンタイムメール.総合サービス.com

icyavocado · 2024-02-25T18:14:14Z

Should the script open a web headless browser and use a regex to search for the new email list. Adding the new domain, and then commit the new diff as a new PR?

haumacher · 2024-02-25T21:30:38Z

@icyavocado I think, this would be a huge effort and only possible for dumb fake-mail providers. "Professional" ones have strong protection against automated querying e.g. CAPTCHAs or even headless browser detection.

Here are some more:

https://www.spamgourmet.com
https://www.fakemailgenerator.com
https://www.easytrashmail.com
https://www.txen.de
https://www.throwawaymail.com
https://hi2.in
https://temp-mail.io
http://ese.kr
https://ulm-dsl.de/

icyavocado · 2024-02-26T01:57:06Z

Agree, from what I'm seeing, this may be a case of high effort, low reward scenario without a way to bypass bot checking. Unless we can show with certainty that requests from Github to these domains won't get blocked, finding an easy way to carry out this change doesn't seem likely.

I wonder if we should write each of these different scrapping like a cypress/selenium test.

martenson · 2024-02-26T19:46:28Z

I agree that some are well-enough protected against headless browser checking for domains. However, I believe there still are some low hanging fruits to be found. E.g. https://www.fakemailgenerator.com/ just directly gives you the dropdown with domain selection. Writing a selenium case that loads this and then checks the list against blocklist opening PR if there is a diff is a nice step.

edit: the example above is actually likely static, but e.g. https://www.fakemail.net/ does not seem protected against such approach

icyavocado · 2024-02-27T17:11:33Z

I can look into this and try to create a basic test that might work for most situations. The aim is to make a simple version to show that the idea can work with minimum effort.

icyavocado · 2024-03-04T07:43:08Z

Here is my proposal for the script using puppeteer. Here is the propose change: icyavocado@384f7a8

TLDR: this script reads from the disposable_providers.txt file and attempts to identify disposable email addresses. It ensures that any domains already on our blocklist or allowlist are excluded from this process.

This is just the first step of the task. As we continue our discussion about how to implement this correctly, I'll be working on the automation aspect.

P/S: I was able to get the automation to work, here is a run of the automation using the script above to find then create a new branch: https://github.com/icyavocado/disposable-email-domains/actions/runs/8142541958/job/22252322009

Here are some potential challenges we might face, along with possible solutions. Your insights and suggestions are welcome:

The current regex also matches gmail.com, which is problematic. We could maintain a list of domains to ignore, but this approach requires ongoing upkeep. If you have alternative solutions, I'd love to hear them.
We could allow custom regex for each disposable provider. This could offer some advantages. For instance, we could pull input from disposable_providers.txt like so: https://10minemail.com \b(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9][a-z0-9-]*[a-z0-9]\b.
Speeding up the process might be achievable by allowing class/id targeting. Again, we could use disposable_providers.txt for input, for example: https://10minemail.com #email.

martenson mentioned this issue Feb 20, 2024

fight the disposable automation with an anti-automation #273

Closed

haumacher mentioned this issue Feb 25, 2024

added emails from throwawaymail.com #437

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automate and scrap rotating domains of the disposable providers #450

automate and scrap rotating domains of the disposable providers #450

martenson commented Feb 20, 2024

haumacher commented Feb 20, 2024

martenson commented Feb 21, 2024 •

edited

maciejstromich commented Feb 22, 2024

haumacher commented Feb 23, 2024 •

edited

icyavocado commented Feb 25, 2024

haumacher commented Feb 25, 2024

icyavocado commented Feb 26, 2024 •

edited

martenson commented Feb 26, 2024 •

edited

icyavocado commented Feb 27, 2024

icyavocado commented Mar 4, 2024 •

edited

automate and scrap rotating domains of the disposable providers #450

automate and scrap rotating domains of the disposable providers #450

Comments

martenson commented Feb 20, 2024

haumacher commented Feb 20, 2024

martenson commented Feb 21, 2024 • edited

maciejstromich commented Feb 22, 2024

haumacher commented Feb 23, 2024 • edited

icyavocado commented Feb 25, 2024

haumacher commented Feb 25, 2024

icyavocado commented Feb 26, 2024 • edited

martenson commented Feb 26, 2024 • edited

icyavocado commented Feb 27, 2024

icyavocado commented Mar 4, 2024 • edited

martenson commented Feb 21, 2024 •

edited

haumacher commented Feb 23, 2024 •

edited

icyavocado commented Feb 26, 2024 •

edited

martenson commented Feb 26, 2024 •

edited

icyavocado commented Mar 4, 2024 •

edited