New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for non-US phone numbers #13
Comments
Yup that's another feature that's on my agenda. Thanks for opening this issue -- it's something I need to address. |
The port of CommonRegex to Java (CommonRegexJava) now supports multilang (although only english is implemented currently), maybe could give you some ideas. |
So I'm intending on adding support for international phone numbers formatted according to the E.123 specs: http://en.wikipedia.org/wiki/E.123. It appears there's significant variation country to country, though. |
That last commit added nominal support for international phone numbers -- it's certainly not complete, but it's a move in the right direction. Any chance you could give me a few test cases that you would like to see supported, @jackhooper? I'll take a look at the Java port, talyssonoc. Thanks! |
@madisonmay Being Australian I'd obviously like to see Australian phone numbers supported. Our phone numbers begin with a two-digit area code (optional - and rarely used - if the caller and receiver are in the same area code), followed by an eight-digit phone number (from memory I'm pretty sure the first four represent the exchange, and the latter four represent the individual line). So it goes (XX) XXXX XXXX. The formatting varies - the most common formats would be XXXX XXXX, XXXXXXXX, and maybe XXXX-XXXX, though I've also seen/heard XXX XXX XX. Australian mobile/cellular phone numbers are ten digits long, they start with 04, and are usually formatted 04XX XXX XXX, but I have seen other formattings. There are other types of phone numbers, such as 1800 (1800 XXX XXX), 1300 (1300 XXX XXX), 13 (13 XX XX), 1902 (1902 XXX XXX), 1900 (1900 XXX XXX), as well as the rarely used 1802 (1802 XXX - also sometimes formatted 180 2XXX). As of right now, CommonRegex can detect the US international dialing code (1), so long as it isn't prefixed with a '+', which international calling codes often are. It doesn't seem to be working for other calling codes. It also doesn't yet recognize international access numbers, which are prefexed onto international numbers, as they are necessary for a telephone exchange to know that you're calling an overseas number. In Australia this code is 0011, in the US it's 011. Last but certainly not least, it does not yet recognize three-digit numbers, such as those for emergency services (911 in the US, 999 in the UK, 000 in Australia, etc.). Insomuch as I can tell, phone number formatting in the US is pretty standardized - all the numbers I've seen have given number of digits per block, separated by hyphens. This isn't really the case here. In Australia the use of hyphens is much less common, and the use of spaces is much more popular. Additionally, here (as in many other countries), the number of digits per block is not quite as standardized. There is definitely a most common way of doing it, but some people don't follow it. Given all of this, adding full international support is likely to be quite a task. Heck, even adding support for one other country, such as Australia, would likely be less than trivial. I certainly don't envy you in having to do so. At present, my Regex skills are rudimentary at best, so I can't help you out much with that side of things. I would be more than happy to test things out, though. Happy coding, |
Thanks so much for the explanation -- I've got a bit of API design thinking to do before I make any serious changes. For the time being, I've added support for international calling codes -- although I haven't pushed that change to pip as of yet. I'm afraid that adding support for all formats by default would be detrimental because of the increase in complexity and the increase in false positives. Not adding support for international formats is likewise a poor choice. However, I think I might have found a good middle ground. I'm toying with the idea of adding an initialization argument to the CommonRegex class that controls how strict the regular expressions used are. I could maintain two separate sets of regexes -- one designed for low false positive rates, and another designed for low false negative rates. The set of regexes with low false negative rates could use a much less strict phone number regex to ensure that all phone numbers are captured. What are your thoughts, JH? Thanks again, Madison |
@madisonmay That sounds like a reasonable way of doing. You're more than welcome for the explanation, too - any excuse for me to waffle on about something ;-) Kind regards, |
It occurred to me that there are a couple of other types of phone numbers which don't currently work. Both are of the (partially) non-numeric variety.
Supporting these types of numbers would almost certainly increase the chance of false positives, so your proposed approach of having two sets of regular expressions - one for low false positives, the other for low false negatives - is looking very good right now. JH |
Hi @jackhooper. I've been trying to think how I should expose the two different sets of regular expressions through the commonregex API. Currently, Let me know your thoughts. |
@madisonmay I've thought about these questions, and I'm afraid I don't really have any definitive answers for you. Certainly, on the the matter of the first question, I honestly have no idea. I'm probably closer to having an answer on the second one, though: perhaps there could be an optional parameter (which would default to |
Cool as this already is, it would be even cooler if it supported non-US phone numbers. I'd try and do it myself, but given how little I currently know about regular expressions I'd probably be more of a hindrance than a help.
The text was updated successfully, but these errors were encountered: