Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-US phone numbers #13

Open
jackhooper opened this issue Jan 19, 2014 · 10 comments
Open

Support for non-US phone numbers #13

jackhooper opened this issue Jan 19, 2014 · 10 comments

Comments

@jackhooper
Copy link

Cool as this already is, it would be even cooler if it supported non-US phone numbers. I'd try and do it myself, but given how little I currently know about regular expressions I'd probably be more of a hindrance than a help.

@madisonmay
Copy link
Owner

Yup that's another feature that's on my agenda. Thanks for opening this issue -- it's something I need to address.

@talyssonoc
Copy link
Contributor

The port of CommonRegex to Java (CommonRegexJava) now supports multilang (although only english is implemented currently), maybe could give you some ideas.

@madisonmay
Copy link
Owner

So I'm intending on adding support for international phone numbers formatted according to the E.123 specs: http://en.wikipedia.org/wiki/E.123. It appears there's significant variation country to country, though.

@madisonmay
Copy link
Owner

That last commit added nominal support for international phone numbers -- it's certainly not complete, but it's a move in the right direction. Any chance you could give me a few test cases that you would like to see supported, @jackhooper?

I'll take a look at the Java port, talyssonoc. Thanks!

@jackhooper
Copy link
Author

@madisonmay Being Australian I'd obviously like to see Australian phone numbers supported. Our phone numbers begin with a two-digit area code (optional - and rarely used - if the caller and receiver are in the same area code), followed by an eight-digit phone number (from memory I'm pretty sure the first four represent the exchange, and the latter four represent the individual line). So it goes (XX) XXXX XXXX. The formatting varies - the most common formats would be XXXX XXXX, XXXXXXXX, and maybe XXXX-XXXX, though I've also seen/heard XXX XXX XX.

Australian mobile/cellular phone numbers are ten digits long, they start with 04, and are usually formatted 04XX XXX XXX, but I have seen other formattings.

There are other types of phone numbers, such as 1800 (1800 XXX XXX), 1300 (1300 XXX XXX), 13 (13 XX XX), 1902 (1902 XXX XXX), 1900 (1900 XXX XXX), as well as the rarely used 1802 (1802 XXX - also sometimes formatted 180 2XXX).

As of right now, CommonRegex can detect the US international dialing code (1), so long as it isn't prefixed with a '+', which international calling codes often are. It doesn't seem to be working for other calling codes. It also doesn't yet recognize international access numbers, which are prefexed onto international numbers, as they are necessary for a telephone exchange to know that you're calling an overseas number. In Australia this code is 0011, in the US it's 011.

Last but certainly not least, it does not yet recognize three-digit numbers, such as those for emergency services (911 in the US, 999 in the UK, 000 in Australia, etc.).

Insomuch as I can tell, phone number formatting in the US is pretty standardized - all the numbers I've seen have given number of digits per block, separated by hyphens. This isn't really the case here. In Australia the use of hyphens is much less common, and the use of spaces is much more popular. Additionally, here (as in many other countries), the number of digits per block is not quite as standardized. There is definitely a most common way of doing it, but some people don't follow it.

Given all of this, adding full international support is likely to be quite a task. Heck, even adding support for one other country, such as Australia, would likely be less than trivial. I certainly don't envy you in having to do so. At present, my Regex skills are rudimentary at best, so I can't help you out much with that side of things. I would be more than happy to test things out, though.

Happy coding,
JH

@madisonmay
Copy link
Owner

Thanks so much for the explanation -- I've got a bit of API design thinking to do before I make any serious changes. For the time being, I've added support for international calling codes -- although I haven't pushed that change to pip as of yet.

I'm afraid that adding support for all formats by default would be detrimental because of the increase in complexity and the increase in false positives. Not adding support for international formats is likewise a poor choice. However, I think I might have found a good middle ground. I'm toying with the idea of adding an initialization argument to the CommonRegex class that controls how strict the regular expressions used are. I could maintain two separate sets of regexes -- one designed for low false positive rates, and another designed for low false negative rates. The set of regexes with low false negative rates could use a much less strict phone number regex to ensure that all phone numbers are captured.

What are your thoughts, JH?

Thanks again,

Madison

@jackhooper
Copy link
Author

@madisonmay That sounds like a reasonable way of doing. You're more than welcome for the explanation, too - any excuse for me to waffle on about something ;-)

Kind regards,
JH

@jackhooper
Copy link
Author

It occurred to me that there are a couple of other types of phone numbers which don't currently work. Both are of the (partially) non-numeric variety.

  1. Numbers with * and/or # in them. In Australia, there is a number *10# (pronounced "star-ten-hash"), for example.
  2. Numbers with words in them. A US example might be 1-800-PHONE-THX (the only reason I know that one is because it appears towards the end of the Star Wars credits).

Supporting these types of numbers would almost certainly increase the chance of false positives, so your proposed approach of having two sets of regular expressions - one for low false positives, the other for low false negatives - is looking very good right now.

JH

@madisonmay
Copy link
Owner

Hi @jackhooper.

I've been trying to think how I should expose the two different sets of regular expressions through the commonregex API. Currently, from commonregex import email gets you the compiled regular expression to manipulate to your hearts content. What do we call the second set of regular expressions that is publicly exposed.? Also, should there be a flag that switches all regular expressions from low false positives to low false negatives, or should that be handled on a case by case basis (each regular expression has a different setting)?

Let me know your thoughts.

@jackhooper
Copy link
Author

@madisonmay I've thought about these questions, and I'm afraid I don't really have any definitive answers for you.

Certainly, on the the matter of the first question, I honestly have no idea.

I'm probably closer to having an answer on the second one, though: perhaps there could be an optional parameter (which would default to False) when you initialise the CommonRegex class to switch all regular expressions in that instance of the CommonRegex class to default to returning low false negatives; and to have an optional parameter in each method to override the default setting? So, if you have an instance of CommonRegex which is set to low false negatives, there would be an optional parameter in each method to return low false positives, and vice versa if the CommonRegex object is set low false positives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants