Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domain names treated as sentences #24

Open
quantoid opened this issue Jul 9, 2018 · 4 comments
Open

Domain names treated as sentences #24

quantoid opened this issue Jul 9, 2018 · 4 comments

Comments

@quantoid
Copy link

quantoid commented Jul 9, 2018

If the text contains a domain name like www.google.com then the parts of that name are extracted as words, e.g. the word "com".

@ghost
Copy link

ghost commented Aug 8, 2018

Hi for this issue and also for real world language which can often be cramped up with numerous punctuation marks, I tried various tokenizers and was satisfied with the way nltk's TweetTokenizer works. I implemented it as follows:

from nltk.tokenize import TweetTokenizer, sent_tokenize
tokenizer_words = TweetTokenizer()
def _generate_phrases(self, sentences):
phrase_list = set()
for sentence in sentences:
word_list = [word.lower() for word in tokenizer_words.tokenize(sentence)]
phrase_list.update(self._get_phrase_list_from_words(word_list))
return phrase_list
Not only does this chalk out www.google.com as is, it also conserves important marks such as #hashtag, @person, etc.

@csurfer
Copy link
Owner

csurfer commented Aug 9, 2018

@nsehwan: I am open to any extension to the package as long as the following are met:

  1. It is a problem for the vast majority.
  2. The solution to the problem can be made generic enough.

Even though it meets the (1) requirement I think we should first formulate your simple solution to a generic one so that it can be used by everyone before implementing it.

@ghost
Copy link

ghost commented Aug 14, 2018

Thanks @csurfer for the information, working on your suggestions

@ghost
Copy link

ghost commented Nov 15, 2018

Sorry for my evanesce !!!
After trying various tokenizers, I thought it better to build a sanitizer/tokenizer based on your suggestions. And really it was actually better that way, i.e. more general.

get_sanitized_word_list is basically a function which takes as input individual sentences, segregated by sent_tokenize and returns list of words similar to what wordpunct_tokenize(sentence) was returning previously but sanitized better.

`def get_sanitized_word_list(data):
result = []
word = ''

for char in data:
	if char not in string.whitespace:
		if char not in string.ascii_letters + "'.~`^:<>/-_%&@*#$123456789":	#List of whatever could be within or at start/end of words
			if word:
				result.append(word)
			result.append(char)
			word = ''
		else:
			word = ''.join([word,char])

	else:
		if word:
			result.append(word)
			word = ''
if word != '':
	result.append(word)
	word=''
return result`

It works on most general cases that I tried so far. And yes better than TweetTokenizer as well. Please let me know what you think about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants