Skip to content

Commit

Permalink
Split sentences on emojis
Browse files Browse the repository at this point in the history
  • Loading branch information
bendangelo committed Jan 26, 2024
1 parent bec8483 commit ae9f85b
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 2 deletions.
2 changes: 1 addition & 1 deletion lib/keyphrase.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ class Keyphrase
CLEAN_REGEX = /([^\p{L}a-zA-Z0-9\'\- \.]|(?<!\w)\.)/ # don't remove ' because it might be part of a stop word
BLACKLIST_REGEX = /(?:^|\s)[^a-zA-Z\p{L}0-9]+\b|\'|\-/ # remove words with no letters, ie 123.23.12. And last chance to remove ' and -
CLEAN_SPACES_REGEX = /^[0-9\s\.]+$|\s+/ # last phase. Remove extra whitespace and lone numbers
SENTENCES_REGEX = /[+!?,;:&\[\]\{\}\<\>\=\/\n\t\\"\\(\\)\u2019\u2013\|]|-(?!\w)|'(?=s)|(?<!\s)\.(?![a-zA-Z0-9])|(?<!\w)\#(?=\w)/u
SENTENCES_REGEX = /[+!?,;:&\[\]\{\}\<\>\=\/\n\t\\"\\(\\)\u2019\u2013\|]|-(?!\w)|'(?=s)|(?<!\s)\.(?![a-zA-Z0-9])|(?<!\w)\#(?=\w)|\p{Extended_Pictographic}+/u

def self.analyse text, options={}
@@keyphrase ||= Keyphrase.new
Expand Down
2 changes: 1 addition & 1 deletion lib/keyphrase/version.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# frozen_string_literal: true

class Keyphrase
VERSION = "0.2.2"
VERSION = "0.2.3"
end
6 changes: 6 additions & 0 deletions spec/keyphrase_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,12 @@
expect(result.keys).to eq ["你好 我一方面什麼 這麼好"]
end

it 'should split on emoji' do
result = Keyphrase.analyse "Study Ghibli Music For 2 Hours 🍀 Heal Your Body With Ghibli, Relax, Work, Sleep Deeply"

expect(result.keys).to eq ["Study Ghibli Music", "Sleep Deeply", "Ghibli", "2 Hours", "Heal", "Body", "Relax"]
end

end

end
Expand Down

0 comments on commit ae9f85b

Please sign in to comment.