Skip to content

Latest commit

 

History

History
249 lines (197 loc) · 8.53 KB

features.md

File metadata and controls

249 lines (197 loc) · 8.53 KB

<<< Previous | Next >>>

Extracting Features

What should we use as features for our data set? What did we use as features for our fruit example before?

table of fruit with features height, width, color, mass, round - one row in table set of features with unknown label

Now that we are using sentences, how can we best represent each sentence as a series of values?

One idea is to count how many particular parts of speech the sentence contains.

  • Nouns: Most basically described as a person, place, or thing. Counting nouns can help determine how many topics are being discussed in a sentence.
  • Adjectives: Descriptors of nouns (eg. "yellow", "angry", "charming"). Counting adjectives can help determine how often descriptive words are being added to nouns, which can demonstrate writing style.

We will now compute all of the parts of speech on each sentence (row) in our dataframe.

# compute parts of speech on each sentence (row)
pos_all = pos_tag_sents(df['sentence'])
print (pos_all[:5])
[[('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'), ('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NNP'), ('recent', 'JJ'), ('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'IN'), ('any', 'DT'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'DT'), ('jury', 'NN'), ('further', 'RB'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'JJ'), ('presentments', 'NNS'), ('that', 'IN'), ('the', 'DT'), ('City', 'NNP'), ('Executive', 'NNP'), ('Committee', 'NNP'), (',', ','), ('which', 'WDT'), ('had', 'VBD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'DT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'DT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('City', 'NNP'), ('of', 'IN'), ('Atlanta', 'NNP'), ("''", "''"), ('for', 'IN'), ('the', 'DT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'DT'), ('election', 'NN'), ('was', 'VBD'), ('conducted', 'VBN'), ('.', '.')], [('The', 'DT'), ('September-October', 'NNP'), ('term', 'NN'), ('jury', 'NN'), ('had', 'VBD'), ('been', 'VBN'), ('charged', 'VBN'), ('by', 'IN'), ('Fulton', 'NNP'), ('Superior', 'NNP'), ('Court', 'NNP'), ('Judge', 'NNP'), ('Durwood', 'NNP'), ('Pye', 'NNP'), ('to', 'TO'), ('investigate', 'VB'), ('reports', 'NNS'), ('of', 'IN'), ('possible', 'JJ'), ('``', '``'), ('irregularities', 'NNS'), ("''", "''"), ('in', 'IN'), ('the', 'DT'), ('hard-fought', 'JJ'), ('primary', 'NN'), ('which', 'WDT'), ('was', 'VBD'), ('won', 'VBN'), ('by', 'IN'), ('Mayor-nominate', 'NNP'), ('Ivan', 'NNP'), ('Allen', 'NNP'), ('Jr.', 'NNP'), ('.', '.')], [('``', '``'), ('Only', 'RB'), ('a', 'DT'), ('relative', 'JJ'), ('handful', 'NN'), ('of', 'IN'), ('such', 'JJ'), ('reports', 'NNS'), ('was', 'VBD'), ('received', 'VBN'), ("''", "''"), (',', ','), ('the', 'DT'), ('jury', 'NN'), ('said', 'VBD'), (',', ','), ('``', '``'), ('considering', 'VBG'), ('the', 'DT'), ('widespread', 'JJ'), ('interest', 'NN'), ('in', 'IN'), ('the', 'DT'), ('election', 'NN'), (',', ','), ('the', 'DT'), ('number', 'NN'), ('of', 'IN'), ('voters', 'NNS'), ('and', 'CC'), ('the', 'DT'), ('size', 'NN'), ('of', 'IN'), ('this', 'DT'), ('city', 'NN'), ("''", "''"), ('.', '.')], [('The', 'DT'), ('jury', 'NN'), ('said', 'VBD'), ('it', 'PRP'), ('did', 'VBD'), ('find', 'VB'), ('that', 'IN'), ('many', 'JJ'), ('of', 'IN'), ("Georgia's", 'NNP'), ('registration', 'NN'), ('and', 'CC'), ('election', 'NN'), ('laws', 'NNS'), ('``', '``'), ('are', 'VBP'), ('outmoded', 'VBN'), ('or', 'CC'), ('inadequate', 'JJ'), ('and', 'CC'), ('often', 'RB'), ('ambiguous', 'JJ'), ("''", "''"), ('.', '.')]]

What's with those part of speech labels? They aren't helpful at all!

The Penn Tagset, which NLTK uses for it's part of speech tagger, is not particularly intuitive. Fortunately, they provide code that allows you to check what different tags stand for.

# troubleshooting: https://github.com/nltk/nltk/issues/919
nltk.help.upenn_tagset("NN")
nltk.help.upenn_tagset("JJ")
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...

Write a function that calculates our features for us

(In this case, numbers of nouns and adjectives that appear in the sentence)

Now we know the tags for the different parts of speech we want to count in each sentence. Let's write a function that will count the parts of speech to us, when given a part of speech tagged sentence (such as what we have already in our DataFrame) and the part of speech we want to count (for example, "NN" to count the number of nouns in the sentence).

def countPOS(pos_tag_sent, POS):
    pos_count = 0
    all_pos_counts = []
    for sentence in pos_tag_sent:
        for word in sentence:
            tag = word[1]
            if tag [:2] == POS:  
                pos_count = pos_count+1
        all_pos_counts.append(pos_count)
        pos_count = 0
    return all_pos_counts

We will now call this function twice, one for each of the parts of speech we are counting. As we finish counting them, we put the results into the DataFrame, saving us the trouble of having to do so later.

df['NN'] = countPOS(pos_all, 'NN')
df['JJ'] = countPOS(pos_all, "JJ")
df.head()
label sentence NN JJ
0 news [The, Fulton, County, Grand, Jury, said, Frida... 11 2
1 news [The, jury, further, said, in, term-end, prese... 13 2
2 news [The, September-October, term, jury, had, been... 16 2
3 news [``, Only, a, relative, handful, of, such, rep... 9 3
4 news [The, jury, said, it, did, find, that, many, o... 5 3
df.tail()
label sentence NN JJ
4426 romance [Nobody, else, showed, pleasure, .] 2 0
4427 romance [Spike-haired, ,, burly, ,, red-faced, ,, deck... 9 3
4428 romance [``, Hello, ,, boss, '', ,, he, said, ,, and, ... 2 0
4429 romance [``, I, suppose, I, can, never, expect, to, ca... 3 0
4430 romance [``, I'm, afraid, not, '', .] 1 0

How many features do we have?

df.groupby('label').sum()
NN JJ
label
news 31593 6678
romance 13821 4022

Practice 3: Save the dataframe to your computer as a csv file (comma separated value)

Hint: .to_csv()

df.to_csv("df_news_romance.csv", index=False)

<<< Previous | Next >>>