Extracting Features

What should we use as features for our data set? What did we use as features for our fruit example before?

Now that we are using sentences, how can we best represent each sentence as a series of values?

One idea is to count how many particular parts of speech the sentence contains.

Nouns: Most basically described as a person, place, or thing. Counting nouns can help determine how many topics are being discussed in a sentence.
Adjectives: Descriptors of nouns (eg. "yellow", "angry", "charming"). Counting adjectives can help determine how often descriptive words are being added to nouns, which can demonstrate writing style.

We will now compute all of the parts of speech on each sentence (row) in our dataframe.

# compute parts of speech on each sentence (row)
pos_all = pos_tag_sents(df['sentence'])

print (pos_all[:5])

[[('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'), ('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NNP'), ('recent', 'JJ'), ('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'IN'), ('any', 'DT'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'DT'), ('jury', 'NN'), ('further', 'RB'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'JJ'), ('presentments', 'NNS'), ('that', 'IN'), ('the', 'DT'), ('City', 'NNP'), ('Executive', 'NNP'), ('Committee', 'NNP'), (',', ','), ('which', 'WDT'), ('had', 'VBD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'DT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'DT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('City', 'NNP'), ('of', 'IN'), ('Atlanta', 'NNP'), ("''", "''"), ('for', 'IN'), ('the', 'DT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'DT'), ('election', 'NN'), ('was', 'VBD'), ('conducted', 'VBN'), ('.', '.')], [('The', 'DT'), ('September-October', 'NNP'), ('term', 'NN'), ('jury', 'NN'), ('had', 'VBD'), ('been', 'VBN'), ('charged', 'VBN'), ('by', 'IN'), ('Fulton', 'NNP'), ('Superior', 'NNP'), ('Court', 'NNP'), ('Judge', 'NNP'), ('Durwood', 'NNP'), ('Pye', 'NNP'), ('to', 'TO'), ('investigate', 'VB'), ('reports', 'NNS'), ('of', 'IN'), ('possible', 'JJ'), ('``', '``'), ('irregularities', 'NNS'), ("''", "''"), ('in', 'IN'), ('the', 'DT'), ('hard-fought', 'JJ'), ('primary', 'NN'), ('which', 'WDT'), ('was', 'VBD'), ('won', 'VBN'), ('by', 'IN'), ('Mayor-nominate', 'NNP'), ('Ivan', 'NNP'), ('Allen', 'NNP'), ('Jr.', 'NNP'), ('.', '.')], [('``', '``'), ('Only', 'RB'), ('a', 'DT'), ('relative', 'JJ'), ('handful', 'NN'), ('of', 'IN'), ('such', 'JJ'), ('reports', 'NNS'), ('was', 'VBD'), ('received', 'VBN'), ("''", "''"), (',', ','), ('the', 'DT'), ('jury', 'NN'), ('said', 'VBD'), (',', ','), ('``', '``'), ('considering', 'VBG'), ('the', 'DT'), ('widespread', 'JJ'), ('interest', 'NN'), ('in', 'IN'), ('the', 'DT'), ('election', 'NN'), (',', ','), ('the', 'DT'), ('number', 'NN'), ('of', 'IN'), ('voters', 'NNS'), ('and', 'CC'), ('the', 'DT'), ('size', 'NN'), ('of', 'IN'), ('this', 'DT'), ('city', 'NN'), ("''", "''"), ('.', '.')], [('The', 'DT'), ('jury', 'NN'), ('said', 'VBD'), ('it', 'PRP'), ('did', 'VBD'), ('find', 'VB'), ('that', 'IN'), ('many', 'JJ'), ('of', 'IN'), ("Georgia's", 'NNP'), ('registration', 'NN'), ('and', 'CC'), ('election', 'NN'), ('laws', 'NNS'), ('``', '``'), ('are', 'VBP'), ('outmoded', 'VBN'), ('or', 'CC'), ('inadequate', 'JJ'), ('and', 'CC'), ('often', 'RB'), ('ambiguous', 'JJ'), ("''", "''"), ('.', '.')]]

What's with those part of speech labels? They aren't helpful at all!

The Penn Tagset, which NLTK uses for it's part of speech tagger, is not particularly intuitive. Fortunately, they provide code that allows you to check what different tags stand for.

# troubleshooting: https://github.com/nltk/nltk/issues/919
nltk.help.upenn_tagset("NN")
nltk.help.upenn_tagset("JJ")

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...

Write a function that calculates our features for us

(In this case, numbers of nouns and adjectives that appear in the sentence)

Now we know the tags for the different parts of speech we want to count in each sentence. Let's write a function that will count the parts of speech to us, when given a part of speech tagged sentence (such as what we have already in our DataFrame) and the part of speech we want to count (for example, "NN" to count the number of nouns in the sentence).

def countPOS(pos_tag_sent, POS):
    pos_count = 0
    all_pos_counts = []
    for sentence in pos_tag_sent:
        for word in sentence:
            tag = word[1]
            if tag [:2] == POS:  
                pos_count = pos_count+1
        all_pos_counts.append(pos_count)
        pos_count = 0
    return all_pos_counts

We will now call this function twice, one for each of the parts of speech we are counting. As we finish counting them, we put the results into the DataFrame, saving us the trouble of having to do so later.

df['NN'] = countPOS(pos_all, 'NN')
df['JJ'] = countPOS(pos_all, "JJ")

df.head()

	label	sentence	NN	JJ
0	news	[The, Fulton, County, Grand, Jury, said, Frida...	11	2
1	news	[The, jury, further, said, in, term-end, prese...	13	2
2	news	[The, September-October, term, jury, had, been...	16	2
3	news	[``, Only, a, relative, handful, of, such, rep...	9	3
4	news	[The, jury, said, it, did, find, that, many, o...	5	3

df.tail()

	label	sentence	NN	JJ
4426	romance	[Nobody, else, showed, pleasure, .]	2	0
4427	romance	[Spike-haired, ,, burly, ,, red-faced, ,, deck...	9	3
4428	romance	[``, Hello, ,, boss, '', ,, he, said, ,, and, ...	2	0
4429	romance	[``, I, suppose, I, can, never, expect, to, ca...	3	0
4430	romance	[``, I'm, afraid, not, '', .]	1	0

How many features do we have?

df.groupby('label').sum()

	NN	JJ
label
news	31593	6678
romance	13821	4022

Practice 3: Save the dataframe to your computer as a csv file (comma separated value)

Hint: .to_csv()

df.to_csv("df_news_romance.csv", index=False)

<<< Previous | Next >>>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

features.md

features.md

Extracting Features

What should we use as features for our data set? What did we use as features for our fruit example before?

Now that we are using sentences, how can we best represent each sentence as a series of values?

What's with those part of speech labels? They aren't helpful at all!

Write a function that calculates our features for us

(In this case, numbers of nouns and adjectives that appear in the sentence)

How many features do we have?

Practice 3: Save the dataframe to your computer as a csv file (comma separated value)

Files

features.md

Latest commit

History

features.md

File metadata and controls

Extracting Features

What should we use as features for our data set? What did we use as features for our fruit example before?

Now that we are using sentences, how can we best represent each sentence as a series of values?

What's with those part of speech labels? They aren't helpful at all!

Write a function that calculates our features for us

(In this case, numbers of nouns and adjectives that appear in the sentence)

How many features do we have?

Practice 3: Save the dataframe to your computer as a csv file (comma separated value)