Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to apply function .preprocess and others to Pandas df? #8

Open
fatihbozdag opened this issue Jan 14, 2023 · 3 comments
Open

How to apply function .preprocess and others to Pandas df? #8

fatihbozdag opened this issue Jan 14, 2023 · 3 comments

Comments

@fatihbozdag
Copy link

Greetings all,

I have a large corpus zipping into a Pandas dataframe and I'd like to iterate text column to record the results of individual functions to separate columns. As far as I get, extractor only accepts str. I am trying to merge scores with metadata included in the dataframe.

For instance, my dataframe is follows.

df.head()
  docid_field  ...                                         text_field
0    BGSU1001  ...   <ICLE-BG-SUN-0001.1> \nIt is time, that our s...
1    BGSU1002  ...   <ICLE-BG-SUN-0002.1> \nNowadays there is a gr...
2    BGSU1003  ...   <ICLE-BG-SUN-0003.1> \nOnce upon a time there...
3    BGSU1004  ...   <ICLE-BG-SUN-0004.1> \nOur educational system...
4    BGSU1005  ...   <ICLE-BG-SUN-0005.1> \nScience, technology an...

Is there a way to apply LingFeat function to df['text_field'] and record scores (let's say LingFeat.EnDF_()) as tuples into another column?
I did try

df['LingFeat'] = df['text_field'].apply(lambda x: extractor.pass_text(x))

and the result is

0      <lingfeat.extractor.pass_text object at 0x0000...
1      <lingfeat.extractor.pass_text object at 0x0000...
2      <lingfeat.extractor.pass_text object at 0x0000...
3      <lingfeat.extractor.pass_text object at 0x0000...
4      <lingfeat.extractor.pass_text object at 0x0000...
                       
923    <lingfeat.extractor.pass_text object at 0x0000...
924    <lingfeat.extractor.pass_text object at 0x0000...
925    <lingfeat.extractor.pass_text object at 0x0000...
926    <lingfeat.extractor.pass_text object at 0x0000...
927    <lingfeat.extractor.pass_text object at 0x0000...
Name: LingFeat, Length: 928, dtype: object

I couldn't go on any further. How should I do it, if it is possible?

@fatihbozdag
Copy link
Author

Another yet related question,

is it possible to add LingFeat to Spacy nlp.pipe?

@brucewlee
Copy link
Owner

Actually, this is a very interesting idea. I'll try to implement this is the next version of this project: LFTK.

@fatihbozdag
Copy link
Author

fatihbozdag commented Mar 2, 2023

I did something like this for those who may want to apply something similar.

a1 = "DocID"

a2 = "I won't say that committing suicide is good or bad, what I want to emphasize here is I think none should accuse such people of something that is only and only up to the person himself. It's his choice and the end is his own end not ours, everyone should be responsible for his rights and wrongs. First we should consider why a person intends to give an end to his life and how he finds the enough courage to kill himself. If somebody is to ready to do such a terrible thing, there should be incredibly huge reasons to force him to this"

df = pd.DataFrame({"DocID": a1, "text_field": a2}, index = [0])

# Advanced Semantic (AdSem) Features

WoKF = [] # Wikipedia Knowledge Features
WBKF = [] # WeeBit Corpus Knowledge Features
OSKF = [] # OneStopEng Corpus Knowledge Features

  # Discourse (Disco) Features
EnDF = [] # Entity Density Features
EnGF = [] # Entity Grid Features

  # Syntactic (Synta) Features
PhrF = [] # Noun/Verb/Adj/Adv/... Phrasal Features
TrSF = [] # (Parse) Tree Structural Features
POSF = [] # Noun/Verb/Adj/Adv/... Part-of-Speech Features

  # Lexico Semantic (LxSem) Features
TTRF = [] # Type Token Ratio Features
VarF = [] # Noun/Verb/Adj/Adv Variation Features 
PsyF = [] # Psycholinguistic Difficulty of Words (AoA Kuperman)
WoLF = [] # Word Familiarity from Frequency Count (SubtlexUS)

  # Shallow Traditional (ShTra) Features
ShaF = [] # Shallow Features (e.g. avg number of tokens)
TraF = [] # Traditional Formulas 

for x in df["text_field"]:
    LingFeat = []
    LingFeat.append(extractor.pass_text(x))
    for y in LingFeat:
        y.preprocess()
        for a in LingFeat:
            WoKF.append(a.WoKF_())
            WBKF.append(a.WBKF_())
            OSKF.append(a.OSKF_())
            EnDF.append(a.EnDF_())
            EnGF.append(a.EnGF_())
            PhrF.append(a.PhrF_())
            TrSF.append(a.TrSF_())
            POSF.append(a.POSF_())
            TTRF.append(a.TTRF_())
            VarF.append(a.VarF_())
            PsyF.append(a.PsyF_())
            WoLF.append(a.WorF_())
            ShaF.append(a.ShaF_())
            TraF.append(a.TraF_())

 ##Advanced Semantic Scores##
    
WoKF_score = pd.DataFrame.from_dict(WoKF, orient = "columns")
WBKF_score = pd.DataFrame.from_dict(WBKF, orient = "columns")
OSKF_score = pd.DataFrame.from_dict(OSKF, orient = "columns")

Adsem_Scores = pd.concat([WoKF_score,WBKF_score,OSKF_score], axis = 1)
Adsem_Scores.insert(0, "DocID", df["DocID"])

    ##Discourse Scores##

EnDF_score =  pd.DataFrame.from_dict(EnDF, orient = "columns")
EnGF_score =  pd.DataFrame.from_dict(EnGF, orient = "columns")

Disco_Scores = pd.concat([EnDF_score, EnGF_score], axis = 1)
Disco_Scores.insert(0, "DocID", df["DocID"])

    ##Syntactic Scores##

PhrF_score = pd.DataFrame.from_dict(PhrF, orient="columns")
TrSF_score = pd.DataFrame.from_dict(TrSF, orient = "columns")
POSF_score = pd.DataFrame.from_dict(POSF, orient = "columns")

Syntactic_Scores = pd.concat([PhrF_score, TrSF_score, POSF_score], axis = 1)
Syntactic_Scores.insert(0, "DocID", df["DocID"])


    ###Lexico-Semantic Scores###

TTRF_score = pd.DataFrame.from_dict(TTRF, orient="columns")
VarF_score = pd.DataFrame.from_dict(VarF, orient="columns")
PsyF_score =  pd.DataFrame.from_dict(PsyF, orient="columns")
WoLF_score =  pd.DataFrame.from_dict(WoLF, orient="columns")

LexicoSemantic_Scores = pd.concat([TTRF_score, VarF_score, PsyF_score, WoLF_score], axis = 1)
LexicoSemantic_Scores.insert(0, "DocID", df["DocID"])
    
    ###Shallow Traditional Features###

ShaF_score = pd.DataFrame.from_dict(ShaF, orient="columns")
TraF_score = pd.DataFrame.from_dict(TraF, orient="columns")

ShTra_Scores = pd.concat([ShaF_score, TraF_score], axis = 1)
ShTra_Scores.insert(0, "DocID", df["DocID"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants