Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating fcm value weighting #2149

Open
eisioriginal opened this issue Nov 3, 2021 · 10 comments
Open

Integrating fcm value weighting #2149

eisioriginal opened this issue Nov 3, 2021 · 10 comments

Comments

@eisioriginal
Copy link

Requested feature

I want to integrate significance calculations based on Log-Likelihood, PMI, DICE and Poisson to the fcm object.

Use case

Co-occurrences can be weighted by statistical significance. Which gives more semantic representations.

@eisioriginal
Copy link
Author

I have the methods and they work efficiently. I think they should be integrated instead of me creating a new package!

@kbenoit
Copy link
Collaborator

kbenoit commented Nov 3, 2021

Can you provide an example so we have a clearer idea of what these do and what sort of output is generated? We have some efficient association methods already used in quanteda.textstats::textstat_keyness() and might be able to adapt these if we knew exactly what sort of association statistics you are interested in generating.

@eisioriginal
Copy link
Author

eisioriginal commented Nov 3, 2021

Hi, basically I talk about optimized implementations of those methods: https://tm4ss.github.io/docs/Tutorial_5_Co-occurrence.html (I'm one of the authors) The visualization is pretty much the same as in Quanteda and recently a started to use Quantedas method. This part can be ignored.

You can read about them also in https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.471.5863&rep=rep1&type=pdf

What you get is an association between words based on the log-likelihood, PMI, DICE and Poisson Significance or weighting schemes. They are not all strict significance measures, but they are very helpful in order to find relevant association between words. Additionally, they are working well in situations where Chi^2 is problematic (Rare cases etc.).

@koheiw
Copy link
Collaborator

koheiw commented Nov 4, 2021

I wrote a small function to compute PMI using FCM while ago. Do you want to add something like this?

> toks <- tokens(c("a b c", "a b d e"))
> fcmt <- fcm(toks)
> 
> fcm_pmi <- function(x) {
+   m <- x@meta$object$margin
+   x <- as(x, "dgTMatrix")
+   x@x <- log(x@x / (m[x@i + 1] * m[x@j + 1]) * sum(m))
+   x@x[x@x < 0] <- 0
+   as.fcm(x)
+ }
> 
> fcmt
Feature co-occurrence matrix of: 5 by 5 features.
        features
features a b c d e
       a 0 2 1 1 1
       b 0 0 1 1 1
       c 0 0 0 0 0
       d 0 0 0 0 1
       e 0 0 0 0 0
> fcm_pmi(fcmt)
Feature co-occurrence matrix of: 5 by 5 features.
        features
features a        b        c        d        e
       a 0 1.252763 1.252763 1.252763 1.252763
       b 0 0.000000 1.252763 1.252763 1.252763
       c 0 0        0.000000 0        0       
       d 0 0        0        0.000000 1.945910
       e 0 0        0        0        0.000000

You are welcome to issue a pull request!

@eisioriginal
Copy link
Author

eisioriginal commented Nov 4, 2021

Yes, this is exactly what I'm proposing, but I want to add more association measures since they all have different properties w.r.t. to research questions and researcher requirements. Since I'm using them all the time, I thought an integration to Quanteda would be nice for the whole community.

I have a background in CSS and a PhD in Computer Science. The proposed measures are our standard repertoire when it comes to semantic interpretation of text resources. I work in the Computational Humanities group in Leipzig University.

@koheiw
Copy link
Collaborator

koheiw commented Nov 4, 2021

Why don't you start a branch to add a new function called fcm_weight() with additional measures? I am happy to assist.

@eisioriginal
Copy link
Author

Nice, will do!

@kbenoit
Copy link
Collaborator

kbenoit commented Nov 4, 2021

Probably better in quanteda.textstats since that's where the association statistics code already lives, and since this is a textual statistic.

@koheiw
Copy link
Collaborator

koheiw commented Nov 4, 2021

I wrote fcm_pmi() for pre-processing for SVD, so I though should be in the main package. If it is for network analysis, textstats would be a better place. @eisioriginal how do you want to use the output.

@eisioriginal
Copy link
Author

Basically I do analyse networks, search for synonyms, mine for semantic chances and all that sort of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants