Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Significant Text aggregation #33

Open
PeterWalchhofer opened this issue Mar 13, 2023 · 3 comments
Open

Feature Request: Significant Text aggregation #33

PeterWalchhofer opened this issue Mar 13, 2023 · 3 comments

Comments

@PeterWalchhofer
Copy link

Dear maintainers,

I currently have the following problem that I would like to address: In the set of documents (here: parliamentary speeches) I want to list the most important terms of all documents returned by a query (with respect to the overally index).

For instance, when querying for the term "Russland" I get a subset of parliamenary speeches. Out of those documents (eventually also filtered by a certain date) I want to return the 20 terms that are most importantly mentioned along "Russland" (e.g. "NATO", "Krim",...). In my Python prototype (with a bingy flask backend) I could achieve this, but soon realized that this will turn into scalabilty problems. I came up with 2 approaches that could fix this.

  1. Persist TF-IDF scores for each term of a single document like this
{
text: "Dear Ladies and gentlemen, [...] Russland ,... in der NATO",
terms: [ 
    {"term": "ladies", "tfidf": 0.01},
    ... ,
    {"term":"NATO", "tfidf": 0.2}
]}

and aggregate the result terms by summing up its values. This comes in handy, as i need the TFIDF scores on a document-level anyways. Again, I think this could lead to scalibility issues, as I would be in need of a join operation of many distinct values. While I could to this with map-reduce, I am not sure if elasticsearch can do this. Therefore, I stepped back from this approach.

  1. Elasticsearch provides significant-text aggregation which seems to suit quite nicely. Is there a way already for passing such queries to Amcat4 or is there an extension needed? Currently, I am struggling a bit with mapping elasticsearch queries to Amcat4 queries, due to its (slight) different JSON structures.

I would be super happy if somebody could help :)

@vanatteveldt
Copy link
Member

Dear Peter,

Interesting case. The elastic API is indeed not directly exposed (which is actually not allowed it seems), but of course if you control the instance you're free to talk to the elastic backend.

My normal reaction would be that this should normally be something you'd want to do client-side, as that gives you full control over the text analysis being performed, rather than relying on whatever the elastic devs chose to implement. So, intuitively I would also prefer to explore option (1) but I'd need to understand a bit better what the algorithm is that you're using to determine "importance".

That said, I hadn't heard of the significant text / term options in elasticsearch, so it might be good to see if we can expose that easily. That said, they also warn of potential performance issues in the docs, so it might still be better to explore client-side processing options?

-- Wouter

@PeterWalchhofer
Copy link
Author

Dear Wouter,

Thank you for the fast response!

You are right that client-side processing may be easier here. This would require to transfer the cutom-aggregation close to the elasticsearch instance in order to reduce network-traffic, e.g. some sort of Python backend before sending the result to the JS client, which I initially would like to avoid. The term importance is simply measured by TF-IDF. In my prototypical implementation I simply concatanated all speeches of a query result and called transform() on the scikit-learn TfIdfVectoizer. However, this is of course not ideal. My current idea would be aggregating the top-20 local per-document term-score-pairs for computing the global top 20.
This, however, seems to be similar to the significant text feature of elasticsearch. Although there is a performance risk, I actually think that this will be faster than the clien-side solution, as it performs the aggregation in a distributed fashion. In the docs they mentioned that they compute the top-k by shard before global aggregation, which seems to be much better than doing all of this on one node.

What do you think?

Peter

@PeterWalchhofer
Copy link
Author

PeterWalchhofer commented Mar 17, 2023

Dear Wouter,

I tried the significant text feature by exposing elasticsearch via docker and it works pretty well (and fast). However, due to difference in how this is computed, I get slightly different terms. For now I decided to create a separate Flask application that talks to the ElasticSearch directly. This way I could also use the term vector API, which is quite handy.

If you plan on exposing significant text or termvectors, please let me know :)

Edit: I came to the conclusion that it is not fast enough. You were right! I did it now on the client-side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants