Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding the preprocessed dataset doubt #6

Open
Joey0538 opened this issue Jan 12, 2021 · 18 comments
Open

Regarding the preprocessed dataset doubt #6

Joey0538 opened this issue Jan 12, 2021 · 18 comments

Comments

@Joey0538
Copy link

Joey0538 commented Jan 12, 2021

There are "index: count" pairs in each row of the preprocessed data file....what do they signify? is it something like the tweet text has been tokenized and based on all the token you have a vocabulary/dictionary with key: count pairs and each row symbolizes the token index: count pair from the vocabulary. Could you please provide more details on how have you preprocessed the dataset?

@pansy33
Copy link

pansy33 commented Jan 30, 2021

I have the same problem, how to get data.TD_RvNN.vol_5000.txt from the original data set Twitter15 and 16.

@CurryTang
Copy link

They are tf-idf vectors from twitter15 & twitter16 datasets.However, I find that the entries in rvnn and the ones in the original dataset don't match.For example, post with id 624298742162845696 is not included in twitter15 & twitter16 dataset

@CynthiaLaura6gf
Copy link

The paper uses TF-IDF values to represent node features, but I don’t know how to extract the retweet or response node features from the original Twitter 15&16 dataset.

@CurryTang
Copy link

The paper uses TF-IDF values to represent node features, but I don’t know how to extract the retweet or response node features from the original Twitter 15&16 dataset.

You have to use Twitter API to crawl the original tweet text via the tweetID offered by the original Twitter15&16 dataset first. However, Twitter API set a rate limit and many original posts are missing. As a result, it's really complicated to get these data. Moreover, as I have mentioned above, the entries in data.TD_RvNN.vol_5000.txt and the ones in original Twitter15&16 don't match exactly. I find two approaches to solve this problem.
You can use the TF-IDF vector as node features directly just like this paper or you may have a look at https://github.com/serenaklm/rumor_detection. The authors of that paper released texts of retweets and responses.

@CynthiaLaura6gf
Copy link

The paper uses TF-IDF values to represent node features, but I don’t know how to extract the retweet or response node features from the original Twitter 15&16 dataset.

You have to use Twitter API to crawl the original tweet text via the tweetID offered by the original Twitter15&16 dataset first. However, Twitter API set a rate limit and many original posts are missing. As a result, it's really complicated to get these data. Moreover, as I have mentioned above, the entries in data.TD_RvNN.vol_5000.txt and the ones in original Twitter15&16 don't match exactly. I find two approaches to solve this problem. You can use the TF-IDF vector as node features directly just like this paper or you may have a look at https://github.com/serenaklm/rumor_detection. The authors of that paper released texts of retweets and responses.

Thank you so much, I will look at https://github.com/serenaklm/rumor_detection.

@CynthiaLaura6gf
Copy link

The paper uses TF-IDF values to represent node features, but I don’t know how to extract the retweet or response node features from the original Twitter 15&16 dataset.

You have to use Twitter API to crawl the original tweet text via the tweetID offered by the original Twitter15&16 dataset first. However, Twitter API set a rate limit and many original posts are missing. As a result, it's really complicated to get these data. Moreover, as I have mentioned above, the entries in data.TD_RvNN.vol_5000.txt and the ones in original Twitter15&16 don't match exactly. I find two approaches to solve this problem. You can use the TF-IDF vector as node features directly just like this paper or you may have a look at https://github.com/serenaklm/rumor_detection. The authors of that paper released texts of retweets and responses.

Excuse me, in the https://github.com/serenaklm/rumor_detection, I can't find the '../data/controversy/raw_data/'. In addition, I have trouble constructing the propagation structure of fake news. Like most papers, I use the retweet or response nodes as propagation nodes. If textual content of the retweet node is used as its feature, the retweet node features should be the same, which affects the detection result? I checked the author's data.TD_RvNN.vol_5000.txt file. The vector of each node is basically different, maybe they just used the response nodes. But I am still confused that I don't know how to represent the retweet node. Do you have any suggestions?

@CurryTang
Copy link

The paper uses TF-IDF values to represent node features, but I don’t know how to extract the retweet or response node features from the original Twitter 15&16 dataset.

You have to use Twitter API to crawl the original tweet text via the tweetID offered by the original Twitter15&16 dataset first. However, Twitter API set a rate limit and many original posts are missing. As a result, it's really complicated to get these data. Moreover, as I have mentioned above, the entries in data.TD_RvNN.vol_5000.txt and the ones in original Twitter15&16 don't match exactly. I find two approaches to solve this problem. You can use the TF-IDF vector as node features directly just like this paper or you may have a look at https://github.com/serenaklm/rumor_detection. The authors of that paper released texts of retweets and responses.

Excuse me, in the https://github.com/serenaklm/rumor_detection, I can't find the '../data/controversy/raw_data/'. In addition, I have trouble constructing the propagation structure of fake news. Like most papers, I use the retweet or response nodes as propagation nodes. If textual content of the retweet node is used as its feature, the retweet node features should be the same, which affects the detection result? I checked the author's data.TD_RvNN.vol_5000.txt file. The vector of each node is basically different, maybe they just used the response nodes. But I am still confused that I don't know how to represent the retweet node. Do you have any suggestions?

I try to use this RvNN file directly. If there's no meaningful text for most retweets, you may need to crawl other metadata such as user information and then use it as the node feature.

@youran521
Copy link

2111105031607336 None 1 1:1 2:2 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:3 15:1 16:1 17:1 18:1
2111105031607336 1 2 32:1 33:1 34:1 9:1 19:2 20:1 21:2 22:1 23:1 24:1 25:1 26:1 27:1 28:1 29:1 30:1 31:1
2111105031607336 1 3 35:1 36:1 37:1
2111105031607336 1 4 40:1 34:4 38:1 39:1
2111105031607336 1 5 1:1 34:1 9:1 42:1 43:1 44:1 45:1 46:1 47:1 48:1 21:2 41:1
In Weibo Datasets,index: count,How are these data calculated using IF-TDF?thank you very much.

@CurryTang
Copy link

2111105031607336 None 1 1:1 2:2 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:3 15:1 16:1 17:1 18:1 2111105031607336 1 2 32:1 33:1 34:1 9:1 19:2 20:1 21:2 22:1 23:1 24:1 25:1 26:1 27:1 28:1 29:1 30:1 31:1 2111105031607336 1 3 35:1 36:1 37:1 2111105031607336 1 4 40:1 34:4 38:1 39:1 2111105031607336 1 5 1:1 34:1 9:1 42:1 43:1 44:1 45:1 46:1 47:1 48:1 21:2 41:1 In Weibo Datasets,index: count,How are these data calculated using IF-TDF?thank you very much.

These TF-IDF vectors are computed from the text contents of Twitter posts and user responses by setting the number of corpus to 5000. If you'd like to compute these vectors, you need to crawl these text contents.

@luckyonetwo
Copy link

你好想问一下,具体步骤是怎么操作的呢?先爬取文本内容和用户反应这个用户反应包括什么信息呢,然后是怎么操作的呢,之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

@pansy33
Copy link

pansy33 commented Apr 26, 2022 via email

@flowingrain
Copy link

你好想问一下,具体步骤是怎么操作的呢?先爬取文本内容和用户反应这个用户反应包括什么信息呢,然后是怎么操作的呢,之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

不能,你没有他的词表,这里的root节点也和twitter15 16里的不太一样,相当于重新构造一个数据集了。可以考虑用他提供的处理好的数据集,只是丢失了词汇间的顺序关系。

@luckyonetwo
Copy link

你好想问一下,具体步骤是怎么操作的呢?先爬取文本内容和用户反应这个用户反应包括什么信息呢,然后是怎么操作的呢,之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

不能,你没有他的词表,这里的root节点也和twitter15 16里的不太一样,相当于重新构造一个数据集了。可以考虑用他提供的处理好的数据集,只是丢失了词汇间的顺序关系。

明白了,还想要请教一下,那想要用自己数据集复现一下怎么能处理数据成这个数据的形式呢

@flowingrain
Copy link

你好想问一下,具体步骤是怎么操作的呢?先爬取文本内容和用户反应这个用户反应包括什么信息呢,然后是怎么操作的呢,之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

不能,你没有他的词表,这里的root节点也和twitter15 16里的不太一样,相当于重新构造一个数据集了。可以考虑用他提供的处理好的数据集,只是丢失了词汇间的顺序关系。

明白了,还想要请教一下,那想要用自己数据集复现一下怎么能处理数据成这个数据的形式呢
抓取转发评论的推特,自行根据tf-idf值获取得分建立词表。节点对象在bigcn的get twitter graph有定义,可据此读取数据写入文件。

@luckyonetwo
Copy link

你好想问一下,具体步骤是怎么操作的呢?先爬取文本内容和用户反应这个用户反应包括什么信息呢,然后是怎么操作的呢,之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

不能,你没有他的词表,这里的root节点也和twitter15 16里的不太一样,相当于重新构造一个数据集了。可以考虑用他提供的处理好的数据集,只是丢失了词汇间的顺序关系。

明白了,还想要请教一下,那想要用自己数据集复现一下怎么能处理数据成这个数据的形式呢
抓取转发评论的推特,自行根据tf-idf值获取得分建立词表。节点对象在bigcn的get twitter graph有定义,可据此读取数据写入文件。

非常感谢

@youran521
Copy link

你好想问一下,具体步骤是怎么操作的呢?先爬取文本内容和用户反应这个用户反应包括什么信息呢,然后是怎么操作的呢,之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

不能,你没有他的词表,这里的root节点也和twitter15 16里的不太一样,相当于重新构造一个数据集了。可以考虑用他提供的处理好的数据集,只是丢失了词汇间的顺序关系。

明白了,还想要请教一下,那想要用自己数据集复现一下怎么能处理数据成这个数据的形式呢
抓取转发评论的推特,自行根据tf-idf值获取得分建立词表。节点对象在bigcn的get twitter graph有定义,可据此读取数据写入文件。

非常感谢

您好,请问您在自己的数据集上,处理成这个数据形式了吗?

@pansy33
Copy link

pansy33 commented Jun 6, 2022 via email

1 similar comment
@pansy33
Copy link

pansy33 commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants