Regarding the preprocessed dataset doubt #6

Joey0538 · 2021-01-12T13:11:06Z

There are "index: count" pairs in each row of the preprocessed data file....what do they signify? is it something like the tweet text has been tokenized and based on all the token you have a vocabulary/dictionary with key: count pairs and each row symbolizes the token index: count pair from the vocabulary. Could you please provide more details on how have you preprocessed the dataset?

pansy33 · 2021-01-30T04:00:54Z

I have the same problem, how to get data.TD_RvNN.vol_5000.txt from the original data set Twitter15 and 16.

CurryTang · 2021-11-23T02:08:24Z

They are tf-idf vectors from twitter15 & twitter16 datasets.However, I find that the entries in rvnn and the ones in the original dataset don't match.For example, post with id 624298742162845696 is not included in twitter15 & twitter16 dataset

CynthiaLaura6gf · 2021-11-23T07:08:51Z

The paper uses TF-IDF values to represent node features, but I don’t know how to extract the retweet or response node features from the original Twitter 15&16 dataset.

CurryTang · 2021-11-23T07:22:48Z

The paper uses TF-IDF values to represent node features, but I don’t know how to extract the retweet or response node features from the original Twitter 15&16 dataset.

You have to use Twitter API to crawl the original tweet text via the tweetID offered by the original Twitter15&16 dataset first. However, Twitter API set a rate limit and many original posts are missing. As a result, it's really complicated to get these data. Moreover, as I have mentioned above, the entries in data.TD_RvNN.vol_5000.txt and the ones in original Twitter15&16 don't match exactly. I find two approaches to solve this problem.
You can use the TF-IDF vector as node features directly just like this paper or you may have a look at https://github.com/serenaklm/rumor_detection. The authors of that paper released texts of retweets and responses.

CynthiaLaura6gf · 2021-11-23T07:49:33Z

The paper uses TF-IDF values to represent node features, but I don’t know how to extract the retweet or response node features from the original Twitter 15&16 dataset.

You have to use Twitter API to crawl the original tweet text via the tweetID offered by the original Twitter15&16 dataset first. However, Twitter API set a rate limit and many original posts are missing. As a result, it's really complicated to get these data. Moreover, as I have mentioned above, the entries in data.TD_RvNN.vol_5000.txt and the ones in original Twitter15&16 don't match exactly. I find two approaches to solve this problem. You can use the TF-IDF vector as node features directly just like this paper or you may have a look at https://github.com/serenaklm/rumor_detection. The authors of that paper released texts of retweets and responses.

Thank you so much, I will look at https://github.com/serenaklm/rumor_detection.

CynthiaLaura6gf · 2021-11-24T03:45:02Z

The paper uses TF-IDF values to represent node features, but I don’t know how to extract the retweet or response node features from the original Twitter 15&16 dataset.

You have to use Twitter API to crawl the original tweet text via the tweetID offered by the original Twitter15&16 dataset first. However, Twitter API set a rate limit and many original posts are missing. As a result, it's really complicated to get these data. Moreover, as I have mentioned above, the entries in data.TD_RvNN.vol_5000.txt and the ones in original Twitter15&16 don't match exactly. I find two approaches to solve this problem. You can use the TF-IDF vector as node features directly just like this paper or you may have a look at https://github.com/serenaklm/rumor_detection. The authors of that paper released texts of retweets and responses.

Excuse me, in the https://github.com/serenaklm/rumor_detection, I can't find the '../data/controversy/raw_data/'. In addition, I have trouble constructing the propagation structure of fake news. Like most papers, I use the retweet or response nodes as propagation nodes. If textual content of the retweet node is used as its feature, the retweet node features should be the same, which affects the detection result? I checked the author's data.TD_RvNN.vol_5000.txt file. The vector of each node is basically different, maybe they just used the response nodes. But I am still confused that I don't know how to represent the retweet node. Do you have any suggestions?

CurryTang · 2021-11-24T06:37:51Z

The paper uses TF-IDF values to represent node features, but I don’t know how to extract the retweet or response node features from the original Twitter 15&16 dataset.

You have to use Twitter API to crawl the original tweet text via the tweetID offered by the original Twitter15&16 dataset first. However, Twitter API set a rate limit and many original posts are missing. As a result, it's really complicated to get these data. Moreover, as I have mentioned above, the entries in data.TD_RvNN.vol_5000.txt and the ones in original Twitter15&16 don't match exactly. I find two approaches to solve this problem. You can use the TF-IDF vector as node features directly just like this paper or you may have a look at https://github.com/serenaklm/rumor_detection. The authors of that paper released texts of retweets and responses.

Excuse me, in the https://github.com/serenaklm/rumor_detection, I can't find the '../data/controversy/raw_data/'. In addition, I have trouble constructing the propagation structure of fake news. Like most papers, I use the retweet or response nodes as propagation nodes. If textual content of the retweet node is used as its feature, the retweet node features should be the same, which affects the detection result? I checked the author's data.TD_RvNN.vol_5000.txt file. The vector of each node is basically different, maybe they just used the response nodes. But I am still confused that I don't know how to represent the retweet node. Do you have any suggestions?

I try to use this RvNN file directly. If there's no meaningful text for most retweets, you may need to crawl other metadata such as user information and then use it as the node feature.

youran521 · 2021-11-25T08:13:58Z

2111105031607336 None 1 1:1 2:2 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:3 15:1 16:1 17:1 18:1
2111105031607336 1 2 32:1 33:1 34:1 9:1 19:2 20:1 21:2 22:1 23:1 24:1 25:1 26:1 27:1 28:1 29:1 30:1 31:1
2111105031607336 1 3 35:1 36:1 37:1
2111105031607336 1 4 40:1 34:4 38:1 39:1
2111105031607336 1 5 1:1 34:1 9:1 42:1 43:1 44:1 45:1 46:1 47:1 48:1 21:2 41:1
In Weibo Datasets,index: count，How are these data calculated using IF-TDF?thank you very much.

CurryTang · 2021-11-26T02:01:05Z

2111105031607336 None 1 1:1 2:2 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:3 15:1 16:1 17:1 18:1 2111105031607336 1 2 32:1 33:1 34:1 9:1 19:2 20:1 21:2 22:1 23:1 24:1 25:1 26:1 27:1 28:1 29:1 30:1 31:1 2111105031607336 1 3 35:1 36:1 37:1 2111105031607336 1 4 40:1 34:4 38:1 39:1 2111105031607336 1 5 1:1 34:1 9:1 42:1 43:1 44:1 45:1 46:1 47:1 48:1 21:2 41:1 In Weibo Datasets,index: count，How are these data calculated using IF-TDF?thank you very much.

These TF-IDF vectors are computed from the text contents of Twitter posts and user responses by setting the number of corpus to 5000. If you'd like to compute these vectors, you need to crawl these text contents.

luckyonetwo · 2022-04-26T12:41:34Z

你好想问一下，具体步骤是怎么操作的呢？先爬取文本内容和用户反应这个用户反应包括什么信息呢，然后是怎么操作的呢，之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

pansy33 · 2022-04-26T12:41:51Z

收到

flowingrain · 2022-05-04T15:54:53Z

你好想问一下，具体步骤是怎么操作的呢？先爬取文本内容和用户反应这个用户反应包括什么信息呢，然后是怎么操作的呢，之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

不能，你没有他的词表，这里的root节点也和twitter15 16里的不太一样，相当于重新构造一个数据集了。可以考虑用他提供的处理好的数据集，只是丢失了词汇间的顺序关系。

luckyonetwo · 2022-05-05T07:16:31Z

你好想问一下，具体步骤是怎么操作的呢？先爬取文本内容和用户反应这个用户反应包括什么信息呢，然后是怎么操作的呢，之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

不能，你没有他的词表，这里的root节点也和twitter15 16里的不太一样，相当于重新构造一个数据集了。可以考虑用他提供的处理好的数据集，只是丢失了词汇间的顺序关系。

明白了，还想要请教一下，那想要用自己数据集复现一下怎么能处理数据成这个数据的形式呢

flowingrain · 2022-05-05T07:21:55Z

你好想问一下，具体步骤是怎么操作的呢？先爬取文本内容和用户反应这个用户反应包括什么信息呢，然后是怎么操作的呢，之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

不能，你没有他的词表，这里的root节点也和twitter15 16里的不太一样，相当于重新构造一个数据集了。可以考虑用他提供的处理好的数据集，只是丢失了词汇间的顺序关系。

明白了，还想要请教一下，那想要用自己数据集复现一下怎么能处理数据成这个数据的形式呢
抓取转发评论的推特，自行根据tf-idf值获取得分建立词表。节点对象在bigcn的get twitter graph有定义，可据此读取数据写入文件。

luckyonetwo · 2022-05-05T07:29:59Z

你好想问一下，具体步骤是怎么操作的呢？先爬取文本内容和用户反应这个用户反应包括什么信息呢，然后是怎么操作的呢，之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

不能，你没有他的词表，这里的root节点也和twitter15 16里的不太一样，相当于重新构造一个数据集了。可以考虑用他提供的处理好的数据集，只是丢失了词汇间的顺序关系。

明白了，还想要请教一下，那想要用自己数据集复现一下怎么能处理数据成这个数据的形式呢
抓取转发评论的推特，自行根据tf-idf值获取得分建立词表。节点对象在bigcn的get twitter graph有定义，可据此读取数据写入文件。

非常感谢

youran521 · 2022-06-06T07:50:52Z

你好想问一下，具体步骤是怎么操作的呢？先爬取文本内容和用户反应这个用户反应包括什么信息呢，然后是怎么操作的呢，之间在github中下载TF-IDF 方法代码就能生成这个文件了吗

不能，你没有他的词表，这里的root节点也和twitter15 16里的不太一样，相当于重新构造一个数据集了。可以考虑用他提供的处理好的数据集，只是丢失了词汇间的顺序关系。

明白了，还想要请教一下，那想要用自己数据集复现一下怎么能处理数据成这个数据的形式呢
抓取转发评论的推特，自行根据tf-idf值获取得分建立词表。节点对象在bigcn的get twitter graph有定义，可据此读取数据写入文件。

非常感谢

您好，请问您在自己的数据集上，处理成这个数据形式了吗？

pansy33 · 2022-06-06T07:51:14Z

收到

pansy33 · 2022-10-11T07:38:33Z

收到

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding the preprocessed dataset doubt #6

Regarding the preprocessed dataset doubt #6

Joey0538 commented Jan 12, 2021 •

edited

pansy33 commented Jan 30, 2021

CurryTang commented Nov 23, 2021

CynthiaLaura6gf commented Nov 23, 2021

CurryTang commented Nov 23, 2021

CynthiaLaura6gf commented Nov 23, 2021

CynthiaLaura6gf commented Nov 24, 2021

CurryTang commented Nov 24, 2021

youran521 commented Nov 25, 2021

CurryTang commented Nov 26, 2021

luckyonetwo commented Apr 26, 2022

pansy33 commented Apr 26, 2022 via email

flowingrain commented May 4, 2022

luckyonetwo commented May 5, 2022

flowingrain commented May 5, 2022

luckyonetwo commented May 5, 2022

youran521 commented Jun 6, 2022

pansy33 commented Jun 6, 2022 via email

pansy33 commented Oct 11, 2022 via email

Regarding the preprocessed dataset doubt #6

Regarding the preprocessed dataset doubt #6

Comments

Joey0538 commented Jan 12, 2021 • edited

pansy33 commented Jan 30, 2021

CurryTang commented Nov 23, 2021

CynthiaLaura6gf commented Nov 23, 2021

CurryTang commented Nov 23, 2021

CynthiaLaura6gf commented Nov 23, 2021

CynthiaLaura6gf commented Nov 24, 2021

CurryTang commented Nov 24, 2021

youran521 commented Nov 25, 2021

CurryTang commented Nov 26, 2021

luckyonetwo commented Apr 26, 2022

pansy33 commented Apr 26, 2022 via email

flowingrain commented May 4, 2022

luckyonetwo commented May 5, 2022

flowingrain commented May 5, 2022

luckyonetwo commented May 5, 2022

youran521 commented Jun 6, 2022

pansy33 commented Jun 6, 2022 via email

pansy33 commented Oct 11, 2022 via email

Joey0538 commented Jan 12, 2021 •

edited