Dataset is available @kaggle platform, and can be downloaded or used from here. It's in a JSON format and being on a physicial machine takes around 2.6GBs. It contains of following features in general:
- Fetched data from YouTube API
- Fetched data from Media Bias / Fact check
- Generated features with help of BERT-as-a-Service, NELA Toolkit, Open Smile
Dataset consists of 421 ChannelObjects, that have following signature:
{
media: MediaObject,
youtube_id: string,
snippet: ChannelSnippetObject,
statistics: ChannelStatisticsObject,
topicDetails:[string],
videos_information: ChannelVideosInformationObject,
language_information: ChannelLanguageInformationObject,
bias:string,
videos:[VideoObject]
}
- media - check MediaObject
- youtube_id - coresponding to channel_id given by YouTube
- snippet - check ChannelSnippetObject
- statistics - check ChannelStatisticsObject
- topicDetails - list of string representing categories of generated topics by YouTube. More info here
- videos_information - check ChannelVideosInformationObject
- language_information - check ChannelLanguageInformationObject
- bias - possible values
extremeleft
,left
,leastbiased
,right
,extremeright
. - videos - list of VideoObject. Check it for more info.
{
title:string,
description: string,
publishedAt: date
}
Check here
{
viewCount: number,
subscriberCount: number,
videoCount: number
}
Check here
{
videos_count: number,
video_ids: [string]
}
As this dataset doesn't represent all videos in a given channel (for example CNN has more then 140K videos) this represent the actual video count for videos we have for this particular channel in dataset as well with their coresponding YouTube ids.
{
factual_reporting_label: string,
bias_label: string,
mediabiasfactcheck_url: url,
youtube_references:[url],
site: url,
accessible:boolean,
manually_checked:boolean,
bias:string
}
Represent fetched data from https://mediabiasfactcheck.com.
{
youtube_id: string,
snippet: VideoSnippetObject
contentDetails: VideoContentObject
status: VideoStatusObject
statistics: VideoStatisticsObject
topicDetails: VideoTopicDetailsObject
localizations: VideoLocalizationObject
background_sounds: [BackgroundSoundObject]
processed: boolean
nela: VideoNelaObject
captions: VideoCaptionsObject
open_smile: VideoOpenSmileObject
speech_embeddings: VideoSpeechEmbeddingsObject
bert: VideoBertObject
}
- youtube_id - id given by YouTube
- snippet - check here
- contentDetails - check here
- status - check here
- statistics - check here
- topicDetails - check here
- localitzations - check here
- nela - check VideoNelaObject
- captions - check VideoCaptionsObject
- open_smile - check VideoOpenSmileObject
- speech_embeddings - check VideoSpeechEmbeddingsObject
- bert - check VideoBertObject
{
title_subs: [float],
title_description: [float],
}
- title_subs - generated 260 features from NELA Toolkit with video's title and subtitles
- title_description - generated 260 features from NELA Toolkit with video's title and description
{
'background': [CaptionObject]
}
If video contains background music (ex. "applause", "music").
{
start: string,
end: string,
text: string,
only_sound: boolean
}
- start - start time of caption in 'HH:MM:SS' format.
- end - end time of caption in 'HH:MM:SS' format.
- text - caption text
- only_sound - checks if there were additional text outside background music for this particular caption
{
'IS09_emotion': {
'1': [float],
'2': [float],
'3': [float],
'4': [float],
'5': [float],
},
'IS12_speaker_trait': {
'1': [float],
'2': [float],
'3': [float],
'4': [float],
'5': [float],
}
}
Represents by keys OpenSmile configs used for features extraction. For each config there is at least one sub key ('1') and can contain up to '5' that represent the speech episode for video from where features were extracted.
{
'1': [float],
'2': [float],
'3': [float],
'4': [float],
'5': [float],
}
For i-vector features there is at least one key ('1') and can contain up to '5' that represent the speech episode for video from where features were extracted.
{
'subs': BertObject,
'title': BertObject,
'description': BertObject,
'tags': BertObject,
'fulltext': BertObject
}
Represents by keys text source for generating BERT features. Check BertObject for more info.
{
'REDUCE_MEAN': [float],
'REDUCE_MAX': [float],
'REDUCE_MEAN_MAX': [float],
'CLS_TOKEN': [float],
'SEP_TOKEN': [float],
}
- Check here