Skip to content
This repository has been archived by the owner on Feb 9, 2020. It is now read-only.

Latest commit

 

History

History
236 lines (215 loc) · 6.95 KB

dataset.md

File metadata and controls

236 lines (215 loc) · 6.95 KB

Access to dataset set

Dataset is available @kaggle platform, and can be downloaded or used from here. It's in a JSON format and being on a physicial machine takes around 2.6GBs. It contains of following features in general:

Dataset signature

Dataset consists of 421 ChannelObjects, that have following signature:

ChannelObject

Resource representation

{
    media: MediaObject,
    youtube_id: string,
    snippet: ChannelSnippetObject,
    statistics: ChannelStatisticsObject,
    topicDetails:[string],
    videos_information: ChannelVideosInformationObject,
    language_information: ChannelLanguageInformationObject,
    bias:string,
    videos:[VideoObject]
}

Properties

ChannelSnippetObject

Resource representation

{
    title:string,
    description: string,
    publishedAt: date
}

Properties

Check here

ChannelStatisticsObject

Resource representation

{
    viewCount: number,
    subscriberCount: number,
    videoCount: number
}

Properties

Check here

ChannelVideosInformationObject

Resource representation

{
    videos_count: number,
    video_ids: [string]
}

Properties

As this dataset doesn't represent all videos in a given channel (for example CNN has more then 140K videos) this represent the actual video count for videos we have for this particular channel in dataset as well with their coresponding YouTube ids.

MediaObject

Resource representation

{
    factual_reporting_label: string, 
    bias_label: string,
    mediabiasfactcheck_url: url,
    youtube_references:[url],
    site: url,
    accessible:boolean,
    manually_checked:boolean,
    bias:string
}

Properties

Represent fetched data from https://mediabiasfactcheck.com.

VideoObject

Resource representation

{
    youtube_id: string,
    snippet: VideoSnippetObject
    contentDetails: VideoContentObject
    status: VideoStatusObject
    statistics: VideoStatisticsObject
    topicDetails: VideoTopicDetailsObject
    localizations: VideoLocalizationObject
    background_sounds: [BackgroundSoundObject] 
    processed: boolean
    nela: VideoNelaObject
    captions: VideoCaptionsObject
    open_smile: VideoOpenSmileObject
    speech_embeddings: VideoSpeechEmbeddingsObject
    bert: VideoBertObject
}

Properties

VideoNelaObject

Resource representation

{
    title_subs: [float],
    title_description: [float],
}

Properties

  • title_subs - generated 260 features from NELA Toolkit with video's title and subtitles
  • title_description - generated 260 features from NELA Toolkit with video's title and description

VideoCaptionsObject

Resource representation

{
    'background': [CaptionObject]
}

Properties

If video contains background music (ex. "applause", "music").

CaptionsObject

Resource representation

{
    start: string,
    end: string,
    text: string,
    only_sound: boolean
}

Properties

  • start - start time of caption in 'HH:MM:SS' format.
  • end - end time of caption in 'HH:MM:SS' format.
  • text - caption text
  • only_sound - checks if there were additional text outside background music for this particular caption

VideoOpenSmileObject

Resource representation

{
    'IS09_emotion': {
        '1': [float],
        '2': [float],
        '3': [float],
        '4': [float],
        '5': [float],
    },
    'IS12_speaker_trait': {
        '1': [float],
        '2': [float],
        '3': [float],
        '4': [float],
        '5': [float],
    }
}

Properties

Represents by keys OpenSmile configs used for features extraction. For each config there is at least one sub key ('1') and can contain up to '5' that represent the speech episode for video from where features were extracted.

VideoSpeechEmbeddingsObject

Resource representation

{
    '1': [float],
    '2': [float],
    '3': [float],
    '4': [float],
    '5': [float],
}

Properties

For i-vector features there is at least one key ('1') and can contain up to '5' that represent the speech episode for video from where features were extracted.

VideoBertObject

Resource representation

{
    'subs': BertObject,
    'title': BertObject,
    'description': BertObject,
    'tags': BertObject,
    'fulltext': BertObject
}

Properties

Represents by keys text source for generating BERT features. Check BertObject for more info.

BertObject

Resource representation

{
    'REDUCE_MEAN': [float],
    'REDUCE_MAX': [float],
    'REDUCE_MEAN_MAX': [float],
    'CLS_TOKEN': [float],
    'SEP_TOKEN': [float],
}

Properties