Skip to content

nansunsun/Cybersecurity-incident-prediction-and-discovery-data

Repository files navigation

Cybersecurity incident prediction and discovery data

Everything about the datasets and data sources mentioned in the survey paper "Data-driven cyber security incident prediction and discovery".


Table of contents

  • Research articles by areas
  • Dataset types
    • Organization reports and datasets
    • Executables
    • Network datasets
    • Synthetic datasets
    • Webpage data
    • Social media data

Research articles by areas


Dataset types


Organization reports and datasets

Related paper Dataset Introduction
[1][2] VERIS community database The vocabulary for event recording and incident sharing
[1] Hackmageddon Information security timelines and statistics
[1] Web Hacking Incidents Database Recording web hacking incident
[3] VirusTotal Analyzing suspicious files and URLs to detect types of malware
[3] National Software Reference Library (NSRL) Providing a reference data set (RDS) of benign software
[3][10] Symantec’s Worldwide Intelligence Network Environment (WINE) Security related data set, including malware, vulnerabilty exploited and so on
[17] KEIO, WIDE-08 and WIDE-09 traces Public traffic data repository
[10] ExploitDB Offensive Security’s Exploit Database Archive
[10] Microsoft’s Exploitability Index Recording exploitability information

Executables

Related paper Dataset Introduction
[14][18] VirusTotal Providing executables samples, such as Windows 7 executable samples and Andriod apps
[14] Softonic App news and reviews, best software downloads and discovery
[14] PortableApps Offering free, commonly used Windows applications that have been specially packaged for portability
[14] SourceForge Open Source applications and software directory
[18] Google Play Offical app store for the Android operating system

Network datasets

Related paper Dataset Introduction
[6] Open recursive projects Open Resolvers pose a significant threat to the global network infrastructure by answering recursive queries for hosts outside of its domain. They are utilized in DNS Amplification attacks and pose a similar threat as those from Smurf attacks commonly seen in the late 1990s. A list of 32 million resolvers that respond to queries in some fashion are collected in this project.
[6] Verisign. Inc Verisign, Inc. is an American company based in Reston, Virginia, United States that operates a diverse array of network infrastructure, including two of the Internet's thirteen root nameservers, the authoritative registry for the .com, .net, and .name generic top-level domains and the .cc and .tv country-code top-level domains, and the back-end systems for the .jobs, .gov, and .edu top-level domains. Verisign also offers a range of security services, including managed DNS, distributed denial-of-service (DDoS) attack mitigation and cyber-threat reporting.
[6] Alexa Web Information Service The Alexa Web Information Service (AWIS) offers a platform for creating innovative Web solutions and services based on Alexa's vast information about web sites, accessible with a web services API.
[6][14] University of Oregon Route Views Project The University's Route Views project was originally conceived as a tool for Internet operators to obtain real-time BGP information about the global routing system from the perspectives of several different backbones and locations around the Internet.
[6] Spoofer project The team is developing and supporting open-source software tools to assess and report on the deployment of source address validation (SAV) best anti-spoofing practices
[6] Zmap Zmap is a modular, open-source network scanner specifically architected to perform Internet-wide scans and capable of surveying the entire IPv4 address space in under 45 minutes from user space on a single machine, approaching the theoretical maximum speed of gigabit Ethernet.

Synthetic datasets

Related paper Dataset Introduction
[5] Synthetic obfuscation C code 5 obfuscating transformations apply to each of 4608 synthetic C programs with security check. Totally, 23,040 synthetic obfuscation C programs are included in this dataset.
[14] Sythetic network graph A simple graph represented by four main node patterns: “center of a star”, “edge of a star”, “bridge nodes” (connecting stars/cliques), and “clique nodes”.

Webpage data

Related paper Dataset Introduction
[7] SEO, porn and gambling webpages Webpages marked as “evil” by Baidu
[8] Malicious and benign websites Malicious websites are collected from PhishTank blacklists and the “search-redirection attacks” list; benign websites are gathered from entire.com zone file and validated by multiple reputation blacklists, including PhishTank blacklists, “search-redirection attacks” list, DNS-BH, Google SafeBrowsing, and hpHosts blacklists

Social media data

Related paper Dataset Introduction
[10][15] Tweets crawled from Twitter Twitter is a social media platform which includes from breaking news and entertainment to sports and politics.
[13] User reviews from Google Play Each review is manually labeled as one or more security-related behaviors (spamming, financial issues, over priviledged permissions and data leakage)
[11] 71,000 articles from leading technical blogs Technical blogs: (1)Dancho Danchev (2)Naked Security (3)The hacker news (4)Webroot (5)Threat Post (6)TaoSecurity (7)Sucuri (8)PaloAlto (9)Malwarebytes (10)Hexacorn

Releases

No releases published

Packages

No packages published