Cybersecurity incident prediction and discovery data

Everything about the datasets and data sources mentioned in the survey paper "Data-driven cyber security incident prediction and discovery".

[1] Proactively predict organization’s breaches incidents:Cloudy with a Chance of Breach: Forecasting Cyber Security Incidents (Usenix, 2015)
[2] Predict risk distributions of different data breach incidents:Prioritizing security spending: A quantitative analysis of risk distributions for different business profiles (WEIS, 2015)
[3] Discover previously unknown malware: The dropper effect: Insights into malware distribution with downloader graph analTytics (ACM SIGSAC, 2015)
[4] Predict whether a file is malicious or not based on first 5 seconds execution: Early Stage Malware Prediction Using Recurrent Neural Networks (Computers & Security, 2018)
[5] Predict the resilience of different software protection transformations against automated attacks: Predicting the Resilience of Obfuscated Code Against Symbolic Execution Attacks via Machine Learning (Usenix, 2017)
[6] Discover the correlation between mismanaged networks and maliciousness of the networks: On the Mismanagement and Maliciousness of Networks (NDSS, 2014)
[7] Discover black keywords used by underground economy: How to Learn Klingon without a Dictionary: Detection and Measurement of Black Keywords Used by the Underground Economy (S&P, 2017)
[8] Predict whether a currently benign website has high risk of becoming malicious in the future: Automatically detecting vulnerable websites before they turn malicious (Usenix, 2014)
[9] Predict malicious websites which are under surface before by identifying the infection campaigns: Delta: automatic identification of unknown web-based infection campaigns (ACM SIGSAC, 2013)
[10] Exploit Twitter to predict realworld vulnerability exploits: Vulnerability disclosure in the age of social media: Exploiting twitter for predicting real-world exploits (Usenix, 2015)
[11] Discover and generate Indicators of Compromise (IOCs): Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence (ACM SIGSAC, 2016)
[12] Discover, identify and encode cyberattack events: Crowdsourcing cybersecurity: Cyber attack detection using social media (ACM CIKM, 2017)
[13] Predict mobile apps security-related behaviors: AUTOREB: Automatically Understanding the Review-to-Behavior Fidelity in Android Applications (ACM SIGSAC, 2015)
[14] Predict the future structural changes of the network: Modeling dynamic behavior in large evolving graphs (WSDM, 2013)
[15] Discover security events on a specific event category: Weakly Supervised Extraction of Computer Security Events from Twitter (WWW, 2015)
[16] Discover vulnerable code: Cross-Project Transfer Representation Learning for Vulnerable Function Discovery (TII, 2018)
[17] Discover zero-day applications in traffic classification system: Robust Network Traffic Classification (TON, 2015)
[18] Discover hidden sensitive operations: Dark Hazard: Learning-based, Large-scale Discovery of Hidden Sensitive Operations in Android Apps (NDSS, 2017)

Dataset types

Organization reports and datasets

Related paper	Dataset	Introduction
[1][2]	VERIS community database	The vocabulary for event recording and incident sharing
[1]	Hackmageddon	Information security timelines and statistics
[1]	Web Hacking Incidents Database	Recording web hacking incident
[3]	VirusTotal	Analyzing suspicious files and URLs to detect types of malware
[3]	National Software Reference Library (NSRL)	Providing a reference data set (RDS) of benign software
[3][10]	Symantec’s Worldwide Intelligence Network Environment (WINE)	Security related data set, including malware, vulnerabilty exploited and so on
[17]	KEIO, WIDE-08 and WIDE-09 traces	Public traffic data repository
[10]	ExploitDB	Offensive Security’s Exploit Database Archive
[10]	Microsoft’s Exploitability Index	Recording exploitability information

Executables

Related paper	Dataset	Introduction
[14][18]	VirusTotal	Providing executables samples, such as Windows 7 executable samples and Andriod apps
[14]	Softonic	App news and reviews, best software downloads and discovery
[14]	PortableApps	Offering free, commonly used Windows applications that have been specially packaged for portability
[14]	SourceForge	Open Source applications and software directory
[18]	Google Play	Offical app store for the Android operating system

Network datasets

Related paper	Dataset	Introduction
[6]	Open recursive projects	Open Resolvers pose a significant threat to the global network infrastructure by answering recursive queries for hosts outside of its domain. They are utilized in DNS Amplification attacks and pose a similar threat as those from Smurf attacks commonly seen in the late 1990s. A list of 32 million resolvers that respond to queries in some fashion are collected in this project.
[6]	Verisign. Inc	Verisign, Inc. is an American company based in Reston, Virginia, United States that operates a diverse array of network infrastructure, including two of the Internet's thirteen root nameservers, the authoritative registry for the .com, .net, and .name generic top-level domains and the .cc and .tv country-code top-level domains, and the back-end systems for the .jobs, .gov, and .edu top-level domains. Verisign also offers a range of security services, including managed DNS, distributed denial-of-service (DDoS) attack mitigation and cyber-threat reporting.
[6]	Alexa Web Information Service	The Alexa Web Information Service (AWIS) offers a platform for creating innovative Web solutions and services based on Alexa's vast information about web sites, accessible with a web services API.
[6][14]	University of Oregon Route Views Project	The University's Route Views project was originally conceived as a tool for Internet operators to obtain real-time BGP information about the global routing system from the perspectives of several different backbones and locations around the Internet.
[6]	Spoofer project	The team is developing and supporting open-source software tools to assess and report on the deployment of source address validation (SAV) best anti-spoofing practices
[6]	Zmap	Zmap is a modular, open-source network scanner specifically architected to perform Internet-wide scans and capable of surveying the entire IPv4 address space in under 45 minutes from user space on a single machine, approaching the theoretical maximum speed of gigabit Ethernet.

Synthetic datasets

Related paper	Dataset	Introduction
[5]	Synthetic obfuscation C code	5 obfuscating transformations apply to each of 4608 synthetic C programs with security check. Totally, 23,040 synthetic obfuscation C programs are included in this dataset.
[14]	Sythetic network graph	A simple graph represented by four main node patterns: “center of a star”, “edge of a star”, “bridge nodes” (connecting stars/cliques), and “clique nodes”.

Webpage data

Related paper	Dataset	Introduction
[7]	SEO, porn and gambling webpages	Webpages marked as “evil” by Baidu
[8]	Malicious and benign websites	Malicious websites are collected from PhishTank blacklists and the “search-redirection attacks” list; benign websites are gathered from entire.com zone file and validated by multiple reputation blacklists, including PhishTank blacklists, “search-redirection attacks” list, DNS-BH, Google SafeBrowsing, and hpHosts blacklists

Social media data

Related paper	Dataset	Introduction
[10][15]	Tweets crawled from Twitter	Twitter is a social media platform which includes from breaking news and entertainment to sports and politics.
[13]	User reviews from Google Play	Each review is manually labeled as one or more security-related behaviors (spamming, financial issues, over priviledged permissions and data leakage)
[11]	71,000 articles from leading technical blogs	Technical blogs: (1)Dancho Danchev (2)Naked Security (3)The hacker news (4)Webroot (5)Threat Post (6)TaoSecurity (7)Sucuri (8)PaloAlto (9)Malwarebytes (10)Hexacorn

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
README.md		README.md
_config.yml		_config.yml
executables.md		executables.md
index.md		index.md
network.md		network.md
organization.md		organization.md
paper.md		paper.md
social_media.md		social_media.md
synthetic.md		synthetic.md
webpage.md		webpage.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

_config.yml

_config.yml

executables.md

executables.md

index.md

index.md

network.md

network.md

organization.md

organization.md

paper.md

paper.md

social_media.md

social_media.md

synthetic.md

synthetic.md

webpage.md

webpage.md

Repository files navigation

Cybersecurity incident prediction and discovery data

Table of contents

Research articles by areas

Dataset types

Organization reports and datasets

Executables

Network datasets

Synthetic datasets

Webpage data

Social media data

About

Releases

Packages

nansunsun/Cybersecurity-incident-prediction-and-discovery-data

Folders and files

Latest commit

History

Repository files navigation

Cybersecurity incident prediction and discovery data

Table of contents

Research articles by areas

Dataset types

Organization reports and datasets

Executables

Network datasets

Synthetic datasets

Webpage data

Social media data

About

Topics

Resources

Stars

Watchers

Forks