Datasets

For easy experimentation, Cornac offers access to a number of popular recommendation benchmark datasets. These are listed below along with their basic characteristics, followed by a usage example. In addition to preference feedback, some of these datasets come with item and/or user auxiliary information, which are grouped into three main categories:

Text refers to textual information associated with items or users. The usual format of this data is (item_id, text), or (user_id, text). Concrete examples of such information are item textual descriptions, product reviews, movie plots, and user reviews, just to name a few.
Graph, for items, corresponds to a network where nodes (or vertices) are items, and links (or edges) represent relations among items. This information is typically represented by an adjacency matrix in the sparse triplet format: (item_id, item_id, weight), or simply (item_id, item_id) in the case of unweighted edges. Relations between users (e.g., social network) are represented similarly.
Image consists of visual information paired with either users or items. The common format for this type of auxiliary data is (object_id, ndarray), where object_id could be one of user_id or item_id, the ndarray may contain the raw images (pixel intensities), or some visual feature vectors extracted from the images, e.g., using deep neural nets. For instance, the Amazon clothing dataset includes product CNN visual features.

How to cite. If you are using one of the datasets listed below in your research, please follow the citation guidelines by the authors (the "source" link below) of each respective dataset.

Dataset	Preference Info.				Item Auxiliary Info.			User Auxiliary Info.
Dataset	#Users	#Items	#Interactions	Type	Text	Graph	Image	Graph
Amazon Clothing (source)	5,377	3,393	13,689	INT [1,5]	✔	✔	✔
Amazon Digital Music (source)	5,541	3,568	64,706	INT [1,5]	✔
Amazon Office (source)	3,703	6,523	53,282	INT [1,5]		✔
Amazon Toy (source)	19,412	11,924	167,597	INT [1,5]
Citeulike (source)	5,551	16,980	210,537	BIN {0,1}	✔
Epinions (source)	40,163	139,738	664,824	INT [1,5]				✔
FilmTrust (source)	1,508	2,071	35,497	REAL [0.5,4]				✔
MovieLens 100k (source)	943	1,682	100,000	INT [1,5]	✔
MovieLens 1M (source)	6,040	3,706	1,000,209	INT [1,5]	✔
MovieLens 10M (source)	69,878	10,677	10,000,054	INT [1,5]	✔
MovieLens 20M (source)	138,493	26,744	20,000,263	INT [1,5]	✔
Netflix Small (source)	10,000	5,000	607,803	INT [1,5]
Neflix Original (source)	480,189	17,770	100,480,507	INT [1,5]
Tradesy (source)	19,243	165,906	394,421	BIN {0,1}			✔

Usage example

Assume that we are interested in the FilmTrust dataset, which comes with both user-item ratings and user-user trust information. We can load these two pieces of information as follows,

from cornac.datasets import filmtrust

ratings = filmtrust.load_feedback()
trust = filmtrust.load_trust()

The ranting values are in the range [0.5,4], and the trust network is undirected. Here are samples from our dataset,

Samples from ratings: [('1', '1', 2.0), ('1', '2', 4.0), ('1', '3', 3.5)] 
Samples from trust: [('2', '966', 1.0), ('2', '104', 1.0), ('5', '1509', 1.0)]

Our dataset is now ready to use for model training and evaluation. A concrete example is sorec_filmtrust, which illustrates how to perform an experiment with the SoRec model on FilmTrust. More details regarding the other datasets are available in the documentation.

Next-Basket Datasets

Dataset	Preference Info.				Extra Info.
Dataset	#Users	#Items	#Baskets	#Interactions	Extra Info.
Ta Feng (source)	28,297	22,542	86,403	817,741	price, quantity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Datasets

Usage example

Next-Basket Datasets

Files

README.md

Latest commit

History

README.md

File metadata and controls

Datasets

Usage example

Next-Basket Datasets