For easy experimentation, Cornac offers access to a number of popular recommendation benchmark datasets. These are listed below along with their basic characteristics, followed by a usage example. In addition to preference feedback, some of these datasets come with item and/or user auxiliary information, which are grouped into three main categories:
- Text refers to textual information associated with items or users. The usual format of this data is
(item_id, text)
, or(user_id, text)
. Concrete examples of such information are item textual descriptions, product reviews, movie plots, and user reviews, just to name a few. - Graph, for items, corresponds to a network where nodes (or vertices) are items, and links (or edges) represent relations among items. This information is typically represented by an adjacency matrix in the sparse triplet format:
(item_id, item_id, weight)
, or simply(item_id, item_id)
in the case of unweighted edges. Relations between users (e.g., social network) are represented similarly. - Image consists of visual information paired with either users or items. The common format for this type of auxiliary data is
(object_id, ndarray)
, whereobject_id
could be one ofuser_id
oritem_id
, thendarray
may contain the raw images (pixel intensities), or some visual feature vectors extracted from the images, e.g., using deep neural nets. For instance, the Amazon clothing dataset includes product CNN visual features.
How to cite. If you are using one of the datasets listed below in your research, please follow the citation guidelines by the authors (the "source" link below) of each respective dataset.
Dataset | Preference Info. | Item Auxiliary Info. | User Auxiliary Info. | |||||
---|---|---|---|---|---|---|---|---|
#Users | #Items | #Interactions | Type | Text | Graph | Image | Graph | |
Amazon Clothing (source) |
5,377 | 3,393 | 13,689 | INT [1,5] |
✔ | ✔ | ✔ | |
Amazon Digital Music (source) |
5,541 | 3,568 | 64,706 | INT [1,5] |
✔ | |||
Amazon Office (source) |
3,703 | 6,523 | 53,282 | INT [1,5] |
✔ | |||
Amazon Toy (source) |
19,412 | 11,924 | 167,597 | INT [1,5] |
||||
Citeulike (source) |
5,551 | 16,980 | 210,537 | BIN {0,1} |
✔ | |||
Epinions (source) |
40,163 | 139,738 | 664,824 | INT [1,5] |
✔ | |||
FilmTrust (source) |
1,508 | 2,071 | 35,497 | REAL [0.5,4] |
✔ | |||
MovieLens 100k (source) |
943 | 1,682 | 100,000 | INT [1,5] |
✔ | |||
MovieLens 1M (source) |
6,040 | 3,706 | 1,000,209 | INT [1,5] |
✔ | |||
MovieLens 10M (source) |
69,878 | 10,677 | 10,000,054 | INT [1,5] |
✔ | |||
MovieLens 20M (source) |
138,493 | 26,744 | 20,000,263 | INT [1,5] |
✔ | |||
Netflix Small (source) |
10,000 | 5,000 | 607,803 | INT [1,5] |
||||
Neflix Original (source) |
480,189 | 17,770 | 100,480,507 | INT [1,5] |
||||
Tradesy (source) |
19,243 | 165,906 | 394,421 | BIN {0,1} |
✔ |
Assume that we are interested in the FilmTrust dataset, which comes with both user-item ratings
and user-user trust
information. We can load these two pieces of information as follows,
from cornac.datasets import filmtrust
ratings = filmtrust.load_feedback()
trust = filmtrust.load_trust()
The ranting values are in the range [0.5,4]
, and the trust network is undirected. Here are samples from our dataset,
Samples from ratings: [('1', '1', 2.0), ('1', '2', 4.0), ('1', '3', 3.5)]
Samples from trust: [('2', '966', 1.0), ('2', '104', 1.0), ('5', '1509', 1.0)]
Our dataset is now ready to use for model training and evaluation. A concrete example is sorec_filmtrust, which illustrates how to perform an experiment with the SoRec model on FilmTrust. More details regarding the other datasets are available in the documentation.
Dataset | Preference Info. | Extra Info. | |||
---|---|---|---|---|---|
#Users | #Items | #Baskets | #Interactions | ||
Ta Feng (source) |
28,297 | 22,542 | 86,403 | 817,741 | price, quantity |