Skip to content

Latest commit

 

History

History
190 lines (154 loc) · 9.57 KB

TrainingDataset.md

File metadata and controls

190 lines (154 loc) · 9.57 KB

Authors: F. Abel, D. Kohlsdorf, R. Pálovics

ACM RecSys Challenge 2016: Training Data

About

In the challenge, the task of the participants will be the following: given a XING user, the recommender should predict those job postings (items) that the user will interact with in the next week.

The training dataset is intended for experimenting and training your models. You can split the interaction data into training and test data for the purpose of evaluating your algorithms during development. For example: you can leave out the last complete week (of the year) from the interaction data and then try to predict whether a given user will positively interact with an item within that week. Relevant items are those items on which a user clicked, bookmarked or replied (interaction_type = 1, 2 or 3). The easiest way to test how your algorithm is performing, is to submit your solution via the submission system.

Anonymization, pseudonymization, noise

The training dataset is a semi-synthetic sample of XING data. The dataset is designed to retain information that is useful for you in creating effective algorithms that address the challenge, while at the same time protecting the privacy of XINC users. The data set is "semi-synthetic" in that it is enriched with artificial users whose presence contributes to the anonymization.

  • the dataset contains artificial users
  • the dataset contains only a fraction of XING users and job postings
  • IDs are used instead of raw text for almost all attribute values (pseudonymization)
  • some attributes of the users may have been removed or flipped to NULL / unknown.
  • not all interactions of a user are contained in the dataset
  • some of the interactions are artificial (= have actually not been performed by the user)
  • timestamps have been shifted (but the order of interactions is kept)

Attempting to identify users or to reveal any private information about the users or information about the business from which the data is coming from is forbidden (cf. Rules).

Your algorithm should not attempt to identify artificial users, or reconstruct flipped values. The training set and the test methodology is designed so that such approaches would not offer you an advantage. In fact, artificial users and interactions are also part of the ground truth against which your solution will be evaluated.

Dataset Description

Impressions

Which items were shown by the existing XING job recommender to which user in which week of the year. Only a subset of the impressions that were generated by XING's job recommender are considered: a fraction of the impressions on the Web (start-page and xing.com/jobs), some for mobile, none for emails. For those impressions there is no guarantee that the item was in the viewport of the user. Fields:

  • user_id ID of the user (points to users.id)
  • year
  • week of the year
  • items is a comma-separated list (not set) of items that were displayed to the user (point to items.id)

Interactions

Interactions that the user performed on the job posting items. Fields:

  • user_id ID of the user who performed the interaction (points to users.id)
  • item_id ID of the item on which the interaction was performed (points to items.id)
  • interaction_type the type of interaction that was performed on the item:
    • 1 = the user clicked on the item
    • 2 = the user bookmarked the item on XING
    • 3 = the user clicked on the reply button or application form button that is shown on some job postings
    • 4 = the user deleted a recommendation from his/her list of recommendation (clicking on "x") which has the effect that the recommendation will no longer been shown to the user and that a new recommendation item will be loaded and displayed to the user
  • created_at a unix time stamp timestamp representing the time when the interaction got created

Users

Details about those users who appear in the above datasets. Fields:

  • id anonymized ID of the user (referenced as user_id in the other datasets above)
  • jobroles comma-separated list of job role terms (numeric IDs) that were extracted from the user's current job title. 0 means that there was no known jobrole detected for the user.
  • career_level career level ID (e.g. beginner, experienced, manager):
    • 0 = unknown
    • 1 = Student/Intern
    • 2 = Entry Level (Beginner)
    • 3 = Professional/Experienced
    • 4 = Manager (Manager/Supervisor)
    • 5 = Executive (VP, SVP, etc.)
    • 6 = Senior Executive (CEO, CFO, President)
  • discipline_id anonymized IDs represent disciplines such as "Consulting", "HR", etc.
  • industry_id anonymized IDs represent industries such as "Internet", "Automotive", "Finance", etc.
  • country describes the country in which the user is currently working:
    • de = Germany
    • at = Austria
    • ch = Switzerland
    • non_dach = non of the above countries
  • region is specified for some users who have as country de. Meaning of the regions: see below.
  • experience_n_entries_class identifies the number of CV entries that the user has listed as work experiences:
    • 0 = no entries
    • 1 = 1-2 entries
    • 2 = 3-4 entries
    • 3 = 5 or more entries
  • experience_years_experience is the estimated number of years of work experience that the user has:
    • 0 = unknown
    • 1 = less than 1 year
    • 2 = 1-3 years
    • 3 = 3-5 years
    • 4 = 5-10 years
    • 5 = 10-15 years
    • 6 = 16-20
    • 7 = more than 20 years
  • experience_years_in_current is the estimated number of years that the user is already working in her current job. Meaning of numbers: same as experience_years_experience
  • edu_degree estimated university degree of the user:
    • 0 or NULL = unknown
    • 1 = bachelor
    • 2 = master
    • 3 = phd
  • edu_fieldofstudies comma-separated fields of studies that the user studied. 0 means "unknown" and edu_fieldofstudies > 0 entries refer to broad field of studies such as Engineering, Economics and Legal, ...

Items

Details about the job postings that were and should be recommended to the users.

  • id anonymized ID of the item (referenced as item_id in the other datasets above)
  • title concepts that have been extracted from the job title of the job posting (numeric IDs)
  • career_level career level ID (e.g. beginner, experienced, manager):
    • 0 = unknown
    • 1 = Student/Intern
    • 2 = Entry Level (Beginner)
    • 3 = Professional/Experienced
    • 4 = Manager (Manager/Supervisor)
    • 5 = Executive (VP, SVP, etc.)
    • 6 = Senior Executive (CEO, CFO, President)
  • discipline_id anonymized IDs represent disciplines such as "Consulting", "HR", etc.
  • industry_id anonymized IDs represent industries such as "Internet", "Automotive", "Finance", etc.
  • country code of the country in which the job is offered
  • region is specified for some users who have as country de. Meaning of the regions: see below.
  • latitude latitude information (rounded to ca. 10km)
  • longitude longitude information (rounded to ca. 10km)
  • employment the type of employment:
    • 0 = unknown
    • 1 = full-time
    • 2 = part-time
    • 3 = freelancer
    • 4 = intern
    • 5 = voluntary
  • tags concepts that have been extracted from the tags, skills or company name
  • created_at a Unix time stamp timestamp representing the time when the interaction got created
  • active_during_test is 1 if the item is still active (= recommendable) during the test period and 0 if the item is not active anymore in the test period (= not recommendable)

Regions

ID Name
0 not specified
1 Baden-Württemberg
2 Bavaria
3 Berlin
4 Brandenburg
5 Bremen
6 Hamburg
7 Hesse
8 Mecklenburg-Vorpommern
9 Lower Saxony
10 North Rhine-Westphalia
11 Rhineland-Palatinate
12 Saarland
13 Saxony
14 Saxony-Anhalt
15 Schleswig-Holstein
16 Thuringia

Target Users

The file target_users.csv contains those user IDs for which you finally need to submit solutions. The file lists one user ID per line (in total, 150,000 user IDs). All those target users are also contained in the training data (see User).

Example Solution File

The file solution_file_example.tgz is an example solution file that was generated by a simple content-based baseline algorithm.