You do need a configuration file

When loading a dataset, you need to write a yaml file in advance giving the parameters needed to process the dataset. Your yaml file should be named what you filled in the parameter dataset with a .yaml suffix and stored in the directory recstudio/data/config. Example Your code:

from recstudio import quickstart

quickstart.run(model='LR', dataset='ml-1m', gpu=[2])

Then the name of your yaml file should be ml-1m.yaml. Your yaml configuration file needs to set the necessary parameters to load and parse the dataset.

How to configure the .yaml file

Example

url: https://files.grouplens.org/datasets/movielens/ml-1m.zip
##
user_id_field: &u UserID:token 
item_id_field: &i MovieID:token
rating_field: &r Rating:float
time_field: &t Timestamp:float
time_format: ~
encoding_method: ISO-8859-1
save_cache: True

inter_feat_name: ratings.dat 
inter_feat_field: [*u, *i, *r, *t]
inter_feat_header: ~

user_feat_name: [users.dat]
user_feat_field: [[*u, Gender:token, Age:token, Occupation:token, Zip-code:token]]
user_feat_header: ~

item_feat_name: [movies.dat]
item_feat_field: [[*i, Title:token_seq:" ", Genres:token_seq:"|")]]
item_feat_header: ~


field_separator: "::"
min_user_inter: 0
min_item_inter: 0
field_max_len: ~
rating_threshold: 3
ranker_rating_threshold: 3
drop_low_rating: True
max_seq_len: 20

network_feat_name: [[social.txt], [ml-100k.kg, ml-100k.link]]
mapped_feat_field: [*u, *i]
network_feat_field: [[[source_id:token, target_id:token]], [[head_id:token, tail_id:token, relation_id:token], [*i, entity_id:token]]]
network_feat_header: [~, ~]

The parameters are explained below.

url

If you choose a dataset from the demos given by RecStudio, your url should be like recstudio/dataset_demo/**your_dataset**. Example

url: recstudio/dataset_demo/ml-100k

If you want to download and use a dataset from a web page, then your url should be the web page URL. Example

url: https://files.grouplens.org/datasets/movielens/ml-1m.zip

If you want to use a local dataset, then you should put the dataset folder into the RecStudio directory and fill in the url with the relative path of the folder. Example

url: recstudio/my_dataset/ml-20m

user_id_field

Feature name and type of user IDs in the dataset. Example

user_id_field: &u user_id:token

item_id_field

Feature name and type of item IDs in the dataset. Example

item_id_field: &i item_id:token

rating_field

Feature name and corresponding type of user and item interaction scores in the dataset. Example

rating_field: &r rating:float

time_field

The feature name and corresponding type of the moment when the user-item interaction occurs in the dataset. Example

time_field: &t timestamp:float

time_format

The form in which user and item interactions occur in the dataset. Example

time_format: "%Y-%m-%dT%H:%M:%Sz"

encoding_method

The encoding format of the dataset. Example

encoding_method: utf-8

save_cache

Whether to save processed dataset to cache. Example

encoding_method: True

inter_feat_name

Filename to store user-item interaction features. Example

inter_feat_name: ml-100k.inter

inter_feat_field

The column name corresponding to the user-item interaction feature in the above file. Example

inter_feat_field: [*u, *i, *r, *t]

inter_feat_header

Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to inter_feat_header=0 and column names are inferred from the first line of the above interaction feature file, if no column names need to be passed then the behavior is identical to inter_feat_header=None. Example

inter_feat_header: 0

user_feat_name

Filename to store user feature. Example

user_feat_name: ml-100k.user

user_feat_field

The column name corresponding to the user feature in the above file. Example

user_feat_field: [[*u, age:token, gender:token, occupation:token, zip_code:token]]

user_feat_header

Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to user_feat_header=0 and column names are inferred from the first line of the above user feature file, if no column names need to be passed then the behavior is identical to user_feat_header=None. Example

user_feat_header: 0

item_feat_name

Filename to store item feature. Example

item_feat_name: ml-100k.item

item_feat_field

The column name corresponding to the item feature in the above file. Example

item_feat_field: [[*u, age:token, gender:token, occupation:token, zip_code:token]]

item_feat_header

Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to item_feat_header=0 and column names are inferred from the first line of the above item feature file, if no column names need to be passed then the behavior is identical to item_feat_header=None. Example

item_feat_header: 0

field_separator

Inline delimiter for the dataset. Example

field_separator: "\t"

min_user_inter

Lower bound on the number of items a user has interacted with. If the number of items a user has interacted with is less than this value, the user will be ignored. Example

min_user_inter: 0

min_item_inter

Lower bound on the number of users an item has interacted with. If the number of users an item has interacted with is less than this value, the item will be ignored. Example

min_item_inter: 0

field_max_len

Correspondence of features to their maximum length Example

feild_max_len: [[rating:1]]

max_seq_len

Maximum length of sequential features. Example

max_seq_len: 20

rating_threshold

Filter out the interactions whose rating is below rating_threshold . if rating_threshold is not none, then the rating of the interaction will be modified to a boolean value: whether the original rating value is not less than rating_threshold. For example, if rating_threshold is 3, then the rating of interactive records with a rating greater than or equal to 3 will be set to 1, and the ratings of other interactive records will be set to 0. Example

rating_threshold: 0

drop_low_rating

If drop_low_rating is set to True, then interaction records with ratings below rating_threshold will be deleted directly, rather than remaining after setting the rating to 0. Example

drop_low_rating: False

ranker_rating_threshold

When ranker scores, the scorer will use the interaction records with a score greater than ranker_rating_threshold as positive samples (labeled as 1), and others as negative samples (labeled as 0). Example

ranker_rating_threshold: 3

Here are some network features, including social network and knowledge graph.

network_feat_name

Filenames where network features are stored. Example

network_feat_name: [[social.txt], [ml-100k.kg, ml-100k.link]]

mapped_feat_field

Features mapped in network features. Example

mapped_feat_field: [[*u, *u], [*i, ~, *i]]

network_feat_name

Network feature names. Example

network_feat_field: [[[source_id:token, target_id:token]], [[head_id:token, relation_id:token, tail_id:token], [*i, entity_id:token]]]

network_feat_header

Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to network_feat_header=0 and column names are inferred from the first line of the above network feature file, if no column names need to be passed then the behavior is identical to network_feat_header=None. Example

network_feat_header: [[0], [0, 0]]