You do need a configuration file
When loading a dataset, you need to write a yaml file in advance giving the parameters needed to process the dataset.
Your yaml file should be named what you filled in the parameter dataset
with a .yaml
suffix and stored in the directory recstudio/data/config
.
Example
Your code:
from recstudio import quickstart
quickstart.run(model='LR', dataset='ml-1m', gpu=[2])
Then the name of your yaml file should be ml-1m.yaml
.
Your yaml configuration file needs to set the necessary parameters to load and parse the dataset.
How to configure the .yaml file
Example
url: https://files.grouplens.org/datasets/movielens/ml-1m.zip
##
user_id_field: &u UserID:token
item_id_field: &i MovieID:token
rating_field: &r Rating:float
time_field: &t Timestamp:float
time_format: ~
encoding_method: ISO-8859-1
save_cache: True
inter_feat_name: ratings.dat
inter_feat_field: [*u, *i, *r, *t]
inter_feat_header: ~
user_feat_name: [users.dat]
user_feat_field: [[*u, Gender:token, Age:token, Occupation:token, Zip-code:token]]
user_feat_header: ~
item_feat_name: [movies.dat]
item_feat_field: [[*i, Title:token_seq:" ", Genres:token_seq:"|")]]
item_feat_header: ~
field_separator: "::"
min_user_inter: 0
min_item_inter: 0
field_max_len: ~
rating_threshold: 3
ranker_rating_threshold: 3
drop_low_rating: True
max_seq_len: 20
network_feat_name: [[social.txt], [ml-100k.kg, ml-100k.link]]
mapped_feat_field: [*u, *i]
network_feat_field: [[[source_id:token, target_id:token]], [[head_id:token, tail_id:token, relation_id:token], [*i, entity_id:token]]]
network_feat_header: [~, ~]
The parameters are explained below.
url
If you choose a dataset from the demos given by RecStudio, your url
should be like recstudio/dataset_demo/**your_dataset**
.
Example
url: recstudio/dataset_demo/ml-100k
If you want to download and use a dataset from a web page, then your url
should be the web page URL.
Example
url: https://files.grouplens.org/datasets/movielens/ml-1m.zip
If you want to use a local dataset, then you should put the dataset folder into the RecStudio directory and fill in the url
with the relative path of the folder.
Example
url: recstudio/my_dataset/ml-20m
user_id_field
Feature name and type of user IDs in the dataset. Example
user_id_field: &u user_id:token
item_id_field
Feature name and type of item IDs in the dataset. Example
item_id_field: &i item_id:token
rating_field
Feature name and corresponding type of user and item interaction scores in the dataset. Example
rating_field: &r rating:float
time_field
The feature name and corresponding type of the moment when the user-item interaction occurs in the dataset. Example
time_field: &t timestamp:float
time_format
The form in which user and item interactions occur in the dataset. Example
time_format: "%Y-%m-%dT%H:%M:%Sz"
encoding_method
The encoding format of the dataset. Example
encoding_method: utf-8
save_cache
Whether to save processed dataset to cache. Example
encoding_method: True
inter_feat_name
Filename to store user-item interaction features. Example
inter_feat_name: ml-100k.inter
inter_feat_field
The column name corresponding to the user-item interaction feature in the above file. Example
inter_feat_field: [*u, *i, *r, *t]
inter_feat_header
Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to inter_feat_header=0
and column names are inferred from the first line of the above interaction feature file, if no column names need to be passed then the behavior is identical to inter_feat_header=None
.
Example
inter_feat_header: 0
user_feat_name
Filename to store user feature. Example
user_feat_name: ml-100k.user
user_feat_field
The column name corresponding to the user feature in the above file. Example
user_feat_field: [[*u, age:token, gender:token, occupation:token, zip_code:token]]
user_feat_header
Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to user_feat_header=0
and column names are inferred from the first line of the above user feature file, if no column names need to be passed then the behavior is identical to user_feat_header=None
.
Example
user_feat_header: 0
item_feat_name
Filename to store item feature. Example
item_feat_name: ml-100k.item
item_feat_field
The column name corresponding to the item feature in the above file. Example
item_feat_field: [[*u, age:token, gender:token, occupation:token, zip_code:token]]
item_feat_header
Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to item_feat_header=0
and column names are inferred from the first line of the above item feature file, if no column names need to be passed then the behavior is identical to item_feat_header=None
.
Example
item_feat_header: 0
field_separator
Inline delimiter for the dataset. Example
field_separator: "\t"
min_user_inter
Lower bound on the number of items a user has interacted with. If the number of items a user has interacted with is less than this value, the user will be ignored. Example
min_user_inter: 0
min_item_inter
Lower bound on the number of users an item has interacted with. If the number of users an item has interacted with is less than this value, the item will be ignored. Example
min_item_inter: 0
field_max_len
Correspondence of features to their maximum length Example
feild_max_len: [[rating:1]]
max_seq_len
Maximum length of sequential features. Example
max_seq_len: 20
rating_threshold
Filter out the interactions whose rating is below rating_threshold
. if rating_threshold
is not none, then the rating of the interaction will be modified to a boolean value: whether the original rating value is not less than rating_threshold. For example, if rating_threshold
is 3, then the rating of interactive records with a rating greater than or equal to 3 will be set to 1, and the ratings of other interactive records will be set to 0.
Example
rating_threshold: 0
drop_low_rating
If drop_low_rating
is set to True
, then interaction records with ratings below rating_threshold
will be deleted directly, rather than remaining after setting the rating to 0.
Example
drop_low_rating: False
ranker_rating_threshold
When ranker scores, the scorer will use the interaction records with a score greater than ranker_rating_threshold
as positive samples (labeled as 1), and others as negative samples (labeled as 0).
Example
ranker_rating_threshold: 3
Here are some network features, including social network and knowledge graph.
network_feat_name
Filenames where network features are stored. Example
network_feat_name: [[social.txt], [ml-100k.kg, ml-100k.link]]
mapped_feat_field
Features mapped in network features. Example
mapped_feat_field: [[*u, *u], [*i, ~, *i]]
network_feat_name
Network feature names. Example
network_feat_field: [[[source_id:token, target_id:token]], [[head_id:token, relation_id:token, tail_id:token], [*i, entity_id:token]]]
network_feat_header
Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to network_feat_header=0
and column names are inferred from the first line of the above network feature file, if no column names need to be passed then the behavior is identical to network_feat_header=None
.
Example
network_feat_header: [[0], [0, 0]]