You do need a configuration file
When loading a model, you need to write a yaml file in advance giving the parameters needed to initialize the model.
Your yaml file should be named the lowercase of what you filled in the parameter "model" with a ".yaml" suffix and stored in the directory recstudio/model/**submodule**/config
(The argument **submodule**
should be selected from ['ae', 'mf', 'seq', 'fm', 'graph', 'kg']).
Example
Your code:
from recstudio import quickstart
quickstart.run(model='LR', dataset='ml-1m', gpu=[2])
Then your yaml file should be named lr.yaml
and stored in directory recstudio/model/fm/config
.
Your yaml configuration file needs to set the necessary parameters to initialize the model.
How to configure the .yaml file
For the convenience of users, we provide some default parameter configurations for all models, which are stored in recstudio/model/basemodel/basemodel.yaml
:
# for training all models
learning_rate: 0.001
weight_decay: 0
learner: adam
scheduler: ~
epochs: 100
batch_size: 2048
num_workers: 0 # please do not use this parameter, slowing down the training process
gpu: [0,] # TODO: gpu=int: number of gpus, use free gpus; gpu=list: gpu ids
num_threads: 10
accelerator: gpu
seed: 2022
# used for training tower-based model
#ann: {index: 'IVFx,Flat', parameter: ~} ## 1 HNSWx,Flat; 2 Flat; 3 IVFx,Flat ## {nprobe: 1} {efSearch: 1}
ann: ~
sampling_method: none #[none, dns, brute, sir, toprand, top&rand, dns&rand]
# sampler: ~ # [uniform, popularity, midx_uni, midx_pop, cluster_uni, cluster_pop, retriever_ipts, retriever_dns]
# negative_count: 1
# sampling_method: ~
# sampling_temperature: 1.0
# excluding_hist: False
init_method: xavier_normal
init_range: ~
# the sampler is configured for dataset
dataset_sampler: ~
dataset_neg_count: ~
negative_count: ~ # negative sample number in training procedure
excluding_hist: False
embed_dim: 64
item_bias: False
# used for evaluating tower-based model
eval_batch_size: 128
split_ratio: [0.8,0.1,0.1]
test_metrics: [recall, precision, map, ndcg, mrr, hit]
val_metrics: [ndcg, recall]
topk: 100
cutoff: [10, 20, 5]
early_stop_mode: max
early_stop_patience: 10
save_path: './saved/'
The parameters are explained below.
TODO:
url
If you choose a dataset from the demos given by RecStudio, your url should be like "recstudio/dataset_demo/your_dataset. Example
url: recstudio/dataset_demo/ml-100k
If you want to download and use a dataset from a web page, then your url should be the web page URL. Example
url: https://files.grouplens.org/datasets/movielens/ml-1m.zip
If you want to use a local dataset, then you should put the dataset folder into the RecStudio directory and fill in the url parameter with the relative path of the folder. Example
url: recstudio/my_dataset/ml-20m
user_id_field
Feature name and type of user IDs in the dataset. Example
user_id_field: &u user_id:token
item_id_field
Feature name and type of item IDs in the dataset. Example
item_id_field: &i item_id:token
rating_field
Feature name and corresponding type of user and item interaction scores in the dataset. Example
rating_field: &r rating:float
time_field
The feature name and corresponding type of the moment when the user-item interaction occurs in the dataset. Example
time_field: &t timestamp:float
time_format
The form in which user and item interactions occur in the dataset. Example
time_format: "%Y-%m-%dT%H:%M:%Sz"
encoding_method
The encoding format of the dataset. Example
encoding_method: utf-8
save_cache
Whether to save processed dataset to cache. Example
encoding_method: True
inter_feat_name
Filename to store user-item interaction features. Example
inter_feat_name: ml-100k.inter
inter_feat_field
The column name corresponding to the user-item interaction feature in the above file. Example
inter_feat_field: [*u, *i, *r, *t]
inter_feat_header
Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to inter_feat_header=0
and column names are inferred from the first line of the above interaction feature file, if no column names need to be passed then the behavior is identical to inter_feat_header=None
.
Example
inter_feat_header: 0
user_feat_name
Filename to store user feature. Example
user_feat_name: ml-100k.user
user_feat_field
The column name corresponding to the user feature in the above file. Example
user_feat_field: [[*u, age:token, gender:token, occupation:token, zip_code:token]]
user_feat_header
Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to user_feat_header=0
and column names are inferred from the first line of the above user feature file, if no column names need to be passed then the behavior is identical to user_feat_header=None
.
Example
user_feat_header: 0
item_feat_name
Filename to store item feature. Example
item_feat_name: ml-100k.item
item_feat_field
The column name corresponding to the item feature in the above file. Example
item_feat_field: [[*u, age:token, gender:token, occupation:token, zip_code:token]]
item_feat_header
Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to item_feat_header=0
and column names are inferred from the first line of the above item feature file, if no column names need to be passed then the behavior is identical to item_feat_header=None
.
Example
item_feat_header: 0
field_separator
Inline delimiter for the dataset. Example
field_separator: "\t"
min_user_inter
Lower bound on the number of items a user has interacted with. If the number of items a user has interacted with is less than this value, the user will be ignored. Example
min_user_inter: 0
min_item_inter
Lower bound on the number of users an item has interacted with. If the number of users an item has interacted with is less than this value, the item will be ignored. Example
min_item_inter: 0
field_max_len
Correspondence of features to their maximum length Example
feild_max_len: [[rating:1]]
max_seq_len
Maximum length of sequential features. Example
max_seq_len: 20
rating_threshold
Filter out the interactions whose rating is below rating_threshold
. if rating_threshold
is not none, then the rating of the interaction will be modified to a boolean value: whether the original rating value is not less than rating_threshold. For example, if rating_threshold
is 3, then the rating of interactive records with a rating greater than or equal to 3 will be set to 1, and the ratings of other interactive records will be set to 0.
Example
rating_threshold: 0
drop_low_rating
If drop_low_rating
is set to True
, then interaction records with ratings below rating_threshold
will be deleted directly, rather than remaining after setting the rating to 0.
Example
drop_low_rating: False
ranker_rating_threshold
When ranker scores, the scorer will use the interaction records with a score greater than ranker_rating_threshold
as positive samples (labeled as 1), and others as negative samples (labeled as 0).
Example
ranker_rating_threshold: 3
Here are some network features, including social network and knowledge graph.
network_feat_name
Filenames where network features are stored. Example
network_feat_name: [[social.txt], [ml-100k.kg, ml-100k.link]]
mapped_feat_field
Features mapped in network features. Example
mapped_feat_field: [[*u, *u], [*i, ~, *i]]
network_feat_name
Network feature names. Example
network_feat_field: [[[source_id:token, target_id:token]], [[head_id:token, relation_id:token, tail_id:token], [*i, entity_id:token]]]
network_feat_header
Row number to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to network_feat_header=0
and column names are inferred from the first line of the above network feature file, if no column names need to be passed then the behavior is identical to network_feat_header=None
.
Example
network_feat_header: [[0], [0, 0]]