Title: | Model Selection Based on Machine Learning (ML) |
---|---|
Description: | Model evaluation based on a modified version of the recursive feature elimination algorithm. This package is designed to determine the optimal model(s) by leveraging all available features. |
Authors: | Hong Lee [aut, cph], Moksedul Momin [aut, cre, cph] |
Maintainer: | Moksedul Momin <[email protected]> |
License: | GPL (>=3) |
Version: | 1.0.0.1 |
Built: | 2024-10-27 04:33:50 UTC |
Source: | https://github.com/mommy003/msml |
A dataset containing N sets of covariates (N=3 as an example here) intended for constant use across all model configurations (refer to the 'model_configuration2' function) when using a training dataset. Please note that if constant covariates are not required, this file is unnecessary (refer to the 'model_configuration' function).
cov_train
cov_train
A data frame for training dataset:
covariate 1
covariate 2
covariate 3
A dataset containing N sets of covariates (N=3 as an example here) intended for constant use across all model configurations (refer to the 'model_configuration2' function) when using a validation dataset. Please note that if constant covariates are not required, this file is unnecessary (refer to the 'model_configuration' function).
cov_valid
cov_valid
A data frame for validation dataset:
covariate 1
covariate 2
covariate 3
A dataset containing 7 sets of PRSs for test dataset and target phenotype
data_test
data_test
A data frame for test dataset:
Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics)
Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics)
Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics)
Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics)
Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics)
Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics)
Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics)
Phenotypic values
A dataset containing 7 sets of PRSs for training data set and target phenotype
data_train
data_train
A data frame for training dataset:
Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics)
Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics)
Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics)
Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics)
Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics)
Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics)
Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics)
Phenotypic values
A dataset containing 7 sets of PRSs for validation dataset and target phenotype
data_valid
data_valid
A data frame for validation dataset:
Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics)
Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics)
Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics)
Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics)
Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics)
Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics)
Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics)
Phenotypic values
This function generates predicted values for the validation dataset by applying optimal weights to features, which were estimated in the training dataset for each model configuration. The total number of model configurations is determined by summing the combinations for each possible number of features, ranging from 1 to 'n' (C(n, k)), where 'n choose k' (C(n, k)) represents the binomial coefficient. Here, 'n' denotes the total number of features, and 'k' indicates the number of features included in each model. For example, with n=7, the total number of model configurations is 127.
model_configuration(data_train, data_valid, mv, model = "lm")
model_configuration(data_train, data_valid, mv, model = "lm")
data_train |
This includes the dataframe of the training dataset in a matrix format |
data_valid |
This includes the dataframe of the validation dataset in a matrix format |
mv |
The total number of columns in data_train/data_valid |
model |
This is the type of model (e.g. lm (default) or glm) |
This function will generate all possible model outcomes for validation and test dataset
data_train <- data_train data_valid <- data_valid mv=8 out=model_configuration(data_train,data_valid,mv,model = "lm") #This process will produce predicted values for the validation datasets, #corresponding to each model configuration trained on the training dataset. #The outcome of this function will yield variables named 'predict_validation' #and 'total_model_configurations. #To print the outcomes run out$predict_validation and out$total_model_configurations. #For details (see https://github.com/mommy003/MSML).
data_train <- data_train data_valid <- data_valid mv=8 out=model_configuration(data_train,data_valid,mv,model = "lm") #This process will produce predicted values for the validation datasets, #corresponding to each model configuration trained on the training dataset. #The outcome of this function will yield variables named 'predict_validation' #and 'total_model_configurations. #To print the outcomes run out$predict_validation and out$total_model_configurations. #For details (see https://github.com/mommy003/MSML).
This function is similar to the model_configuration function, with the added capability to maintain constant variables across models during training and prediction (see cov_train and cov_valid in page 2). Additionally, users have the option to select between linear or logistic regression models.
model_configuration2( data_train, data_valid, mv, cov_train, cov_valid, model = "lm" )
model_configuration2( data_train, data_valid, mv, cov_train, cov_valid, model = "lm" )
data_train |
This includes the dataframe of the training dataset in a matrix format |
data_valid |
This includes the dataframe of the validation dataset in a matrix format |
mv |
The total number of columns in data_train/data_valid |
cov_train |
This includes dataframe of covariates for training dataset in a matrix format |
cov_valid |
This includes dataframe of covariates for validation dataset in a matrix format |
model |
This is the type of model (e.g. lm (default) or glm (logistic regression)) |
This function will generate all possible model outcomes for validation and test dataset
data_train <- data_train data_valid <- data_valid mv=8 cov_train <- cov_train cov_valid <- cov_valid out=model_configuration2(data_train,data_valid,mv,cov_train, cov_valid, model = "lm") #This process will produce predicted values for the validation datasets, #corresponding to each model configuration trained on the training dataset. #The outcome of this function will yield variables named 'predict_validation' #and 'total_model_configurations. #To print the outcomes run out$predict_validation and out$total_model_configurations. #For details (see https://github.com/mommy003/MSML). #If a user intends to employ logistic regression without constant covariates, #we advise preparing a covariate file where all values are set to 1.
data_train <- data_train data_valid <- data_valid mv=8 cov_train <- cov_train cov_valid <- cov_valid out=model_configuration2(data_train,data_valid,mv,cov_train, cov_valid, model = "lm") #This process will produce predicted values for the validation datasets, #corresponding to each model configuration trained on the training dataset. #The outcome of this function will yield variables named 'predict_validation' #and 'total_model_configurations. #To print the outcomes run out$predict_validation and out$total_model_configurations. #For details (see https://github.com/mommy003/MSML). #If a user intends to employ logistic regression without constant covariates, #we advise preparing a covariate file where all values are set to 1.
This function will identify the best model in the validation and test dataset.
model_evaluation(dat, mv, tn, prev, pthreshold = 0.05, method = "R2ROC")
model_evaluation(dat, mv, tn, prev, pthreshold = 0.05, method = "R2ROC")
dat |
This is the dataframe for all the combinations of the model in a matrix format |
mv |
The total number of columns in data_train/data_valid |
tn |
The total number of best models to be identified |
prev |
The prevalence of disease in the data |
pthreshold |
The significance p value threshold when comparing models (default 0.05) |
method |
The methods to be used to evaluate models (e.g. R2ROC (default) or r2redux) |
This function will generate all possible model outcomes for validation and test dataset
dat <- predict_validation mv=8 tn=15 prev=0.047 out=model_evaluation(dat,mv,tn,prev) #This process will generate three output files. #out$out_all, contains AUC, p values for AUC, R2, and p values for R2, #respectively for all models. #out$out_start, contains AUC, p values for AUC, R2, and p values for R2, #respectively for top tn models. #out$out_selected, contains AUC, p values for AUC, R2, and p values for R2, #respectively for best models. This also includes selected features for models #For details (see https://github.com/mommy003/MSML).
dat <- predict_validation mv=8 tn=15 prev=0.047 out=model_evaluation(dat,mv,tn,prev) #This process will generate three output files. #out$out_all, contains AUC, p values for AUC, R2, and p values for R2, #respectively for all models. #out$out_start, contains AUC, p values for AUC, R2, and p values for R2, #respectively for top tn models. #out$out_selected, contains AUC, p values for AUC, R2, and p values for R2, #respectively for best models. This also includes selected features for models #For details (see https://github.com/mommy003/MSML).
A dataset containing target phenotype and 127 sets of model configurations based on validation dataset
predict_validation
predict_validation
A data frame for predicted values for target dataset from model configurations_test:
Phenotypic values in target dataset
predicted values for target dataset from model configuration1
predicted values for target dataset from model configuration2
predicted values for target dataset from model configuration3
predicted values for target dataset from model configuration4
predicted values for target dataset from model configuration5
predicted values for target dataset from model configuration6
predicted values for target dataset from model configuration7
predicted values for target dataset from model configuration8
predicted values for target dataset from model configuration9
predicted values for target dataset from model configuration10
predicted values for target dataset from model configuration11
predicted values for target dataset from model configuration12
predicted values for target dataset from model configuration13
predicted values for target dataset from model configuration14
predicted values for target dataset from model configuration15
predicted values for target dataset from model configuration16
predicted values for target dataset from model configuration17
predicted values for target dataset from model configuration18
predicted values for target dataset from model configuration19
predicted values for target dataset from model configuration10
predicted values for target dataset from model configuration21
predicted values for target dataset from model configuration22
predicted values for target dataset from model configuration23
predicted values for target dataset from model configuration24
predicted values for target dataset from model configuration25
predicted values for target dataset from model configuration26
predicted values for target dataset from model configuration27
predicted values for target dataset from model configuration28
predicted values for target dataset from model configuration29
predicted values for target dataset from model configuration30
predicted values for target dataset from model configuration31
predicted values for target dataset from model configuration32
predicted values for target dataset from model configuration33
predicted values for target dataset from model configuration34
predicted values for target dataset from model configuration35
predicted values for target dataset from model configuration36
predicted values for target dataset from model configuration37
predicted values for target dataset from model configuration38
predicted values for target dataset from model configuration39
predicted values for target dataset from model configuration40
predicted values for target dataset from model configuration41
predicted values for target dataset from model configuration42
predicted values for target dataset from model configuration43
predicted values for target dataset from model configuration44
predicted values for target dataset from model configuration45
predicted values for target dataset from model configuration46
predicted values for target dataset from model configuration47
predicted values for target dataset from model configuration48
predicted values for target dataset from model configuration49
predicted values for target dataset from model configuration50
predicted values for target dataset from model configuration51
predicted values for target dataset from model configuration52
predicted values for target dataset from model configuration53
predicted values for target dataset from model configuration54
predicted values for target dataset from model configuration55
predicted values for target dataset from model configuration56
predicted values for target dataset from model configuration57
predicted values for target dataset from model configuration58
predicted values for target dataset from model configuration59
predicted values for target dataset from model configuration60
predicted values for target dataset from model configuration61
predicted values for target dataset from model configuration62
predicted values for target dataset from model configuration63
predicted values for target dataset from model configuration64
predicted values for target dataset from model configuration65
predicted values for target dataset from model configuration66
predicted values for target dataset from model configuration67
predicted values for target dataset from model configuration68
predicted values for target dataset from model configuration69
predicted values for target dataset from model configuration70
predicted values for target dataset from model configuration71
predicted values for target dataset from model configuration72
predicted values for target dataset from model configuration73
predicted values for target dataset from model configuration74
predicted values for target dataset from model configuration75
predicted values for target dataset from model configuration76
predicted values for target dataset from model configuration77
predicted values for target dataset from model configuration78
predicted values for target dataset from model configuration79
predicted values for target dataset from model configuration80
predicted values for target dataset from model configuration81
predicted values for target dataset from model configuration82
predicted values for target dataset from model configuration83
predicted values for target dataset from model configuration84
predicted values for target dataset from model configuration85
predicted values for target dataset from model configuration86
predicted values for target dataset from model configuration87
predicted values for target dataset from model configuration88
predicted values for target dataset from model configuration89
predicted values for target dataset from model configuration90
predicted values for target dataset from model configuration91
predicted values for target dataset from model configuration92
predicted values for target dataset from model configuration93
predicted values for target dataset from model configuration94
predicted values for target dataset from model configuration95
predicted values for target dataset from model configuration96
predicted values for target dataset from model configuration97
predicted values for target dataset from model configuration98
predicted values for target dataset from model configuration99
predicted values for target dataset from model configuration100
predicted values for target dataset from model configuration101
predicted values for target dataset from model configuration102
predicted values for target dataset from model configuration103
predicted values for target dataset from model configuration104
predicted values for target dataset from model configuration105
predicted values for target dataset from model configuration106
predicted values for target dataset from model configuration107
predicted values for target dataset from model configuration108
predicted values for target dataset from model configuration109
predicted values for target dataset from model configuration110
predicted values for target dataset from model configuration111
predicted values for target dataset from model configuration112
predicted values for target dataset from model configuration113
predicted values for target dataset from model configuration114
predicted values for target dataset from model configuration115
predicted values for target dataset from model configuration116
predicted values for target dataset from model configuration117
predicted values for target dataset from model configuration118
predicted values for target dataset from model configuration119
predicted values for target dataset from model configuration120
predicted values for target dataset from model configuration121
predicted values for target dataset from model configuration122
predicted values for target dataset from model configuration123
predicted values for target dataset from model configuration124
predicted values for target dataset from model configuration125
predicted values for target dataset from model configuration126
predicted values for target dataset from model configuration127