How to do train/test and cross validation error analysis

How to do train/test and cross validation error analysis

kikike updated 4 years, 8 months ago 2 Members · 9 Posts
Ask Anything Random Here
kikike

Member
November 3, 2020 at 8:02 pm

Up
0
Down

I am confused about how to do train/test and cross validation error analysis for linear regression:)

Can anyone help to explain how to assess the predictive performance (train/test and cross-validation), many thanks!

I googled online , if we only have two parts train/test, I am not sure whether the following steps are correct

1. EDA for all data

2. train data for model selecting and diagnostics to get final model

3. using final model to assess the prediction performance on test data . how to assess? calculate MSE of fitted value and actual value ?

can anyone help to explain how to assess the predictive performance by using cross-validation?
Justin

Administrator
November 3, 2020 at 9:12 pm
Up
0
Down

This is a good question. I know many people are confused about it.

To be accurate, for cross validation, we actually need to split data into 3 parts:

1. Train, for model building.

2. Validate: for model validation.

3. Hold-out: use it to fine tune the model if needed.

The suggested partition is: Train: Validate: Holdout=60:20:20.

However, many people skip the Hold out partition and only use Train: Validate=70:30 or 60:40. It is fine if you skip it, if you do out-of-time validation instead.

Does this surprise or confuse you further?
- This reply was modified 4 years, 8 months ago by Justin.
- kikike
  
  Member
  November 3, 2020 at 9:32 pm
  
  Up
  0
  Down
  
  Thanks for quick reply:)
  
  So we use train data to find the final model, and use the validation for assess the predicting performance?
  
  I saw the following R code, is that used for assess the predicting performance?
  
  set.seed(1) # set the random number generator seed
  
  cv.error.10 = cv.glm(validate, finalmodel, K=10) #here, does validate means the validate dataset?
  
  paste(“The cv error for final model is”, signif(cv.error.10$delta[1], digits=3))
  - Justin
    
    Administrator
    November 3, 2020 at 9:34 pm
    
    Up
    0
    Down
    
    No, this is another cross validation method called K-fold validation. Google it.
    - kikike
      
      Member
      November 3, 2020 at 10:11 pm
      
      Up
      0
      Down
      
      Many thanks, so for k fold cross cross-validation, how do we split data for model building and for assess predicting performance?
      - Justin
        
        Administrator
        November 4, 2020 at 12:14 am
        
        Up
        0
        Down
        
        Below is sth from internet:
        
        “k-Fold Cross-Validation
        
        Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
        
        The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.
        
        Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
        
        It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.
        
        The general procedure is as follows:
        
        Shuffle the dataset randomly.
        
        Split the dataset into k groups
        
        For each unique group:
        
        Take the group as a hold out or test data set
        
        Take the remaining groups as a training data set
        
        Fit a model on the training set and evaluate it on the test set
        
        Retain the evaluation score and discard the model
        
        Summarize the skill of the model using the sample of model evaluation scores
        
        Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times”
        
        kikike
        
        Member
        November 4, 2020 at 11:40 am
        
        Up
        0
        Down
        
        Appreciate the detailed information. Yes , that is the theory of k fold cross-validation.
        
        I think I confused you the question.
        
        My question is :
        
        1. We have a model already(we need dataset to find our final model, let’s say this process we use dataset A ).
        
        2. Then we need to assess predicting performance of our model. We can use k fold cross-validation to do it. During this process, we still need dataset, let’s say this process we use dataset B.
        
        The question is what is dataset A, and what is dataset B ?
        
        For example, A and B come from the same original dataset? Let ‘s say 70% of original dataset used for A, and 30% of original dataset used for B?
        
        That is what I want to ask, help it is much clear now.
        
        Justin
        
        Administrator
        November 4, 2020 at 12:14 pm
        
        Up
        0
        Down
        
        If you have the model built already, it is just a model validation, not a cross validation. You don’t need to split it, just use the whole data to validate it.
        
        kikike
        
        Member
        November 4, 2020 at 1:17 pm
        
        Up
        0
        Down
        
        Got it:) Many thanks!