Home Forums Ask Anything Random Here How to do train/test and cross validation error analysis

  • How to do train/test and cross validation error analysis

     kikike updated 4 years ago 2 Members · 9 Posts
  • kikike

    Member
    November 3, 2020 at 8:02 pm
    Up
    0
    Down

    I am confused about how to do train/test and cross validation error analysis for linear regression:)

    Can anyone help to explain how to assess the predictive performance (train/test and cross-validation), many thanks!

    I googled online , if we only have two parts train/test, I am not sure whether the following steps are correct

    1. EDA for all data

    2. train data for model selecting and diagnostics to get final model

    3. using final model to assess the prediction performance on test data . how to assess? calculate MSE of fitted value and actual value ?

    can anyone help to explain how to assess the predictive performance by using cross-validation?

  • Justin

    Administrator
    November 3, 2020 at 9:12 pm
    Up
    0
    Down

    This is a good question. I know many people are confused about it.

    To be accurate, for cross validation, we actually need to split data into 3 parts:

    1. Train, for model building.

    2. Validate: for model validation.

    3. Hold-out: use it to fine tune the model if needed.

    The suggested partition is: Train: Validate: Holdout=60:20:20.

    However, many people skip the Hold out partition and only use Train: Validate=70:30 or 60:40. It is fine if you skip it, if you do out-of-time validation instead.

    Does this surprise or confuse you further?

    • This reply was modified 4 years ago by  Justin.
    • kikike

      Member
      November 3, 2020 at 9:32 pm
      Up
      0
      Down

      Thanks for quick reply:)

      So we use train data to find the final model, and use the validation for assess the predicting performance?

      I saw the following R code, is that used for assess the predicting performance?

      set.seed(1) # set the random number generator seed

      cv.error.10 = cv.glm(validate, finalmodel, K=10) #here, does validate means the validate dataset?

      paste(“The cv error for final model is”, signif(cv.error.10$delta[1], digits=3))

      • Justin

        Administrator
        November 3, 2020 at 9:34 pm
        Up
        0
        Down

        No, this is another cross validation method called K-fold validation. Google it.

        • kikike

          Member
          November 3, 2020 at 10:11 pm
          Up
          0
          Down

          Many thanks, so for k fold cross cross-validation, how do we split data for model building and for assess predicting performance?

          • Justin

            Administrator
            November 4, 2020 at 12:14 am
            Up
            0
            Down

            Below is sth from internet:

            k-Fold Cross-Validation

            Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

            The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

            Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

            It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

            The general procedure is as follows:

            1. Shuffle the dataset randomly.
            2. Split the dataset into k groups
            3. For each unique group:
              1. Take the group as a hold out or test data set
              2. Take the remaining groups as a training data set
              3. Fit a model on the training set and evaluate it on the test set
              4. Retain the evaluation score and discard the model
            4. Summarize the skill of the model using the sample of model evaluation scores

            Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times”

            • kikike

              Member
              November 4, 2020 at 11:40 am
              Up
              0
              Down

              Appreciate the detailed information. Yes , that is the theory of k fold cross-validation.

              I think I confused you the question.

              My question is :

              1. We have a model already(we need dataset to find our final model, let’s say this process we use dataset A ).

              2. Then we need to assess predicting performance of our model. We can use k fold cross-validation to do it. During this process, we still need dataset, let’s say this process we use dataset B.

              The question is what is dataset A, and what is dataset B ?

              For example, A and B come from the same original dataset? Let ‘s say 70% of original dataset used for A, and 30% of original dataset used for B?

              That is what I want to ask, help it is much clear now.

            • Justin

              Administrator
              November 4, 2020 at 12:14 pm
              Up
              0
              Down

              If you have the model built already, it is just a model validation, not a cross validation. You don’t need to split it, just use the whole data to validate it.

            • kikike

              Member
              November 4, 2020 at 1:17 pm
              Up
              0
              Down

              Got it:) Many thanks!

Log in to reply.

Original Post
0 of 0 posts June 2018
Now