Home › Forums › Ask Anything Random Here › How to do train/test and cross validation error analysis
-
How to do train/test and cross validation error analysis
-
I am confused about how to do train/test and cross validation error analysis for linear regression:)
Can anyone help to explain how to assess the predictive performance (train/test and cross-validation), many thanks!
I googled online , if we only have two parts train/test, I am not sure whether the following steps are correct
1. EDA for all data
2. train data for model selecting and diagnostics to get final model
3. using final model to assess the prediction performance on test data . how to assess? calculate MSE of fitted value and actual value ?
can anyone help to explain how to assess the predictive performance by using cross-validation?
-
This is a good question. I know many people are confused about it.
To be accurate, for cross validation, we actually need to split data into 3 parts:
1. Train, for model building.
2. Validate: for model validation.
3. Hold-out: use it to fine tune the model if needed.
The suggested partition is: Train: Validate: Holdout=60:20:20.
However, many people skip the Hold out partition and only use Train: Validate=70:30 or 60:40. It is fine if you skip it, if you do out-of-time validation instead.
Does this surprise or confuse you further?
- This reply was modified 4 years, 2 months ago by Justin.
-
Thanks for quick reply:)
So we use train data to find the final model, and use the validation for assess the predicting performance?
I saw the following R code, is that used for assess the predicting performance?
set.seed(1) # set the random number generator seed
cv.error.10 = cv.glm(validate, finalmodel, K=10) #here, does validate means the validate dataset?
paste(“The cv error for final model is”, signif(cv.error.10$delta[1], digits=3))
-
-
-
Below is sth from internet:
“k-Fold Cross-Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.
Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.
The general procedure is as follows:
- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
- Take the group as a hold out or test data set
- Take the remaining groups as a training data set
- Fit a model on the training set and evaluate it on the test set
- Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores
Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times”
-
Appreciate the detailed information. Yes , that is the theory of k fold cross-validation.
I think I confused you the question.
My question is :
1. We have a model already(we need dataset to find our final model, let’s say this process we use dataset A ).
2. Then we need to assess predicting performance of our model. We can use k fold cross-validation to do it. During this process, we still need dataset, let’s say this process we use dataset B.
The question is what is dataset A, and what is dataset B ?
For example, A and B come from the same original dataset? Let ‘s say 70% of original dataset used for A, and 30% of original dataset used for B?
That is what I want to ask, help it is much clear now.
-
-
-
Log in to reply.