justin
MemberForum Replies Created
-
It is true. This is one of the distinct differences between Call Symput() and Select Into in creating macro variables. Thus when we need to reference the macro variables, the reference methods are totally different too. Please see the syntax below.
data CCC;
set BBB;
where Birth_Date< &MV_1 ;
where Birth_Date< "&MV_2"n ;
run;Please note: your code will fail to work if you do NOT reference them correctly. Be careful, details are critically important in programming.
- This reply was modified 4 years, 2 months ago by Justin.
-
In this case, we use F-test to check on this categorical predictor. Test the reduced model against the full model, if it is insignificant, then we just drop it.
If it is overall significant, but few levels are NOT significant. We can then regroup it by combining the insignificant levels, which will make more sense for this variable.
-
This is a good question. I know many people are confused about it.
To be accurate, for cross validation, we actually need to split data into 3 parts:
1. Train, for model building.
2. Validate: for model validation.
3. Hold-out: use it to fine tune the model if needed.
The suggested partition is: Train: Validate: Holdout=60:20:20.
However, many people skip the Hold out partition and only use Train: Validate=70:30 or 60:40. It is fine if you skip it, if you do out-of-time validation instead.
Does this surprise or confuse you further?
- This reply was modified 4 years, 2 months ago by Justin.
-
Thank all for the contribution and discussion, very useful and helpful.
Based on Yi’s contribution, we can go one step further. If the newly created data frames don’t have any similar pattern, we can use below method to create all of them.
### create a list of dataframe names.
dflist=['Core', 'Spouse', 'Education', 'Work', 'Certificate', 'Additional']
### looping over each element
for i in range(len(dflist)):
print("loop round: ", i)
globals()[dflist[i]] = raw[i]
print(Work)It will create the 6 data frames: Core, Spouse… Additional. Ideas on ideas, this is a more generic way to do it.
-
One way is to use pandas package, it has the read_html() function and read HTML tables into a list of data frames. It is very simple and convenient. Any other method?
import requests
url = 'https://www.welcomebc.ca/Immigrate-to-B-C/B-C-Provincial-Nominee-Program/Invitations-to-Apply'
html = requests.get(url).content
html
tables = pd.read_html(html)
tables
df = tables[1]
df
df.to_excel('/kaggle/working/skilled.xlsx', sheet_name='skilled', index = False)Also, I want to extract the table names from the web content:
Table 1: Skills Immigration and Express Entry BC
Table 2: Entrepreneur Immigration
How to grab them and assign them to each list element?
-
Not actually!
In my opinion, SAS is a commerical software, widely used in banks, telecom and goverment sectors. So, it is useful to learn it if want to seek a career there.
R and Python are quite similar and both are open source tools. Grasping one is sufficient.
One commercial + one open source, they are enough for most data analysis. How do you think?
-
In data step, if we use multiple SET statements rather than one SET statement, the outcome is to overwrite rather than appending. The observations in the later SET will overwrite the observations in the previous one.
Another key point is: when will it stop and how many observations can produce? Given below example, if A and B have exactly same variables.
data C;
set A; * 5 records;
set B; * 10 records;
run;
data D;
set B; * 10 records;
set A; * 5 records;
run;Both data C and D have only 5 records, but the records are different, why?
As you remember, each data set has a End of File indicator (which can be monitored by the END=EOF option), the data step execution is stopped no matter which data set reaches the end of the file first. In the above case, data A has only 5 obs, therefore it always reach the end first, and determines the final number of observations in the output data set: 5 observations! In summary, the final number of output observations is always determined by the smallest number of dataset observations, if you have multiple SET statements..
However, although the number of observations are same in data C and D, but the records are different, because the later one always overwrite the previous one: the ORDER does matter!
-
Below is sth from internet:
“k-Fold Cross-Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.
Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.
The general procedure is as follows:
- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
- Take the group as a hold out or test data set
- Take the remaining groups as a training data set
- Fit a model on the training set and evaluate it on the test set
- Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores
Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times”