Forum Replies Created

Viewing 1 - 15 of 30 posts
  • Justin

    Administrator
    July 7, 2022 at 9:37 am
    Up
    0
    Down

    ### 3) Deactivation forecasting

    ######### Install the pmdarima library to perform time series analysis.
    !pip install pmdarima
    !pip install matplotlib
    !pip install statsmodels

    # !pip uninstall statsmodels -y
    # !pip install statsmodels==0.11.0

    # Import the library
    from pmdarima import auto_arima
    from pmdarima.arima import ADFTest
    from statsmodels.tsa.seasonal import seasonal_decompose

    # Ignore harmless warnings
    import warnings
    warnings.filterwarnings(“ignore”)

    # Fit auto_arima function to deactivation dataset

    churn = impute[impute[‘active’]==0].groupby([‘deact_yymm’]).agg( churn=(‘acctno’, ‘nunique’) )
    churn.info()

    churn[‘yymm_day’]= churn.index + ’01’
    churn

    # ETS Decomposition
    result = seasonal_decompose(churn[‘churn’],
    model =’multiplicative’, period=1)

    # ETS plot
    result.plot()

    ### Test for Stationarity: H0: unit root is present (non-stationary). H1: the series is stationary. ###########################
    # In statistics and econometrics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis
    # of a unit root is present in a time series sample. The alternative hypothesis is different depending
    # on which version of the test is used, but is usually stationarity or trend-stationarity.
    # It is an augmented version of the Dickey–Fuller test for a larger and more complicated set of time series models.

    ### should_diff(x): Test whether the time series is stationary or needs differencing.

    ADF_Test=ADFTest(alpha=0.05)
    pval, should_diff= ADF_Test.should_diff(churn[‘churn’])
    print(‘p-value={}, should differencing = {}’.format(pval, should_diff))

    # Test results: p-value=0.9770785716280405, should differencing = True.
    # It is non stationary because p-value>0.05. True means that it needs differencing.

    stepwise_fit = auto_arima(churn[‘churn’],
    start_p = 1, start_q = 1,
    max_p = 3, max_q = 3, m = 12,
    start_P = 0, seasonal = True,
    d = None, D = 1, trace = True,
    error_action =’ignore’, # we don’t want to know if an order does not work
    suppress_warnings = True, # we don’t want convergence warnings
    stepwise = True) # set to stepwise

    # To print the summary
    stepwise_fit.summary()

    # Best model: ARIMA(2,0,1)(0,1,0)[12] intercept
    # Total fit time: 6.703 seconds
    # ARIMA(2,0,1)(0,1,0)[12]: AIC=197.856, Time=0.13 sec

    # Fit the best model
    from statsmodels.tsa.statespace.sarimax import SARIMAX

    model = SARIMAX(churn[‘churn’],
    order = (2,0,1),
    seasonal_order =(0,1,0,12) )

    result = model.fit()
    result.summary()

    ##### Use the time series model to forecast for next 6 months. ###############
    ### predict() function: generate in-sample predictions from the fit ARIMA model.

    forecast = result.predict(start = len(churn), #### start=25, len(churn)=25
    end = (len(churn)-1) + 6, #### end=25-1+6= 30.
    typ = ‘levels’).rename(‘Forecast’)

    # Plot the forecast values
    churn[‘churn’].plot(figsize = (12, 5), legend = True)
    forecast.plot(legend = True)

  • Justin

    Administrator
    August 3, 2021 at 7:07 pm

    proc sql;
    create table CCC as
    select a.*,
    b.Year as Year_2,
    c.Year as Year_3

    from AAA a inner join AAA b
    on a.Name=b.Name and a.Year=b.Year +1
    inner join AAA c
    on a.Name=c.Name and a.Year=c.Year +2
    order by a.Name, a.Year;

    create table Output as

    select distinct Name

    from CCC;

    run;

  • Justin

    Administrator
    February 7, 2021 at 5:25 pm

    Method 2: Use ODS tagsets.ExcelXP destination

    Microsoft Excel uses the general format when importing data values that do not have an Excel format applied. This general format attempts to determine how the value should be formatted. For example, an account number that is not alpha numeric and has leading zeroes will lose the leading zero. The
    same problem occurs when you type a value in a cell of Excel.

    To get around this problem, the Excel Text format can be applied so that the value comes over the same way it was displayed in SAS. How this is done depends on the ODS destination that you use to generate the file. If the ExcelXP destination is used, the Excel text format can be applied using the
    format: parameter within the TAGATTR= attribute. Using the MSOffice2k/HTML/HTML3 destinations, the Microsoft Office CSS style property mso-number-format can be used with the same text format.

    Example:

     data one;
    input account name $;
    cards;
    023456 Bob
    054556 Henry
    034456 Wes
    ;
    run;

     /* ExcelXP destination */
    ods tagsets.ExcelXP file="temp.xls";
    proc print data=one;
    var name;
    var account / style(data)={tagattr="format:@"};
    format account z6.;
    run;
    Ods tagsets.ExcelXP close;

     /* MSOffice2k destination */
    ods MSoffice2k file="temp1.xls";
    proc print data=one;
    var name;
    var account / style(data)={htmlstyle="mso-number-format:\@"};
    format account z6.;
    run;
    Ods msoffice2k close;

    Note: The above options work with tagsets.ExcelXP, MSOffice2k, but does not work with ODS HTML. Please refer below for more information: http://support.sas.com/kb/32/414.html

  • Justin

    Administrator
    February 7, 2021 at 4:51 pm

    Method A:

     proc sort data=test;
    by ID;
    run;

     data B;
    set test;
    by ID;
    retain Z1-Z4;
    if not missing(V1) then Z1=V1;
    if not missing(V2) then Z2=V2;
    if not missing(V3) then Z3=V3;
    if not missing(V4) then Z4=V4;
    if last.ID;
    drop V1-V4;
    rename Z1=V1 Z2=V2 Z3=V3 Z4=V4;
    run;

    Method B:

    proc sql;
    select ID,
    max(V1) as V1,
    max(V2) as V2,
    max(V3) as V3,
    max(V4) as V4
    from test 
    group by ID;
    quit;

    Method C:

    %let NM=is not missing;
    data one(keep=id v1 v2 where=(v1 &NM))
    two(keep=v3 where=(v3 &NM))
    three(keep=v4 where=(v4 &NM));
    set test;
    ;
    run;

    data all;
    set one;
    set two;
    set three;
    run;
    %let NM=is not missing;
    data test2;
    merge test(keep=id v1 v2 where=(v1 &NM))
    test(keep=v3 where=(v3 &NM))
    test(keep=v4 where=(v4 &NM));
    ;
    run;

    Method D

     %macro test;
    proc sql;
    create table var_ls as
    select distinct name
    from dictionary.columns
    where libname='WORK' and memname='TEST' and upcase(name) ne 'ID';
    quit;

    data _null_;
    set var_ls;
    call symput(cats('var',_n_),name);
    call symput('cnt',_n_);
    run;

    proc sql;
    create table test2 as
    select id
    %do i = 1 %to &cnt;
    , max(&&var&i) as &&var&i
    %end;

    from test
    group by id;
    quit;
    %mend;
    %test;

    • This reply was modified 3 years, 10 months ago by  Justin.
  • Justin

    Administrator
    February 7, 2021 at 4:31 pm

    CALL MISSING Routine

    Assigns missing values to the specified character
    or numeric variables.

    Syntax
    CALL MISSING(variable-name-1 <, variable-name-2, …>);

    Required Argument
    variable-name

    specifies the name of SAS character or numeric variables.

    Details
    The CALL MISSING routine assigns an ordinary numeric missing value (.) to each numeric variable in the argument list.

    The CALL MISSING routine assigns a character missing value (a blank) to each character variable in the argument list. If the current length of the character variable equals the maximum length, the current length is not changed. Otherwise, the current length is set to 1.

    You can mix character and numeric variables in the argument list.

    Comparisons
    The MISSING function checks whether the argument has a missing value but does not change the value of the argument.

    data one;
    prod='shoes';
    invty=7498;
    sales=23759;
    call missing(prod, invty);
    put prod= invty= sales=;
    run;
    data one;
    prod='shoes';
    invty=7498;
    sales=23759;
    call missing(of _all_ );
    put prod= invty= sales=;
    run;

    The above code set the specified variables to missing values, including both numeric and char variables.

    • This reply was modified 3 years, 10 months ago by  Justin.
  • Justin

    Administrator
    February 7, 2021 at 3:53 pm

    Method B:

    Use SUM statement to create a counter variable CNT, then split the dataset into 2 parts based on CNT=0 or not. Then merge them back together. Drop and rename vars if needed. See below:

    data W;
    length CNT 3 New $20 ;
    set Alice;
    retain New ' ' ;
    if Name =' ' then CNT +1;
    else CNT=0;
    if CNT=0 then New=Name;
    run;

    data Old(drop=CNT);
    set W;
    where CNT=0;
    run;

    proc sort data=Old; by New; run;
    proc sort data=W(keep= New Name Q5) out= New;
    by New;
    run;
    data final;
    merge old new;
    by New;
    run;

    Method 3: Use By—NotSorted option.

     data B;
    length New $10 Pre $10;
    set Alice;
    by Name notsorted;
    retain Pre 'AAA';
    F=first.Name;
    L=last.Name;
    Flag=F+L;
    if flag=2 and not missing(Name) then New=Name;
    else New=Pre;
    Pre=New;
    run;

     data Old;
    set B;
    if F=1 and not missing(Name) then output Old;
    run;

    Then merge the OLD with B by New.

    Reach-out:

    Some people may not have SAS in their company. If so, can you only use SQL to do it? It will be much more challenging in coding. But no pain, no gain!

    • This reply was modified 3 years, 10 months ago by  Justin.
  • Justin

    Administrator
    February 7, 2021 at 3:48 pm

    Method A: Retain + Merge

    filename Alice 'file path\alice.xls';
    proc import datafile=Alice out=alice(rename=(Name_ = Name)) dbms=Excel replace;
    run;
    *************** Condition: By default, the blank in Name means it has same value  as the previous obs. ************;

    data B(drop=pre);
    length Pre $20 New_Name $20 ;
    retain Pre ' ';
    set Alice;

    if Name =" " then New_Name=Pre;
    else New_Name=Name;
    Pre=New_Name;
    run;

    data First_Name ;
    set B;
    if New_Name = Name;
    run;

    proc sort data=First_Name; by name; run;
    proc sort data=B(keep=New_Name Q5) out=Full ;
    by New_Name;
    run;

    data final(drop=New_Name);
    length name $20;
    merge First_Name full(rename=(new_name=name)) ;
    by name;
    run;

  • Justin

    Administrator
    February 6, 2021 at 11:27 pm

    3. IFN and IFC: Conditional Processing: Equivalent to ifelse() function in R.

    Syntax
    IFN (logical-expression, value-returned-when-true, value-returned-when-false
    , <,value-returned-when-missing>)

    Details

    The IFN function uses conditional logic that enables you to select among several values based on the value of a logical expression.

    IFN evaluates the first argument, then logical-expression. If logical-expression is true (that is, not zero and not missing), then IFN returns the value in the second argument. If logical-expression is a missing value, and you have a fourth argument, then IFN returns the value in the fourth argument. Otherwise, if logical-expression is false, IFN returns the value in the third argument.

    The IFN function, an IF/THEN/ELSE construct, or a WHERE statement can produce the same results. (See Examples.) However, the IFN function is useful in DATA step expressions when it is not convenient or possible to use an IF/THEN/ELSE construct or a WHERE statement.

    data _null_;
    input TotalSales;
    commission= IFN (TotalSales > 10000, TotalSales*.05, TotalSales*.02);
    put commission=;
    datalines;
    25000
    10000
    .
    0
    663
    .
    500
    ;
    run;

    IFC: Character Function

    Details
    Length of Returned Variable
    In a DATA step, if the IFC function returns a value to a variable that has not previously been assigned a length, then that variable is given a length of 200 bytes.

    The Basics
    The IFC function uses conditional logic that enables you to select among several values based on the value of a logical expression.

    IFC evaluates the first argument, logical-expression. If logical-expression is true (that is, not zero and not missing), then IFC returns the value in the second argument. If logical-expression is a missing value, and you have a fourth argument, then IFC returns the value in the fourth argument. Otherwise, if logical-expression is false, IFC returns the value in the third argument.

    The IFC function is useful in DATA step expressions, and even more useful in WHERE clauses and other expressions where it is not convenient or possible to use an IF/THEN/ELSE construct.

    data B;
    length name $12 grade 3 performance $30;
    input name $ grade;
    performance = IFC(grade GE 60, 'Pass', 'Fail! Work hard...', 'NA');
    * performance = IFC( substr (name,1,1)='K', 'First name starts with K', 'Others');
    datalines;
    Ted 60
    John 74
    Kareem 89
    . 65
    Jack 53
    Peter .
    Amanda 99
    Smith 0
    . 55
    Sara .
    Kati 35
    Maria 92
    ;
    run;

    Note: In IFN and IFC, a logic expression can have 3 levels:
    1 = true
    0 = false
    (.) = missing when specified.

    You can use functions in the logic expression, such as Substr, Put, Scan etc…

  • Justin

    Administrator
    February 6, 2021 at 6:13 pm

    The COALESCE function in SQL returns the first non-NULL expression among its arguments. It is the same as the following CASE WHEN statement:

     SELECT 
    CASE ("column_name") WHEN "expression 1 is not NULL" THEN "expression 1"
    WHEN "expression 2 is not NULL" THEN "expression 2"
    ...
    [ELSE "NULL"]
    END
    FROM "table_name"

    For examples, say we have the following table Table Contact_Info
    Name Business_Phone Cell_Phone Home_Phone
    Jeff 531-2531 622-7813 565-9901
    Laura NULL 772-5588 312-4088
    Peter NULL NULL 594-7477

    and we want to find out the best way to contact each person according to the following rules:
    1. If a person has a business phone, use the business phone number.
    2. If a person does not have a business phone and has a cell phone, use the cell phone number.
    3. If a person does not have a business phone, does not have a cell phone, and has a home phone, use the home phone number.

    We can use the COALESCE function to achieve our goal:

    SELECT Name, 
    COALESCE(Business_Phone, Cell_Phone, Home_Phone) as Contact_Phone
    FROM Contact_Info;

    Result:
    Name Contact_Phone
    Jeff 531-2531
    Laura 772-5588
    Peter 594-7477

    They do the same job as the below SAS code:

    If Business_phone is not null then contact_phone=business_phone;
    Else if cell_phone is not null then contact_phone=cell_phone;
    Else if home_phone is not null then contact_phone=home_phone;
    Else contact_phone=”N/A”;

    Isn’t it cool? Then practice and use it!

  • Justin

    Administrator
    January 22, 2021 at 11:24 am

    It is very easy to count the total number of missing values in a data frame with R and Python. Below is the code. Suppose we have a data frame named df.

    R code

    sum( is.na(df) )  
    df[ is.na(df) ]<- -999 ### we can replace all the missing values in this way.

    Python code: we can use either isnull() or is.na() function. Please note: isna()/notna() are same as isnull()/notnull() in Python.

    ### Below code counts the number of missing values of each column.
    df.isnull().sum()
    df.isna().sum()

    ### Below code counts the number of missing values of the whole data frame.
    df.isnull().sum().sum()
    df.isna().sum().sum()

    For SAS programming, it is not an easy job. Some people may think of using macro to do it. SAS macro will work it out certainly, but it is too much work. My suggestion is to use macro only when you have to. Actually, we have a much easier and better solution if we use the CMISS() function in SAS. Below gives the SAS code.

    /*cmiss() function: This function counts the number of missing arguments.*/
    /* argument*: specifies a constant, variable, or expression. Argument can be either a character value or a numeric value.*/

    data AAA;
    set P1.New_Wireless End=EOF ;
    miss=CMISS(of acctno -- sales);
    Total_Miss + Miss;
    if EOF=1 then output;
    run;
    ************* The correct answer is 371,468 for the above input data. ********;

    The most challenging task is to use SQL to do it. It is very complicated and we may have to use the “information schema” to do it. Does any one have a better solution?

    • This reply was modified 3 years, 11 months ago by  Justin.
  • Justin

    Administrator
    November 26, 2020 at 4:21 pm
    Up
    0
    Down

    Generally, you should remove the null values and use non-equal variances.

  • Justin

    Administrator
    November 28, 2020 at 9:03 pm
    Up
    0
    Down

    Yes, correct!

  • Justin

    Administrator
    November 26, 2020 at 4:23 pm
    Up
    0
    Down

    It’s too complicated, we have a better way.

  • Justin

    Administrator
    November 26, 2020 at 4:22 pm
    Up
    0
    Down

    Not a sum, just a sequence ID. In SAS, we can use the SUM statement to create a sequence ID or a cumulative sum.

  • Justin

    Administrator
    November 25, 2020 at 7:42 pm

    Yes, you are correct. Proc Sort includes one more step in sorting data, its output is also based on the original sequence of the input data.

    • This reply was modified 4 years ago by  Justin.
Viewing 1 - 15 of 30 posts
error: Content is protected !!