23 May 2006. The MAKEDATA program.


The MAKEDATA GAUSS program lets you
   * Create GAUSS datasets from text (ASCII) files, and from EXCEL files.
   * Create a new GAUSS dataset, containing transformations of data in an old GAUSS dataset.

Contents:

     I. Synopsis
   IIa. CREATE: Create a new dataset from data in an existing dataset.
   IIb. ATOG  : Convert a text data file into a GAUSS dataset
   IIc. MERGE : Merge two gauss dataset sets
   IId. XVC   : Compute and save means, and variance-covariance, for data in "classes"
   III. Examples


         ----------------------------------
I. Synopsis

To use MAKEDATA, you must create a "GRBL2-style" input file.

These input files are comprised of "keyphrases".
 
  The syntax of these keyphrases is:

     KEYWORD option_list ;

  where:
    Keyword: one of several keywords understood by MAKEDATA
    option_list: a list of one or more space delimited options.
       The syntax of these options depends on the keyword.
  Notes: 
     * comments are enclosed between ampersand characters.
     * each keyphrase MUST end with a semi-colon 
     * keywords and options are case-INsensitive
     * For a complete description of GRBL2 input files, see in GRBL2_BATCH.TXT

MAKEDATA supports four basic actions:

   CREATE: Create a new dataset from data in an existing dataset.
   ATOG  : Convert a text data file (or EXCEL spreadsheet) into a GAUSS dataset
   MERGE : Merge two GAUSS datasets.
    XVC  : Compute and save means, and variance-covariance, for data in "classes"


In addition, you specify where to write a "report" to.
   OUTPUT : Name of a file to write a report of results to 

  
In addition, a number of other keyphrases are supported, depending on which of the four
basic actions you have running.


   --------------------------------

II. Description of CREATE, ATOG, MERGE, and XVC.

   --------------------------------


IIa. CREATE: Create a new dataset from data in an existing dataset.

  CREATE is used to create a new dataset from an existing dataset.
  For example, CREATE allows you to
    * You can create new variables
    * Given the values of existing variabiles, conditionally create new variables,
      or drop observations
    * Create ID (identifier) variables
    * Create "dummy" variables.

  Note that CREATE will read data from the GAUSS dataset specified in a INPUT keyphrase.
  It will write the new data file to the file specified in an OUTDATA keyphrase.

  For example:
        CREATE ;
          INPUT mydata0;
          OUTDATA mydata1 ;

      ... CREATE options (as described below)  ...
   RUN;

  Hints:
     *  for complicated new variables, you can run CREATE several times in succession.


  The following keywords are used in CREATE.
      BY     : Compute statistics across a "panel", rather than entire dataset
      DROP   : Drop (do not include) variables
      DUMMY  : Create one (or several) vectors of "dummy" variables
      ID, ID2, ID3 : Create "ID" variables
      INPUT  : The name of the input GAUSS dataset (it will be read)
      NEW    : Create one (or several) new  variables (using a specified equation)
      NEW  MATRIX   : Create one (or several) temporary matrices (using a specified equation)
      NEW IF : Conditionally create variable, drop variable, or drop observation
      PRINT  : Write a line of text and data values to the output file (used with NEW IF)
      OUTDATA : Name of GAUSS dataset to create
      OUTPUT : Name of a file to write a report of results to
      RUN    : Do the action (you must specify RUN at the "end" of the list of CREATE options).
      SAVE   : Save a matrix to a .FMT file
      SORT   : Sort all observations
      X      : Existing variables to retain

  The details follow ....


   BY  Compute statistics across panels

       BY is used to divide the dataset into panels.  Statistical, and data, functions will then operate on the
       data in each panel, rather than across all rows of the dataset.

       Syntax:
            BY varname ;
       or
            BY numeric_value ;

       where VARNAME is one of the variables in the dataset (it can NOT be a value you create with NEW).

       If varname is used, panels are formed from all contiguous rows sharing the same value (of varname).
       If numeric_value is used, panels consist of subsets of rows.
       Thus, BY 10  means rows 1 to 10 are in the first panel, 11-20 the second, etc.

       Note that BY will NOT sort the observations -- thus if non-contiguous blocks of
       rows (blocks seperated by other rows) have the same value of the BY variable,
       each block will be treated as seperate panels.

          If this is a problem, you should create a temporary dataset using SORT, and then
          use BY (on this temporary dataset).

       When BY is specified, the statistical functions (when used in "observation mode") will compute
       their values over the panel, not over the entire dataset

      The statistical and data functions that BY effects are:
            MIN, MAX, MEAN, MEDIAN, PROD, SUM, VAR, QUANTILE, N, NMISS, VAL.
      To reiterate, these functions are effected only when used in the "by-observation" mode (that is,
      when you only specify one argument).

      CLASS is also effected -- CLASS values will be unique to the BY group only!

      Notes:
        * When using a statistical function (across observations) and BY, all rows in the panel
          will be given the same value.

        * Extraction of an element, say  x[1,3], is relative to the BY group.
          Thus, if you want to give every row, in a by group, the value of a variable in the first row:
               BY  MYID ;
               NEW INCOME=MYINC[1,1];

        * If you use ID to create a new variable, it does NOT need to be the same size as BY.
          (though it seems odd to create a "balanced" panel using ID that is different than
          the panel specified in BY).

        * Caution:

             BY does NOT work with "math matrices" -- with matrices stored (using NEW MATRIX)
             that are not NxK (N=number of rows in the dataset).
                For example:  NEW MATRIX INVXX= INV((x1,x2,x3)'(x1,x2,x3))  -- a 3x3 matrix.

             BY DOES work with "data matrices" -- with matrices stored (using NEW MATRIX) that
             are NxK (say, a  matrix that is a concatenation of several variables)
                For example:  NEW MATRIX XVARS= (x1,x2,x3) .* exp(z)  -- a Nx3 matrix.

    DROP    Drop variables (do NOT include in new dataset).

       DROP is used to drop variables (such as X and NEW variables) that you may
       have needed temporarily.

       Syntax:
          DROP  varlist  ;
       Example:
          DROP X1 X2 XNEW1 ;

       Example:
         You create a NEW variable  using X1 and X2, but you don't want to include
         X1 and X2 (and you don't want to include some temporary variables);
                  X  xa xb  x1 x2 ;
                   NEW XTMP1 = X1 + X2 ;
                  NEW XTMP2 = min(x1,x2);
                   NEW X12 xtmp1/xtmp2 ;
                   DROP x1 x2 XTMP1 XTMP2 ;
         The outdata file will contain XA XB and X12.

    Notes:
             * to DROP an observation, you should use NEW IF.
             * to KEEP an observation, you should use NEW IF (see the description of KEEP for
               details on how DROP and KEEP work together).


  DUMMY    Specify "dummy" variables to create.

    Create vectors of 0/1 (or 0/value) dummies, using an "ordinal variable" to define the levels;
      and (optionally) a cardinal value to defined the "values".

        Syntax:
            DUMMY ORDVAR CARDVAR -disp -BASE=bname ~
            DUMMY ORDINAL_VAR CARDINAL_VAR_OR_1 -DROP_or_-KEEP  -BASE=ABCDE ~  def2 ...  , ;

    Where:

      ORDVAR : a "ordinal variable", that contains a limited set of values.
               A seperate dummy variable will be created for each possible value.

     CARDVAR : A variable name, or a 1.
               If 1, the dummy variables will be 0/1.
               If variable name, instead of using "1" for the (single) non-zero dummy value, you
               can use this observation's value of cardvar. Cardvar
               can be one of the variables in the dataset, OR it can be new variable.

     -disp   : Optional. -DROP or -KEEP
                If -DROP, then drop the first level
                If -KEEP, then keep the first level
                -KEEP is the default

   -BASE=bname : Optional. Base name for dummies. Must be no more than 5 characters.
                  It's a good idea to end with underscore ("_").
                  If not specified. "Dnn_" is used, where nn starts with "01" (for the
                  first dummy specified), etc.

       ~ (or | ) are used to seperate dummy specifications


   Notes
    * If you are using "1" as the cardinal variable, and you are including a constant, we highly recommend
      specifying -DROP.

    *   You can specify several dummys. For example:
           DUMMY REGION  1 -DROP  ~
                 RACE , INCOME -KEEP -BASE=INC_ ;

         (note that the commas in the above example are used for readability, they are ignored by GRBL2)

    *  To NOT have dummies, you can us:   DUMMY NO

    *  Example: to use log of XINCOME in a dummy on RACE :
           new LNINC = ln(xincome) ;
           dummy RACE LNINC -DROP ;

   Examples:
        DUMMY REGION 1 -DROP  ; if 5 different regions, makes a vector of four 0/1 dummies (first region has no dummy)
        DUMMY RACE INCOME -KEEP ;  if 4 races, makes a 4x1 vector of zeros, with one of the zeros replaced by INCOME
        DUMMY NO  ;

       If
         *  VERSION has 4 distinct values: 1, 2, 3, and 4
         *  The variables you wish to "expand" are a 0/1 dummy, and AGE.
       then:
          DUMMY VERSION 1 -DROP ~  VERSION AGE -KEEP ;
       will yield two sets of dummies will be.
       For example:
            For observations    /\ the following variables are
                with values:    /\ created
            VERSION and AGE     ||
                   4  35        ||  0 0 1    0   0   0 35
                   2  50        ||  1 0 0    0  50   0  0
                   1  12        ||  0 0 0   12   0   0  0
            where the first three columns correspond to the 0/1 dummy (note that the effect
            of -DROP when VERSION=1), and the last four correspond to AGE



These ID variables can be used to create "balanced" ID variables.

    ID  Create a "balanced panel" indicator variable

      ID is used to identify a "panel" of rows -- row belonging to a single "observation".

         For example, several years of data for one indiviual.
         Or, several choices available to an individual.

      Syntax:
              ID numeric_value

      A new variable, with a name of ID, is created.
      The ID variable consists of a sequence of subsets; each subset has the same number of
      rows, and the value within the subset is the same (the values differ across
      subsets).

      Thus:
           * the specified numeric value is the number of rows per panel; NOT the number of 
             panels.
           * all rows belonging to a panel will have the same value of the ID variable
           * the number_of_rows in the dataset MUST be a multiple of ID.

    ID2   Create a "balanced nest" indicator variable

       ID2 is used to identify "nests" within a "panel".
       It creates a new variable with the name ID2.

       For example, an indivdual may have multiple years of data, with each year having several rows
       (say, 1 row per month).
          * The rows belonging to an observation are a panel,
          * the rows belonging to one observation's year, are a nest.

      Syntax:
              ID2 numeric_value

      Notes:
          * ID2 is the number of rows per nest.
            It is NOT the number of nests per panel.
          * If you specify ID2  and you must also specify ID 


    ID3   Create a "balanced sub-nest" indicator variable
       ID3 identifies "sub-nests" within a nest. It creates a new variable with the name ID3.

      Syntax:
              ID3 numeric_value

       For example, an indivdual may have multiple years of data, with each year having
       12 months of data, and each month consisting of several rows
       (say, 1 row per week).
          * The rows belonging to an observation are a panel,
          * the rows belonging to one observation's year, are a nest.
          * the rows belonging to one month are a sub-nest.


      Notes:
          *  ID3 is the number of rows per nest.
             It is NOT the number of sub-nests per nest.
          *  ID2 must be a multiple of ID3
          *  If you specify ID3, you MUST also specify ID2



    Example, given an N row dataset:
      ID 12 ; ID2 4 ; ID3 2 ;
      Would yield:
             Row       ID      ID2     ID3
               1  1  1  1
               2  1  1  1
               3  1  1  2
               4  1  1  2
               5  1  2  1
               6  1  2  1
               7  1  2  2
               8  1  2  2
               9  1  3  1
                     ..
               12 1  3  2     
               13 2  1  1
               14 2  1  2
               ...
               N*12  N  3  2        

               
           

    INPUT    Name of a pre-existing GAUSS dataset. Values will be read from this dataset.
       Example: INPUT mydata1 ;
       If no path is specified, data will be read from the GAUSS working directory.


    KEEP   Specify which variables to keep.
       KEEP is used to retain only a set of variables (such as X and NEW variables).

       Syntax:
          KEEP  varlist  ;
       Example:
          KEEP X1NEW X2NEW ZIPCODE ;

       Or, KEEP * to keep all the variables.

       Example: 
         You create several statistical variables, and you don't want to keep the
         original data.
                   X * ;
                   BY FIPS :
                  NEW X1SD=sd(x1);
                   NEW X5MEAN=mean(x5)
                   NEW x12_25=quantile(0.5,x12);
                   KEEP FIPS x1sd x5mean x12_25 ;
         The outdata file will contain FIPS, X1SD, X5MEAN, and X12_25.

       Notes:

        *  KEEP will override any DROP statements.
           However, DROP statements (including DROP VAR statements in NEW IF keyphrases),
           that appear after a KEEP, will remove variables that are in this KEEP.

           Note that variables not in this preceding KEEP will NOT be kept. 

           That is, DROP modifies KEEP. KEEP overrides DROP.

           Example:
             X FIPS X1 X2 X3 X4 X5 X6 X7 ;
             KEEP FIPS X1 X2 X3 X4 X5 ;
             DROP X2 X3 ;               @ this is okay but a bit silly: is simpler to just use KEEP X1 X4 X5 @
             NEW if AGE gt 100 then drop x5 ;  @ if any observation has age>100, X5 is dropped @
             NEW if AGE lt 10 then drop x6 ;  @ this won't hurt, but is pointless (since X6 is not being kept @
        
          
        *  If you specify:
                 KEEP  varlist1 ;
                    ...
                 KEEP varlist2 ;

           Only varlist2 variables are kept.

        *  To KEEP an observation, you should use NEW IF.
    
        *  KEEP * is somewhat redundant (the default is to keep all the
           variables). However, it can be useful if you use the GRBL2
           IF syntax to "conditionally execute keyphrases".

        
    NEW   Specify a new variable to create, using existing variables and matrices
    NEW MATRIX  Specify a new matrix to create, using existing variables and matrices

      Syntax:
           NEW  newvar1 = an_equation ;
           NEW MATRIX  newmtx1 = an_equation ;
      where:
          an_equation is an equation.

      You can specify as many NEW (and NEW MATRIX and NEW IF) keyphrases as you want --
      all the specified NEW variables (and matrices) will be created.

      Note that NEW MATRIX creates a matrix in temporary storage -- matrices are NOT
      written to the new dataset.

      An_equation can either return a scalar, or a Nx1 row (where N is the number of 
      rows in the input dataset), or a JxK matrix. 
  
         * scalar: Example:  new z1= (pi*5)/exp(3)). 
                   A new variable (in this example, Z1) is added to the new dataset.
                   Each row (in the new dataset) will have this scalar value
                   
         * vector: Example:  new z2 = ( (DIST/MPH) * (INCOME/2040)) *0.3
                   A new variable (in this example, Z2) is added to the new dataset.
                   Each row will be computed using the values from the corresponding row in the old dataset.

         * matrix : Example: new matrix m4= (x1,x2,x3) '  (x1,x2,x3) 
                  The matrix (in this example, M4) is stored for internal used (by later equations). 
                  It is NOT written to the new dataset.

                  Note: to specify a matrix, see the following description of parenthesis.

          If you NEW varname=..; and you return a matrix, only the first column of the matrix is used
          (if necesary, missing values will be used if the rows of the matrix < N).

          If you use NEW MATRIX mtxname= something ; and you return a scalar or a Nx1 vector, it will be
          stored as is (no padding is done).
                   
      Equations can be arbitrarily long; containing parenthesis, brackets, colon, commas, variable names,
      numbers, math and logic operators, and functions.

         Parenthesis: ( and ) can be used to group terms together.

                 Parenthesis are also used to specify one or more "arguments" for a
                 function. In this case, they are preceded by a function name.
                
                 Furthemore, ( ) is used to specify a matrix. In this case, they are
                 NOT preceded by a function name.
                     
                   To form a matrix, use  
                            (arg1 , .. ,argk )
                     -- commas seperate columns of the matrix.
                     -- if an argument is a scalar, it will be expanded to a Nx1 vector
                     -- if an argument is a matrix, it will be inserted (column-wise).
                      
                   Example: new matrix  m1 = (x1,x2,10*exp(3),sin(x4));
                            new matrix  m2 = (z1,m1,z2) .* weight1 ;
                          In the above example, m2 will have 6 columns -- the middle 4
                          consisting of the columns of m1 (multiplied by the weight1
                          variable).  
                     
                   Note: you can include matrices (defined using parenthesis) in 
                         NEW variable definitions: just be sure the resulting equation evaulates
                         to a scalar or to an Nx1  vector (N=number of rows in data).
                         Example:
                              new matrix m1=(x1,x2,x3);       @ creates a Nx3 matrix @
                              new matrix beta={12,12.5,61.2} ; @ creates a 3x1 matrix @
                              new  y = sum(m1 * beta )  ==  Nx3 * 3x1 = Nx1 
                              

         Brackets: { and } are used to specify a Kx1 matrix.

                   You can specify a Kx1 matrix by using { , ... , }.
                   Examples:
                       new matrix  vec1 = {1 , 5 , 6 }
                       new matrix  b1   = {bcost, bincome, 301, exp(dist)/ln(time) }

                   If  the arguments are variable or matrices, the first row is used.
                   
                   Basically: { } is a shorthand for the ROW(1,..,) function.

                   Note: you can include matrices, defined using brackets, in 
                         NEW variable definitions: remember that only the first column
                         of the result will be saved!

         Square Brackets: [ and ] are used to extract elements from a matrix.
                   Examples:
                       new matrix  ds= (dist1,dist2,dist3)
                       new matrix cmtx=corr(ds);
                       new c23=cmtx[2,3] ;

    Colons:   Colons are used to select a range of values from a matrix.

                   Examples:
                       new matrix ds = {d1,d2,d3,d4}
                       new matrix c2050_2=ds[20:50,1:2] ;   @ extracts a 30 x 2 matrix @


         Commas:   used to supply multiple arguments to a function (such as LN, the statistical
                   functions, and in matrix definition).

                   Example: NEW MINCOST=min(mycost,hiscost,hercost);  @ 3 arguments@

                   Hint: for easy reading of numbers, instead of comma you can use `.
                      Example: 125`535.34
                   BE CAREFUL: ` is a spacer, while ' is "transpose".

         Variable name: any of the variables mentioned in X, ID, or DUMMY.
                  In addition, variables and matrices defined in preceding NEW .. keyphrases
                  can be used in later NEW.. keyphrases.
         
                  Lastly, several built-in variables are also understood:
                        PI : 3.1416...
                       _N  : The row number 
                     _PANEL : The panel number 
                             If a BY variable is not specified, this equals 1.
                             If a BY variable IS specified, this is sequential panel number,
                             of the panel this row belongs to.
                             I.e.; rows in the the 3rd panel will have _PANEL = 3.
                             Thus, this is NOT the same as the ID variable used to identify the panel!
              
                Note:  _N and _PANEL are computed before obseravtions are DROPped, and before they are sorted.

                             
         Number: any numeric value. Scientific notation is supported (such as 1.08e-3). 
                 You can use . for "missing value"

         Operators:
              Math operators:
                 ^, *, /, +,  - , \,  %, !, .* , '

              \   is truncated division = Trunc(10/3) = 10\3 = 3 ;
              %   is remainder  10%3 =1
              .* and * are both multiplication. They differ when two K (K>1) 
                  column  matrices are used --
                  *  is matrix multiplication,
                  .* forces element-by-element multiplication
              ' is transpose. It can also be used for matrix multiplication: so that
                 M1'M2 is a shorthand for M1'*M2.
               ! is factorial. You can also use the FACT() function.

              Logic operators 
                 =, <, >, >=, <=, /=,  OR, AND  (returns a 0 or 1)
            
                  AND -- returns 1 if both sides are ne 0
                  OR  -- returns 1 if either side is ne 0

              Or, for you old folks who remember Fortran, you can use the following two 
              character codes: GT, GE, LT, LE, EQ, and NE.

              Examples: NEW ISBIG = WEIGHT > 200 ;
                        NEW ISSMALL = WEIGHT LE 120 ;

         Functions: Math, statistical, data functions, and random number:

           Math functions. 
                LN    - natural log. values <=0 return NAN.
                      You can specify a 2nd argument: if the value <= 0, the ln of this 2nd
                      argument is used. Example:  NEW  vnew=ln(height,0.001) ;
               LOG    -base 10 log. As with LN, you can specify a 2nd argument (used if value<=0)
               EXP   - exponentiation

               SQRT  - Square root (missing value used for arguments <= 0)

               TRUNC  - truncate (remove stuff to right of decimal) -- basically, the GAUSS TRUNC function
               CEIL   - truncate up (round up ) -- basically, the GAUSS CEIL function
               FACT   - factorial. FACT(4) = 4! = 24
               LNFACT - ln of factorial. LNFACT(4)=3.178

               SIN   - sin (radians)
               COS   - cosine (radians)
               TAN   - tangent (degrees)

               SIND   - sin (degrees)
               COSD   - cosine (radians)
               TAND   - tangent (degrees)

               GAMMA: The Gamma function: Gamma(n+1) = N!
               LogGamma: The log of the gamma function: LogGamma(N+1)= lnfact(n);
               DiGamma: First derivative of LogGamma 
               TriDamma: Second derivative of the LogGamma

               PDFN  - normal pdf
               CDFN  - normal cdf


             PDFBVN - PDF of the bivariate normal distribution.
               This can take 2,3,4, or 5 arguments
               The first two arguments are the X1 and X2 to evaluate the bivariate normal at.
               The 3, 4 and 5 arguments are the SD1, SD2, and rho coeffficients.
               If they are not specified, default values of 1.0, 1.0, and 0.0 are used (respectively).
               Note that these arguments can be variables or scalars.
               Note that a bivariate normal centered at zero is assumed (subtract the mean from
               first and/or second arguments if the mean is not zero).

           CDFBVN - CDF of the bivariate normal distribution.
               This can take 2,3,4, or 5 arguments
               The first two arguments are the X1 and X2 to evaluate the bivariate normal at.
               The 3, 4 and 5 arguments are the SD1, SD2, and rho coeffficients.
               If they are not specified, default values of 1.0, 1.0, and 0.0 are used (respectively).
               Note that these arguments can be variables or scalars.
               Note that a bivariate normal centered at zero is assumed (subtract the mean from
               first and/or second arguments if the mean is not zero).



            Statistical functions.

            Statistical functions can operate on arguments, or on observations:
              on arguments: if you specify more than one argument, then the statistic is
                            computed on the arguments. In other words, the statistic is 
                            computed seperately for each observation, using the values of the
                            variables you specified (in the argument list)
              on observations: if you specify just one argument, then the statistic is computed
                               across the rows. Thus, one unique value is computed -- the
                               statistic for the variable specified in the argument.
                               That is ... N rows are returned (N being the number of rows in the
                               dataset, or panel) -- but each row has the same value (the value of the
                               statistic computed across all the observations in the dataset or panel).

        Reminder: if you have specified a BY -- then statistics are computed across observations in a "panel".


               MIN   - minimum.
               MAX   - maximum
               MEAN  - mean
               MEDIAN  - median
               PROD  - product
               SD    - standard deviation
               SUM   - sum
               VAR   - variance

               QUANTILE - value of a specified quantile.
                     The syntax for QUANTILE is a bit different:
                     the first argument is the quantile (a number between  0 and 1).
                     For observations, specify a 2nd argument: its quantile statistic will be computed.
                     For arguments:  for each row, the quantile across the 2,3,... arguments is computed.

                         Note that QUANTILE(0.5,x) and MEDIAN(X) return identical results.


              CORRX - Correlation matrix. Creates the KxK correlation of the K arguments.
                      Or, if one (or more) of the arguments is a matrix; of the K-J arguments,
                      where J=# columns of the data matrices).

              VCX   - Covariance matrix. Creates the KxK correlation of the K arguments.
                      Or, if one (or more) of the arguments is a matrix; of the K-J arguments,
                      where J=# columns of the data matrices).




          Data functions.

               CLASS - Creates "class" identifiers. Every unique combination of values of
                       the several (one to ten) arguments gets a unique value of the class
                       variable.  Thus, for observations that have identical values of the
                       arguments, CLASS will return the same value.

                       Notes:
                       * the rows do NOT need to be sorted.
                       * the value of the class variable is somewhat random. Thus, although the
                       value of the class variable is "internally" consistent, it can not dependably be
                       used cross files.
                       Thus, if you create a CLASS on two files, using the same class variables;
                       then an observation in file 1 will NOT necessarily have the same value (of its CLASS variable)
                       as the SAME observation (an observation that has exactly the value values) in file 2.

               N     - number of arguments, or number of rows
               NMISS - number of arguments with missing values, or number of rows with
                       missing values



               IN    - Compares value of argument 1 to other arguments, returns column number
                       of first match; or 0 if no match (basically, the GAUSS INDNV function
                       across each column of a matrix).

               IN_COL - Compare value of argument 1 to the rows of argument 2, returns row number
                        of first match, or 0 if no match (basically, INDNV across the rows of a
                        variable).
                         Notes:
                            * the first argument can be a scalar value, or a vector
                              If scalar, then all rows (of the new variable) will have the same value
                              If vector, then for each row of argument 1, return its  match across all the rows of argument 2

                         Example (assuming v1 and v2 are variables):
                            If  v1 = 40,5,20,20,10
                            and v2 = 10,20,30,40,50
                            And aa=In_Col(v1,v2)
                            then aa=4,0,2,2,1

              IN_RANGE - Returns index of column greater than a value.
                    The first column contains the value, columns 2... contain the range
                    Each row can have a different range -- but each row MUST have
                    columns 2 ... sorted (column 2 having the smallest value).
                    Returns 1 if less than value of Column 1, k if less than value of column k).
                   If K row, and value > largest value, returns K+1.
                   Example:
                           a=IN_range(5,1,3,7,12)      : a=3
                           a=IN_range(0.5,1,3,7,12)    : a=1
                           a=IN_range(200,1,3,7,12)    : a=5


               UNIQUE - Returns number of unique values.
                        If just one argument,
                          returns  number of unique values across rows -- like the
                          statistical functions, the same value is returned for every row.
                        If more than one argument,
                          returns number of unique values across the arguments --
                          a different value will be returned for each row.

               ISMISS  - returns 1 if any of the arguments are missing (0 otherwise)

               MISSRV - converts missing values to a numeric value.
                        This requires two arguments: the first is the "replacement", the second is the target-variable
                        Any missing value in the target variable is replaced with the appropriate value from the replacement
                        (the replacement can be a scalar, or another variable).
                        Note that the 2nd argument should be a variable (if a matrix is provided, only the first column is examined).
                         
                        Example:  new  a1=missrv(25000,income);
                                 would replace missing values of INCOME with 25,000.

               NOT   - Returns 1 if argument equals 0, 1 otherwise (only uses first argument)

               VAL   -  Returns the n'th largest value of value of the nth row of the data, where the first argument
                        is used to specify n.
                        If n<1, return the smallest value
                        If n>rows(data), or if n is missing (.), return the largest values.
                  Thus, VAL is similar to QUANTILE, but using whole numbers rather than a fraction.

                Notes on data functions:
                   *  N, UNIQUE, and NMISS are like the statistics functions-- they can be
                      across observations, or they can be across arguments.

                   *  IN and MISS are across arguments (hence, there must be more than one argument).

                   *  NOT uses just the first argument.

                   *  VAL is like QUANTILE: the first argument identifies which row's data to use.
                      However, VAL only works on observations (there is no "val of nth argument"
                      feature).

                Example of the CLASS function:
                       Example:   IDC=class(v1,v2,v3)
                                row    v1   v2   v3   IDC
                                 1      1   2     12   1
                                 2      1   3     12   2
                                 3      1   2     12   1
                                 4      2   2     12   3
                                 5      2   2     15   4
                                 6      1   3     12   2
`                                7      15  6     22   5
                                 8      1   2     12   1

                        Note that the unique values assigned to each "class" are somewhat
                        arbitrary -- there is no definitive relationship between the value
                        assigned to the class and the values (or row numbers) of the observations.

          Random functions:

             RNDN : Normal random number. Two arguments should be specified, the first is the
                   mean, the 2nd the standard deviation. If you specify just one argument, a
                   mean of 0 and sd of 1 is used (that is, the value of this first argument is
                   ignored).

                   Example: RN1 = rndn(.)    -- mean 0, sd 1, normal random variable
                            RN2 = rndn(100,20) --  mean 100, sd 20, normal random variable
             
             RNDU : Uniform random number. Two arguments should be specified, the first is the
                   minimum, the 2nd the maximum. If you specify just one argument, a
                   min of 0 and max of 1 is used.

             RNDR : Select number randomly from a list of arguments.
                   Multiple arguments must be specified (you can specify as many arguments as
                   you want, but there should be at least two).
                   A number from this list will be chosen.

              For all three of the above, the 2nd, 3rd, etc. arguments can be number or a variable.
              If a variable, each observation will have its own min, or max ,or sd, etc.


          Matrix functions:

            ELEMENT :  Returns a 1x1 element of a matrix 
                       The first argument (r) is the row number.
                       The second argument (c) is the column number.

                       Basically, ELEMENT(r,c,amtx) is identical to amtx[r,c] --
                       except that you can specify amtx within ELEMENT 
                       (for example: ELEMENT(r,c,corrx(v1,v3,v3)) ;

                ROW :  Returns a row of a matrix (as a Kx1 matrix).
                       The first argument (r) is the number of the row you want to extract.

                       Returns a Kx1 vector containing the r'th row of this matrix.

                       Or, if the first argument is 0, returns number of rows in the matrix.

                       If r > number of rows of the matrix, an error occurs.

                       Example: r10=row(10,x1,x2,x3);

               ROWS:   Return several rows of a vector (or a matrix).
                       The first argument should be an index into the rows of the 2nd (and other) arguments.
                       Thus, the first argument must consist of integer values greater than 0 and
                       less than the number of rows in the 2nd (et al) argument(s).
                       
                       Special case: like with ROW, ROWS(0,x) returns "number of rows in x".
                       
                       The first argument can be a matrix with few rows than the 2nd (and later) arguments.
                       
                       Example:
                       If ID1=1 3 2 4 1 5
                       and X= 11 22 33 44 55 66
                       then vv=rows(id1,x);
                       would yield vv=11 33 22 44 11 55 ;
                       
                       Or, if you used :  NEW MATRIX ID1={3,2,1} ;
                                          vv=rows(id1,x);
                                          then vv= 22 33 11 . . .   (i.e.; missings are for rows 4...)
                                          
                       If X1 is a matrix (or you specify multiple values)
                          If you use NEW -- only the first column of X1 is returned
                          If you use NEW MATRIX -- all the rows  are returned

                COL :  Returns a column of a matrix (as a Kx1 matrix).
                       The first argument is a number (k).

                       Returns a Rx1 vector containing the k'th column of this matrix.

                       Or, if the first argument is 0, returns number of columns in the matrix.

                       If K > number of columns of the matrix, an error occurs.

                       Example: kk=col(10,x1,x2,x3);
                                ncols =col(0,m1)

                INV:   Inverse of a matrix.
                       This expects one argument; which should be a matrix (either one
                       defined using NEW MATRIX, or one created on the fly)

                       Example:  m1=(x1,x2,x3);
                                 m2= m1'm1;
                                 invm1=inv(m2);  @ invm1 will be a 3x3 matrix @

                Notes on ELEMENT, ROW, and COL:
                 *  As with other matrix operations, you can save the results to a variable
                   (or to a matrix).

                 *  The remaining arguments specify the matrix from which to extract the row.
                    As with the ( ,..,) syntax for creating matrices, these arguments can
                    contain values, variables, or matrices.
                    If other matrices are included, they MUST be N rows long.

                 *  If the rows returned by ROW, COL, or MTX (K) is less than the number of
                    observations in the dataset (N), missing values will be used to "pad".
                    For example:
                         NEW MY5= row(5,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10) ;
                     will return a 10x1 matrix (K=10) -- the fifth row of variables X1 to X10.
                     If N=100, then the MY5 variable will have 100 rows;
                         * rows 1 to 10  containing the values (of the 5th row of X1 to X10)
                         * rows 11 to 100 containing missing values.


          Examples:
                 NEW X11 = X1 * X1 ;
                 NEW X11 =  X1 * X1 ;
                 NEW X22 =   X2 * X2  ;
                 NEW X1_2 = (X11 - X22) / (x11 + x12 ) ;

                 NEW YBIG = Y1 > 100 ;
                 NEW  Y2  =  (ln( x1 + x2 ) / exp(x1 +x2) ) - 100 ; 

                 new obsnum= _n ;
                 new panelnum = _panel ;
         
                 NEW xsum = sum(x1,x2,x3,x4);
                 New xmin = min(xa,xb,xc,xd,xe,xf/100); 

                 NEW ISBIG = ( (WEIGHT > 200) or (SIZE>6) ) and (CLASSA = 1) ;

                 NEW ISONE = IN(1,v1,v2,v3)  ;
  
                 new cost_10=VAL(10,COST)   ;    @ the value of the COST variable of the 10th row @
                                        
                 new ithpanel=panelnum(1);

                 new nsites=unique(visit1,visit2,visit3,visit4,visit5,visit6,visit7) ;

                 new nsites=unique(visit1,visit2,visit3,visit4,visit5,visit6,visit7) -
                            (nmiss(visit1,visit2,visit3,visit4,visit5,visit6,visit7)>0) ;

                 new nsites=unique(visit1,visit2,visit3,visit4,visit5,visit6,visit7) -
                            (in(0,visit1,visit2,visit3,visit4,visit5,visit6,visit7)>0) ;


                 new rndcauchy = rndn(HEIGHT,10) / rndu(.)  ;  @ the . means use the default, 0 to 1 , range @
                 new rndwhich = rndr(1,2,20,100,mymax,mymin) ;    

                 new matrix m1= (x1,x2,x3).* (income / 2040)  ;
                 new matrix r1= row(1,25,m1,wage);
                 new matrix r1 = {1,25,m1,wage}  ;            @ same as the above @

                 new matrix  m1 = (x1,x2,x3);     @ x1,x3,and x3 are variables @
                 new matrix  m2 = m1'm1 ;         @ KxN * NxK matrix multiplication @
                 new   ncol=col(0,m2)
                 new matrix mc1 = col(ncol,m2);   @ returns the last column as a Kx1 vector @

                 new matrix xs=(1,bwtc,percap,coldeg);   @ create a data matrix @
                 new matrix xsp= xs'permits;             @ a "math matrix" @
                 new beta_us= inv(xs'xs) * xsp;       @ Beta OLS coeffients in rows 1 to 4, 5 and up are missings @

                 by state ;                         @ compute seperate OLS beta for each STATE @
                 new matrix xs=(1,bwtc,percap,coldeg);   @ this "data matrix" can be used with BY @
                 new beta_st=  inv(xs'xs) * (xs'permits);  @ Since BY is being used, do NOT use the XSP math matrix @
                  

                 new pbvn=pdfbvn(age-mean(age),income-mean(income),sd(age),sd(income),0.5);   @assuming 0.5 correlation @

                 BY MYID ;
                 NEW mymax=max(price) ;
                 new lowval=min(price);
                 new val4=value(4,price);
                   could yield:
                  Row  MYID   PRICE MYMAX LOWVAL VAL4
                    1   1    11  16  2     16
                    3   1    2       16    2     16
                    4   1    16    16   2     16
                    5   2     3  4      1      2
                    6   2     2  4   1      2
                    7   2     4  4   1      2
                    8   2     2  4   1      2
                    9   2     1  4   1      2


      Notes:
         *  IMPORTANT NOTE:
                You can NOT refer to a variable in the dataset that is
                NOT included in X keyphrase.
                However, you can use DROP to not save X variables (say, X variables
               which are only used to create a NEW variable.

          *   Matrices are NOT saved to the new dataset. If you want
              to save one, or more, columns of a matrix, create new variables (one
              for each column you want to save) using the COL function.
              Note that if the column has more than N rows, an error will occur.

          *   "Math matrices" (matrices that are not NxK) do NOT work with BY --
               BY will not pull out a "panel specific" matrix (it will pull out junk).

               However, when creating a NEW variable, you CAN use existing "data" (NxK) matrices
               BY will pull out the relevant rows (belonging to this panel) from this matrix.

               You can also create matrices on the fly - BY will use the
               relevant rows  to form a "panel specific" matrix.

          *   NEW keyphrases are processed in order of appearance. Hence, NEW
              keyphrases can refer to variables and matrices defined in "earlier" NEW keyphrases.

          *   When a BY variable is specified, statistics are computed across observations in the
              BY group (as identified by the BY idvar).

          *   Variable and function names are case-insensitive

    NEW IF    Perform an action if a condition is met

     Supported actions are: DROP a variable, DROP or KEEP an observation,
     or create a NEW variable.

        Syntax:

            NEW IF condition  THEN  action ELSE action2 ;


        where:

            condition: an equation, using the same syntax as NEW.  
                       If this "condition" equation evaluates to 1, the "action" is done.
                       Otherwise, action2 is done.

                 Action 2 is optional -- if not specified, and the condition is 0, nothing
                 is done.
                       
            action:  Either
             and        DROP  ;
            action2   or
                       DROP  avar  ;
                      or
                        KEEP ;
                      or
                         newvar = an_equation
                 or
         PRINT "text string in quotes " var_or_matrix "more text " var_or_matrix ...
                 or
         PRINT_MTX "text string in quotes " var_or_matrix "more text " var_or_matrix ...

                 where the newvar and an_equation are the same as in NEW


       DROP AVAR means "DROP the AVAR variable from ALL observations".
            Thus, if just one observation has the condition equal 1, 
            AVAR will be dropped from ALL observations.
            Note: you can only specify one variable at a time to drop.

       DROP and KEEP (not followed by a variable name) mean "drop this observation" and 
       "keep this observation".
          CREATE processes DROP and KEEP statements in the order entered, using
          the following logic:
             i) to start, an index to all observations is created
            ii) if a DROP is encountered, all observations satisfying the condition
                (i.e.; condition equals 1 and DROP is used in a THEN clause)
                are removed from the index.
           iii) if a KEEP is encountered, all observations satisfying the condition,
                are re-entered (if necessary) into the index.
                That is, previously DROPped observations are un-dropped. 
                Otherwise (the observation was never dropped), it stays in the index.
            iv) After all DROP and KEEP keyphrases have been processed, the index is used
                to extract observations.
               
            In other words, 
              * KEEP counteracts DROP
              * KEEP (by itself) will NEVER lead to an observation being dropped.
       
       PRINT is used to print text and data value to the output file.
          You can have several text strings, and variables or matrices, on a line.
          Text strings MUST be enclosed by double quotes (").
          If you print a variable, all rows will be written to the output file.
          You can only PRINT matrices or variables -- in PRINT, equations will NOT be analyzed.
     
     PRINT can ONLY appear after a THEN or an ELSE. 
          If you want to uncondtionally PRINT, use:
       IF 1=1 then Print "This is my output ";

     Notes:
          * If you specify several  variables, they will be printed sequentially (all the values of the first
            variable, then the second variable, etc.). 
          * If you want to print the variables for each row on its own line, first create a matrix, and then
                 specify the matrix name in the PRINT statement. For example:
                      new matrix  incinfo=(wage,workhours,bonus,tax);
                      new if 1==1 then print_mtx "Info used to compute income: Wage Hours Bonus Tax " incinfo ;
          * If you use BY, then only the values for the last "panel" are printed (this may be modified in
            the near future).                      

          * To print a single value, use a matrix. For example:
                     new  z=sum(trips);
                     new matrix z1=z[1,1];
                     new if 1==1 then print_mtx "Total trips (across all observations): " z1 ;


       PRINT_MTX is identical to PRINT, except matrices will be displayed in a formatted fashion (equal sized columns).          
           
       Notes:
 
             * New variables, created with IF, have default values of missing
               Thus, if you do not specify an ELSE, and a row does not satisfy the condition,
               a value of missing is assigned to the new variable (for that row)

             * Typically, one would specify the same variable name in action and action2.
               However, you CAN specify two different variables. Hence, observations where
               action2 is done (where condition evaluates to 0) will have missing values for the
               variable defined in action.
        
             * To facilitate evaluation of multiple conditions, NEW IF  can modify
               pre-existing values. In contract, NEW can not overwrite an existing
               variable.

               A trivial example:
                      if age > 65 then retired = 1 ;
                      if age <=65 then retired = 2 ;
                  Is the same as
                     if age > 65 then retired = 1  else retired=2 ;

                 A more interesting example, that could NOT be done easily
                 just using THEN ... ELSE.
                     retired=0;
                     if age > 65 then retired = 1 ;
                     if pension > 0 then retired = 2 ;
                     if in(trust_fund,101,202,303)>0 then retired=3 ;
               
               You can also use NEW IF to change the values of an existing variable (one
               in the input dataset).
               For example:
                   IF 1 then oldvar=oldvar*10 ;
               (since 1 is always true, the oldvar will be transformed)


             * NEW IF, as a subset of NEW, keyphrases are processed in order of appearance. 
               Hence, NEW keyphrases can refer to variables defined in "earlier" NEW keyphrases.

             * To KEEP observations that satisfy a condition, and  DROP observations that don't, use:
                   NEW IF condition then keep else drop  ;

             * We don't expect the DROP AVAR syntax to be used very often. Hence we do
               not provide a KEEP AVAR complement.

             * Sorry, we do not currently support conditional creation of "math" matrices. The bookkeeping
               is too cumbersome (perhaps we will tackle this problem later).

             * Historical note: since IF is used as a GRBL2 keyword, it is necessary to
               use NEW IF for "conditional variable creation".

      Examples:

          NEW IF  V1 > 100 then  DROP v1 ;  @ will drop the V1 variable if ANY v1 value is > 100 @
             
          NEW If VOLD = 1  then vnew=1 else vnew=0;   @ set VNEW=1 if vold=1, otherwise VNEW=0 @
             
          New if (exp(3)/v100) > (ln(tan(vdog)-sin(vcat)) then monkey=2 ; @monkey = missing otherwise@

          NEW ISBIG = ( (WEIGHT > 200) or (SIZE>6) ) and (CLASSA = 1) ;
          NEW if not(isbig) then keep else drop ;   @ only keep the non-BIG observations @
          DROP ISBIG ;                       @ and dont keep the ISBIG variable @
        
          BY PID ;                  @ create summary stats on a panel, and save just the stats (one row per panel)@
          X visits cost ;
          TOTTRIP=sum(visits);
          MAXTRIP=max(visits);
          avgcost=means(cost);
          pnlseq= _panel ;
          if _n = 1 then keep else drop ;
          DROP visits cost;         @ new dataset will have PID TOTTRIP MAXTRIP AVGCOST and PNLSEQ @

     NEW IF VOLD=1 then print "This has OLD values " else PRINT "This has NO old values ";
         
          new matrix corrds=corrx(d1,d2,d3,d4);    
          NEW IF 1==1 then print_mtx corrds ;

     new matrix  incinfo=(wage,workhours,bonus,tax);
          new if 1==1 then print_mtx "Info used to compute income: Wage Hours Bonus Tax " incinfo ;


    OUTDATA   Name of the GAUSS dataset to create.
        Example: OUTDATA mydata2 ;

        Or, to suppress creation of a new data file, use OUTDATA 0 ;
        This can be useful if you are just SAVEing matrices (or PRINTing results).,
            
        Notes:
            * If no path is specified, data will be written to the GAUSS working directory.
            * Do NOT specify an extension -- .DAT is automatically used.
            * If the specified file (say,mydata2.dat), it will be overwritten.
            * This can be a good thing...
                  if you want to overwrite the "current" database -- 
                  just set the NEW= value to be the same as the IN= value.
                  For example: 
                     CREATE ;
                       INPUT mydata0;
                       OUTDATA mydata0 ;
           * OUTDATA is NOT the same as OUTPUT -- OUTDATA is where the GAUSS dataset is written to,
             OUTPUT is where the short summary report is written to.

     OUTPUT  Name of a file to write a report of results to 
        See GRBL2_BATCH.TXT for the details.


   Notes:
      * the "avar" variables must be one of the variables included in the new dataset, as specified in
        a X, ID, ID2, ID3, NEW, or DUMMY option.
      *  _ can be used in variable names


    RUN  Create the new data file

       You must have a RUN ; keyphrase after all your CREATE keyphrases.
       It tells MAKEDATA to go ahead and do it!.


    SAVE  Save a matrix to a GAUSS .FMT file

        You can save matrices you create (using the NEW MATRIX keyphrase) to files by using the
   SAVE a GRBL2 command.

   You can only have 1 save command per CREATE -- though you can specify multiple matrices in this
   save command. 

      Syntax:
          SAVE FILE=mtx file2=mtx2 ...;
     where
        file - a file name (do NOT include an extension)
        var - a matrix created with NEW MATRIX 

     If you do not include FILE=, then a file with the same name as
     the matrix will be created.
   
     Example:
        new matrix xx=(age,income,race,educ);
        SAVE XX   ;     @   save XX to XX.FMT  @

        new matrix xx=(age,income,race,educ);
        new matrix xcorr=corrx(xx);
       new matrix axvc=vcx(xx);  
        SAVE  XCORR=x1  AXVC=X2     @ save XCORR to  X1.FMT, and axVC to X2.FMT @

   
 
    SORT Sort the data

      SORT has the following syntax:
          SORT [-d] var1 var2 var3 ;
       where var2 and var3 are optional.

      SORT will sort the data in ascending order, or if -d is specfied, in descending order.
      You can specify up to 3 variables; sort will sort by the first variable, then by the 2nd variable
      (retaining sort order of the first), then the second.

      Notes:
          * sorting occur as the last step -- after all new variables have been created, and
            all observations dropped.
          * you can ONLY sort on a variable specified in a X, ID, ID2, ID3, NEW, 
            or DUMMY option. Of course, you can DROP this "sorted on variable" 
            (so that it is not included in the new dataset)
          * To sort across more than 3 variables, you can create a new variable using
            CLASS, and sort on the CLASS value. The order of the resulting observations
            may not be nested, but all "similar" observations will be contiguous.

    X    A list of variables to add, from the old dataset, to the new dataset. 
      
       See GRBL2_BATCH.TXT for the details. 
        
       Examples:   X FIPS DISTANCE PERMITS DEPTH SECCI RAMPS  FEE1 FEE2 ;
                   X  *  ;

       Notes:
         * X must include ALL the variables you might use, including:
            ~ dependent, weight, and auxillary variables.
            ~ ID variables
            ~ variables used in creating NEW variables
       
         * To include all variables in the old dataset, use:    X  *  ;

         * You can DROP variables that you don't need to retain (but that may be used in
           creating NEW variables).
  
         * Or, you can KEEP just the variables you want to retain (dropping all others)

   --------------------------------

IIb. ATOG: Convert a text data filem, or EXCELL spreadsheet, into a GAUSS dataset

  As a convenience, you can convert a text data file (say, an export from SAS), or an 
  EXCEL spreadsheet,  into a GAUSS dataset. 

  ATOG is a substitue for GAUSS's ATOG.EXE program.  If you are familiar with the GAUSS ATOG program, 
  it is probably simpler to use it  (rather then use this ATOG utility). 
  That said, this ATOG is quite fast, so there is no serious disadvantge to using
  this utility instead of ATOG.EXE.

  Data in the text file must be space, comma, or tab delimited.
  Data can be one row (in the text file) per observation,
  or spread across several rows. 

     * One row mode: each line of the input file is a seperate "observation" -- each line
       of the input file is used to create a new row in the output file.
       If a line in the input file does not contain enough values, it is padded (at the end) with missing values
       If a line contains too many values, the extra values are discarded.
 
   To specify one row mode, use the ONEROW YES keyphrase (described below).      

     * Several row mode: data is read a value at a time, filling in the data matrix in row first fashion.
       This is the default, and is how ATOG.EXE works.
       In this mode, newlines are treated as value delimiters (just as spaces are treated as value delimiters).

       Hence, you must be sure that you have the correct number of data values in the file.
       A small mistake (such as one value dropped) can completely mess up subsequent rows.

   To specify several row mode, use the ONEROW NO keyphrase (described below).      

  In either mode, the number of columns in the newly created dataset is set by one of the following variable name
  specifications (VAR or VARS_1).

  The following keyphrases are used in ATOG.
     TEXT input_file   : Text (ascii) input file
     EXCEL FILE=input_file SHEET=N START=Cn  : EXCEL input spreadsheet. 
     OUTDATA new_data_File  : Output file (where the newly created fill)
     VARS list-of-variable-names  : List of variables names
     VARS_1      : Where, in the input file, to read variables names from
     MISS  nn    : Numeric value used to indicate a missing value 
     ONEROW YES  : One row of data per observation
     OUTPUT      : Name of a file to write a report of results to 
     RUN         : Create new data file
   
  The details follow ....

     TEXT input_file ; 
     The ascii text file to read

     EXCEL FILE=input_file SHEET=N START=Cn  ; 
     Specify an excel file to read
     FILE = The file name
     SHEET = Optional: the sheet number. If not specified, sheet 1 is read.
     START = Start reading from  this EXCEL cell. If not specified, start at cell A1.

     Note: Obviously, you should not use both TEXT and EXCEL at the same time.

     OUTDATA new_File ; 
            The gauss dataset to create. 
       For both INPUT and NEW, if no path is specified, the Gauss working directory is first examined.
       We recommend NOT specifying an extension (.DAT is used by default).


     VARS list-of-variable-names; 
          A list of variable names. 
          Note: Only the first 8 characters of a variable name are used. 


     VARS_1  s:e ;
         Instead of VARS, you can look for the variable names in the file.
         For  text files:
              s and e are the start and end rows (variable names are read from these rows)
              if s:e is not specified, use row 1
         if s (without an :e) is specified, use row s
    For excel files:
         s:e is an excel range (i.e.; A2:F2). The entire range must be specified.
         Or, if s:e is not specified, the first row is read (implicitily, A1:FF1).
         You MUST specify the ":e" component of "s:e".
       
         Note: 
      * Only the first 8 characters of a variable name are used. 


     MISS  nn ; 
            nn is a value to be treated as missing. For example, SAS uses -9 as a missing value flag.
       Note that by default, a . (period surrounded by spaces) signifies missing value.

     ONEROW YES ;
      If you wish to force GAUSS to treat each row of the input file as a seperate row, you can specify
           ONEROW YES ;.
           If a row has fewer variables than specified (in the VARS or VARS_1 keyphrase), missing values are padded to
      the end.
           If a row has more variables, the extra values are discarded.

     Note: ONEROW NO ;, which is the default,  means 
           load values into data as encountered -- newlines are treated as value seperators,
           not as "new row in output dataset" indicators    
           
     Basically, ONEROW NO is what ATOG does.


   OUTPUT   : Name of a file to write a report of results to 
          See GRBL2_BATCH.TXT for the details.

   RUN : Create new data file

       You must have a RUN ; keyphrase after all your ATOG keyphrases.
       It tells MAKEDATA to go ahead and do it!.

         
   Examples:

     ATOG  ;
       TEXT  my_foo1.txt ;
       OUTDATA  My_foo1_new ;
       VARS  X1 Y1 X2 Y3 Z1 Z2 ;
       ONEROW YES ;
       Run ;


    ATOG   ;
       TEXT my_foo.txt ;
       OUTDATA My_foo_new ;
       VARS_1 ;         @ the first line of MY_FOO.TXT must contain the variable names@
       MISS -9 ;
       Run ;

    ATOG ; 
       excel file = my_foo2.xls  start=D2 ; @ skip header info @
       VARS_1 C3:C6 ;      @ variable names in cells c3 to c6 @

       OUTDATA  My_newx ;
       VARS  X1 Y1 X2 Y3 Z1 Z2 ;
       Run ;

=================================================
IIc. MERGE: Merge two gauss datasets

  MERGE allows you two create a new gauss dataset consisting of the merge of two existing gauss datasets.

  Three kinds of merge are supported:
   
    BY-ROW) The datasets are row-wise concatenated -- each row of the new dataset is formed from the
       corresponding rows of the two input (pre-existing) datasets.
            If the datasets have a different number of observations, missing values are used for the 
            variables associated with the "missing rows" of the shorter dataset
 
   BY-INDEX) Variables are added from the 2nd dataset to the first dataset, based on the value of index variables.
             This supports many-to-one matches: typically, you will have several observations with the same index
             in file 1 (say, a FIPS code); and one observation with this index in the
             second dataset (say, county-level demographic measures).

   APPEND)  Rows are appended, with missing values used for variables when necessary.


  In the BY-ROW or BY-INDEX mergnes,when the variables you wish to include from the two datasets have variables with 
  the same name, you can choose which value to use.

         That is ..if a variable name appears in both the first and 2nd file, only one of 
         the values will be written to  the output data file -- the TYPE keyphrase is used to specify which one.


 The following keyphrases are used in MERGE.

     FileA    : The first input file (pre-existing Gauss data file)
     FileB    : The 2nd input file (pre-existing Gauss data file)
     IDA      : Identifier(s) for rows in first file
     IDB      : Identifier(s) for rows in second file
     OUTDATA      : New Gauss data file to be created
     RUN         : Create new data file
     TYPE        : Type of merge, and how to deal with same named variables
     VA          : Verification variables for FileA
     VB          : Verification variables for FileB
     XA          : Variables from FileA to write to new data file
     XB          : Variables from FileB to write to new data file


 The details follow ....

     FileA  a_file ;
         The first input file (pre-existing Gauss data file)

     FileB  b_filename ;
         The second input file (pre-existing Gauss data file)

     IDA  a_variable or a_variable_list;
         An "index" variable, or several indexi variables, for the first input file.

     IDB  a_variable or a_variable_list;
         An "index" variable, or several index variables, for the second input file.

      Note: IDA and IDB are only used in by-index merges.

     OUTDATA  o_filename ;
        New Gauss data file to be created
        If it exists, it will be overwritten.
        If you want, o_filename can be the same as the FileA or FileB files (which will be overwritten).

     RUN      ;
          Create  the new data file

     TYPE -xx ;
     TYPE  -ROW -xx  -BIG ;
     TYPE  -APPEND ;

       Type of merge, and how to deal with same named variables

       If -ROW or -APPEND is not specified, a BY-INDEX merge is attempted.

      -ROW specifies row-by-row concatenation (BY-ROW).
      -APPEND signals that rows from the 2nd dataset should be appended to rows from the first
       dataset (APPEND).

       Notes:
           * in a by-index merge, the number of rows in the output dataset will equal the
             number of rows in the first input file.  This means that rows in the 2nd input file, that
             have an index value that is NOT present in the first input file, are discarded.
           * in a by-row merge, the number of rows is the greater of the number of rows in the first or
             second file.
           * in an append merge, the number of rows is the sum of the rows in the first and second files.

       -BIG is ONLY used with by-row merges.
          -BIG is used to avoid "out-of-memory" problems when merging large datasets -- you do NOT need to specify
          it for smaller datasets.  With smaller datasets, execution is somewhat faster if you do NOT
          specify -BIG.  By large dataset, I mean a file having a size (in bytes) about 10% of system memory
          (a 100M file on a system with 1G of RAM).

       -xx dictates how to deal with matching variables (it is NOT used if -APPEND is specified).
            -A   : Always use the variable from the first file
            -B   : Always use the variable from the second file
           -AB   : Use the variable from the first file if it is non-missing. Otherwise, use the variable from the 2nd file.
           -BA   : Use the variable from the 2nd file if it is non-missing. Otherwise, use the variable from the first file.

         Notes:
           * the default type is AB.
           * Merging is faster if you do NOT have any variables that overlap.

     VA varlist ;

         Varlist is a space delimited list of variables to check (pulled from FileA).
         For each row to be added to the new dataset, the VA varlist is compared to the VB varlist.
         If there are ANY unequal values, an error occurs (the program stops).

         Thus, VA and VB MUST have the same number of variables (the variables can be different).

         VA and VB are ONLY used in by-row merges.

    VB varlist
         Varlist is a space delimited list of variables to check (pulled from FileB).
         See above for the details.

     XA varlist ;
         Varlist is a space delimited list of variables to add, from the first input file, to the new dataset.
         Alternatively, you can "rename" variables in your varlist, using  NEWNAME=OLDNAME

         Note that the variables specified in X  SHOULD include ALL the variables you might use; including any index variables
         (say, index variables specified by IDA).

        Hint: to include all variables in the first input file, use:    XA  *  ;

          Examples:
                   XA * ;
                   XA SITEID DISTANCE WAREA SECCI :
                   XA MILES  TIME  AGE=YRSOLD  INCOME  MALE=GENDER ;

          In the last example, the YRSOLD variable (in file B) will be saved to the new data set using
          a variable name of AGE. Similarly, GENDER will be saved using a variable name of MALE.

     XB varlist ;
         Varlist is a space delimited list of variables to add, from the 2nd input file, to the new dataset.
         Alternatively, you can "rename" variables in your varlist, using  NEWNAME=OLDNAME

         See GRBL2_BATCH.TXT for the details. Note that the variables specified in X  SHOULD include
         ALL the variables you might use; including any index variables (say, index variables specified by IDB).

        Hint: to include all variables in the second input file, use:    XB  *  ;


  Note on BY-INDEX merge:
     * If you want the index variables written to the data file, you MUST include them in  XA
       (don't include them in XB).

  Note on APPEND:
     * The rows in the new dataset will have all the variables specified in XA and XB (with duplicates removed).
       If no such variable exists in one of the "appended" datasets, a missing value is used.


  Reminder:
   XA and XB can contain variables with the same name -- the -A, -B, -AB, or -BA option (in the TYPE
   keyphrase) dictates which of these will be used.


    Thus, if:
        XA V1 V2 V3 ;
        XB V1 Z1 Z3 ;
        TYPE -AB -ROW ;
    is specified, then ...
       * Row-by-row concatenation is done.
       * Each row of the new dataset will have 5 variables: V1 V2 V3 Z1 Z3
       * The value of V1 from the first input file is used. However, if this value is missing, then the value of
         V1 from the 2nd input file is used (even if it, also, is missing).

     Alternative:
        You can include the value of both variables (that have the same name in each file)
        by using the NewName=OldName syntax in XA or XB.

        If you rename, then TYPE will compare the "new" names, NOT the old ones!
        Thus, if you
           * if you rename var X1 in file A to Z1 (i.e.; XA  MYID INCOME Z1=X1),
           *  var Y1 in file B to Z1, (i.e.; XB  MYID AGE Z1=Y1),
           then  the TYPE option will be used to chose which value (from File A or from File B)
           will be used for the Z1 variable (in the new dataset).

  On multiple index variables:

    If you are doing a BY-INDEX merge, you must specify IDA and IDB.  IDA, and IDB, can contain 1, or more than 1,
    index variables -- however, they must contain the same number.

    Examples:
        IDA MYID ; IDB  MYID ;
        IDA MYIDA ; IDB  MYIDB ;
        IDA MODEL MAKE YEAR  ; IDB  MODEL TYPE DATE ;

   If you use more than one index variable, matching is done in order.
   In the above example, the following are compared:
       MODEL  <--> MODEL
       MAKE   <--> TYPE
       YEAR   <--> DATE

   In any case (regardless of how many index variables you use), the first row in the 2nd file that matches a row
   in the first file, is used. "Match" means "all values of the index variables are identical".

   If there is no match, missing values are used (for the variables contained in file 2).




=================================================
IIc. XVC: Compute mean and variance/covariance information

XVC is used to create a gauss dataset that contains "class" specific means
and variance/covariance information, for a set of "X" variables.
These are called VCI files.

For example, VCI files are used by the UNC=XB mixed-logit model.

XVC can take two options: -CORR and -NOVC.

   -NOVC : do NOT compute covariance -- just compute mean.
   -CORR : instead of covariance matrix, compute a correlation matrix

The following keywords are used by XVC:
  INPUT    Specify an existing gauss dataset

  OUTDATA  The output data set (the VCI file to create)

  X    A list of variables (from the INPUT file). You can use * to include
      all variables in the INPUT file.

  ID   The "CLASS" variable. The "CLASS" variable identifies observations that
      belong in a class.  Variance information is computed across all observations
      belonging to a class. 

           Note: rows do NOT have to be sorted by this ID variable -- rows with the same value
                 of the ID variable can be scattered randomly throughout the file.
                 This is different than the usage of ID in mixed and other "panel" models.

 CLASS     Name to give the "class" variable identifier in the OUTDATA file. If not
           specified, "CLASSID" is used.

 MEAN     If YES, then a
       i) a row containing means will also be computed. 
      ii) a column, with variable name of _N, will be added. This will contain the 
               number of non-missing values for this variable in this CLASS.
           Note that the _N variable for the "means row" (classid=0 row) will be the total 
           number of rows in this class (including rows with missing values).

 WEIGHT   A weight variable.
     If specified, a weighted covariance (and mean) is computed -- observations with larger
          weights are  given more influence.
     At this time, WEIGHTed correlation matrix are not supported.

A VCI file produced by XVC has a special structure:

If you did NOT specify -NOVC ....

   * The columns have variable names that are identical to the X variable names you choose.
 
   * Two additional variables are included: the "class id" and the "_name_" variables.

        i) the class ID: By default, it has a name of CLASSID -- you can change this using CLASS.
           This identifies the "class".
     
       ii) _NAME_  :  a _NAME_ variable (this can not be modified).
           For each "class", there will be KX (KX=number of X variables specified) rows. 
           The  _NAME_ variable for each row will be a X-variable name.

  *   If you specify MEAN, an additional row and an additional column is added.
     This column has a name of _N_ 
      Note that the _NAME_ variable for this "means" row equals 0

If you DID specify -NOVC ....

   * The columns have variable names that are identical to the X variable names you choose.
   * Two additional variables are included: the "class id" and the "_n_" variables.
     The class id is as described above.
     The _N_ variable is the number of observations in this class.

Basic structure of a VCI file:
   The covariance matrix for the XI & XJ variables of the CTH class (the class with
   a CLASSID variable equal to cth) is stored in the VCI file, in  row where:
        i) The CLASSID variable will have value of CTH (say, cth=1023)
       ii) The _NAME_ variable wil have a value of XI (say, XI=INCOME)
      iii) Use the column for the XJ variable (say, XJ=AGE is the 5th column)


Example: 

   XVC ;
   X X1 X12 X5 X5 ;
   ID FIPS ;
   CLASS CLASSJ
   MEAN YES ;  

could produce

              CLASSJ _NAME_    X1  X12      X5      X4     _N_
           -------------------------------------------------------
               1      X1        20    1       5      -3      16
               1      X12        1   51      -2       6      16   
               1      X5         5   -2      22      12      16
               1      X4        -3    6      12       8      16
               1       0        5.6  13.5   9.25    4.6      16
 


   XVC  -NOVC ;
   X X1 X12 X5 X5 ;
   ID FIPS ;
   CLASS CLASSJ
   MEAN YES ;  

could produce
              CLASSJ    X1   X12      X5     X4     _N_
           -------------------------------------------------------
               1        5.6  13.5   9.25    4.6      16

Notes:

  *  Means and covariances are computed using element-wise deletion of missing values --
     the mean (or covariance) if a given variable (say, X1)  is estimated seperately
     (missing values for other variables, say XM3, row will NOT effect the mean or covariances
      computed for X1).

  *  If MEAN YES is specified, the last row of a class has the "means" -- 
     with CLASSID value  of 0. 
     Note that the last value of the last row is the total number of observations.
     In the above example, which assumes there are no missing values, it is the same as the variable
     specific "number of non-missing" rows (stored in the first 4 rows of the _N variable).

  * A "class" can include observations from many different sites, or whatever. 
    That is, the designation of classes in your data need NOT correspond to any
    other division of your data. In particular, it need not correspond to the
    "panel" or "observation" identifier.
    All that is must do is identify rows of data that are "created" by the 
    same data generating process (in the sense of data that share a mean and covariance).

    In fact, every single alternative in the main gauss dataset 
    can use the same value for class id.

 * Symmetry will be maintained: the value of row XA and column XB will equal row XB column XA.

 * Variance and covariance values of 0 are legit -- they signal no variance (say, if there is only
   one observation in a class)





=================================================
III. Examples:


OUTPUT Status.out  -reset ;

ATOG ;
  TEXT  e:\grbl2\discrete\doubleb.asc ;
  OUTDATA e:\grbl2\discrete\doubleb ;
  vars_1 ;
  onerow yes ;
  run;

Create ;
  TEXT e:\grbl2\discrete\doubleb ;
  OUTDATA e:\grbl2\discrete\doubleb2 ;
  x dep bid1 bidlow bidhigh x1 x2 ;
  new  CYY = dep = 1 ;
  new CYN = dep =2 ;
  new CNY = dep =3  ;
  new CNN= dep = 4 
  new l1 = bidhigh * CYY ;
  new l2= bid1 * CYN
  new  l3= bidlow * CNY ; 
  new   u1 = 1000000 * CYY  ;
  new  u2=bidhigh * CYN ; 
  new u3= bid1 * CNY ;
  new u4=bidlow * CNN ;
  run;


Create  ;
   TEXT e:\projects\grbl2\discrete\doubleb2 ;
  OUTDATA e:\projects\grbl2\discrete\doubleb3 ; 
   x dep bid1 bidlow bidhigh x1 x2 CYY CYN CNY CNN ;
   NEW  LOWBID = l1 + l2 + l3 ;
   new  HIGHBID= U1 + u2 + u3 + u4 ;
   new isrand=rndr(0,1,2) ;
  run;

Merge ;
  FileA e:\grbl2\data\set1 ;
  FileB e:\grbl2\data\set2 ;
  Type -row -big        @ use default match-type of -AB @
  XA X1 X2 X3 Z4 ;      @ but no variable name match @
  XB Y1 Y22  H33 ;
  OUTDATA e:\grbl2\data\set12 ;
 run ;


Merge ;        @ add county level socioeconomics, and replace missing values of income @
  FileA e:\grbl2\data\people ;
  FileB e:\grbl2\data\county ;
  Type  -AB ;                    @ use default of by-index @
  XA INCOME AGE MILES  VISITS FIPS ;
  XB PAG PCRP INCOME   ;         @ County income used if people income is missing @
  IDA  FIPS ;
  IDB FIPS ;
  OUTDATA e:\grbl2\data\people_2 ;

 run;


XVC   ;
  INPUT allobs ;
  outdata obsvc ;
     X price income age distance educ permits ;
  id fips ;
  class class1 ;
  mean yes ;
run;
