Process Data using DSL (from Fraktal SAS Programming)

Aus phenixxenia.org
Zur Navigation springen Zur Suche springen

Zurück

Übersicht

Vorwärts

General Remarks on table-formatted Data Repositories

In 21st century data processing data are kept in databases managed by active server software already listed elsewhere in these guidelines. Hence, the most common scenario met by SAS coders is working on data which are already in tabular format. These are not as aged as, but quite near to the concept SAS uses for its proprietary table format known as SAS datasets. Having emerged in 1990, the SAS Multiple Engine Architecture provides a continuously growing number of means to seamlessly integrate data tables managed by 3rd party database servers. To the SAS programmer these means are known as engines that provide mainly read/write interaction with most data sources used in professional IT.

See chapter DBMS Interaction for technically more detailed information on this topic.

As a result from the forementioned situation we do not put emphasis on data server technology nor vendor when talking about processing data sources of tabular format. Everthing behind a library reference is assumed to be of compatible structure and, hence, treated equally throughout these guidelines. Data processing performed in a data step is done sequentially, i.e. observation by observation or row by row.

Welcome to the universe of record-wise data processing!


Table-to-Table

Code executed Function performed
data hotels_2;
Generate a new data table hotels_2
 set hotels;
Use SET statement to read lines from data table hotels
run;
End data step run-group

This maximum simple table-to-table data step uses a SET statement to loop over the observation/row number to process each line from the source dataset and write the result to the target dataset. No OUTPUT statement is needed here since the data step automatically performs this function at the end of processing unless no explicit OUTPUT statement is used.

Since there's no processing performed here, dataset hotel_2 is a perfect copy of dataset hotel. To copy datasets as a whole SAS provides other means such as PROC COPY. Generating copies line-by-line is the most inefficient method to do this.


Table-to-two-Tables

Code executed Function performed
data random_lo random_hi;
Generate new data tables random_lo and random_hi
 set random;
Use SET statement to read lines from data table random
select (group);
Branch over values of group to appropriately partition the random dataset
when('A','B','C','D') do;
Work on uniformly distributed noise
  if noise < 0.5 
then output random_lo;
else output random_hi;
Write lower interval of equally distributed noise to random_lo and vice versa
end;
Terminate uniform noise branch
otherwise do;
Work on normally distributed noise
  if noise < 0 
then output random_lo;
else output random_hi;
Write lower interval of normal distributed noise to random_lo and vice versa
end;
Terminate normal noise branch
end;
Terminate branching over group
run;
End data step run-group

This is a hard-coded data step of medium range complexity using two methods of branching logic.

  • The IF-THEN-ELSE logic is most appropriate for single value checks and may turn out as a performance de-accelerator when used in IF-THEN-IF-THEN-... cacades.
  • The SELECT-WHEN logic is a condensed branch for any number of values and hence, pretty fast.


Results

When issuing the phrase "vt random_lo" in the command window of your SAS session, the generated table will pop-up in a "Viewtable Window". From the unifom distribution only the lower interval has remained and from the normal distribution only the negative values have remained. This is exactly the result intended.

group seed noise
D 993 0.22943
D 994 0.28024
D 996 0.27041
D 998 0.40269
D 999 0.47663
X 2 -1.04528
X 4 -1.61314
X 8 -0.08472
X 9 -1.37055
X 10 -0.20202


On the contrary, when issuing the phrase "vt random_hi" in the command window of your SAS session, the 2nd generated table will pop-up in a "Viewtable Window". From the unifom distribution only the upper interval has remained and from the normal distribution only the positive values have remained. Again, this is exactly the result intended.

group seed noise
D 988 0.52537
D 992 0.58119
D 995 0.94102
D 997 0.86849
D 1000 0.63694
X 1 0.82346
X 3 2.59510
X 5 0.78804
X 6 1.24662
X 7 1.31970


Zurück

Übersicht

Vorwärts