Data Step Programming (from Fraktal SAS Programming)
What is this?
The triplex name "Data Step Programming" needs to be explained step-by-step:
- DATA is the SAS technical term for values operated on.
- STEP is the SAS conceptual name for a segment-wise oriented coding structure.
- PROGRAMMING is the SAS term for coding a scripted (not compiled) algorithm.
Data Step Programming is done using the "Data Step Language" (DSL). The Data Step Language is a fully equipped 3rd generation language, modelled on IBM Corporation's "PL/1" called successor candidate for FORTRAN.
What does it do?
Simply speaking, SAS Data Step Programming processes one "observation" at a time when generating a "dataset". An observation is a data line, known as "row" to the rDBMS specialist coding SQL, that is derived from the punch card concept in pioneering ages of IT; hence, a dataset is a table made from observations that share a common structure.
Observations are processed in a one-line-register called "Program Data Vector" (PDV).
Generally speaking, each line of code in DSL applies some function to the PDV, the altered content of which is then written to the dataset generated, either implicitly or on explicitly stated order using an "OUTPUT" statement.
The Data Step
SAS code segments coded in DSL are called a Data Step.
- A Data Step starts with a "DATA" statement, containing up to 32 names for datasets to be created in this step.
- When reading from an already existing data source in
- text file format, an "INFILE" statement is used accompanied by a file reference that points to the text file, followed by an "INPUT" statement to code two-dimensional reading;
- tabular format, a "SET" statement is used accompanied by concatenated library reference and table name, that initiates looping over the records (or "observations" or "rows") in the data source, applying the DSL coded algorithm on each record.
- When all processing is coded, a "RUN" statement is issued to invoke the SAS DSL compiler that will perform a Compile and Go then, resulting in immediate results such as generated datasets and more.
Being terminated with a RUN statement, each Data Step is called a "Run Group".
Hello World
No surpise, the standard introduction to coding looks like PL/1 code, except the DATA and RUN statements:
data _NULL_; put 'hello world'; run;
Not important here, but, once in a while quite useful, is the dataset name "_null_". No dataset is created but statements are executed, which is solely PUT here.