SAS: Reading Census Data

Census data is exceptionally large. The 2016 census profile for Ontario is 4.5G and has more than 46,694,909 lines of records. To extract the data efficiently, StatsCan provides a csv file that identifies the starting row number for each geography. Using this file, you can compile the parameter list for the geographical area of interest at the selected geographic level, eg. province level, census division level, and census subdivision level.
Census file can be downloaded at link
Step 1: Compile parameter lists

%put &name.;
Canada Ontario Durham York Toronto Peel Halton
 %put &start.;
2 2249 7287023 9513800 12198965 20521853 25289987
 %put &end.;
2248 4495 7289269 9516046 12201211 20524099 25292233

Step 2: Extract and Compile data

%macro ext (namelst=, startlst=, endlst=);
proc datasets library=work noprint;
delete census;
quit;
%let i=1;
%do %while (%scan(&namelst., &i, ' ') ne );
%let parm1=%scan(&namelst., &i, ' ');
%let parm2=%scan(&startlst., &i, ' ');
%let parm3=%scan(&endlst., &i, ' ');
data census_&parm1.;
infile 'X:\Work\Stats Can\98-401-X2016044_ONTARIO_eng_CSV\98-401-X2016044_ONTARIO_English_CSV_data.csv'
delimiter = ',' firstobs=&parm2. obs=&parm3. TRUNCOVER  DSD LRECL=32767 ;
INFORMAT 
year 8.
geo_code  $13.
geo_level 8.
geo_name $80.
gnr 8.1
gnr_lf 8.1
quality_flag $5. 
alt_geo_code 8.
Item $50.
itemID 8.
Notes 8.
Total 8.
Male $8.
Female $8.
;
FORMAT 
year 8.
geo_code  $13.
geo_level 8.
geo_name $50.
gnr 8.1
gnr_lf 8.1
quality_flag $5. 
alt_geo_code 8.
Item $80.
itemID 8.
Notes 8.
Total 8.
Male $8.
Female $8.
;
input
year 
geo_code $
geo_level 
geo_name $
gnr
gnr_lf
quality_flag 
alt_geo_code 
Item $
itemID 
Notes 
Total
Male $
Female $
;
run;
proc append base = census data = census_&parm1.;
run;
%let i = %eval (&i +1);
%end;
%mend ext;
%ext (namelst=&name., startlst=&start., endlst=&end. );

The resulting dataset is only 4.1mb, which you can manipulate efficiently.

Reading Data from fixed position txt file

Read data from fixed position txt file to SAS.

This type of file normally don’t have the variable name as the first observation because the length of the variable name might not be consistent with the length of the variable. The starting position for each variable is the same for all the records in the txt file.

Below in the admit table in fixed position txt file.

2458 Murray, W      M 27 1  72 168 HIGH 85.20
2462 Almers, C      F 34 .  66 152 HIGH 124.80
2501 Bonaventure, T F 31 .  61 123 LOW  149.75
2523 Johnson, R     M 43 31 63 137 MOD  149.75
2539 LaMance, K     M 51 4  71 158 LOW  124.80
2544 Jones, M       M 29 6  76 193 HIGH 124.80
  • Require . for missing value
  • use $ in input statement to identify character variable
  • use X-X to identify the starting position and ending position for the character variables.  This also determines the length of the character variable.
  • use @x to identify the starting position for the numeric variable.
  • use x. to identify the length of the numeric variable. eg: @23 age 2. indicates to read from position 23 the age variable with a length of 2.

Method1:  not identify the position for the numeric variable.

data admit;
infile 'C:\Users\irisan\Documents\My SAS Files(32)\9.3\Tutorial\admit_fixed.txt' ;
input ID $ 1-4 name $ 6-19 sex $ 21 age date height weight actlevel $ 36-39 fee;
run;

Method2: identify the position for each variable.

data admit;
infile 'C:\Users\irisan\Documents\My SAS Files(32)\9.3\Tutorial\admit_fixed.txt' ;
input
ID $ 1-4
name $ 6-19
sex $ 21
@23 age 2.  /* length for age is 2 */
@26 date 
@29 height
@32 weight
actlevel $ 36-39
@41 fee ;
run;

Read only  certain observations that meet the criteria for selected variables from the txt file, using 2 input statements and if statement.  Second input statement only contains the required variables, so not all the variables are read into the target SAS dataset.

data admit;
infile 'C:\Users\irisan\Documents\My SAS Files(32)\9.3\Tutorial\admit_fixed.txt' ;
input actlevel $ 36-39 @23 age  @;
if  age < 35 and actlevel EQ 'HIGH' then delete;
input
name $ 6-19
sex $ 21
@29 height
@32 weight
@41 fee ;
run;

 

SAS: Reading Data from tab delimited txt file

Read tab delimited txt file to sas.

data admit;
length name $14.;
infile ‘C:\Users\Documents\My SAS Files(32)\9.3\Tutorial\admit.txt’ dlm=’09’x firstobs =2  ;
input ID $4. name $ sex $1. age date height weight actlevel $4. fee;
run;

  • missing value in the txt file should be either . or blank space.
  • use dlm=’09’x to read the tab delimited file
  • SAS default character variable length is 8. In order to read the name variable, need to add length statement before the infile statement, otherwise the name field will be truncated at 8.
  • add firstobs =2 if the first line in the txt file is the variable names.
  • use lrecl= 32760 if the observation is very long (longer than the default 256).

Read only certain observations that meet the criteria from the tab delimited file to sas.

Method1:

data admit;
length name $14.;
infile 'C:\Users\Documents\My SAS Files(32)\9.3\Tutorial\admit.txt' dlm='09'x firstobs =2;
input ID $4. name $ sex $1. @;
if sex = 'F' then delete;
input age date height weight actlevel $4. fee;
run;

Method2:

data admit;
length name $14.;
infile 'C:\Users\irisan\Documents\My SAS Files(32)\9.3\Tutorial\admit.txt' dlm='09'x firstobs =2;
input ID $4. name $ sex $1. @ ;
if sex = 'M' then do;
input  age date height weight actlevel $4. fee;end;
else delete;
run;
  • use 2 input statements to select ‘M’ admits.
  • in the first input statement, use one trailing @ to tell SAS to hold the record for if statement. In method 1, the ‘F’ records will not be passed on to the second input statement.
  • for tab delimited data, it is not possible to only input selected variables from the source file.  All the variables need to be input in the variable sequence of source file from left to right.

SAS: Input Raw data with different length of character variable

data map;
infile datalines delimiter=',';
length faculty $2 progname $90. facdesc $25.;
input  faculty $ progname $ facdesc $;
datalines;
AB, AB-Digital Media, , Faculty AB
AB, AB-Interdisc. Fine Arts, Faculty AB
CD,SC-Biology,Science, Faculty CD
;
run;

NOTE:

  • Can’t use Input statement to set length for the variable.  Input statement only identify the character variables by putting $ after the variable name.
  • Use ‘,’ to separate the value of the variable in the datalines; no need to put ‘;’ to end each dataline, just end the whole datalines section with one ‘;’.
  • Put Length statement before the Input statement to set the desired variable length for the data needed input.

BENEFIT:

  • No need to use “” for character values that contain spaces.
  • No need to align the variables in datalines.