Data: Curl Installation and batch download PDF files from a website

cURL Installation

  1. download carcet.pem from  https://curl.haxx.se/docs/caextract.html
  2. download curl from https://curl.haxx.se/download.html (I use a windows system, so I download the binary file for windows 64 bit)
  3. extract curl to c:\curl and put carcet.pem under c:\curl\bin folder
  4. add environment variable curl with the path of curl.exe

Curl is useful for downloading files from the website.  The basic command is the following:

curl -O url

This command needs be run under the c:\curl\bin folder. The files are downloaded to C:\Users\username\

Purpose: download SAS 9.3 user guide pdf files from https://support.sas.com/documentation/onlinedoc/stat/930/

  1. In window cmd, under c:\curl\bin, run curl -o index https://support.sas.com/documentation/onlinedoc/stat/930/
    • generate the index file which contains the wrapped source html code of the webpage
  2.  open git bash, run following
      1. cd c
      2. cd curl
      3. cd bin
      4. grep -i pdf  index > list
  3. list contains the href=”*.pdf”. Use Excel text to column to get only the name of the pdf files.
  4. open list in notepad++ and at the bottom of the window, it shows “Windows (CR LF)”,  right click and select “Unix (LF)”.  This will solve the error “curl: (3) Illegal characters found in URL”
  5. Start a new bash file in notepad++ with the following code
    1. echo “Start!”
      url=https://support.sas.com/documentation/onlinedoc/stat/930/
      while read query
      do
      curl -O “$url${query}”
      echo $url${query}
      done < list
    2. save as echo
  6. in git bash, navigate to where the echo file is and run following
    1. bash echo