A tool for download web resources is wget. It comes with a feature to mirror web sites, but you can also use it to download specific files, like PDFs. This is very easy and straightforward to do:
wget <url> Example: wget http://localhost/doc.pdf
This will instruct wget to download the file doc.pdf from localhost and save it as doc.pdf. It is not as easy when the weber service is
- requesting authentication or
- the URL of the PDF file ends in the same file name
The documentation of wget states that you can provide the username and password for BASIC authentication. What about a web site that asks for SAML 2.0? You can pass HTTP headers to wget via parameter –header. This feature makes it easy: log on to the server via a browser and then copy the headers. These headers contain the session information of you user and can be used by wget to connect as an authenticated user.
How to get the HTTP headers
- Log on to the web site
- Open developer tools
- Select a web resource
- Copy the HTTP headers. For cURL, its just selecting Copy all as cURL. This gives the complete cURL command. For just the headers, select Copy Request Headers.
User-Agent: Mozilla/5.0 Chrome/56 Accept-Encoding: gzip, deflate, sdch, br Cookie: JSESSIONID=DBE1FED5C040B2DF7;
Each line is one –header parameter for wget. It is not feasible to add all these headers to each wget request individually. For maintenance and better readability these values should be read from a file. Problem: wget does not allow to read the header parameter from a file. There is no option for something like –header <file_with_headers>. What there is the . wgetrc file. This is the configuration file wget reads when called, and in this file it`s possible to define HTTP header values. For each HTTP header, create a new “header = <value>” entry in the file.
With this configured in the file, wget will send always these HTTP headers with each request. If the session cookies copied from the browser are valid the requests are authenticated and wget is able to download the file.
Sometimes the file you want to download has a generic URL. Each file ends in the same file name at the server. For instance, http://localhost/category/doc.pdf, or /uid/E.pdf. In such cases, wget will download the file and save it as doc.pdf or E.pdf. This is not a problem when you download just one file, but when you download more files, like 20, wget numerate the files: E.pdf.1, E.pdf.2, E.pdf.3, …
This makes it hard to work with the files. A solution can be to check if the web server is supporting content-disposition. If so, the server should send the real file name of the archive in the HTTP response. The real file name can be seen in the Conent-Disposition header as filename.
With content-diposition, wget can save the downloaded file from /<UID>/E.pdf as <UID>.pdf instead of E.pdf. As the UID is unique, the file can easily be identified after download.
wget --content-disposition http://localhost/<uid>/E.pdf
Given the above example, the file download is saved locally as 2399104_E_20170304.pdf