Download files with wget

Published by Tobias Hofmann on March 7, 2017March 7, 2017

3 min read

A tool for download web resources is wget. It comes with a feature to mirror web sites, but you can also use it to download specific files, like PDFs. This is very easy and straightforward to do:

wget <url>
Example: wget http://localhost/doc.pdf

This will instruct wget to download the file doc.pdf from localhost and save it as doc.pdf. It is not as easy when the weber service is

requesting authentication or
the URL of the PDF file ends in the same file name

Authentication

The documentation of wget states that you can provide the username and password for BASIC authentication. What about a web site that asks for SAML 2.0? You can pass HTTP headers to wget via parameter –header. This feature makes it easy: log on to the server via a browser and then copy the headers. These headers contain the session information of you user and can be used by wget to connect as an authenticated user.

How to get the HTTP headers

Log on to the web site
Open developer tools
Select a web resource
Copy the HTTP headers. For cURL, its just selecting Copy all as cURL. This gives the complete cURL command. For just the headers, select Copy Request Headers.

Example:

User-Agent: Mozilla/5.0 Chrome/56
Accept-Encoding: gzip, deflate, sdch, br
Cookie: JSESSIONID=DBE1FED5C040B2DF7;

Each line is one –header parameter for wget. It is not feasible to add all these headers to each wget request individually. For maintenance and better readability these values should be read from a file. Problem: wget does not allow to read the header parameter from a file. There is no option for something like –header <file_with_headers>. What there is the . wgetrc file. This is the configuration file wget reads when called, and in this file it`s possible to define HTTP header values. For each HTTP header, create a new “header = <value>” entry in the file.

With this configured in the file, wget will send always these HTTP headers with each request. If the session cookies copied from the browser are valid the requests are authenticated and wget is able to download the file.

File name

Sometimes the file you want to download has a generic URL. Each file ends in the same file name at the server. For instance, http://localhost/category/doc.pdf, or /uid/E.pdf. In such cases, wget will download the file and save it as doc.pdf or E.pdf. This is not a problem when you download just one file, but when you download more files, like 20, wget numerate the files: E.pdf.1, E.pdf.2, E.pdf.3, …

This makes it hard to work with the files. A solution can be to check if the web server is supporting content-disposition. If so, the server should send the real file name of the archive in the HTTP response. The real file name can be seen in the Conent-Disposition header as filename.

With content-diposition, wget can save the downloaded file from /<UID>/E.pdf as <UID>.pdf instead of E.pdf. As the UID is unique, the file can easily be identified after download.

wget --content-disposition http://localhost/<uid>/E.pdf

Given the above example, the file download is saved locally as 2399104_E_20170304.pdf

2 Comments

Download files with leading zero in name using wget | It`s full of stars! · May 4, 2017 at 09:31

[…] my previous blog I showed how wget can be used to download a file from a server using HTTP headers for authentication and how to use Content-Disposition directive […]

Parallel download of files using curl | It`s full of stars! · May 22, 2017 at 12:08

[…] a previous blog, I showed how to download files using wget. The interesting part of this blog was to pass the authentication cookies to the server as well as […]

Download files with wget

Authentication

File name

Tobias Hofmann

2 Comments

Download files with leading zero in name using wget | It`s full of stars! · May 4, 2017 at 09:31

Parallel download of files using curl | It`s full of stars! · May 22, 2017 at 12:08

Leave a Reply Cancel reply

SAP Technology

Wait, that’s possible?

Technology

SAP TechEd 2024 Virtual session material download

Technology

Solution for npm install fails with node-gyp error: memory file not found

Download files with wget

Authentication

File name

Tobias Hofmann

2 Comments

Download files with leading zero in name using wget | It`s full of stars! · May 4, 2017 at 09:31

Parallel download of files using curl | It`s full of stars! · May 22, 2017 at 12:08

Leave a Reply Cancel reply

Related Posts

SAP Technology

Wait, that’s possible?

Technology

SAP TechEd 2024 Virtual session material download

Technology

Solution for npm install fails with node-gyp error: memory file not found