Parallel download of files using curl
In a previous blog, I showed how to download files using wget. The interesting part of this blog was to pass the authentication cookies to the server as well as using the file name given by the Content-Disposition directive when saving the file. The example of the previous blog was to download a single file. What if you want to download several files from a server? Maybe hundreds or even thousands of files? wget is not able to read the location from a file and download these in parallel, neither is curl capable of doing so. You can start the download as a sequence, letting wget/curl download the files one by one, as shown in my other blog. Just use a FOR loop until you reach the end.
Commands
For downloading a large amount of files in parallel, you`ll have to start the download command several times in parallel. To achieve this, several programs in bash must be combined.
Create the list of files to download. This is the same as shown in my previous blog.
for i in 1 {1..100}; do `printf "echo https://server.fqdn/path/to/files/%0*d/E" 7 $i` >> urls.txt; done
Start the parallel download of files. Start 10 threads of curl in background. This is an enhanced version of the curl download command of my previous blog. Xargs is used to run several instances of curl.
nohup cat urls.txt | xargs -P 10 -n 1 curl -O -J -H "$(cat headers.txt)" >nohup.out 2>&1 &
Explanation
The first command is creating a list of files to download and stores them in the file urls.txt.
The second command is more complex. First, cat is printing the content of urls.txt to standard-out. Then, xargs is reading from standard-in and uses it as input for the curl command. For authentication and other headers, the content of the file headers.txt is used. The input for curl for the first line is then:
curl -O -J -H "$(cat headers.txt)" https://server.fqdn/path/to/files/0000001/E
The parameter –P 10 informs xargs to run the command 10 times in parallel. It takes the first 10 lines of input and starts for each input a new curl process. Therefore, 10 processes of curl are running in parallel. To run more downloads in parallel, give a higher value for –P, like 20, or 40.
To run the download in background, nohup is used. All output is redirected to nohup.out: >nohup.out 2>&1
SSH
To have the download running while being logged on via SSH, the tool screen should be used. After logon via ssh, call screen, run the above command, and hit CTRL + A + D to exit screen.
ssh user@server.fqdn screen nohup cat urls.txt | xargs -P 10 -n 1 curl -O -J -H "$(cat headers.txt)" >nohup.out 2>&1 & CTRL+A+D exit
0 Comments