How to use find to sort files across folders

Short version

You have files named File0.txt to File100.txt in different folders and want to move the first 30 files in a separate directory (command for Mac users. Linux users can use find and mv):

For sorting FileNN.txt (character + number)

gfind -type f -printf "%f %p\n" | sort -n -k 1.5 | sed 's/.* //' | head -30 | xargs gmv -t ./A/

For sorting NN.txt (numeric filename)

gfind -type f -printf "%f %p\n" | sort -n | sed 's/.* //' | head -30 | xargs gmv -t ./A/

Preparation

For the below commands to work, you’ll need to use GNU find. If you are using a Mac, you’ll need to install the GNU version of find and mv via homebrew.

brew install findutils coreutils

Create a test folder structure. There will be 3 folders and several files in them.

mkdir 1
mkdir 2
mkdir 3

Create 100 files with name TestNN.txt with sample content and place them in one of the three directories randomly.

for i in {000..100}
  do
    Num=$((1 + RANDOM % 3))
    echo hello > "$Num/File${i}.txt"
done

After running the above script, the folder will look like this (running ls -R)

Also create the target directory A:

mkdir A

Commands

After the initial setup is done, we have several files in 3 directories. If you use find to get a list of all files, you’ll see that the output is not sorted.

gfind ./ -type f

A Unix command to sort files is sort. Applying sort in this scenario won’t help, as the files are sorted by the folder name:

gfind ./ -type f | sort -n

The output is now sorted by folder name and then by file name, but not only by file name. Copying the first 50 elements won’t result in the File1 – File 50. The files are not distributed across the directories as needed.

It is possible to see a solution to the problem: sort only on the filename, while still having the complete path in the output for piping the parth to the copy command. Find includes exactly this possibility: print a specific field. To control the output, parameter -printf is available, and %f prints the filename, while %p includes the folder.

gfind -type f -printf "%f\n"

The output of the command only prints the filename.

To output the file with path, use %p. In both cases \n is used to have each file in a new line.

gfind -type f -printf "%p\n"

Both output parameters can be combined. %f %p\n will first print the filename, then space, then the path.

gfind -type f -printf "%f %p\n"

Applying sort on this output will sort on the file name only.

gfind -type f -printf "%f %p\n" | sort -n

Close, but not exactly how it should be. In case your filename consists only of numbers, this will already work. In the example however, the filename contains characters. Therefore, sorting is not working correctly. It starts with File0.txt, then File1.txt, but then comes File10.txt and not File2.txt. To sort by the number, add to sort an additional parameter: -k 1.5. As the filename contains a fixed value (File), the parameter will instruct sort to ignore this part when sorting and focus only on the number.

Note: you may apply the same sort parameter without using find, just ls. As long as your path has the same size, it will work. For folders named 1..9 it’s ok, but when your folder has two or more chars (like 10, or 213, or test), the parameter needs to be adjusted.

List all files with directory name using ls:

ls -d1 */*

Sort by number in filename:

ls -d1 */* | sort -n -k 1.7

gfind -type f -printf "%f %p\n" | sort -n -k 1.5

With the last command, the output is correctly sorted based on the filename. Now, how to use this output to move the files to the target directory? Just piping the output to mv won’t work. The first part with the filename is not needed, only the second part. Both parts are separated by blank, and using sed, it’s possible to eliminate the part before the blank from the output.

gfind -type f -printf "%f %p\n" | sort -n -k 1.5 | sed 's/.* //'

The last step is now to use mv to move the files to the target directory. To not have to move all files, let’s take only the first 30 files. Gnu mv is needed to move the files, as the default MacOS BSD mv does not include the -t parameter. To pass the files line by line, xargs is used together with gmv.

gfind -type f -printf "%f %p\n" | sort -n -k 1.5 | sed 's/.* //' | head -30 | xargs gmv -t ./A/

Result

Now there are the first 30 files in folder A.

gls -1v ./A

 

Parallel download of files using curl

In a previous blog, I showed how to download files using wget. The interesting part of this blog was to pass the authentication cookies to the server as well as using the file name given by the Content-Disposition directive when saving the file. The example of the previous blog was to download a single file. What if you want to download several files from a server? Maybe hundreds or even thousands of files? wget is not able to read the location from a file and download these in parallel, neither is curl capable of doing so. You can start the download as a sequence, letting wget/curl download the files one by one, as shown in my other blog. Just use a FOR loop until you reach the end.

Commands

For downloading a large amount of files in parallel, you`ll have to start the download command several times in parallel. To achieve this, several programs in bash must be combined.

Create the list of files to download. This is the same as shown in my previous blog.

for i in 1 {1..100}; do `printf "echo https://server.fqdn/path/to/files/%0*d/E" 7 $i` >> urls.txt; done

Start the parallel download of files. Start 10 threads of curl in background. This is an enhanced version of the curl download command of my previous blog. Xargs is used to run several instances of curl.

nohup cat urls.txt | xargs -P 10 -n 1 curl -O -J -H "$(cat headers.txt)" >nohup.out 2>&1 &

Explanation

The first command is creating a list of files to download and stores them in the file urls.txt.

The second command is more complex. First, cat is printing the content of urls.txt to standard-out. Then, xargs is reading from standard-in and uses it as input for the curl command. For authentication and other headers, the content of the file headers.txt is used. The input for curl for the first line is then:

curl -O -J -H "$(cat headers.txt)" https://server.fqdn/path/to/files/0000001/E

The parameter –P 10 informs xargs to run the command 10 times in parallel. It takes the first 10 lines of input and starts for each input a new curl process. Therefore, 10 processes of curl are running in parallel. To run more downloads in parallel, give a higher value for –P, like 20, or 40.

To run the download in background, nohup is used. All output is redirected to nohup.out: >nohup.out 2>&1

SSH

To have the download running while being logged on via SSH, the tool screen should be used. After logon via ssh, call screen, run the above command, and hit CTRL + A + D to exit screen.

ssh user@server.fqdn
screen
nohup cat urls.txt | xargs -P 10 -n 1 curl -O -J -H "$(cat headers.txt)" >nohup.out 2>&1 &
CTRL+A+D
exit

Download files with leading zero in name using wget

In my previous blog I showed how wget can be used to download a file from a server using HTTP headers for authentication and how to use Content-Disposition directive send by the server to determine the correct file name. With the information of the blog it`s possible to download a single file from a server. But what if you must download several files? Maybe hundreds or thousands of files? Files whose file name is created using a mask, adding leading zeros?

Add leading zeros

What you need is a list of files to download. I`ll follow my example from the previous post and my files follow a specific patter: number. All files are numbered from 1 to n. To make it more special / complicated, it`s not only 1 to n. A mask is applied: 7 digits in total, with leading 0. 123 is 0000123, or 5301 is 0005301. In recent versions of Bash, you can use a FOR loop to loop through the numbers and printf for formatting the output and add the leading zeros. To get the numbers correctly formatted, the command is:

for i in 140000 {140001..140005}; 
  do echo `printf "%0*d" 7 $i`; 
done

This prints (echo) the numbers 140000 to 140005 with leading zero.

Start download

Adding the wget command in the printf directive allows to download the files. The execution flow is to let the FOR loop together with printf create the right number with mask, and wget downloads the file. After the file is download, the next iteration of the FOR loop starts, and the next file is downloaded. Assuming that I have PDF documents named 0140000.pdf to 0140005.pdf on server http://localhost:9080, the FOR loop with wget is:

for i in 140000 {140001..140005}; 
  do `printf "wget -nc --content-disposition http://localhost:9080/%0*d.pdf\n" 7 $i`; 
done

Result

Alternative

The above example is using wget. Of course, you can do the same using curl.