This document explains a Bash script that recursively downloads HTML pages from a specified URL, extracts all the .html
links from the downloaded pages, counts the number of extracted links, and then displays them. The script also cleans up temporary files created during the operation.
Before running this script on a Linux system (or Termux), ensure the following tools are installed:
To install these tools on a Debian-based system (like Ubuntu or Termux), use the following command:
sudo apt-get install wget grep awk coreutils
#!/bin/bash
# Main URL
url="https://www.miralishahidi.ir"
# Output file to store the links
output_file="html_links.txt"
# Display message to start downloading the page
echo "Starting to download and extract .html links from $url ..."
# Step 1: Recursively download all pages
echo "Step 1: Downloading all pages recursively..."
wget --recursive --no-parent --accept "*.html" --level=inf --no-check-certificate --quiet --output-file=wget.log --directory-prefix=temp "$url"
echo "Step 1 completed: Pages downloaded."
# Step 2: Extract .html links from the downloaded files
echo "Step 2: Extracting .html links..."
grep -Eroh 'href="[^"]+\.html"' temp | awk -F\" '{print $2}' | sort | uniq > $output_file
echo "Step 2 completed: Links extracted."
# Step 3: Count the number of extracted links
echo "Step 3: Counting the extracted links..."
count=$(wc -l < $output_file)
echo "Step 3 completed: Number of .html links extracted: $count"
# Step 4: Display the extracted links
echo "Step 4: Displaying the extracted links..."
echo "The extracted links have been saved in the file $output_file:"
cat $output_file
# Clean up temporary files
echo "Cleaning up temporary files..."
rm -rf temp
echo "Temporary files cleaned up."
# Final message
echo "Operation completed successfully."
url="https://www.miralishahidi.ir"
: Sets the URL from which HTML files are to be downloaded.output_file="html_links.txt"
: Specifies the output file where extracted .html
links will be stored.The following command is used to download all HTML pages recursively from the specified URL:
wget --recursive --no-parent --accept "*.html" --level=inf --no-check-certificate --quiet --output-file=wget.log --directory-prefix=temp "$url"
.html
extension.wget.log
.temp
directory..html
Links:This step extracts .html
links from the downloaded files:
grep -Eroh 'href="[^"]+\.html"' temp | awk -F\" '{print $2}' | sort | uniq > $output_file
.html
links in the temp
directory.html_links.txt
.The script counts the number of lines (links) in html_links.txt
using the following command:
count=$(wc -l < $output_file)
The script then echoes the count to the user.
The extracted links are displayed using the cat $output_file
command.
The script cleans up temporary files created during the operation:
rm -rf temp
This command deletes the temp
directory.
A message is displayed indicating the successful completion of the operation.
In Termux, to transfer a file from the Termux environment to the internal storage of your Android device, you need to first grant Termux the necessary access to internal storage. Afterward, you can use the cp
command to copy the file to internal storage.
First, you need to enable access to internal storage for Termux. Run the following command:
termux-setup-storage
This command will display a permission request that you need to approve. After approval, a storage
directory will be created in the root of Termux's home directory (~/
), which is linked to your device's internal storage.
Once access is granted, you can copy the html_links.txt
file to internal storage using the following command:
cp html_links.txt /sdcard/
This command copies html_links.txt
to the root directory of the SD card (which is equivalent to internal storage). If you want to move the file to a specific folder in internal storage, change the destination path instead of /sdcard/
.
To transfer the file to the Download
folder in internal storage, use:
cp html_links.txt /sdcard/Download/
After copying, you can use the device's file manager application to navigate to the desired folder (such as Download
) and verify that the file exists.
If the file does not appear in the destination, check that the necessary permissions have been granted to Termux.