This guide explains a Bash script that performs the following tasks:
.html
links from the downloaded pages.To execute this script on a Linux system or Termux, you must have the following tools installed:
For Debian-based systems (like Ubuntu or Termux), install these tools using:
sudo apt-get install wget grep awk coreutils
If you're using Termux, follow these steps to install the required tools:
coreutils
package for sorting and removing duplicates.
# Update repositories and upgrade existing packages
pkg update && pkg upgrade -y
# Install wget
pkg install wget -y
# Install grep
pkg install grep -y
# Install awk
pkg install gawk -y
# Install sort and uniq (provided by coreutils)
pkg install coreutils -y
To confirm that each tool is installed and functioning correctly, run the following commands:
wget --version # Verify wget installation
grep --version # Verify grep installation
awk --version # Verify awk installation
sort --version # Verify sort availability
uniq --version # Verify uniq availability
Each command should display the version information, confirming successful installation.
#!/bin/bash
# Main URL
url="https://www.miralishahidi.ir"
# Output file to store the links
output_file="html_links.txt"
# Display message to start downloading the page
echo "Starting to download and extract .html links from $url ..."
# Step 1: Recursively download all pages
echo "Step 1: Downloading all pages recursively..."
wget --recursive --no-parent --accept "*.html" --level=inf --no-check-certificate --quiet --output-file=wget.log --directory-prefix=temp "$url"
echo "Step 1 completed: Pages downloaded."
# Step 2: Extract .html links from the downloaded files
echo "Step 2: Extracting .html links..."
grep -Eroh 'href="[^"]+\.html"' temp | awk -F\" '{print $2}' | sort | uniq > $output_file
echo "Step 2 completed: Links extracted."
# Step 3: Count the number of extracted links
echo "Step 3: Counting the extracted links..."
count=$(wc -l < $output_file)
echo "Step 3 completed: Number of .html links extracted: $count"
# Step 4: Display the extracted links
echo "Step 4: Displaying the extracted links..."
echo "The extracted links have been saved in the file $output_file:"
cat $output_file
# Clean up temporary files
echo "Cleaning up temporary files..."
rm -rf temp
echo "Temporary files cleaned up."
# Final message
echo "Operation completed successfully."
url="https://www.miralishahidi.ir"
: Sets the URL from which HTML files are to be downloaded.output_file="html_links.txt"
: Specifies the output file where extracted .html
links will be stored.The following command is used to download all HTML pages recursively from the specified URL:
wget --recursive --no-parent --accept "*.html" --level=inf --no-check-certificate --quiet --output-file=wget.log --directory-prefix=temp "$url"
.html
extension.wget.log
.temp
directory..html
Links:This step extracts .html
links from the downloaded files:
grep -Eroh 'href="[^"]+\.html"' temp | awk -F\" '{print $2}' | sort | uniq > $output_file
.html
links in the temp
directory.html_links.txt
.The script counts the number of lines (links) in html_links.txt
using the following command:
count=$(wc -l < $output_file)
The script then echoes the count to the user.
The extracted links are displayed using the cat $output_file
command.
The script cleans up temporary files created during the operation:
rm -rf temp
This command deletes the temp
directory.
A message is displayed indicating the successful completion of the operation.
In Termux, to transfer a file from the Termux environment to the internal storage of your Android device, you need to first grant Termux the necessary access to internal storage. Afterward, you can use the cp
command to copy the file to internal storage.
First, you need to enable access to internal storage for Termux. Run the following command:
termux-setup-storage
This command will display a permission request that you need to approve. After approval, a storage
directory will be created in the root of Termux's home directory (~/
), which is linked to your device's internal storage.
Once access is granted, you can copy the html_links.txt
file to internal storage using the following command:
cp html_links.txt /sdcard/
This command copies html_links.txt
to the root directory of the SD card (which is equivalent to internal storage). If you want to move the file to a specific folder in internal storage, change the destination path instead of /sdcard/
.
To transfer the file to the Download
folder in internal storage, use:
cp html_links.txt /sdcard/Download/
After copying, you can use the device's file manager application to navigate to the desired folder (such as Download
) and verify that the file exists.
If the file does not appear in the destination, check that the necessary permissions have been granted to Termux.