Bash Script to Download and Extract HTML Links

This document explains a Bash script that recursively downloads HTML pages from a specified URL, extracts all the .html links from the downloaded pages, counts the number of extracted links, and then displays them. The script also cleans up temporary files created during the operation.

Requirements

Before running this script on a Linux system (or Termux), ensure the following tools are installed:

To install these tools on a Debian-based system (like Ubuntu or Termux), use the following command:

sudo apt-get install wget grep awk coreutils

The Bash Script

#!/bin/bash

# Main URL
url="https://www.miralishahidi.ir"

# Output file to store the links
output_file="html_links.txt"

# Display message to start downloading the page
echo "Starting to download and extract .html links from $url ..."

# Step 1: Recursively download all pages
echo "Step 1: Downloading all pages recursively..."
wget --recursive --no-parent --accept "*.html" --level=inf --no-check-certificate --quiet --output-file=wget.log --directory-prefix=temp "$url"
echo "Step 1 completed: Pages downloaded."

# Step 2: Extract .html links from the downloaded files
echo "Step 2: Extracting .html links..."
grep -Eroh 'href="[^"]+\.html"' temp | awk -F\" '{print $2}' | sort | uniq > $output_file
echo "Step 2 completed: Links extracted."

# Step 3: Count the number of extracted links
echo "Step 3: Counting the extracted links..."
count=$(wc -l < $output_file)
echo "Step 3 completed: Number of .html links extracted: $count"

# Step 4: Display the extracted links
echo "Step 4: Displaying the extracted links..."
echo "The extracted links have been saved in the file $output_file:"
cat $output_file

# Clean up temporary files
echo "Cleaning up temporary files..."
rm -rf temp
echo "Temporary files cleaned up."

# Final message
echo "Operation completed successfully."

Explanation of Each Step

Main URL and Output File Setup:

Step 1 - Recursively Download Pages:

The following command is used to download all HTML pages recursively from the specified URL:

wget --recursive --no-parent --accept "*.html" --level=inf --no-check-certificate --quiet --output-file=wget.log --directory-prefix=temp "$url"

Step 2 - Extract .html Links:

This step extracts .html links from the downloaded files:

grep -Eroh 'href="[^"]+\.html"' temp | awk -F\" '{print $2}' | sort | uniq > $output_file

Step 3 - Count Extracted Links:

The script counts the number of lines (links) in html_links.txt using the following command:

count=$(wc -l < $output_file)

The script then echoes the count to the user.

Step 4 - Display Extracted Links:

The extracted links are displayed using the cat $output_file command.

Cleanup:

The script cleans up temporary files created during the operation:

rm -rf temp

This command deletes the temp directory.

Final Message:

A message is displayed indicating the successful completion of the operation.

Transferring Files from Termux to Internal Storage

In Termux, to transfer a file from the Termux environment to the internal storage of your Android device, you need to first grant Termux the necessary access to internal storage. Afterward, you can use the cp command to copy the file to internal storage.

Steps to Follow:

1. Grant Access to Internal Storage:

First, you need to enable access to internal storage for Termux. Run the following command:

termux-setup-storage

This command will display a permission request that you need to approve. After approval, a storage directory will be created in the root of Termux's home directory (~/), which is linked to your device's internal storage.

2. Copy File to Internal Storage:

Once access is granted, you can copy the html_links.txt file to internal storage using the following command:

cp html_links.txt /sdcard/

This command copies html_links.txt to the root directory of the SD card (which is equivalent to internal storage). If you want to move the file to a specific folder in internal storage, change the destination path instead of /sdcard/.

Default Termux Paths to Internal Storage:

Example:

To transfer the file to the Download folder in internal storage, use:

cp html_links.txt /sdcard/Download/

Checking the File:

After copying, you can use the device's file manager application to navigate to the desired folder (such as Download) and verify that the file exists.

Note:

If the file does not appear in the destination, check that the necessary permissions have been granted to Termux.