Bash Script for Generating XML Sitemap

Overview

This Bash script generates an XML sitemap for a given website. It downloads all pages from the website (excluding PDF files), extracts the URLs, and formats them into an XML sitemap.

Installation and Execution

On Linux

To use this script on a Linux system, follow these steps:

Ensure that wget and grep are installed. You can install them using your package manager if needed:

sudo apt-get install wget grep

Save the script to a file, e.g., generate_sitemap.sh.
Make the script executable:

chmod +x generate_sitemap.sh

Run the script:

./generate_sitemap.sh

On Termux (Android)

To use this script on Termux (Android), follow these steps:

Install Termux from the Google Play Store or F-Droid.
Install required packages:

pkg install wget grep

Save the script to a file, e.g., generate_sitemap.sh, in Termux's home directory.
Make the script executable:

chmod +x generate_sitemap.sh

Run the script:

./generate_sitemap.sh

Transferring Files from Termux to Internal Storage

To transfer the sitemap.xml file from Termux to your Android device's internal storage, follow these steps:

Grant Access to Internal Storage: First, grant Termux access to internal storage by running the following command in Termux:

termux-setup-storage

Transfer the File: After granting access, you can transfer the sitemap.xml file to a folder in internal storage. For example, to transfer it to the Download folder, use:

cp sitemap.xml /sdcard/Download/

Common Storage Paths:

/sdcard/: Main internal storage.
/sdcard/Download/: Downloads folder.
/sdcard/Documents/: Documents folder.

Verify the File: After transferring, use a file manager app on your device to navigate to the folder where you transferred the file and verify its presence.
Note: If the file does not appear in the destination folder, ensure that Termux has the necessary permissions and that the destination path is correct.

Script Code


#!/bin/bash

# Initial settings
url="https://www.miralishahidi.ir/"
sitemap_file="sitemap.xml"
temp_links="links.txt"

echo "Starting sitemap generation process..."

# Step 1: Download website pages, excluding PDFs
echo "Downloading website content from $url..."
wget --recursive \
     --no-parent \
     --reject "*.pdf" \
     --level=1 \
     --no-check-certificate \
     --quiet \
     --output-file=wget.log \
     --directory-prefix=temp \
     "$url"

# Step 2: Extract valid links (.html and .php only) and ensure full URLs
echo "Extracting valid links..."
grep -Eroh 'href="[^"]+"' temp | \
  awk -F'"' '{print $2}' | \
  grep -E "\.(html|php)$" | \
  sort -u | \
  sed "s|^/|$url/|" > "$temp_links"

# Check if valid links were found
if [ ! -s "$temp_links" ]; then
  echo "No valid links found. Exiting."
  rm -rf temp "$temp_links" wget.log
  exit 1
fi

# Step 3: Create the XML file with the correct format
echo "Generating $sitemap_file..."
cat <<EOF > "$sitemap_file"
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
EOF

# Get the current date in ISO 8601 format
current_date=$(date -u +"%Y-%m-%dT%H:%M:%S+00:00")

# Step 4: Add extracted links to the sitemap with full domain names
echo "Adding links to $sitemap_file..."
while IFS= read -r link; do
  # Ensure all links start with the full domain
  full_link="$link"
  if [[ ! "$full_link" =~ ^https?:// ]]; then
    full_link="$url$full_link"
  fi

  cat <<EOF >> "$sitemap_file"
  <url>
    <loc>$full_link</loc>
    <lastmod>$current_date</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
  </url>
EOF
done < "$temp_links"

# Close the XML file
echo "</urlset>" >> "$sitemap_file"

# Step 5: Clean up temporary files
echo "Cleaning up temporary files..."
rm -rf temp "$temp_links" wget.log

# Final success message
echo "Sitemap successfully created: $sitemap_file"