Bash Script for Generating XML Sitemap

Overview

This Bash script generates an XML sitemap for a given website. It downloads all pages from the website (excluding PDF files), extracts the URLs, and formats them into an XML sitemap.

Installation and Execution

On Linux

To use this script on a Linux system, follow these steps:

  1. Ensure that wget and grep are installed. You can install them using your package manager if needed:
sudo apt-get install wget grep
  1. Save the script to a file, e.g., generate_sitemap.sh.
  2. Make the script executable:
chmod +x generate_sitemap.sh
  1. Run the script:
./generate_sitemap.sh

On Termux (Android)

To use this script on Termux (Android), follow these steps:

  1. Install Termux from the Google Play Store or F-Droid.
  2. Install required packages:
pkg install wget grep
  1. Save the script to a file, e.g., generate_sitemap.sh, in Termux's home directory.
  2. Make the script executable:
chmod +x generate_sitemap.sh
  1. Run the script:
./generate_sitemap.sh

Transferring Files from Termux to Internal Storage

To transfer the sitemap.xml file from Termux to your Android device's internal storage, follow these steps:

  1. Grant Access to Internal Storage: First, grant Termux access to internal storage by running the following command in Termux:
termux-setup-storage
  1. Transfer the File: After granting access, you can transfer the sitemap.xml file to a folder in internal storage. For example, to transfer it to the Download folder, use:
cp sitemap.xml /sdcard/Download/
  1. Common Storage Paths:
  1. Verify the File: After transferring, use a file manager app on your device to navigate to the folder where you transferred the file and verify its presence.
  2. Note: If the file does not appear in the destination folder, ensure that Termux has the necessary permissions and that the destination path is correct.

Script Code


#!/bin/bash

# Main URL to generate the sitemap for
url="https://www.miralishahidi.ir"

# Output file name for the sitemap
sitemap_file="sitemap.xml"

# Temporary file to store extracted links
temp_links="links.txt"

# Step 1: Recursively download all pages, excluding PDF files
echo "Downloading and extracting links from $url ..."
wget --recursive \
     --no-parent \
     --reject "*.pdf" \
     --level=inf \
     --no-check-certificate \
     --quiet \
     --output-file=wget.log \
     --directory-prefix=temp \
     "$url"

# Step 2: Extract all URLs that belong to the domain, excluding PDFs
echo "Extracting URLs..."
grep -Eroh 'href="[^"]+"' temp | \
    awk -F\" '{print $2}' | \
    grep "^$url" | \
    grep -v "\.pdf$" | \
    sort | \
    uniq > $temp_links

# Step 3: Generate the XML sitemap
echo "Generating $sitemap_file..."
cat <<EOF > $sitemap_file
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
EOF

# Get the current date and time in the format YYYY-MM-DDTHH:MM:SS+00:00
current_date=$(date -u +"%Y-%m-%dT%H:%M:%S+00:00")

# Step 4: Add each URL to the sitemap with proper XML formatting and lastmod tag
while IFS= read -r link; do
    cat <<EOF >> $sitemap_file
    <url>
        <loc>$link</loc>
        <lastmod>$current_date</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.5</priority>
    </url>
EOF
done < $temp_links

# Close the XML file
echo "</urlset>" >> $sitemap_file

# Cleanup temporary files
rm -rf temp $temp_links wget.log

# Final message
echo "Sitemap generated successfully and saved to $sitemap_file."
            

Script Functionality