Overview
This Bash script generates an XML sitemap for a given website. It downloads all pages from the website (excluding PDF files), extracts the URLs, and formats them into an XML sitemap.
Installation and Execution
On Linux
To use this script on a Linux system, follow these steps:
- Ensure that
wget
andgrep
are installed. You can install them using your package manager if needed:
sudo apt-get install wget grep
- Save the script to a file, e.g.,
generate_sitemap.sh
. - Make the script executable:
chmod +x generate_sitemap.sh
- Run the script:
./generate_sitemap.sh
On Termux (Android)
To use this script on Termux (Android), follow these steps:
- Install Termux from the Google Play Store or F-Droid.
- Install required packages:
pkg install wget grep
- Save the script to a file, e.g.,
generate_sitemap.sh
, in Termux's home directory. - Make the script executable:
chmod +x generate_sitemap.sh
- Run the script:
./generate_sitemap.sh
Transferring Files from Termux to Internal Storage
To transfer the sitemap.xml
file from Termux to your Android device's internal storage, follow these steps:
- Grant Access to Internal Storage: First, grant Termux access to internal storage by running the following command in Termux:
termux-setup-storage
- Transfer the File: After granting access, you can transfer the
sitemap.xml
file to a folder in internal storage. For example, to transfer it to theDownload
folder, use:
cp sitemap.xml /sdcard/Download/
- Common Storage Paths:
/sdcard/
: Main internal storage./sdcard/Download/
: Downloads folder./sdcard/Documents/
: Documents folder.
- Verify the File: After transferring, use a file manager app on your device to navigate to the folder where you transferred the file and verify its presence.
- Note: If the file does not appear in the destination folder, ensure that Termux has the necessary permissions and that the destination path is correct.
Script Code
#!/bin/bash
# Main URL to generate the sitemap for
url="https://www.miralishahidi.ir"
# Output file name for the sitemap
sitemap_file="sitemap.xml"
# Temporary file to store extracted links
temp_links="links.txt"
# Step 1: Recursively download all pages, excluding PDF files
echo "Downloading and extracting links from $url ..."
wget --recursive \
--no-parent \
--reject "*.pdf" \
--level=inf \
--no-check-certificate \
--quiet \
--output-file=wget.log \
--directory-prefix=temp \
"$url"
# Step 2: Extract all URLs that belong to the domain, excluding PDFs
echo "Extracting URLs..."
grep -Eroh 'href="[^"]+"' temp | \
awk -F\" '{print $2}' | \
grep "^$url" | \
grep -v "\.pdf$" | \
sort | \
uniq > $temp_links
# Step 3: Generate the XML sitemap
echo "Generating $sitemap_file..."
cat <<EOF > $sitemap_file
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
EOF
# Get the current date and time in the format YYYY-MM-DDTHH:MM:SS+00:00
current_date=$(date -u +"%Y-%m-%dT%H:%M:%S+00:00")
# Step 4: Add each URL to the sitemap with proper XML formatting and lastmod tag
while IFS= read -r link; do
cat <<EOF >> $sitemap_file
<url>
<loc>$link</loc>
<lastmod>$current_date</lastmod>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>
EOF
done < $temp_links
# Close the XML file
echo "</urlset>" >> $sitemap_file
# Cleanup temporary files
rm -rf temp $temp_links wget.log
# Final message
echo "Sitemap generated successfully and saved to $sitemap_file."
Script Functionality
- Downloading Pages: Uses
wget
to recursively download pages from the specified URL while excluding PDF files. The downloaded pages are stored in a directory namedtemp
. - Extracting URLs: Processes the downloaded HTML files to extract URLs. It filters out URLs that do not belong to the specified domain and excludes PDFs. The results are saved to
links.txt
. - Generating Sitemap: Creates an XML sitemap file named
sitemap.xml
. Each URL is formatted with<loc>
,<lastmod>
,<changefreq>
, and<priority>
tags. - Cleaning Up: Removes temporary files and directories created during the process.