website downloading & mirroring
How to mirror a website
Some people are into data hoarding, some people are also into phishing, and some people just think a website is useful and should be kept for further use. Website mirroring, the process of creating an exact replica of a website on a different server, can be a powerful tool when used ethically and responsibly. However, it’s essential to understand both the benefits and potential drawbacks.
httrack- command line tool
Httracker is a powerful tool we can use to save and mirror websites. There are several examples where this is tool is crucial. Say you need to mirror a website for phishing, or theres useful resources on this website that others should be able to access. Either way this is how we can use Httrack.
to install in kali linux we are going to use the command ‘sudo apt install httrack’
Then we need to create a directory to store our website using the command ‘mkdir website’ or name the directory whatever you like.
We are going to use the ‘cd’ command to open the website directory. Then ‘httrack’ to run the httack application
Httrack will give us several options below. I am entering the project name as ‘httrack’ . we need to define the base path ‘/root/website’ . I am using the Vulnweb pentest page for this blog because I feel like they probably won’t mind me mirroring their website, so I am going to copy and paste their URL under ‘enter URLs’ ‘login page (vulnweb.com)’ . I would also recomend practicing on pages like this before you start mirroring other websites.
Httrack provides us with option for switches below. Under actions we can choose mirror all of the URLs or mirror just the website. The different wildcards are for defining filters
+*.gif accept all gif images
-*.js exclude all files with the extension js
+/images /.jpg include all files in that directory
We can define recurse level with the ‘-’ flag followed by the recurse level we want
-0 only downloads the start page
-1 downloads the start page and all links directly from it
-2 downloads as many pages as possible including links and links from those
Then we choose ‘Y’
Then use the ‘ls’ command to list your files under the website directory
Then ‘cd vulnweb’ because that is what my file is called
‘ls’ will list the files I have in Httrack. I’m going to use the command ‘python -m http.server’ . The ‘python’ command will use pthon’s subprocess module to execute httrack commands. The ‘-m’ command will tell python to run the module as a script. ‘http.server’ will run the HTTP server module.
Now that we have been given our http server we can navigate to ‘http://localhost:8000’ to run our mirrored website
winttrack- gui
Wintrack is the GUI equivilent to httrack which can be downloaded at HTTrack Website Copier - Free Software Offline Browser (GNU GPL)
Here we will be interacting with a graphical interface to copy our website of choice
Once downloaded, we will launch the Httrack application and press ‘next’
I’m going to name my project ‘test website’ and save it to the website directory
Here we are given the option to choose an action where we can test links, download all links to fullly mirror the site. In this case I am just mirroring the front page.
Once we click ‘next’ we are taken to the page where our website mirror is downloaded
Then we can click ‘browse mirrored website’
And we have our mirrored website saved on Wintrack
Mirror prevention measures- how can we bypass them?
Httrack can be a hit or miss, its easy to mirror a website front page like we have above. What if we are trying to mirror a website and the website has measures to prevent it.
One technique some websites use is robots.txt, which prevents the website from being indexed if some derivatives are applied.
IP Blocking is another technique where websites will recognise the IP of certain crawlers trying to access and they will block them. This method can potentially be bypassed using a proxy server or VPN.
User Agent Detection works by analysing the string request sent by an application, if the user agent seems like a mirroring tool, the website may block it. Bypassing user agent detection is difficult but can be achieved through using proxy servers, rotating IPs to avoid being flagged as a bot. We may be able to avoid user agent detction by mimicking a real users browser.
Rate limiting to limit the amount of requests one IP address can send, this can be bypassed with a proxy server too.
Legal Action, some websites a copywrited and it is very illegal to copy them. Owners of the website will not be happy with your actions so do not do illegal stuff for malicious purposes.
-wget
Wget has switches that we can use to bypass mirror prevention measures from secure websites. There are several tools I am going to use to attempt to find vulnrabilities in my test website (I am not actually doing anything illegal). However wget is more difficult to use then httrack.
The basic syntax for wget is ‘wget {website URL}’ . before using wget navigate to a directory where you want to store your website
If we want to download the whole website including links, we are going to use several switches.
-r which will enable recoursive downloading
-p downloads necessary files including such as images and CSS to display html pages correctly
-E converts html files to .html
-K converts the links in downloaded HTML files to point to local files
-np Prevents wget from going up parent directories, ensuring you only download files within the specified domain.
Here in this attempt, the website i tried to copy has website prevention measures
I’m going to try bypass these measures using the user agent string script so it looks like I am coming from a legit website as many websites block crawlers and I dont want to be blocked, the syntax and script I am using is:
wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36" -r -p -E -k -np (domain url)
Success! but I want the whole website
So we are going to need to utilise some more scripts to achive the results we want. I want to fully mirror the website so I am going to also utilise these scripts ‘--mirror --convert-links --adjust-extension --page-requisites --no-parent’
— mirror will enable recursive downloading, treating the server as a mirror.
— convert-links converts links in HTML files to relative paths for offline viewing.
—adjust-extension Adds appropriate extensions to downloaded files (like .html, .css).
—page-requisites Downloads all necessary files for page rendering (images, CSS, etc.).
—no-parent Prevents going up directory levels.
we can also use the ‘—wait’ and ‘—random-wait’ command to avoid overloading the server
I’ve written a simple bash script to mirror websites with wget because I am lazy and don’t want to write commands over and again. I’ve used to scripts as well as the switches and agent string mentioned above, you can use the same script just swap out the website URL for the one you want. Warning, this script may take a while to run because it utilizes many commands and switches.
I’m copying and pasting my script below in case someone finds it useful
#!/bin/bash
website=http://testphp.vulnweb.com/login.php
mirror_dir="website_mirror"
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
mkdir -p "$mirror_dir"
cd "$mirror_dir"
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent --user-agent "$user_agent" -p -r -E -K -np "$website"
once I run my script im going to navigate to where it is stored
I’m then going to navigate to my file manager and locate the files under my Kali Linux folder
Here I am given every file from my mirrored website
And here we go, our mirrored website.
If you are mirroring websites, please respect copyright laws and do not use this for unethical purposes.