How to Find All Present and Archived URLs on a web site

There are various reasons you may have to have to locate each of the URLs on an internet site, but your specific target will ascertain what you’re looking for. For instance, you may want to:

Determine each and every indexed URL to analyze issues like cannibalization or index bloat
Obtain recent and historic URLs Google has viewed, specifically for web-site migrations
Uncover all 404 URLs to Get well from publish-migration errors
In Each individual situation, just one Resource gained’t Provide you every little thing you would like. Regrettably, Google Research Console isn’t exhaustive, and also a “site:example.com” search is proscribed and difficult to extract info from.

In this put up, I’ll stroll you through some resources to construct your URL checklist and in advance of deduplicating the information using a spreadsheet or Jupyter Notebook, based on your web site’s size.

Aged sitemaps and crawl exports
In case you’re trying to find URLs that disappeared within the Stay internet site just lately, there’s a chance someone on the crew might have saved a sitemap file or a crawl export prior to the changes ended up designed. In case you haven’t already, check for these documents; they're able to typically provide what you require. But, in case you’re reading this, you probably did not get so Blessed.

Archive.org
Archive.org
Archive.org is a useful Software for Website positioning responsibilities, funded by donations. When you seek out a domain and select the “URLs” choice, you can accessibility up to 10,000 outlined URLs.

Having said that, Here are a few limitations:

URL Restrict: You'll be able to only retrieve up to web designer kuala lumpur 10,000 URLs, that's insufficient for more substantial web-sites.
High-quality: Quite a few URLs may very well be malformed or reference source files (e.g., images or scripts).
No export possibility: There isn’t a designed-in method to export the listing.
To bypass The shortage of the export button, make use of a browser scraping plugin like Dataminer.io. Even so, these constraints necessarily mean Archive.org may not provide an entire solution for bigger web sites. Also, Archive.org doesn’t reveal irrespective of whether Google indexed a URL—but when Archive.org located it, there’s an excellent opportunity Google did, much too.

Moz Pro
When you could possibly typically make use of a connection index to locate external websites linking to you personally, these applications also explore URLs on your internet site in the method.


How you can utilize it:
Export your inbound links in Moz Pro to get a speedy and easy list of concentrate on URLs out of your web-site. For those who’re managing a huge Web-site, think about using the Moz API to export facts further than what’s workable in Excel or Google Sheets.

It’s imperative that you Take note that Moz Pro doesn’t verify if URLs are indexed or found by Google. Nevertheless, given that most internet sites use exactly the same robots.txt procedures to Moz’s bots as they do to Google’s, this method normally functions nicely to be a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Research Console delivers many useful resources for making your list of URLs.

Hyperlinks reports:


Just like Moz Professional, the Back links section supplies exportable lists of concentrate on URLs. Regrettably, these exports are capped at 1,000 URLs Just about every. You are able to use filters for certain webpages, but considering that filters don’t implement into the export, you might should depend upon browser scraping resources—limited to five hundred filtered URLs at any given time. Not best.

Overall performance → Search engine results:


This export offers you a summary of web pages acquiring look for impressions. Whilst the export is limited, You should utilize Google Lookup Console API for much larger datasets. Additionally, there are free of charge Google Sheets plugins that simplify pulling a lot more in depth details.

Indexing → Web pages report:


This section supplies exports filtered by situation kind, while they're also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is a superb source for gathering URLs, using a generous Restrict of a hundred,000 URLs.


Better yet, you may use filters to create different URL lists, correctly surpassing the 100k limit. Such as, if you would like export only website URLs, observe these actions:

Phase 1: Incorporate a section for the report

Stage two: Click “Create a new phase.”


Action 3: Determine the section by using a narrower URL sample, for instance URLs containing /web site/


Be aware: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply useful insights.

Server log documents
Server or CDN log documents are Potentially the ultimate tool at your disposal. These logs capture an exhaustive record of each URL path queried by end users, Googlebot, or other bots in the recorded period of time.

Things to consider:

Details dimensions: Log information could be huge, a lot of sites only retain the last two weeks of knowledge.
Complexity: Analyzing log information is often challenging, but different instruments are offered to simplify the procedure.
Blend, and superior luck
After you’ve collected URLs from all of these resources, it’s time to mix them. If your internet site is sufficiently small, use Excel or, for much larger datasets, equipment like Google Sheets or Jupyter Notebook. Ensure all URLs are continuously formatted, then deduplicate the list.

And voilà—you now have a comprehensive list of existing, outdated, and archived URLs. Good luck!

Leave a Reply

Your email address will not be published. Required fields are marked *