Getting started with Manual Content Discovery

Understanding Robots.txt, Favicon, Sitemap.xml, HTML Headers, and the Framework stack

To begin, we should ask, in the context of web application security, what is content? A file, video, photograph, backup, or a website feature are all examples of content. When we use the term "content discovery," we're not referring to things that are readily apparent on a website; it's everything else that wasn't intended for public viewing. This content might be hidden in the code, or it might reside on a separate server (thus requiring additional steps for retrieval). Examples might be anything from pages or portals for employee use to previous versions of the website, backup files, system configuration documents, and administration panels.

In many cases, content discovery is a necessary step in the overall process of web application security testing. By uncovering content that was not intended for public viewing, we can better understand how the application works and identify potential vulnerabilities. There are a variety of methods that can be used for content discovery, each with its own advantages and disadvantages.

There are three primary methods for finding content on a website: manually, automatically, and through OSINT (Open-Source Intelligence). In this post, we'll focus on Manually Determining Content.

There are several places on a website where we may look for more material to get started.

Robots.txt

The robots.txt file is a file that specifies which pages on your website should or shouldn't be displayed in search engine results, as well as which search engines are allowed to crawl the site. It's not unusual to block certain website sections in search engine results. These pages might be areas like administration interfaces or files intended for website customers. Thus the file provides us with a list of sites that the site's owners don't want us to discover as penetration testers.

Favicon

The favicon is the little graphic that appears in the browser's tab next to the website's name. It's also displayed in the address bar when you hover over the website's name. The favicon can be a useful indicator of content on a website.

When a website is developed using a framework, the installer's favicon may remain in the browser tab. If the website developer doesn't replace this with a custom one, it might indicate which framework they're using.

The OWASP host a database of standard framework icons that you may utilize to verify against the targets favicon (Favicon Database). We can use external sources to learn more about the framework stack after we've identified it.

Sitemap.xml

A sitemap is a file that lists all the pages on a website. It can be used as a content discovery tool because it provides an overview of all the content on a site. Sitemaps are especially helpful when you're trying to identify which sections of a website are being blocked by the robots.txt file. The sitemap.xml file may be accessed by adding /sitemap.xml to the website URL.

HTML Headers

HTML headers can be a valuable source of information for content discovery. They contain metadata about the page, including the title, description, keywords, webserver software and possibly the programming/scripting language in use. For example, the web server is NGINX version 1.18.0 and runs PHP version 7.4.3, while the database server is MySQL 5.7 (but running on a separate port). Using this information, we may discover vulnerable versions of software in use.

We can show the HTML headers with a curl command against the web server, using the -v switch to produce verbose mode and provide the headers:

curl http://ip -v

Framework stack

Once we've identified the content management system (CMS) in use, we can investigate further to learn about the framework stack. The framework stack refers to the collection of software used to power a website. It usually includes a web server, a CMS, and a database. Often, the framework stack is revealed in the HTML headers of the website. We may discover even more information from there, such as the software's features and other information, which might lead us to additional material.

Conclusion

In this post, we've looked at some methods for manually content discovery. We've seen how to use the robots.txt file, favicon, sitemap.xml, and HTML headers to get started. We've also looked at how to identify the content management system (CMS) in use and investigate the framework stack. These techniques can be valuable in our efforts to gather more information about a website.

Did you find this article valuable?

Support Johannes Loevenich by becoming a sponsor. Any amount is appreciated!