A practitioner’s guide to classifying every asset in your attack surface

DetectifyMay 13, 2025

TLDR: This article details methods and tools (from DNS records and IP addresses to HTTP analysis and HTML content) that practitioners can use to classify every web app and asset in their attack surface. You’ll learn to view your assets from an attacker’s perspective, enabling you to understand not only that an asset exists but also its exact nature.

“You can’t secure what you don’t know exists.” It’s a common refrain in cybersecurity (and for good reason!). But the reality is a bit more complex: it’s not enough to just know that something exists. To effectively secure your assets, you need to understand what each of them is. Without proper classification, applying the right security processes or tools becomes a guessing game.

There’s a discrepancy between what you think you’re exposing and what you actually are exposing. Critically, an attacker only cares about what is actually accessible to them, not what you think it is. Research from Detectify indicates that the average organization is missing testing 9 out of 10 of its complex web apps that are potential attack targets.

Imagine you’ve identified a few thousand assets exposed to the internet. The crucial next step is to determine what you are actually exposing. Different tools can help depending on what’s on your attack surface, but instead of focusing on specific tools right away, let’s concentrate on the methods and data points used to understand what each asset is.

Data points for an outside-in perspective

Numerous data points can be used for classification. Let’s examine them in the order of a typical connection flow, assuming an outside-in, black-box analysis perspective. Internal network data or based on source code would require a different approach.

Asset classification methods covered in this guide

Handshake

DNS: Where is the DNS hosted? What types of pointers (A, CNAME, MX, etc.) are used? Where are they pointing? Are there informative TXT records (e.g., SPF, DKIM, DMARC)?
IPs: Where is the IP address geographically located? What Autonomous System Number (ASN) does it belong to? Is it an individual IP or part of a larger range?
Ports: Which ports are open or closed? How does the firewall behave (e.g., treatment of TCP vs. UDP, dropped vs. rejected packets)?
Protocol/Schema: What protocol responds on an open port (e.g., HTTP, FTP, SSH)? Are there nested protocols (e.g., HTTP over TLS, WebSocket over HTTP)?
SSL/TLS: Which Certificate Authority (CA) issued the certificate? What does JARM fingerprinting and handshake data reveal? What Subject Alternative Names (SANs) are listed?

Deep dive into HTTP

The data available for deeper classification heavily depends on the protocol encountered. For this blog post, we’ll focus primarily on HTTP, the backbone of web applications.

Key HTTP data points include:

Response Codes: Is it a 200 OK, a 30X redirect (and where to?), or a 50X server error?
Headers: Response headers are particularly rich, including custom X-headers, Cookies and security headers.
File Signatures: These are unique identifiers forming part of a file’s binary data, often found in the first few bytes of a response body.
Content-Type and Length: Is the response JSON, XML, HTML? What’s the size of the response?

Further down into HTML

If the response is HTML, we can delve even deeper:

Favicon: Many applications use default favicons. Hashing these icons can quickly identify known software.
URL Patterns: Are there detectable patterns in URLs (e.g., /wp-admin/, /api/v1/, specific query parameter structures)?
Meta-tags: name attributes (e.g., for description, keywords, generator) or http-equiv attributes (simulating response headers) can reveal underlying technologies or CMS.
Form-tags: The structure, input field names, and action URLs within forms (especially login forms) can indicate specific systems.
Links in Code: Are there hardcoded links to known sources, documentation, or license agreements?
Code Patterns: Detectable patterns in JavaScript, HTML structure, or CSS can point to specific frameworks or libraries.
Third-Party Resources: What external resources (scripts, images, APIs, tracking pixels) are being loaded, and from where?

Other Protocols

If we haven’t gone down the HTTP and HTML path (e.g., we’ve encountered an SSH or SMTP server), we would then look further into the binary response or protocol-specific handshake data to understand what software components are running. However, that’s a topic for another article.

Data Points Unpacked

When we examine each data point individually, significant opportunities for fingerprinting and understanding exposed assets emerge. Combining them provides even richer insights:

DNS

NS records and CNAMEs: Can be used to understand hosting providers (e.g., AWS, Azure, GCP), third-party SaaS applications, and CDNs/WAFs. Analyzing the domain name itself often yields this information.
DNS security records (e.g., SPF, DMARC): Can reveal third-party services used for functions like marketing automation or invoicing, which can be relevant for supply chain risk assessment or social engineering attack vectors.

Tools and Techniques: Manual inspection can be done with the dig command and basic human pattern recognition for small-scale analysis. For larger-scale testing, open-source tools like MassDNS can be highly effective.

IPs

ASN (Autonomous System Number): Helps determine organizational ownership, network size and scope, and geographical footprint. ASN data can also indicate underlying technology providers, as vendors often allocate IP blocks to different products or services.

Tools and Techniques: Nmap is a widely used tool for IP and port scanning. Alternatives for large-scale scanning include Zmap and MASSCAN. Whois lookups (command-line or web-based) are essential for ASN information.

Ports

Understanding which ports are open can help determine the firewall in place and the underlying systems running.

Single Ports: While specific ports are commonly associated with certain services (e.g., port 80 for HTTP), this isn’t guaranteed. Misconfigurations can lead to odd combinations of ports and services. Port status is an indication, not proof; probing the service is necessary for confirmation.
Combination of Ports: Certain combinations of open ports can strongly indicate an underlying system. For example, Cloudflare often presents a standard set of 13 open ports, while Imperva Incapsula might show all TCP ports as open.
Port “Spoofing”/Firewall Behavior: If a firewall detects a port scan, it might respond by showing no open ports, dropping packets, or indicating all ports are open. Analyzing this behavior in detail can provide clues about the edge device (firewall/WAF) in use.
Malformed Requests: Sending malformed requests that don’t adhere to RFCs can sometimes elicit responses that reveal more information than standard requests.

Tools and Techniques: For scanning at scale, masscan is fast, though it may produce a higher number of false positives. You’ll need to decide between speed and accuracy, as they often involve trade-offs. Nmap offers more accuracy and service detection features.

Protocol/Schema

The identified protocol/schema is connected to the combination of hostname (e.g., the Host header for domain fronting, or TLS-based routing using SNI), IP address, and port in the request.

Nested Communications: Communications can be nested. Many basic tools might not capture these nested communications, whether they result from intentional design or misconfiguration. This can lead to an incomplete understanding of what’s truly exposed.

Tools and Techniques: Nmap is the most known service. Other tools like JA4T (for TLS client/server fingerprinting) and fingerprintx can also help identify protocols and services.

SSL/TLS

Certificate Authority: Are certificates updated manually or automatically (e.g., Let’s Encrypt certificates are short-lived and usually automated)? Are different certificate authorities used in different parts of the infrastructure? This can hint at internal processes or even supply chain elements.
Subject Alternative Names (SANs): Is the certificate used for other domains? What can be learned from them? For example, google.com’s certificate lists over 50 domains under SAN.
JARM: Passively analyzing JARM hashes (an active TLS fingerprinting technique) can group disparate servers by configuration, identify default applications or infrastructure, and even fingerprint malware command and control servers.
Handshake Details: Different TLS server implementations respond differently when actively probed. Analyzing supported ciphers and TLS versions provides insights into the server’s configuration and potential vulnerabilities.

Tools and Techniques: JARM fingerprinting tools actively probe servers. Certificate Transparency (CT) logs are valuable public data sources for discovering issued certificates for domains, like crt.sh.

Deep Dive into HTTP Responses

Response Codes

A simple 200 OK status code might offer limited information in isolation. However, observing an application’s status codes in response to crafted payloads can be far more revealing. Different payloads will trigger different behaviors, and a WAF may interfere. Additionally, response codes can vary based on the user-agent and accept-header.

10X (Informational): Commonly seen when upgrading to WebSockets or when Expect headers are used.
20X (Successful): Limited use in isolation for system identification without further context.
30X (Redirection): Redirect headers can give hints about underlying systems, authentication flows, or application structure. An example:

$ curl -v http://whitehouse.gov
* Trying 192.0.66.51:80...
* Connected to whitehouse.gov (192.0.66.51) port 80 (#0)
> GET / HTTP/1.1
> Host: whitehouse.gov
> User-Agent: curl/7.81.0
> Accept: /
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 301 Moved Permanently
< Server: nginx
< Date: Wed, 16 Apr 2025 12:15:21 GMT
< Content-Type: text/html
< Content-Length: 162
< Connection: keep-alive
< Location: https://whitehouse.gov/
<
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx</center>
</body>
</html>

The response body of the redirect clearly states that nginx is used.

40X (Client Error): These are often very interesting, as they can be triggered with specially crafted payloads tailored to specific type of systems. Different systems have unique 404 pages or error messages.
50X (Server Error): It’s not uncommon for 50X errors to present custom error pages or verbose error messages that can be connected to a specific system type, framework, or even programming language. If a 50X error can be triggered, you might be able to detect more.

Tools and Techniques: Common web scanning tools like Burp Suite, combined with human ingenuity, can help us understand more.

For example, sometimes triggering a non 200 status code might expose more information about a system or an underlying technology. As an example, if you’re looking to identify assets running IBM Notes/IBM Domino it can be helpful to request an nsf-file that does not exist.

Sending a GET request to example.com/foo.nsf can trigger a 404 response containing strings such as <h1>Error 404</h1>HTTP Web Server: IBM Notes Exception - File does not exist</body>.However, simply sending a request to a non-existing path such as example.com/foo will not trigger the same descriptive error.

Response Headers

This category is vast, so we’ll focus on key areas:

X- Headers: Custom headers can explicitly state technologies used by the target (e.g., X-Powered-By: Express, X-Generator: Drupal). Some X- headers are unique to specific technologies or intermediary devices.
Server Header: Often specifies the web server software (e.g., Server: Apache/2.4.58, Server: nginx).
Security Headers:
- Content-Security-Policy (CSP): Can help understand the underlying resources being loaded, such as CDNs, cloud storage buckets for static assets, or types of tracking pixels/marketing systems used. (This can be invaluable for sales teams building prospect lists too!)
- Example: Checking the CSP of paypal.com can reveal reliance on Salesforce:

$ curl -Iks https://www.paypal.com/se/home | grep -Eo '.{16}salesforce.{16}'
e.com https://.salesforce.com https://.f
l.com https://*.salesforce.com https://sec

Tools and Techniques: Command-line tools like concurl or HTTPie. Web fuzzers like ffuf, dirsearch, or gobuster (often used for content discovery) can also be used to observe header variations based on different paths or inputs.

File Signatures

Many file types can be identified by the first few bytes of the file.

Tools and Techniques: The file command and other command like xxd or hexdump can be used to inspect these bytes.

Content-Type and Length

The combination of the Content-Type header and Content-Length can indicate application types:

Traditional web apps: Typically respond with HTML and longer Content-Length for full pages.
APIs: Respond with JSON, XML, and usually with shorter Content-Length for unauthenticated requests.
Single-Page Applications (SPAs): Typically respond with HTML, often containing at least one <script> tag, but generally with a much shorter Content-Length than traditional web apps.

Tools and Techniques: curl is useful here.

Further Down in HTML

Favicon Fingerprinting

With libraries of known favicons (or their hashes), this can be a very fast way to scan a large number of assets. Running favicon fingerprinting across broad domain sets can yield significant insights into the technologies used.

Tools and Techniques: Tools like httpx -favicon or platforms like Shodan (which has a favicon hash search) can automate this.

URL Patterns

The structure of URLs can be very revealing. The most basic example is the existence of an admin page at a specific path (e.g., /wp-admin for WordPress). Other examples are how product categories or user profiles are represented (e.g., /product/{id}, /user/{username}), the encoding used in parameters and the presence of directory listings.

Tools and Techniques: Wordlists of common paths can be sent with tools like Burp Intruder, ffuf, or dirsearch. Regex patterns can then be applied to the response data to identify interesting results.

Meta-Tags

name attribute: Default descriptions, keywords, or generator tags often indicate a specific CMS or framework.

<meta name="generator" content="WordPress 6.2.2" />
<meta id="shopify-digital-wallet" ...>
<meta name="shopify-checkout-api-token" ...>

http-equiv attribute: Can be used in the same way as HTTP response headers and can therefore carry similar identifying information.

Tools and Techniques: These tricks are often hidden within tools and are typically opaque. Any DAST tool would fit into this category; Nuclei, for example, has open-source signatures for these purposes. Tools like Wappalyzer and WhatWeb could also be included, as they utilize similar techniques.

Form-Tags

The structure and patterns in form tags, especially for logins, along with their action URLs, can provide strong clues about the underlying CMS.

Tools and Techniques: Some tools that can be used are Wappalyzer, WhatWeb, curl+grep.

Code Patterns

Looking at code patterns involves identifying characteristic snippets, function names, variable naming conventions, CSS class structures, or HTML element arrangements typical of certain frameworks or libraries.

When analyzing code patterns, we are increasingly using statistical and linguistic models to match identified applications with known examples. One approach is to examine the linguistic structure of the code and remove all plain text content. However, this process becomes more challenging when the code is obfuscated or compressed.

Tools and Techniques: BishopFox has utilized bindings for Tree-sitter to parse the abstract syntax tree (AST) of JavaScript. Detectify co-founder and security researcher Fredrik Almroth has explored ANTLR for similar purposes, specifically aiming to parse GraphQL. Some useful links are https://tree-sitter.github.io/tree-sitter/ and https://www.antlr.org/. Comparing tree structures after obtaining an AST involves a relatively advanced field of mathematics.

Links in Code

It’s not uncommon for CMSs, themes, or plugins to include default links to their documentation or license agreement. For example:

<a href="https://www.espocrm.com" title="Powered by EspoCRM"
Powered by <a href="http://ofbiz.apache.org"
href="https://about.gitea.com">Powered by Gitea

Third-Party Resources

Applications frequently load third-party resources, and both the type and location of these resources can provide insights about the application itself. This information can reveal details about any supply chain dependencies or technical components being utilized. For instance, the presence of an analytics platform (such as Amplitude) typically suggests that the application is of significant importance and is actively being developed.

Tools and Techniques: Wappalyzer provides this information, highlighting unique properties that may exist in external JavaScript, such as those hosted on CloudFront, which is a great source for links, domains, API operations, and more. Occasionally, these JavaScript files might contain sensitive information. Some interesting links to bookmark are TruffleHog and KeyHags.

Combining all methods

While individual data points are helpful, their true power is unlocked when combined. Only then can we answer more elaborate and critical questions about our attack surface. One might envision an AI agent piecing this together, but a more standard approach involves defining the question and then selecting the appropriate data points and tools needed.

Some questions might require only a single data point, while others necessitate combining many to achieve an acceptable confidence level in the classification. Consider these:

Do we have stale DNS endpoints?
Are we adhering to internal policies for approved Certificate Authorities?
Are our redirects configured correctly, and are their targets appropriate?
Are we following our internal security policies regarding the desired tech stack?
Are all applications covered with appropriate protection?
Where are all our APIs?
Is our Configuration Management Database (CMDB) accurate and up-to-date with what’s actually exposed?
How comprehensively are we assessing our attack surface?

Systematically collecting and analyzing these diverse data points can help security teams move beyond simple asset discovery to a much deeper understanding, classification, and potentially testing of their web applications. Some tools can automate asset classification and deliver intelligent recommendations on what assets are potential attack targets and warrant deep testing.

Are you interested in learning more about Detectify? Start a 2-week free trial or talk to our experts.
If you are a Detectify customer already, don’t miss the What’s New page for the latest product updates, improvements, and new vulnerability tests.

Detectify

Check out more content

Best Practices

The API vulnerabilities nobody talks about: excessive data exposure

TLDR: Excessive Data Exposure (leaking internal data via API responses) is the silent, pervasive threat that is more dangerous than single dramatic flaws like SQL …

October 28, 2025

Best Practices

Migrating Critical Messaging from Self-Hosted RabbitMQ to Amazon MQ

TLDR: We successfully migrated our core RabbitMQ messaging infrastructure from a self-hosted cluster on EKS to managed Amazon MQ to eliminate the significant operational burden …

October 23, 2025

Best Practices

Why API security is different (and why it matters)

Two months since I joined Detectify and I’ve realized something: API security is a completely different game from web application security. And honestly? I think …

October 14, 2025

Best Practices

EU Regulating InfoSec: How Detectify helps achieving NIS 2 and DORA compliance

**Disclaimer: The content of this blog post is for general information purposes only and is not legal advice. We are very passionate about cybersecurity rules and …

June 03, 2025