How to Compare Uniform Resource Identifiers
By: Monika • Essay • 2,050 Words • May 17, 2010 • 1,285 Views
How to Compare Uniform Resource Identifiers
How to Compare Uniform Resource Identifiers
Author: Tim Bray
Abstract
This document discusses issues concerning the comparison of Uniform Resource Identifiers (URIs) and documents common practice.
Introduction
Software is commonly required to compare two URIs. Such comparison is always in respect of some particular purpose, and different software modules might reasonably come to different conclusions about the same pair of URIs. This document uses the terms "different" and "equivalent" to describe the possible outcomes of such comparisons, but, as the discussion of examples and procedures makes clear, there are many possible application-dependent versions of equivalence.
Since URIs exist to identify resources, presumably they should be considered equivalent when they identify the same resource. This definition of equivalence is not of much practical use for reasons which include:
* Resources may have many different identifiers.
* Web architecture defines how resources are named and how their representations are interchanged, but doesn't define resource equivalence.
For these reasons, determination of equivalence or difference must be based on string comparison, perhaps augmented by reference to additional rules provided in one or more RFCs.
Software modules performing such comparisons differ in their requirements and therefore their URI equivalence criteria. This document describes a variety of methods which may be used to compare URIs, the trade-offs between them, and the types of applications which might use them.
The expressiveness of URIs is limited by their small character repertoire. The IRI specification currently under development is aimed at addressing this. The material in this note applies equally to URIs and IRIs.
Status of This Document
This the second draft of this document, and reflects editorial input from members of the TAG and the broader community, but may not represent the consensus of the TAG.
Background
Inevitability of False Negatives
URIs exist to identify resources. A resource, in the Web Architecture, is an abstraction; a URI may in some cases be dereferenced to yield a representation of the resource. Any two different URIs may identify the same resource, in the view of the user or publisher of that resource. Thus, while comparison of two URIs can establish with confidence that they are equivalent and identify the same resource, such comparisons can always yield "false negatives". Put another way, it is often possible to determine that two URIs are equivalent, but it is never possible to be sure that they identify different resources.
Rules Governing URIs
The syntax of URIs is defined by RFC2396; the present document cannot really be understood without reference to that RFC. RFC2396 defines a URI as a sequence of characters, with the definition of "character" not tied to any particular form of storage; the characters may be stored on disk one byte per character, in a Java string two bytes per character, painted on the side of a bus, or spoken in conversation.
The repertoire of characters in URIs is limited, comprising a subset of US-ASCII. Certain of these characters have special roles, for example : and /, and may not be otherwise used in URIs.
The world contains many characters useful in identifying resources beyond those in US-ASCII, and furthermore the special characters such as : and / are also often useful. RFC2396's "%-escaping" mechanism is helpful in these situations. %-escaping is a two-step process; the logical characters in the URI are encoded in some fashion (such as ASCII, UTF-8, or Shift-JIS) as a series of octets; each octet is then represented as a 2-digit hexadecimal code preceded by the percent sign %.
URI Schemes
RFC2396 specifies that every URI has a "scheme", a leading sequence of characters delimited by a colon character :. Two examples are http://example.com/uri and ftp://ftp.example.com/uri; their schemes respectively are "http" and "ftp". It is common practice to name classes of URIs by their scheme, for example "HTTP URIs" and "FTP URIs".
Each URI scheme which is appropriately registered with the Internet Assigned Names Authority has a governing RFC; for example, HTTP URIs are described