RDF [[RDF-CONCEPTS]] describes a graph-based data model for making claims about the world and provides the foundation for reasoning upon that graph of information. At times, it becomes necessary to compare the differences between sets of graphs, digitally sign them, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF datasets such that these operations can be performed.
This document is a work in progress.
When data scientists discuss canonicalization, they do so in the context of achieving a particular set of goals. Since the same information may sometimes be expressed in a variety of different ways, it often becomes necessary to be able to transform each of these different ways into a single, standard format. With a standard format, the differences between two different sets of data can be easily determined, a cryptographically-strong hash identifier can be generated for a particular set of data, and a particular set of data may be digitally-signed for later verification.
In particular, this specification is about normalizing RDF datasets, which are collections of graphs. Since a directed graph can express the same information in more than one way, it requires canonicalization to achieve the aforementioned goals and any others that may arise via serendipity.
Most RDF datasets can be normalized fairly quickly, in terms of algorithmic time complexity. However, those that contain nodes that do not have globally unique identifiers pose a greater challenge. Normalizing these datasets presents the graph isomorphism problem, a problem that is believed to be difficult to solve quickly. More formally, it is believed to be an NP-Intermediate problem, that is, neither known to be solvable in polynomial time nor NP-complete. Fortunately, existing real world data is rarely modeled in a way that manifests this problem and new data can be modeled to avoid it. In fact, software systems can detect a problematic dataset and may choose to assume it's an attempted denial of service attack, rather than a real input, and abort.
This document outlines an algorithm for generating a normalized RDF dataset given an RDF dataset as input. The algorithm is called the Universal RDF Dataset Canonicalization Algorithm 2015 or URDNA2015.
There are different use cases where graph or dataset canonicalization are important:
A canonicalization algorithm is necessary, but not necessarily sufficient, to handle many of these use cases. The use of blank nodes in RDF graphs and datasets has a long history and creates inevitable complexities. Blank nodes are used for different purposes:
Further more, RDF semantics dictate that deserializing the same RDF document results in the creation of unique blank nodes, unless it can be determined that on each occasion, the blank node identifiers the same resource; this is due to the fact that blank node identifiers are an aspect of a concrete RDF syntax and are not intended to be persistent or portable. Within the abstract RDF model, blank nodes do not have identifiers (although some RDF store implementations may use stable identifiers and choose to make them portable). See Blank Nodes in [[!RDF-CONCEPTS]] for more information.
RDF does have a provision for allowing blank nodes to be published in an externally identifiable way through the use of Skolem IRIs, which allows a given RDF store to replace the use of blank nodes in a concrete syntax with IRIs, which serve to repeatably identify that blank node within that particular RDF store, however, this is not generally useful for talking about the same graph in different RDF stores, or other concrete representations. In any case, a stable blank node identifier defined for one RDF store, serialization is arbitrary, and typically not relatable to the context within which it is used.
This specification defines an algorithm for creating stable blank node identifiers repeatably for different serializations possibly using individualized blank node identifiers of the same RDF graph (dataset) by grounding each blank node through the nodes to which it is connected, essentially creating Skolem blank node identifiers. As a result, a graph signature can be obtained by hashing a canonical serialization of the resulting normalized dataset, allowing for the isomorphism and digital signing use cases. As blank node identifiers can be stable even with other changes to a graph (dataset), in some cases it is possible to compute the difference between two graphs (datasets), for example if changes are made only to ground triples, or if new blank nodes are introduced which do not create an automorphic confusion with other existing blank nodes. If any information which would change the generated blank node identifier, a resulting diff might indicate a greater set of changes than actually exists.
TimBL has a design note on problems with Diff which should be referenced.
Jerremy Carroll has a paper on signing RDF graphs.
This document is a detailed specification for an RDF dataset canonicalization algorithm. The document is primarily intended for the following audiences:
To understand the basics in this specification you must be familiar with basic RDF concepts [[!RDF-CONCEPTS]]. A working knowledge of graph theory and graph isomorphism is also recommended.
There are a number of ways that one may participate in the development of this specification:
_:that is used as an identifier for an blank node. Blank node identifiers are typically implementation-specific local identifiers; this document specifies an algorithm for deterministically specifying them.
Canonicalization is the process of transforming an input dataset to a normalized dataset. That is, any two input datasets that contain the same information, regardless of their arrangement, will be transformed into identical normalized dataset. The problem requires directed graphs to be deterministically ordered into sets of nodes and edges. This is easy to do when all of the nodes have globally-unique identifiers, but can be difficult to do when some of the nodes do not. Any nodes without globally-unique identifiers must be issued deterministic identifiers.
Strictly speaking, the normalized dataset must be serialized to be stable, as within a dataset, blank node identifiers have no meaning. This specification defines a normalized dataset to include stable identifiers for blank nodes, but practical uses of this will always generate a canonical serialization of such a dataset.
In time, there may be more than one canonicalization algorithm and, therefore, for identification purposes, this algorithm is named the "Universal RDF Dataset Canonicalization Algorithm 2015" (URDNA2015).
This statement is overly prescriptive and does not include normative language. This spec should describe the theoretical basis for graph canonicalization and describe behavior using normative statements. The explicit algorithms should follow as an informative appendix.
When performing the steps required by the canonicalization algorithm, it is helpful to track state in a data structure called the canonicalization state. The information contained in the canonicalization state is described below.
_:c14n, for issuing canonical blank node identifiers.
During the canonicalization algorithm, it is sometimes necessary to issue new identifiers to blank nodes. The Issue Identifier algorithm uses an identifier issuer to accomplish this task. The information an identifier issuer needs to keep track of is described below.
_:c14nis a proper initial value for the identifier prefix that would produce blank node identifiers like
The canonicalization algorithm converts an input dataset into a normalized dataset. This algorithm will assign deterministic identifiers to any blank nodes in the input dataset.
Documenting the algorithm is a WIP, various steps will become more detailed in time.
See the note for the hash first degree quads
algorithm. We should either remove the loop based on
simple here but indicate that the original design
of the algorithm was to have such a loop, or leave it but
inform implementers that it is safe to break after one iteration
of the loop (again, indicating why). A future version of this
algorithm should make the loop effectual.
true, issue canonical identifiers for blank nodes:
1, continue to the next mapping.
This algorithm issues a new blank node identifier for a given existing blank node identifier. It also updates state information that tracks the order in which new blank node identifiers were issued.
This algorithm takes an identifier issuer and an existing identifier as inputs. The output is a new issued identifier. The steps of the algorithm are:
Add note that the result of this algorithm for a
particular blank node will always be the same. This is only true
because there was a typo in the spec that has now been implemented
by many implementations. The design of the algorithm was to use the
assigned canonical blank node identifier, if available, instead of
_:z, similar to how it is used in
the related hash algorithm, but this text never made it into the spec
before implementations moved forward. Therefore, the hashes here
never change, making the loop based on the
flag that calls this algorithm unnecessary; it needs to only run
once. A future version of this algorithm should correct this mistake.
This algorithm takes the canonicalization state and a reference blank node identifier as inputs.
_:a, otherwise, use the blank node identifier
Note potential need to normalize literals to their canonical representation here as well, if not done on the original input dataset.
This algorithm creates a hash to identify how one blank node is related to another. It takes the canonicalization state, a related blank node identifier, a quad, an identifier issuer, issuer, and a string position as inputs.
<, the value of the predicate in quad, and
The relationship and difference between this algorithm and the hash first degree quads algorithm should be better explained. There may also be better names for the two algorithms.
The 'path' terminology could also be changed to better indicate what a path is (a particular deterministic serialization for a subgraph/subdataset of nodes without globally-unique identifiers).
Usually, when trying to determine if two nodes in a graph are equivalent, you simply compare their identifiers. However, what if the nodes don't have identifiers? Then you must determine if the two nodes have equivalent connections to equivalent nodes all throughout the whole graph. This is called the graph isomorphism problem. This algorithm approaches this problem by considering how one might draw a graph on paper. You can test to see if two nodes are equivalent by drawing the graph twice. The first time you draw the graph the first node is drawn in the center of the page. If you can draw the graph a second time such that it looks just like the first, except the second node is in the center of the page, then the nodes are equivalent. This algorithm essentially defines a deterministic way to draw a graph where, if you begin with a particular node, the graph will always be drawn the same way. If two graphs are drawn the same way with two different nodes, then the nodes are equivalent. A hash is used to indicate a particular way that the graph has been drawn and can be used to compare nodes.
This algorithm works in concert with the main canonicalization algorithm to produce a unique, deterministic identifier for a particular blank node. This hash incorporates all of the information that is connected to the blank node as well as how it is connected. It does this by creating deterministic paths that emanate out from the blank node through any other adjacent blank nodes.
The inputs to this algorithm are the canonicalization state, the identifier for the blank node to recursively hash quads for, and path identifier issuer which is an identifier issuer that issues temporary blank node identifiers. The output from this algorithm will be a hash and the identifier issuer used to help generate it.
gbased on whether component is a subject, object, graph name, respectively.
<, the hash in result, and
A previous version of this algorithm has light deployment. For purposes of identification, the algorithm is called the "Universal RDF Graph Canonicalization Algorithm 2012" (URGNA2012), and differs from the stated algorithm in the following ways:
_:g, instead of
>; there were no delimiters.
p, for property, when the related blank node was a subject and the value
r, for reverse or reference, when the related blank node was an object. Since URGNA2012 only normalized graphs, not datasets, there was no use of the graph name position.
The editors would like to thank Jeremy Carroll for his work on the graph canonicalization problem, Gavin Carothers for providing valuable feedback and testing input for the algorithm defined in this specification, Sir Tim Berners Lee for his thoughts on graph canonicalization over the years, Jesús Arias Fisteus for his work on a similar algorithm.