The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Unique Identifiers for GEDCOM Records and Their Attributes

Terminology

o Must means mandatory and visa versa

You know the drill.

o UUID 'v' GUID

UUIDs (Universally Unique Identifiers) are also known as GUIDs (Globally Unique Identifiers). This document uses UUID.

o HEADER/INDI/FAM/etc

To avoid endlessly referring to HEADER/INDI/FAM/etc, this document just uses INDI.

o Page numbers

Page numbers refer to DRAFT Release 5.5.1, in Ged551-5.pdf. See "References" for downloading details.

Purpose of this document

This document is intended to formalize a set of proposals to add Unique Identifiers (UUIDs) to the GEDCOM specification, and hence to GEDCOM-structured data.

These proposals are expected to be implementated within various software packages, without any expectation that they will be accepted by the original authors of the GEDCOM specification.

UUIDs function to make it possible to combine GEDCOM files from multiple sources (files or systems), while retaining enough information so as to be able to uniquely identify the specific source from which any particular assertion about an individual or family originated.

This document is written from the point of view that a specific software system handling GEDCOM data has the capability of storing UUIDs, without necessarily being able to generate them.

This means that such a system which cannot generate UUIDs must nevertheless have the capability of reliably processing 1 or more UUIDs per GEDCOM record and/or attribute.

This means handling an INDI record which has 1 or more UUIDs, and further it means handling a NAME (within that INDI) which itself has 1 or more UUIDs.

Also, for a system which can generate UUIDs, the question arises: At what stage should these UUIDs be generated. This is discussed below.

Limitations of this document

This document does not attempt to define any mechanism whereby data is stored in any particular system.

It discusses everthing in terms of GEDCOM-compatible (i.e. stand-alone) files.

A Use Case for UUIDs

o Import

Import 2 GEDCOM files into a program which uses UUIDs, assigning a UUID to each input file.

o Match individuals

The program (somehow) decides an INDI record from file A matches an INDI record from file B. Let's say the names match.

o Flag inconsistencies

Then when there is inconsistent data (e.g. the asserted birthdays of the individual are different in the 2 files), the program saves the respective files' UUIDs attached to each to birthday.

o Multiple versions of data

In this manner inconsistent data is saved permanently (carried forward in time), without any specific determination that one version is 'right' and one is 'wrong'.

o Returning to the roots

At some later time we can then use these UUIDs to determine which source the specific values of the birthdays came from. Naturally any piece of data can be treated likewise.

Identifying Unique Identifiers per INDI

Proposal: Use _UID, UUID and TYPE as the GEDCOM-like tags for UUIDs.

Usage:

o Extension of the definition of HEADER and INDI
          HEADER :=
          n HEAD                                    {1:1}
          ...
            +1 _UID <<UNIQUE_IDENTIFIER_STRUCTURE>> {0:M}

          INDIVIDUAL_RECORD :=
          n @XREF:INDI@ INDI                        {1:1}
          ...
            +1 _UID <<UNIQUE_IDENTIFIER_STRUCTURE>> {0:M}

Note:

o The {0:M} limit

The limit is not {0:1} because there needs to be provision for storing multiple UUIDs per HEADER or INDI, assuming these have been generated on, or on behalf of, different source systems.

Perhaps every UUID in the file should be specified once in the header, for the convenience of the software importing the data. This also helps validation of the data.

o REFN

REFN is rejected as being suitable for this purpose because its length is limited to 20.

o RIN

RIN is rejected as being suitable for this purpose because its length is limited to 12.

o UNIQUE_IDENTIFIER_STRUCTURE
          UNIQUE_IDENTIFIER_STRUCTURE :=
          n UUID <UNIQUE_IDENTIFIER>         {34:34}
            +1 TYPE <UNIQUE_IDENTIFIER_TYPE> {1:248}
o UUID
          UUID := {Size=34:34}
          A unique identifier assigned to the INDI and, perhaps implicitly, to all attributes of the INDI.

The following is a quotation from the documentation of the Perl module Data::UUID:

"The algorithm for UUID generation, used by this extension is described in the Internet Draft "UUIDs and GUIDs" by Paul J. Leach and Rich Salz. (See RFC 4122.) It provides reasonably efficient and reliable framework for generating UUIDs and supports fairly high allocation rates -- 10 million per second per machine -- and therefore is suitable for identifying both extremely short-lived and very persistent objects on a given system as well as across the network."

The following is an adaptation of a quotation from the documentation of the Perl module Data::Session::ID::UUID34:

"Perl usage: my($uuid) = Data::UUID -> new -> create_hex.

Note: Data::UUID returns '0x' as the prefix of the 34-byte hex digest. You have been warned."

The point of including the '0x' in the UUID is to indicate to any system that the characters in the UUID are meant to be hex digits.

o TYPE
          TYPE := {Size=1:248}
          A user-defined definition of the UUID.

Avoiding Unnecessary Proliferation of UUIDs

Proposal: A system must only generate the minimum number of UUIDs to satisfy the requirements of this document.

This means only a single UUID need be stored in the header of a GEDCOM file, on the understanding it automatically applies ('trickles down') to each record within the file.

Likewise, only a single UUID need be stored in an INDI record, on the understanding it trickles down to each attribute (e.g. NAME) within that INDI.

The attribute is said to 'inherit' the UUID of its parent.

Generation of UUIDs

Proposal: That UUIDs, by default, need not be generated.

In other words, UUIDs are not, by default, mandatory.

Hence any (isolated) system can continue to operate in the absence of UUIDs.

Lifetime of UUIDs

Proposal: When a system does generate an INDI, then only 1 UUID is permanently associated with the corresponding data.

Hence, within any (isolated) system, any number of UUIDs can be generated per INDI record, but only the most recently generated one is to be preserved. All trace of all preceeding UUIDs is to be expunged from the system.

Importation of GEDCOM data

When a file of GEDCOM data is imported, various cases arise:

o Importation into an empty system

In this case, one UUID must be generated, which willl then apply to all of the incoming data.

o Importation into a system with pre-existing data
o When the pre-existing data has at least 1 UUID

Then, the imported data must have a UUID, or a UUID must be generated on-the-fly for the incoming data.

o When the pre-exising data does not have UUIDs

Then a UUID must be generated for the pre-existing data.

After that, the incoming data is treated as in the preceeding case.

The point of this is to specifically exclude any attempt to combine 2 datasets in some manner based on the false assumption that they can't have any matching data.

In each case, 2 (at least) UUIDs must be output in the header when the data is exported.

Merging and Matching Sources

o Exactly matching data

When data coming from the 2 sources exactly matches, there is no need to store a UUID with each attribute of the combined result. Here, attribute refers to, say, NAME within INDI.

It only suffices to store the 2 UUIDs, if they exist, at the INDI level.

Actually, if every data record matches, then there is only a need to store the UUID are the header level, and not at each INDI level.

o Mis-matched data

When the data does not match, then the UUIDs, if they exist, are to be stored at the INDI level. In practice, this means 1 or more levels below the INDI tag itself. See "Exporting UUIDs" just below.

At the level of the attribute which mis-matches, say NAME within INDI, it only suffices to store 1 UUID, although both can be stored. The one not stored can be inferred from the pair stored at the INDI level. This optimization is designed to reduce file sizes.

Perhaps this can be extended to the header, where the first mentioned UUID becomes the default (or the default is explicitly tagged), so that any data lacking but needing a UUID takes the default as its UUID.

Exporting UUIDs

The Header

Proposal: The header must list all UUIDs which appear in the data records.

This allows importing code to better prepare for the data, and to perform validation of the data.

Exporting more than 1 UUID per INDI

If we start with 2 GEDCOM files:

   0 @I1@ INDI
         1 NAME Alfred E. /Newman/

   0 @I2@ INDI
         1 Name Alfred Einstein /Newman/

Then, after they are been combined, how should they be exported? There are 2 obvious possibilities:

o As above, but with cross-referencing
   0 @I1@ INDI
         1 NAME Alfred E. /Newman/
           2 ALIA @I2@
           2 _UID 0x1...

   0 @I2@ INDI
         1 NAME Alfred Einstein /Newman/
           2 ALIA @I1@
           2 _UID 0x2...
o As one INDI with an AKA (alias)
   0 @I1@ INDI
         1 NAME Alfred E. /Newman/
           2 _UID 0x1...
         1 NAME Alfred Einstein /Newman/
           2 TYPE aka
           2 _UID 0x2...

Proposal: That only the second format be supported.

There is less processing complexity in handling the second format.

References

General:

o GEDCOM Specification
o The Perl Genealogy::* Namespace

Perl modules:

o Gedcom
o Data::UUID
o Data::Session::ID::UUID34