Creating Universally Unique ID in Python

2009-10-05 17:41:30 » distributed data storage, guid, id, persistence, Python, Python Standard Library, unique identifier, uuid

GUID is a term that was bandied about in my office to signify any unique id that we used to identify our database records. But I never gave it a second thought for a long time, that is until I heard UUID mentioned in the context of couchdb, as enabling distributed data storage. This piqued my interest and I started reading on UUID. (By the way, our Guids were just sequential numbers generated by our db). So I started digging more info on UUID in general and python's uuid module in particular.

Unique as a Snowflake

What is UUID?

UUID or universally unique identifier is a 128 bit number that guarantees (or atleast provides extremely high probability ) that the id is unique across space and time. Hence the UUID can be used to identify any entity without having to register with a central authority or coordinate with other services. This reduces the cost of assigning UUIDs. The algorithms to generate UUIDs are so efficient that 10 million UUIDs can be assigned in a second per machine. This ensures that the UUIDs can be used as transaction ids. Since UUID is a number, storing searching and sorting is efficient compared to other alternatives. UUIDs were initially used by Networking Computing System. This was later standardized by Distributed Computing Environment of the Open Source Foundation. GUID (Globally Unique ID) is Microsoft's adaptation of the UUID and it is used identify different component interfaces in the Component Object Model, among other things.

Representing UUID

The UUID is a 128 bit number or a 16 byte number. This means that the number of possible values for a UUID is 2^128 or 3.4 X 10^38. This means that 1 trillion UUIDs would have to be created every nanosecond for 10 billion years to exhaust the number of UUIDs. That’s a lot of UUIDs. UUIDs are used by several operating systems as IDs. For example in linux, the partitions are identified by uuids. The below commands would illustrate that.

blkid




/dev/sda1: TYPE="ntfs" UUID="7AB07E19B07DDC57" LABEL="Fracshun"
/dev/sda2: UUID="974b7a60-ff19-4517-982e-a41134661936" SEC_TYPE="ext2" TYPE="ext3"
/dev/sda4: LABEL="IBM_SERVICE" UUID="5CAC-523F" TYPE="vfat"
/dev/sda5: TYPE="ntfs" UUID="68ACBC3EACBC0918" LABEL="Wyldfyr"
/dev/sdb1: TYPE="ntfs"
/dev/sdb3: TYPE="ntfs"
/dev/sdb5: TYPE="ntfs"
/dev/sdb2: TYPE="ntfs"
/dev/mmcblk0p1: LABEL="CANON_DC" UUID="40D9-5C92" TYPE="vfat"




ls /dev/disk/by-uuid/ -lrth




lrwxrwxrwx 1 root root 10 2009-10-03 22:25 7AB07E19B07DDC57 -> ../../sda1
lrwxrwxrwx 1 root root 10 2009-10-03 22:25 974b7a60-ff19-4517-982e-a41134661936 -> ../../sda2
lrwxrwxrwx 1 root root 10 2009-10-03 22:25 5CAC-523F -> ../../sda4
lrwxrwxrwx 1 root root 10 2009-10-03 22:25 68ACBC3EACBC0918 -> ../../sda5
lrwxrwxrwx 1 root root 15 2009-10-04 22:47 40D9-5C92 -> ../../mmcblk0p1

The UUID can also be represented by 32 character hexadecimal number. The canonical form of representing UUIDs is as hexadecimal numbers in the format 8-4-4-4-12 or (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) e.g “550e8400-e29b-41d4-a716-446655440000”

There are 5 variants of UUID:
Version 1 (MAC address)
Version 2 (DCE Security)
Version 3 (MD5 hash)
Version 4 (random)
Version 5 (SHA-1 hash)

We will look at each variant and how the same can be generated in python.

python-uuid

Python provides the uuid module as a part of the standard library since the 2.5 version to generate uuids.

Version 1 (MAC address)

The version 1 UUID is generated from the MAC Address of the computer and the timestamp. A random component is added to this. Since the UUID is marked by the MAC address of the creating entity and the timestamp, the uniqueness is guaranteed. The random component adds further assurance of uniqueness. Since it is possible to recover the MAC address and timestamp from the UUID, it is not preferred.

>>> import uuid
>>> uuid.uuid1()




UUID('c9f8b609-b81e-3c95-8188-914324e741c8')

The uuid1 function call creates an object of class uuid.UUID. Let us look at the different aspects of the UUID object

>>> id  = uuid.uuid1()
>>> id.bytes #Bytes representation of the uuid
'\xc4\xdf\xb1P\xb1\xa9\x11\xde\xa1l\x00\xc0\x9f\xedS\x8e'
>>> id.bytes_le #Little endian bytes representation
'P\xb1\xdf\xc4\xa9\xb1\xde\x11\xa1l\x00\xc0\x9f\xedS\x8e'
>>> id.clock_seq
8556L
>>> id.clock_seq_hi_variant
161L
>>> id.clock_seq_low
108L
>>> id.fields #(time_low, time_mid,  time_hi_version, clock_seq_hi_variant, clock_seq_low, node)
(3302994256L, 45481L, 4574L, 161L, 108L, 827316851598L)
>>> id.hex
'c4dfb150b1a911dea16c00c09fed538e'
>>> id.int
261690165753032864726052546050565952398L
>>> id.node
827316851598L
>>> id.time
134740381578277200L
>>> id.time_hi_version
4574L
>>> id.time_low
3302994256L
>>> id.time_mid
45481L
>>> id.urn #Uniform Resource Name
'urn:uuid:c4dfb150-b1a9-11de-a16c-00c09fed538e'
>>> id.variant
'specified in RFC 4122'
>>> id.version
1

Version 2 (DCE Security)

Version 2 UUIDs are similar to Version 1 UUIDs, except that the clock_seq_low field is replaced with the local domain id, usually “POSIX UID domain” or “POSIX GID domain”, and the time_low field is replaced with local host's id. In combination with the clock_seq_low field, the time_low field is interpreted as the POSIX UID or POSIX GID. This is not supported in the python's uuid module.

Version 3 (MD5 hash)

Version 3 UUIDs are generated from a qualified name space like an URL or a domain name and an object id or name. Version 3 UUIDs is of the form xxxxxxxx-xxxx-3xxx-xxxx-xxxxxxxxxxxx where x represents hexadecimal digits. This is generated by taking string representation of the hex form of domain/namespace's UUID and concatenating the name to it. The resultant string is hashed with MD5 creating a 128 bit string. The 4 version bits and 2 reserved bits corresponding to the UUID format are set and the string is converted back to hexadecimal form.

The python uuid3 function takes a namespace UUID and a string as input. For identical UUID and input string, the output is also identical. Hence this is useful where the entity can be identified by its name within a namespace.

>>> uuid.uuid3(uuid.NAMESPACE_DNS, 'python')
UUID('c9f8b609-b81e-3c95-8188-914324e741c8')
>>> uuid.uuid3(uuid.NAMESPACE_DNS, 'python')
UUID('c9f8b609-b81e-3c95-8188-914324e741c8')
>>> x = uuid.uuid3(uuid.NAMESPACE_DNS, 'python')
>>> y = uuid.uuid3(uuid.NAMESPACE_DNS, 'python')
>>> x  == y
True
>>> mac = uuid.uuid1()
>>> x = uuid.uuid3(mac, 'python')
>>> x  == y
False

Version 5 (SHA-1 hash)

Version 5 UUIDs are generated using SHA-1 hashing, otherwise it is similar to version 3. According to RFC 4122, version 5 is preferred over version 3 name based UUIDs. In the the output of the SHA-1 hash is 160 bits and this is truncated to 128 bits to use as UUID. Version 5 UUIDs is of the form xxxxxxxx-xxxx-5xxx-xxxx-xxxxxxxxxxxx where x represents hexadecimal digits.

>>> y = uuid.uuid5(uuid.NAMESPACE_DNS, 'python')
>>> x = uuid.uuid5(uuid.NAMESPACE_DNS, 'python')
>>> x  == y
True

Version 4 (random)

Version 4 UUIDs are generated using random numbers. This algorithm too sets the version number as well as two reserved bits. All other bits obtained through random number generators. Version 4 UUID is of the form xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx with hexadecimal digits x and hexadecimal digits 8, 9, A, or B for y.

 >>> r1 = uuid.uuid4()
>>> r1
UUID('c4017364-4093-4622-bc0e-1defe804636e')
>>> r2 = uuid.uuid4()
>>> r2
UUID('6a4540e4-992d-4a45-92b7-c5d37bf599a9')

Reconstructing UUID from different formats

Let us take the different formats of the uuid we generated in variant 1 and see how we can reconstruct the UUID object from the strings.

>>> uuid.UUID('c4dfb150b1a911dea16c00c09fed538e') #From hex representation
UUID('c4dfb150-b1a9-11de-a16c-00c09fed538e')
>>> uuid.UUID(bytes='\xc4\xdf\xb1P\xb1\xa9\x11\xde\xa1l\x00\xc0\x9f\xedS\x8e') #From bytes format
UUID('c4dfb150-b1a9-11de-a16c-00c09fed538e')
>>> uuid.UUID(bytes_le='P\xb1\xdf\xc4\xa9\xb1\xde\x11\xa1l\x00\xc0\x9f\xedS\x8e') #From bytes Little Endian Format
UUID('c4dfb150-b1a9-11de-a16c-00c09fed538e')
>>> uuid.UUID(int=261690165753032864726052546050565952398L) #Integer representation
UUID('c4dfb150-b1a9-11de-a16c-00c09fed538e')
>>> uuid.UUID('{c4dfb150b1a911dea16c00c09fed538e}')
UUID('c4dfb150-b1a9-11de-a16c-00c09fed538e')
>>> uuid.UUID('urn:uuid:c4dfb150-b1a9-11de-a16c-00c09fed538e')#Uniform Resource Name format
UUID('c4dfb150-b1a9-11de-a16c-00c09fed538e')
comments powered by Disqus