UID Numbering Scheme
Some thoughts on A UID Numbering Scheme published by Unique Identification Authority of India (UIDAI).
Process for De-duplication (page 5):
Since biometric information contain no ordering and hence cannot be indexed like text based information, when a resident applies for a UID with his/her fingerprints, iris and photo of face, these biometrics have to be compared against the entire UID database (existing residents with UIDs) to ensure that this new applicant is indeed unique and has not already been allotted a UID (even under a different name, address etc). This 1:N biometric comparison (N=size of the UID database) is the most compute intensive operation of the UID server systeem.
It sounds as if the search operation would take O(N) time. Why can't the data be treated as binary numbers that can be ordered and indexed? It'll improve the search time to at least O(log N) time and possibly even further to O(1) time if a hashtable is used. The performance improvement is going to be huge since N is as large as 1.2 billion.
Memorization of UID (page 6):
This section is about how long the string length should be. In short, the string has to be as short as possible but that meets density requirement and does not include alphabet characters, just numbers. It is important to keep the UID simple and small to help residents to remember their number.
Firstly the use of the hindu-arabic numeral system(0,1,2,3,4,5,6,7,8,9) is suggested since these numerals are recognized/used by the largest subset of people in the country. Secondly we suggest the use of 12 digits (11 + 1 check sum) since 11 digits gives us a 100 billion number space which in turn can provide a low density of used numbers.
I don't think a 12 digit UID is easy to memorize. However, I agree that this length is necessary to ensure unique UIDs for a large population like India's. The population of India is about 1.2 billion now. Now issuing 1.2 billion UIDs from a maximum possible space of 100 billion numbers implies that any UID chosen at random has a 1.2% probability of being a valid UID issued to someone. In other words, 1 out of every 83 possible UID values is a valid UID issued to someone. Therefore, these UIDs should not be treated as secret. It should be assumed that an adversary can easily guess valid UIDs issued to actual residents.
Note that this problem gets worse as the population of India grows. Currently, it is growing at the rate of about 1.4% every year. The growth rate is expected to slow down in future. For now, I will assume that the growth rate is never going to exceed 1.4% to get a conservative estimate. Then by the year 2050, the population of India would not exceed 2.1 billion. Then the probability that a randomly generated UID is an actual UID issued to a resident would increase to 2.1%.
Therefore, any critical operation performed using a UID must perform an independent verification, such as two-factor authentication, SMS-based verification, etc. to ensure that the operation is approved by the actual resident the UID is issued to.
UID static PIN and dynamic PIN (page 7):
In order to authenticate (ascertain it is who s/he claims to be) a resident needs to provide his/her UID number as well as say a biometric marker – such as a fingerprint.
Using biometrics while issuing UIDs may be fine. But using biometrics for other important transactions might put the resident at risk. For example, see this BBC news story: Malaysia Car Thieves Steal Finger.
Principles and Requirements (page 11):
Number Generation: The numbers are generated in a random, non-repeating sequence. There are several approaches to doing this in the computer science literature. The algorithm and any "seed" chosen to generate IDs should not be made public and should be considered a national secret.
This violates Shannon's maxim, "The enemy knows the system." The security of the system must rely on the secrecy of the seed only. It must not depend on the secrecy of the algorithm. Further, as explained in point 2 above, an adversory can randomly generate 12 digit number with a high likelihood of it being an actual UID to a resident.
The Checksum (page 12):
There is one scheme that meets our requirements: the Verhoeff Scheme. This scheme is relatively complex, and in the days before ubiquitous computing, there was a tendency to avoid it in favor of simpler schemes. In this day and age however, and at the scale of the UID, precision must be the goal. The Verhoeff scheme catches all single errors and all adjacent transpositions. It also catches >95% of twin errors and >94% of jump transpositions.
For those who are curious about what this scheme is, more information can be found at http://www.cs.utsa.edu/~wagner/laws/verhoeff.html and http://en.wikipedia.org/wiki/Verhoeff_algorithm.
Update on 31 May 2010: After an email conversation with Nandan Nilekani about the points I have documented in this blog post, he requested that I send these points in the form a document to him so that he can have it reviewed by his team. I have done so today.