voice of humanity: technical progress - Chandler UUID    
 technical progress - Chandler UUID11 comments
picture17 May 2004 @ 19:41, by Roger Eaton

InterMix will use Chandler for a front end. At least that is the plan at this point. So as a first order of business, I decided to adapt the Chandler UUID for InterMix, thinking that having the same format would make the meshing of the two programs easier. The UUID is a Universally Unique IDentifier and is used as a key for all Chandler repository items.

So first I tried to compile the Chandler uuid.c module. But after much research, and I am still not quite sure, it appears that C extensions to Python for Windows require one to purchase Microsoft Visual C , and that comes in the .NET package which runs $700 or so. Not what I wanted to hear. There does seem to be a possibility of using a free Delphi C compiler, but then one needs to go through some sort of gyration involving renaming this or that -- also not what I wanted to hear.

Next up, then was to try to implement the uuid.c module in Python. This got me down to the nitty gritty of understanding the C code, and what I found was surprising. The Chandler format is based on a specification written in 1998 by Paul J. Leach of Microsoft and Rich Salz of Certco as a possible IETF standard. This standard was never adopted by the IETF, but it appears that a close relative was adopted as ISO-11578.

The Leach / Salz spec calls for a 16 byte (128 bit) uuid, and the specific Chandler format is divided into three parts (!): eight bytes of timestamp version, two bytes of "clock sequence" and six bytes of ethernet card "mac" address or random bits if the computer has no ethernet card. Each ethernet card has a unique address built into it, and the combination of time unique mac address should make for a truly globally unique ID. Turns out to be not so simple.

The system calls to get the mac address were too technical for me to try to implement in Python, newbie that I am, so I looked around and found another implementation of the uuid concept and also some pointers that called into question the wisdom of using the mac address. Here is from WebDA V Distributed Authoring Protocol: "This is a security concern, due to the potential to trace hardware by following the IEEE 802 address. The UUID/GUID I-D by Paul Leach contains an alternate mechanism for generating the "node" field using pseudo-random numbers which doesn’t have this security concern." And see also this for slightly more detailed commentary on the same lines.

So that finished off the mac address concept for me -- but note that Chandler is still using it! The other python uuid generator at [link] was nice and simple -- use the current IP address if the computer is on the internet, otherwise random bits. At first blush it seems the time plus the internet address should be unique, but it finally dawned on me that often IP addresses are dynamically allocated, so when person A shuts down an internet connection, person B with a machine that happened to have a clock set to an earlier time might get the same IP address and you could get a duplicate uuid. Thinking along those same lines, I began to wonder, too, about the mac address, which is supposed to be unique. What if some company in China is reusing addresses for local ethernet cards? So looks like random bits is the way to go for at least the last 6 bytes of the uuid.

Now about the timestamp? Here's another surprise. The Chandler implementation overlays the high order bits of the date/time with a hardcoded uuidversion number. So the time and date cannot be recovered from the Chandler uuid, and the resulting time portion of the uuid is cyclical over I dunno what, maybe some years, instead of occuring just once until the timestamp runs out in the year 3000 and something. I guess this is just a bug that could be fixed, but it means as things are, this part of the Chandler format is also not usable.

The third part of the uuid is the clock sequence. The purpose of the clock sequence is to make the uuid unique for your own system even if the timestamp is a duplicate. Here Chandler saves the previously used timestamp, and if the new timestamp is *equal to* the old, which it definitely can be because some machines give time in whole seconds, then it adds one to the clock sequence. Error! One should add 1 if the new timestamp is *less than* or equal to the old timestamp, not just if it is equal to the old timestamp. Why? Because if someone sets the clock back on their machine, it might, just might try to issue another uuid at a duplicate time, later down the road without realizing it because an intervening issue has replaced the saved old timestamp. So that was strike three in my book, and I decided to go with the simplest possible uuid - 128 random bits.

The Chandler uuid.c module can currently be found in the full source download at [link] -- look for chandler/Chandler/repository/util/ext/uuid.c.

Now that I have decided to use a simple design, well of course, it strikes me that simpler is better! And really it is true. There is much to be said for a simple design, as long as it works. 128 random bits is plenty, for the next couple decades, anyway, to ensure uniqueness of the InterMix uuid, and the simple solution has the great advantage of allowing other programmers to easily create InterMix style uuids in their own code, and that is important. I think the Chandler people should reconsider their uuid design, so no one else gets hung up on this. It would be great if they adopted 128 random bits as their uuid design!

In any case, please do not interpret this article as being AGAINST Chandler. Not at all. I think Chandler is going to go and go big time and that this is a tiny problem, easily cleaned up, in a small corner of an early version of what OSAF is out there accomplishing.



[< Back] [voice of humanity]

Category:  

11 comments

18 May 2004 @ 11:42 by Roger Eaton @208.187.17.232 : error correction / random uuid hard
A couple good responses on the Chandler developement list to this article -- mainly a correction is in order -- the "bug" I referred to in the Chandler uuid.c code is not a bug at all. It was my mistake to think there was a problem.

However, the good news is that the Chandler people do think a random uuid would be best, just they are not sure how to make one because it is hard to get a "cryptographically secure" 128 random bits. Certainly knowlegeable people writing on the internet say randomness is a hard problem -- see {link:http://www.merrymeet.com/jon/usingrandom.html|Using and Creating Cryptographic-Quality Random Numbers}.

All we are trying to do though, is to cover the 128 bits evenly so as to produce as few duplicates as possible. It doesn't matter if there is a systematic aspect to the process that might make it possible for a code-breaker to deduce the next uuid from the previous one. The Python random number generator does not recycle for a truly huge number: 2**19937-1. But it uses 53 bit floats, so I'm guessing the seed cannot effectively cover more than 53 bits. That would seem to greatly reduce the coverage of 128 bits. Might it not be possible to use three such generators, and OR the resulting 128 bit outputs? Too much guessing -- I am an amateur! Nothing to be done, though, but to go ahead as best I can and then come back and fix later if and when there is money to hire a professional.

Also, here is info about C compilers for Python -- but you have to compile Python itself with them -- so it is not very practical:

- compile python with a free compiler such as cygwin's gcc or mingw.
- compile python with the $100.00 version of Microsoft Visual C,
same as the $700.00 version but without full optimization
- compile python with the $0.00 version of Microsoft Visual C,
same as the $700.00 version but command line only
http://msdn.microsoft.com/visualc/vctoolkit2003/

info from Andi Vajda on the chandler dev list  



22 May 2004 @ 19:33 by Roger Eaton @208.187.17.227 : what are the odds?
The attraction of time + clock_seq + ethernet_node_address is that it will not produce any duplicates at all if done correctly (and if there truly is no Chinese factory putting out ethernet cards with duplicate addresses), but it has the drawback of allowing items to be tracked by source and time. If items are to be trackable, it should be done explicitly.

128 bits covers an immense range, 0 to 340,282,366,920,938,463,463,374,607,431,768,211,456. It gets used up surprisingly fast, though, if the aim is never to have any duplicates ever. If you have 20 people at a party, and check if any two have the same birthday, the first person has to be checked against 19, the second against 18 etc, so there are 19+18+...+3+2 = 189 -- for any single check there is one chance in 365,so there is better than half a chance two will have matching birthdays in a group of 20.

The formula for n items where n is an even number is ((n + 1) * ((n/2) - 1)) + 1.

Throw away the +1 and that approx = (n**2 - n -2)/2.

How big does n have to be to get halfway to 2**128? Halfway to 2**128 is 2**127.

So solve (n**2 - n -2)/2 = 2**127 or equivalently n**2 - n - 2 = 2**128.

Throw out the -2 and even the -n and solve n = approx square root of 2**128 = 2**64 = 18,446,744,073,709,551,616.

The ballpark result then is that when we hit about 18 quintillion uuids issued, if they are spread out over the 128 bits randomly, we have half a chance of getting one duplicate.

That's actually quite encouraging, I think. With 1 billion users each creating 1 billion uuids a year, that will take 18 years before we are likely to get our first duplicate with good randomness. That does give an idea of the scope that 128 bits allows.

So the only question in my mind is whether pseudo random can be made to work.

Here is the description of the Python random number generator from the Python 2.3 Library Documentation:

"Almost all module functions depend on the basic function random(), which generates a random float uniformly in the semi-open range [0.0, 1.0). Python uses the Mersenne Twister as the core generator. It produces 53-bit precision floats and has a period of 2**19937-1. The underlying implementation in C is both fast and threadsafe. The Mersenne Twister is one of the most extensively tested random number generators in existence. However, being completely deterministic, it is not suitable for all purposes, and is completely unsuitable for cryptographic purposes."

That sounds good. I think I can get good random 128 bit uuids by using the Python SHA-1 hash routine on item content to get seeds for the random number generator, so I am going to go ahead on these lines. At some point, there will have to be an expansion of the uuid. 18 billion billion is not enough!  



23 May 2004 @ 15:35 by ming : Random numbers
The problem with any use of random numbers for unique IDs is that, however unlikely, there's nothing in the statistics that hinders that you get duplicates. All we can do is make it statistically unlikely, but I'm not sure that means much. After you've generated one random number, and you generate the next one, that same number is just as likely as any other number. Us silly humans would consider that some kind of mistake, of course. Anyway, I use stuff like MD5(microtime()) for IDs to use in cookies. But I'm not sure I'd dare using it for more important IDs.  


23 May 2004 @ 15:38 by ming : C
If you use Cygwin, I suppose you could use a Python already made for that, which I assume would be compiled with gcc. I'm not sure if it makes sense for you that it all lives in Cygwin, though.  


14 Jun 2004 @ 11:26 by ov : unique id
First time I've seen this article, wonder how it fell through the cracks.

My experience in this is way outdated (1989) so take this with a grain of salt. Yes, it is essential that data items have a unique id. I also think that it is necessary to have a fine granularity perhaps along the line of being able to point to a single line item in a list, and everything is a list, a lists of lists, just like the unix file directory and the lisp infrastructure.

A model that I used for the unique Id issue was using the virtual memory address as the unique ID. The virtual address was constructed from my own local virtual memory tables, and was a composite through an intermediate index table which also contained a few bytes of boolean flags in a fixed size record so that a lot of sorting and selection could be done directly using the index file without having to access the data itself. The composite consisted of position within the page, plus the file designator, plus the directory tree, and above that was an address to "external systems" which would have to be accessed through a shared database somewhere on the internet. (heh, this was 1989 and hadn't heard yet about Tim Berner Lee's work)

The thing was that even though each node had a unique ID the ID was not stored anywhere in memory, but was implicity stored in the location of the item within memory. There was a bit stored in memory, well actually a couple of bits, using the same technique as the pdp-11 assembly language technique of where a byte was either an instruction or an address, and the instructions weren't numbers but an array of boolean bits; so each entry in my index file started with a bit to indicate if this was a local entry or an extended entry. So okay you had to construct the id based on location but the time that this took was more than compensated for the fact that you didn't have to search for anything, you just went there directly.

As far as the security issues okay I think they should be handled in a seperate level. Security based on not knowing the location to me depend on secrecy which means the only people that know about them are those that hack and those are the only ones you really have to worry about. In my opinion security should be handled by encryption and not by keeping the data location a secret.

Maybe this fires some neurons, maybe not, but like I said it is old stuff and possibly not relevant.  



6 Jul 2004 @ 23:17 by Roger Eaton @208.187.17.232 : a fine granularity
Apologies, Ov. I did not mean to ignore your comment at all. I've just been swamped by work and life in general. But now I am back.

I think I'll have to pass on the idea of the offset in memory being the ID. That would preclude using a database, such as the Berkeley DB that I have decided on.

However, the issue of granularity is worth considering. Your system let you index right into the item, which seems right conceptually, though, was it really useful? I suppose so. I do think there needs to be a way of referring to just a section of an item. The longer the item, the more the need.

Perhaps sections can be a special kind of item branched off the original whole item.

Security seems impossible to guarantee, and open source makes it worse. Anyone can see exactly what your security measures are. I am thinking that part of connecting two hubs should be that the two owners exchange passwords via email or better yet, voice. I don't mind having a chokepoint just at the point where the initial connection is made, since it will make it harder to build a farm of dummy hubs whose real purpose is to distort the ratings.  



7 Jul 2004 @ 00:31 by ov : Chandler UUID
Heh, not to worry Roger, things get done in their own time. Don't blame you for not using the offset, that involves a whole different strategy, and it doesn't really fit with using markup language or using databases (it does however eleminate a a few dozen layers of browser and database crap:-) ). Adobe Acrobat seems to still get lots of use and I absolutely detest that interface, kind of like looking at porno through a keyhole just because that's the way you first come across it -- I just mind that up eh).

Impossible to say whether it was useful Roger because it never did get beyond a few data models and a prototype. Couldn't convince anybody that it was a significant idea, I mean the whole distributed hypertext thing, and then when the markup language came out the whole idea took off in a different direction. There is a huge step between a small model and a commercial product. I read a hndsight article a few years back that talked about the original html specs probably could have been a lot better but then it wouldn't have been as popular or as evolved as far as it did. Shortly after the html there was a hyperwave project by Hoeflinn (sp?), same group of people behind the Principia Cybernetica, that wanted to add some structure but it doesn't seem to have gotten far and I think a lot of it has to do with the anarchist crowd of end users that has this obstinance defiance of structure purely out of principle.

At the very minimum the article needs a unique ID, and each comment in the article needs an ID. The problem with the granularity as I see it is in the browsers themselves. Sure you can do lots of things with dynamic html, but it involves inserting little anchors all over the place, just in case you might use them, when it should be something that is embedded in the document tree of the browser rather than kludged onto the source document. Most of what I have to say on this is irrelevant because it has little to do with conceptulizing anymore and almost everything to do with how to kludge other people's software.

I still think it would be worthwhile to have a client side and a server side component even with the hub model. Teh hubs can send out long documents, and the client side can enfoliate that long document down to an effective length of two or three screens. The scripts wouldn't even be need to be included on each html page since they could be accessed as include files; forget the details on this but I remember seeing it in some tutorials a few years ago. (Then again now the thing is style sheets and xhml and double the effort for the M$ way and the way everybody else wants it.

Something that might be worth looking at though is a browser plug in that processes batch packages sent from the hubs. People use them for trivial little widgets so I don't think it would be asking too much to have them install it if they want to participate in the conferencing.  



12 Dec 2008 @ 03:23 by tibet tour @222.210.174.104 : tibet
Want to go to tibet tour? You can rely on us! We are expert in Tibet Tours.  


30 Dec 2008 @ 23:26 by 彼女 @61.210.115.201 : thanks
thanks  


3 Jul 2009 @ 20:06 by Carsten Klein @91.54.97.90 : Globally unique is near to impossible
Globally unique is near impossible to do, since there are no real random number generators available, and, just in case you be doing random number generation on the basis of for example white noise generators embedded into your system, it might turn out that a different system will actually produce the same white noise as your own white noise based random number generator.

As such, all uuids are always to be considered locally unique, as is stated in the Java APIs.

Pointer:
Java implements a UUID generator that generates generation 3 and 4 uuids.

Since the sources are freely available, you should have a look into those.

And, as a personal perception of mine, never trust anything that ejects from so-called Microsoft research labs.  



3 Jul 2009 @ 20:08 by Carsten Klein @91.54.97.90 : Besides that
you should definitely fix your CAPTCHA generator, see the spam above.  


Your Name:
Your URL: (or email)
Subject:       
Comment:
For verification, please type the word you see on the left:


[< Back] [voice of humanity] [PermaLink]?