David Farrar
2008-04-11 17:18:42 UTC
Hello,
I've been looking recently at the cdb format and come across a
discrepancy between the actual output of cdbmake and the output I
expected. Hopefully somebody will be able to tell me where I have gone
wrong.
If I create a small cdb file with cdbmake, using the records
+13,3:thomas.mangin->foo
+11,3:markcowgill->bar
so that the hashes of both keys are stored in the same hash table, I
notice that this table is 4 slots long.
Parsing the cdb file, I can see that the four slots are
2417049413, pointer to record with key markcowgill,
empty slot
empty slot
8948549, pointer to record with key thomas.mangin
where the value preceeding each pointer is a stored hash of the
respective key.
Taken from http://cr.yp.to/cdb/cdb.txt,
"""
A record is located as follows. Compute the hash value of the key in
the record. The hash value modulo 256 is the number of a hash table.
The hash value divided by 256, modulo the length of that table, is a
slot number. Probe that slot, the next higher slot, and so on, until
you find the record or run into an empty slot.
"""
Reading this paragraph, I expected to be able able to find the entry for
'markcowgill' by seeking to the starting position of the hash table +
(2*4) * n bytes, where n = (hash / 256) % 4.
In this particular case, (hash / 256) % 4 = (2417049413 / 256) % 4 = 3,
meaning that I skip past the value I'm looking for (jumping straight to
'thomas.mangin') and falsely report that the key does not exist.
I can still retrieve a key by hardcoding the slot position to 0 or even
reading each record sequentially but this doesn't seem to be the best
way to read quickly and I would appreciate it if somebody could put me
ok the right track.
In case it helps, I have written a rough and ready python cdb
implementation to check the data and rewrite it in a format that does
what I expect. I'm not sure that the write function is correct but I ran
it on a fairly large dataset and it gave me the results I wanted.
Setting self.broken on line 31 to False forces slot number = 0 when
performing a hash lookup and is the only way I can parse data generated
by cdbmake
David
I've been looking recently at the cdb format and come across a
discrepancy between the actual output of cdbmake and the output I
expected. Hopefully somebody will be able to tell me where I have gone
wrong.
If I create a small cdb file with cdbmake, using the records
+13,3:thomas.mangin->foo
+11,3:markcowgill->bar
so that the hashes of both keys are stored in the same hash table, I
notice that this table is 4 slots long.
Parsing the cdb file, I can see that the four slots are
2417049413, pointer to record with key markcowgill,
empty slot
empty slot
8948549, pointer to record with key thomas.mangin
where the value preceeding each pointer is a stored hash of the
respective key.
Taken from http://cr.yp.to/cdb/cdb.txt,
"""
A record is located as follows. Compute the hash value of the key in
the record. The hash value modulo 256 is the number of a hash table.
The hash value divided by 256, modulo the length of that table, is a
slot number. Probe that slot, the next higher slot, and so on, until
you find the record or run into an empty slot.
"""
Reading this paragraph, I expected to be able able to find the entry for
'markcowgill' by seeking to the starting position of the hash table +
(2*4) * n bytes, where n = (hash / 256) % 4.
In this particular case, (hash / 256) % 4 = (2417049413 / 256) % 4 = 3,
meaning that I skip past the value I'm looking for (jumping straight to
'thomas.mangin') and falsely report that the key does not exist.
I can still retrieve a key by hardcoding the slot position to 0 or even
reading each record sequentially but this doesn't seem to be the best
way to read quickly and I would appreciate it if somebody could put me
ok the right track.
In case it helps, I have written a rough and ready python cdb
implementation to check the data and rewrite it in a format that does
what I expect. I'm not sure that the write function is correct but I ran
it on a fairly large dataset and it gave me the results I wanted.
Setting self.broken on line 31 to False forces slot number = 0 when
performing a hash lookup and is the only way I can parse data generated
by cdbmake
David