Databases: Cached Functions and a First Taste of SQL

 

First, some quick applications of what we did last time (these would have been good exercises) to use a database to make cached function.

A dbm-based Cached Function Decorator

Let's create a "@db_cached_function" decorator, which will have the advantage of being really easy to use, and provides the key functionality from last time that you might care about.

{{{id=4| class db_cached_function: def __init__(self, file): self.file = os.path.abspath(file) def __call__(self, f): A = sage.misc.function_mangling.ArgumentFixer(f) import anydbm, cPickle def g(*args, **kwds): k = A.fix_to_named(*args, **kwds) s = cPickle.dumps(k, 2) db = anydbm.open(self.file, 'c') if db.has_key(s): ans = cPickle.loads(db[s]) db.close() return ans # We close, compute, then reopen, since computing # f(...) will typically take a while, and there is # no reason to "lock" the database during that time. # (Of course, we aren't really locking it. This just # decreases the chances of corruption slightly.) db.close() x = f(*args, **kwds) db = anydbm.open(self.file, 'c') db[s] = cPickle.dumps(x, 2) db.close() return x def keys(): db = anydbm.open(self.file, 'c') # program around bug db.has_key('') keys = db.keys() keys = [cPickle.loads(k) for k in keys] db.close() return keys g.keys = keys return g /// }}}

This is really easy to use.

{{{id=3| os.system('rm -f /tmp/dbcache1*') @db_cached_function('/tmp/dbcache1') def fac(n): return factor(n) /// }}} {{{id=1| for p in primes(100,220): print p, fac(2^p-1) /// 101 7432339208719 * 341117531003194129 103 2550183799 * 3976656429941438590393 107 162259276829213363391578010288127 109 745988807 * 870035986098720987332873 113 3391 * 23279 * 65993 * 1868569 * 1066818132868207 127 170141183460469231731687303715884105727 131 263 * 10350794431055162386718619237468234569 137 32032215596496435569 * 5439042183600204290159 139 5625767248687 * 123876132205208335762278423601 149 86656268566282183151 * 8235109336690846723986161 151 18121 * 55871 * 165799 * 2332951 * 7289088383388253664437433 157 852133201 * 60726444167 * 1654058017289 * 2134387368610417 163 150287 * 704161 * 110211473 * 27669118297 * 36230454570129675721 167 2349023 * 79638304766856507377778616296087448490695649 173 730753 * 1505447 * 70084436712553223 * 155285743288572277679887 179 359 * 1433 * 1489459109360039866456940197095433721664951999121 181 43441 * 1164193 * 7648337 * 7923871097285295625344647665764672671 191 383 * 7068569257 * 39940132241 * 332584516519201 * 87274497124602996457 193 13821503 * 61654440233248340616559 * 14732265321145317331353282383 197 7487 * 26828803997912886929710867041891989490486893845712448833 199 164504919713 * 4884164093883941177660049098586324302977543600799 211 15193 * 60272956433838849161 * 3593875704495823757388199894268773153439 }}} {{{id=16| fac.keys() /// [((), (('n', 2535301200456458802993406410751),)), ((), (('n', 649037107316853453566312041152511),)), ((), (('n', 10384593717069655257060992658440191),)), ((), (('n', 713623846352979940529142984724747568191373311),)), ((), (('n', 11972621413014756705924586149611790497021399392059391),)), ((), (('n', 3064991081731777716716694054300618367237478244367204351),)), ((), (('n', 12554203470773361527671578846415332832204710888928069025791),)), ((), (('n', 162259276829213363391578010288127),)), ((), (('n', 170141183460469231731687303715884105727),)), ((), (('n', 174224571863520493293247799005065324265471),)), ((), (('n', 182687704666362864775460604089535377456991567871),)), ((), (('n', 187072209578355573530071658587684226515959365500927),)), ((), (('n', 200867255532373784442745261542645325315275374222849104412671),)), ((), (('n', 10141204801825835211973625643007),)), ((), (('n', 2722258935367507707706996859454145691647),)), ((), (('n', 696898287454081973172991196020261297061887),)), ((), (('n', 2854495385411919762116571938898990272765493247),)), ((), (('n', 11692013098647223345629478661730264157247460343807),)), ((), (('n', 766247770432944429179173513575154591809369561091801087),)), ((), (('n', 3138550867693340381917894711603833208051177722232017256447),)), ((), (('n', 803469022129495137770981046170581301261101496891396417650687),)), ((), (('n', 3291009114642412084309938365114701009965471731267159726697218047),))] }}} {{{id=5| for p in primes(100,220): print p, fac(2^p-1) /// 101 7432339208719 * 341117531003194129 103 2550183799 * 3976656429941438590393 107 162259276829213363391578010288127 109 745988807 * 870035986098720987332873 113 3391 * 23279 * 65993 * 1868569 * 1066818132868207 127 170141183460469231731687303715884105727 131 263 * 10350794431055162386718619237468234569 137 32032215596496435569 * 5439042183600204290159 139 5625767248687 * 123876132205208335762278423601 149 86656268566282183151 * 8235109336690846723986161 151 18121 * 55871 * 165799 * 2332951 * 7289088383388253664437433 157 852133201 * 60726444167 * 1654058017289 * 2134387368610417 163 150287 * 704161 * 110211473 * 27669118297 * 36230454570129675721 167 2349023 * 79638304766856507377778616296087448490695649 173 730753 * 1505447 * 70084436712553223 * 155285743288572277679887 179 359 * 1433 * 1489459109360039866456940197095433721664951999121 181 43441 * 1164193 * 7648337 * 7923871097285295625344647665764672671 191 383 * 7068569257 * 39940132241 * 332584516519201 * 87274497124602996457 193 13821503 * 61654440233248340616559 * 14732265321145317331353282383 197 7487 * 26828803997912886929710867041891989490486893845712448833 199 164504919713 * 4884164093883941177660049098586324302977543600799 211 15193 * 60272956433838849161 * 3593875704495823757388199894268773153439 }}} {{{id=6| os.system('ls -lh /tmp/dbcache1*') /// -rw-r--r-- 1 wstein wheel 24K Nov 30 16:20 /tmp/dbcache1.db 0 }}} {{{id=106| %time for n in range(1,10000): z = fac(n) /// CPU time: 6.04 s, Wall time: 12.88 s }}}

The above is pretty slow (compared to what we do below), and this is because of all the opening and closing of the database.

{{{id=105| /// }}}

WARNING: If you have multiple threads or processes and they try to simultaneously open and write to the same dbm database, this might cause trouble.  E.g., the following code does this.  Strangely, I can't get it to break -- maybe the ndbm implementation on OS X has some locking capabilities builtin...

{{{id=11| # Potentially dangerous because of locking issues? os.system('rm -f /tmp/dbcache2*') @parallel(6) def hard_work(n): @db_cached_function('/tmp/dbcache2') def f(n): sleep(.5) return n*n return f(n) /// }}} {{{id=18| for X in hard_work([1..20]): print X /// (((1,), {}), 1) (((2,), {}), 4) (((4,), {}), 16) (((5,), {}), 25) (((6,), {}), 36) (((3,), {}), 9) (((7,), {}), 49) (((8,), {}), 64) (((9,), {}), 81) (((11,), {}), 121) (((10,), {}), 100) (((12,), {}), 144) (((13,), {}), 169) (((14,), {}), 196) (((15,), {}), 225) (((16,), {}), 256) (((17,), {}), 289) (((18,), {}), 324) (((19,), {}), 361) (((20,), {}), 400) }}} {{{id=102| for X in hard_work([1..20]): print X /// (((1,), {}), 1) (((5,), {}), 25) (((4,), {}), 16) (((6,), {}), 36) (((3,), {}), 9) (((2,), {}), 4) (((8,), {}), 64) (((11,), {}), 121) (((10,), {}), 100) (((9,), {}), 81) (((12,), {}), 144) (((7,), {}), 49) (((13,), {}), 169) (((17,), {}), 289) (((16,), {}), 256) (((18,), {}), 324) (((15,), {}), 225) (((14,), {}), 196) (((19,), {}), 361) (((20,), {}), 400) }}} {{{id=104| /// }}}

See StackOverflow: "'separate process that coordinates read/write access to that file' - in other words, implement a database server :-)"

And hence we are motivated to consider more sophisticated databases.

We'll start with SQLite.

Using SQLite in Sage

Sage comes with the powerful SQLite database server.   Check out their website.

Here's an example of using it at a very low level to make a database of integer factorizations.

{{{id=28| import sqlite3 # a standard Python module file = '/tmp/sqlite1' if os.path.exists(file): os.unlink(file) db = sqlite3.connect(file) cursor = db.cursor() cursor.execute("""CREATE TABLE factorizations (number TEXT, factorization TEXT, UNIQUE(number))""") cursor.execute("CREATE INDEX factorizations_idx ON factorizations(number)") db.commit() /// }}} {{{id=27| cursor.execute('INSERT INTO factorizations VALUES("6","[(2,1),(3,1)]")') /// }}} {{{id=25| db.commit() /// }}}

We can look at our new database on the command line, completely independently of Sage/Python:

deep:~ wstein\$ sage -sh
(sage subshell)\$ sqlite3 /tmp/sqlite1
SQLite version 3.6.22
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .schema
CREATE TABLE factorizations 
         (number TEXT, factorization TEXT, UNIQUE(number));
CREATE INDEX factorizations_idx ON factorizations(number);
sqlite> select * from factorizations;
6|[(2,1),(3,1)]

Oh, the UNIQUE command above makes it so you can't enter another factorization of the same number.

{{{id=35| cursor.execute('INSERT INTO factorizations VALUES("6","[(2,1),(3,1)]")') /// Traceback (most recent call last): File "", line 1, in File "_sage_input_37.py", line 10, in exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("Y3Vyc29yLmV4ZWN1dGUoJ0lOU0VSVCBJTlRPIGZhY3Rvcml6YXRpb25zIFZBTFVFUygiNiIsIlsoMiwxKSwoMywxKV0iKScp"),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))' + '\n', '', 'single') File "", line 1, in File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmp4fRfjv/___code___.py", line 2, in exec compile(u'cursor.execute(\'INSERT INTO factorizations VALUES("6","[(2,1),(3,1)]")\')' + '\n', '', 'single') File "", line 1, in sqlite3.IntegrityError: column number is not unique }}}

Let's populate the database with factorizations:

{{{id=34| %time for n in range(1,10000): f = str(list(factor(n))).replace(' ','') try: z = cursor.execute('INSERT INTO factorizations VALUES("%s","%s")'%(n,f)) except: print "Unable to insert factorization of %s"%n /// Unable to insert factorization of 6 CPU time: 1.91 s, Wall time: 1.92 s }}} {{{id=39| db.commit() /// }}}

If you use the command line again (exit and reload) you'll find:

sqlite> SELECT * FROM factorizations;
6|[(2,1),(3,1)]
1|[]
2|[(2,1)]
3|[(3,1)]
4|[(2,2)]
5|[(5,1)]
7|[(7,1)]
8|[(2,3)]
9|[(3,2)]
10|[(2,1),(5,1)]
...
9998|[(2,1),(4999,1)]
9999|[(3,2),(11,1),(101,1)]
9998|[(2,1),(4999,1)] 9999|[(3,2),(11,1),(101,1)]

Obviously, to use SQLite the way we're using it here, you have to know something of the SQL language.   Fortunately, you don't need to know very much to do a few basic things, there are tons of examples on the web, many tutorials, and basic SQL isn't hard to learn.  

To illustrate that SQLite3 can at least be a good key:value store, let's implement a simple pickle-based key:value store using SQLite3.   This is something that will work like a dictionary, but stored on disk using SQLite3, and you have to call the commit method to write out the changes you make to the dictionary to disk..   The complete implementation is fairly short.

I wrote the following from scratch.  It was surprising how many tricks I had to figure out by Googling around...

{{{id=47| import cPickle, sqlite3 class SQLiteKeyValueStore: def __init__(self, file): """Create or open the SQLite3-based key:value database stored in the given file.""" self._db = sqlite3.connect(file) self._cursor = self._db.cursor() self._file = file try: self._cursor.execute("select * from sqlite_master").next() except StopIteration: # This exception will occur if the database is brand new (has no tables yet) try: self._cursor.execute("CREATE TABLE cache (key BLOB, value BLOB, UNIQUE(key))") self._cursor.execute("CREATE INDEX cache_idx ON cache(key)") self._db.commit() except sqlite3.OperationalError: pass # failure could happen if another process maybe created # and initialized the database at the same time. That's fine. def __del__(self): """Called when the database is freed to close the connection.""" self._db.close() def __repr__(self): """String representation of the database.""" return "SQLite3-based key:value database stored in '%s'"%self._file def has_key(self, key): """Returns True if database has the given key.""" return self._cursor.execute( "SELECT count(*) FROM cache WHERE key=?", (self._dumps(key),) ).next()[0] > 0 def __getitem__(self, key): """Return item in the database with given key, or raise KeyError.""" s = self._cursor.execute( "SELECT value FROM cache WHERE key=?", (self._dumps(key),) ) try: return self._loads(str(s.next()[0])) except StopIteration: raise KeyError, str(key) def __setitem__(self, key, value): """Sets an item in the database. Call commit to make this permanent.""" self._cursor.execute("INSERT OR REPLACE INTO cache VALUES(?, ?)", (self._dumps(key), self._dumps(value))) def __delitem__(self, key): """Removes an item from the database. Call commit to make this permanent.""" self._cursor.execute("DELETE FROM cache WHERE key=?", (self._dumps(key),) ) def _dumps(self, x): """Converts a Python object to a binary string that can be stored in the database.""" return sqlite3.Binary(cPickle.dumps(x,2)) def _loads(self, x): """Used internally to turn a pickled object in the database into a Python object.""" return cPickle.loads(x) def keys(self): """Return list of keys in the database.""" return [self._loads(str(x[0])) for x in self._cursor.execute( "SELECT key FROM cache" )] def commit(self): """Write assignments made to the database to disk.""" self._db.commit() /// }}}

First a quick test.

{{{id=49| os.system('rm -f /tmp/sql_test') S = SQLiteKeyValueStore('/tmp/sql_test'); S /// SQLite3-based key:value database stored in '/tmp/sql_test' }}} {{{id=50| S[2] = 10 S[3] = 8 /// }}} {{{id=58| S.keys() /// [2, 3] }}} {{{id=51| S[2] /// 10 }}} {{{id=60| timeit('S[3]') /// 625 loops, best of 3: 29.5 µs per loop }}} {{{id=61| timeit('S[3]=8') /// 625 loops, best of 3: 34.3 µs per loop }}} {{{id=72| timeit('S[3]=8; S.commit()') /// 625 loops, best of 3: 1.16 ms per loop }}} {{{id=62| del S[3] /// }}} {{{id=64| S.keys() /// [2] }}} {{{id=65| S = SQLiteKeyValueStore('/tmp/factorizations.sql') /// }}} {{{id=66| %time for n in range(1,10000): S[n] = factor(n) /// CPU time: 2.36 s, Wall time: 2.37 s }}} {{{id=69| S.commit() /// }}} {{{id=68| os.system("ls -lh /tmp/factorizations.sql") /// -rw-r--r-- 1 wstein wheel 3.6M Nov 30 17:06 /tmp/factorizations.sql 0 }}}

Incidentally, if you delete data, the database file doesn't automatically shrink.  As explained here, you can "vacuum" the database. 

{{{id=67| %time for n in range(1,10000): del S[n] S.commit() /// CPU time: 0.23 s, Wall time: 0.52 s }}} {{{id=55| os.system("ls -lh /tmp/factorizations.sql") /// -rw-r--r-- 1 wstein wheel 3.6M Nov 30 17:07 /tmp/factorizations.sql 0 }}} {{{id=77| S.keys() /// [] }}} {{{id=79| S._cursor.execute('VACUUM') /// }}} {{{id=73| os.system("ls -lh /tmp/factorizations.sql") /// -rw-r--r-- 1 wstein wheel 4.0K Nov 30 17:09 /tmp/factorizations.sql 0 }}} {{{id=78| /// }}}

Given any key:value store, it is easy to implemented a cached function.  For simplicity of use, we call commit every time, which costs a millisecond.  (The other option would be to have to explicitly call f.commit(), where f is the decorated function.)

{{{id=40| class sqlite_cached_function: def __init__(self, file): self.db = SQLiteKeyValueStore(file) def __call__(self, f): """Return decorated version of f.""" A = sage.misc.function_mangling.ArgumentFixer(f) def g(*args, **kwds): k = A.fix_to_named(*args, **kwds) try: return self.db[k] except KeyError: pass x = self.db[k] = f(*args, **kwds) self.db.commit() return x def keys(): return self.db.keys() g.keys = keys g.db = self.db return g /// }}} {{{id=45| os.system('rm -f /tmp/sqlite_factor_func') @sqlite_cached_function('/tmp/sqlite_factor_func') def fac(n): return factor(n) /// }}} {{{id=82| fac.keys() /// [] }}} {{{id=83| fac(2010) /// 2 * 3 * 5 * 67 }}} {{{id=84| fac.db.keys() /// [((), (('n', 2010),))] }}} {{{id=89| %time for n in range(1,1000): z = fac(n) /// CPU time: 0.88 s, Wall time: 1.83 s }}} {{{id=87| %time for n in range(1,1000): z = fac(n) /// CPU time: 0.07 s, Wall time: 0.07 s }}} {{{id=86| os.system('ls -lh /tmp/sqlite_factor_func') /// -rw-r--r-- 1 wstein wheel 389K Nov 30 17:17 /tmp/sqlite_factor_func 0 }}} {{{id=91| /// }}}

SQLite also deals with the concurrent access problem.  You can have many programs all open the same database, and use it.  When one does a commit, that database is locked for a moment and nobody else can commit, etc.  This is dealt with automatically be SQLite -- you don't have to worry about it.  Thus we finally have cached functions and a key:value store that works well and supports concurrent read/write access.   (NOTE: According to the SQLite docs, unfortunately locking in some cases might not work correctly on an NFS shared filesystem.)

{{{id=94| # SHOULD WORK -- using @parallel makes it so 6 new processes are spawned. They all simultaneously # open the database file and start simultaneously trying to commit to it. os.system('rm -f /tmp/sqlite_func') @parallel(6) def fac(n): # Import to put the decorator inside hard_work -- otherwise, when # @parallel spawns 6 forked processes, they will all have exactly # the same database file handle. By doing this, we ensure each # call to hard_work(n) opens a new connection to the database. @sqlite_cached_function('/tmp/sqlite_func') def f(n): sleep(.5) # make it harder to factor :-) return factor(n) return f(n) /// }}} {{{id=85| # safe for X in fac([1..12]): print X /// (((1,), {}), 1) (((4,), {}), 2^2) (((6,), {}), 2 * 3) (((3,), {}), 3) (((2,), {}), 2) (((5,), {}), 5) (((7,), {}), 7) (((8,), {}), 2^3) (((9,), {}), 3^2) (((10,), {}), 2 * 5) (((11,), {}), 11) (((12,), {}), 2^2 * 3) }}} {{{id=95| for X in fac([1..20]): print X /// (((2,), {}), 2) (((3,), {}), 3) (((5,), {}), 5) (((6,), {}), 2 * 3) (((4,), {}), 2^2) (((1,), {}), 1) (((7,), {}), 7) (((8,), {}), 2^3) (((11,), {}), 11) (((10,), {}), 2 * 5) (((14,), {}), 2 * 7) (((12,), {}), 2^2 * 3) (((9,), {}), 3^2) (((15,), {}), 3 * 5) (((16,), {}), 2^4) (((17,), {}), 17) (((13,), {}), 13) (((19,), {}), 19) (((18,), {}), 2 * 3^2) (((20,), {}), 2^2 * 5) }}} {{{id=109| /// }}}

See source code here, for something like the above, with tests and compression: http://code.google.com/p/purplesage/source/browse/psage/lmfdb/sqlite_keyval.py

{{{id=108| /// }}} {{{id=101| /// }}} {{{id=100| /// }}} {{{id=99| /// }}} {{{id=97| /// }}}