Math 581d: Nov 29, 2010:  Persistent Python Databases for Sage -- beyond pickle

NOTE: Be aware of trac 10352, which has an spkg I installed before writing this lecture, which updates ZODB to the latest version.  

In this lecture we'll talk about:

We assume the reader understands Python's pickle system, as explained in the previous lecture. 

Our running sample database problem is: make a database of factorizations of integers.   This is a good example because:

 

 

{{{id=5| /// }}}

Here's the sort of thing that happens if you try to use your filesystem as a database, and your filesystem happens to be on a slightly old critical server:

 

root@boxen:/sagenb.orig/sagenb/sage_notebook.sagenb# df -i /
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1            4481024 4477699    3325  100% /
root@boxen:/sagenb.orig/sagenb/sage_notebook.sagenb# find new_home |wc -l
643101
root@boxen:/sagenb.orig/sagenb/sage_notebook.sagenb# time find home |wc -l
3021185

You can only create a few million files by default on an older ext3 filesystem. And if you do that, even if you have plenty of disk space, you can't create any more.  (So don't mess up like I did!)

{{{id=21| /// }}}

dbm -- "Simple Database Interface"

Official Python docs: http://docs.python.org/library/anydbm.html

Python comes with something called anydbm, which implements a very simple, but potentially very powerful database.   What you get is a persistent (to disk) object that works much like a Python dictionary, except the keys and values must both be strings.   The neat thing is that anydbm uses the best implementation of a dbm available when you build Sage from source: one of several possible fast C implementations, or possibly a really slow stupid Python one if none of those was available.  So it just works.  

Let's try it out. 

First, create a database.

{{{id=23| file = DATA + 'anydbm.db' if os.path.exists(file): os.unlink(file) import anydbm D = anydbm.open(DATA + 'anydbm', 'c'); D /// }}} {{{id=162| D. /// Traceback (most recent call last): File "", line 1, in File "_sage_input_10.py", line 10, in exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RC4="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))' + '\n', '', 'single') File "", line 1, in File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmprf1hYZ/___code___.py", line 2 D. ^ SyntaxError: invalid syntax }}}

For us, it turned out this is using dbm, which is the standard UNIX dbm interface.    I think the OS X implementation of this is "ndbm".

{{{id=31| print anydbm._defaultmod /// }}} {{{id=27| D['6'] = '[(2,1), (3,1)]' D['12'] = '[(2,2), (3,1)]' /// }}} {{{id=163| D['6'] /// '[(2,1), (3,1)]' }}}

You can only store strings, and the keys must be strings.

{{{id=26| D['1'] = [] /// Traceback (most recent call last): File "", line 1, in File "_sage_input_5.py", line 10, in exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RFsnMSddID0gW10="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))' + '\n', '', 'single') File "", line 1, in File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpw07GFU/___code___.py", line 2, in exec compile(u"D['1'] = []" + '\n', '', 'single') File "", line 1, in TypeError: dbm mappings have string elements only }}} {{{id=35| D[1] = '[]' /// Traceback (most recent call last): File "", line 1, in File "_sage_input_6.py", line 10, in exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RFsxXSA9ICdbXSc="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))' + '\n', '', 'single') File "", line 1, in File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmp39_Pae/___code___.py", line 3, in exec compile(u"D[_sage_const_1 ] = '[]'" + '\n', '', 'single') File "", line 1, in TypeError: dbm mappings have string indices only }}}

Now we close the database, then reopen and see that it really stored our data.  If you don't explicitly close the database, then the data might not get stored to disk.

{{{id=36| D.close() /// }}} {{{id=39| D.keys() /// Traceback (most recent call last): File "", line 1, in File "_sage_input_15.py", line 10, in exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RC5rZXlzKCk="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))' + '\n', '', 'single') File "", line 1, in File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpIEu8Dc/___code___.py", line 2, in exec compile(u'D.keys()' + '\n', '', 'single') File "", line 1, in dbm.error: DBM object has already been closed }}} {{{id=37| import anydbm D = anydbm.open(DATA + 'anydbm', 'c') D.keys() /// ['6', '12'] }}} {{{id=56| D['6'] /// '[(2,1), (3,1)]' }}} {{{id=55| D['12'] /// '[(2,2), (3,1)]' }}} {{{id=91| eval(D['12']) /// [(2, 2), (3, 1)] }}} {{{id=78| del D['6'] D.keys() /// ['12'] }}} {{{id=79| D.close() /// }}} {{{id=81| D = anydbm.open(DATA + 'anydbm', 'c') D.keys() /// ['12'] }}} {{{id=80| /// }}} {{{id=77| /// }}}

Key Point: This is dramatically different than just storing a pickle and loading it, because if there are billions of key/values, only the ones you actually use get loaded into memory.   Databases are all about scalability!

According to Python and other docs, there are various limitations on anydbm, depending on which database is actually used (which depends on how/where Python/Sage was built!).  On my OS X laptop, NDBM is used, which has severe limits, though I had trouble finding out what they are -- one page says "there is a limit on the total size of the key/content pairs that can be stored (this ranges from 1018 bytes to 4096 bytes)".

Let's break stuff.

{{{id=57| file = DATA + 'bigdbm.db' if os.path.exists(file): os.unlink(file) import anydbm bigdb = anydbm.open(DATA + 'bigdbm', 'c'); bigdb /// }}} {{{id=59| s = 'x'*(20 * 10^6) # a 20-megabyte string. bigdb['1'] = s /// }}} {{{id=54| bigdb.close() print os.system('ls -lh %s'%DATA) /// total 39336 -rw-r--r-- 1 wstein staff 16K Nov 28 15:38 anydbm.db -rw-r--r-- 1 wstein staff 19M Nov 28 15:41 bigdbm.db -rw-r--r-- 1 wstein staff 168B Nov 28 14:08 zodb.fs -rw-r--r-- 1 wstein staff 279B Nov 28 14:12 zodb.fs.index -rw-r--r-- 1 wstein staff 6B Nov 28 14:13 zodb.fs.lock -rw-r--r-- 1 wstein staff 0B Nov 28 14:13 zodb.fs.tmp 0 }}}

Now try to reopen it.

{{{id=62| bigdb = anydbm.open(DATA + 'bigdbm', 'c') /// }}} {{{id=61| bigdb /// }}}

It seems to work fine. So, I guess we didn't hit a size limit. It's worrisome though.

{{{id=67| len(bigdb['1']) /// 20000000 }}} {{{id=70| del bigdb['1'] bigdb.close() /// }}} {{{id=73| bigdb = anydbm.open(DATA + 'bigdbm', 'c') len(bigdb['1']) /// Traceback (most recent call last): File "", line 1, in File "_sage_input_46.py", line 10, in exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("YmlnZGIgPSBhbnlkYm0ub3BlbihEQVRBICsgJ2JpZ2RibScsICdjJykKbGVuKGJpZ2RiWycxJ10p"),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))' + '\n', '', 'single') File "", line 1, in File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpifnMje/___code___.py", line 3, in exec compile(u"len(bigdb['1'])" + '\n', '', 'single') File "", line 1, in KeyError: '1' }}}

But the database file is still big:

{{{id=88| print os.system('ls -lh %s'%DATA) /// total 39336 -rw-r--r-- 1 wstein staff 16K Nov 28 15:38 anydbm.db -rw-r--r-- 1 wstein staff 19M Nov 28 15:42 bigdbm.db -rw-r--r-- 1 wstein staff 168B Nov 28 14:08 zodb.fs -rw-r--r-- 1 wstein staff 279B Nov 28 14:12 zodb.fs.index -rw-r--r-- 1 wstein staff 6B Nov 28 14:13 zodb.fs.lock -rw-r--r-- 1 wstein staff 0B Nov 28 14:13 zodb.fs.tmp 0 }}} {{{id=50| /// }}} {{{id=20| /// }}}

Pros:

  • Really simple to learn and understand.
  • You can store a huge number of records efficiently.
  • There are modern very scalable implementations like Tokyo Cabinet.

Cons:

  • You can only store strings.
  • The database file stays large even when you delete keys and close.
  • Very unclear  what the maximum record size is, or even what database implementation you'll be using...  The Python docs say the following scary thing: "The database is also (unfortunately) subject to the limitations of dbm, if it is used — this means that (the pickled representation of) the objects stored in the database should be fairly small, and in rare cases key collisions may cause the database to refuse updates."
{{{id=89| /// }}} {{{id=25| /// }}} {{{id=18| /// }}}

shelve

Shelve is a standard Python module [Python docs] that provides a disk-persistent dictionary like object.     It is built on top of anydbm.  They main point is that the values of in the dictionary do not have to be strings.  This is done by simply pickling the objects you store, then storing the pickles.  You can read the complete source code to shelve in a few minutes -- it's just a light wrapper on anydbm. 

Let's take shelve for a spin.

 

{{{id=92| file = DATA + 'shelve.db' if os.path.exists(file): os.unlink(file) import shelve D = shelve.open(DATA + 'shelve', 'c'); D /// {} }}} {{{id=93| D[6] = factor(6) /// Traceback (most recent call last): File "", line 1, in File "_sage_input_21.py", line 10, in exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RFs2XSA9IGZhY3Rvcig2KQ=="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))' + '\n', '', 'single') File "", line 1, in File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpu8IVcH/___code___.py", line 3, in exec compile(u'D[_sage_const_6 ] = factor(_sage_const_6 )' + '\n', '', 'single') File "", line 1, in File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python/shelve.py", line 132, in __setitem__ p.dump(value) File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python/copy_reg.py", line 70, in _reduce_ex raise TypeError, "can't pickle %s objects" % base.__name__ TypeError: can't pickle SageObject objects }}}

But no worries, there is a protocol option to shelve, which specifies the pickle protocol to use. Sage objects are Cython classes, and require protocol 2.

{{{id=90| file = DATA + 'shelve' if os.path.exists(file+'.db'): os.unlink(file+'.db') import shelve D = shelve.open(file, 'c', protocol=2); D /// {} }}} {{{id=164| D[15] = factor(15) /// Traceback (most recent call last): File "", line 1, in File "_sage_input_24.py", line 10, in exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RFsxNV0gPSBmYWN0b3IoMTUp"),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))' + '\n', '', 'single') File "", line 1, in File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpznv7Vn/___code___.py", line 3, in exec compile(u'D[_sage_const_15 ] = factor(_sage_const_15 )' + '\n', '', 'single') File "", line 1, in File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python/shelve.py", line 133, in __setitem__ self.dict[key] = f.getvalue() TypeError: dbm mappings have string indices only }}} {{{id=17| D['6'] = factor(6) D['12'] = factor(12) /// }}} {{{id=96| D['12'] /// 2^2 * 3 }}}

Write it to disk, then reload and see what we got.

{{{id=97| D.close() os.system('ls -lh %s/shelve*'%DATA) /// -rw-r--r-- 1 wstein staff 16K Nov 29 13:09 /Users/wstein/.sage/psage_notebook.sagenb/home/admin/31/data//shelve.db 0 }}} {{{id=95| D['6'] /// Traceback (most recent call last): File "", line 1, in File "_sage_input_97.py", line 10, in exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RFsnNidd"),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))' + '\n', '', 'single') File "", line 1, in File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpQGizJZ/___code___.py", line 2, in exec compile(u"D['6']" + '\n', '', 'single') File "", line 1, in File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python/shelve.py", line 121, in __getitem__ f = StringIO(self.dict[key]) File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python/shelve.py", line 79, in closed raise ValueError('invalid operation on closed shelf') ValueError: invalid operation on closed shelf }}} {{{id=100| D = shelve.open(file, 'c', protocol=2); D /// {'12': 2^2 * 3, '6': 2 * 3} }}} {{{id=99| D['6'] /// 2 * 3 }}}

Great!  Let's try putting a big object in there, say a 10-million digit integer.  Watch out -- sadly printing of shelve objects is totally massively braindead, and will try to load the whole database, so print out the whole damn 20-million digit number, so do not ever try to print D once you have some nontrivial data in it!

{{{id=165| D /// {'12': 2^2 * 3, '6': 2 * 3} }}} {{{id=106| D.__repr__?? ///

File: /Users/wstein/purple/install/psage-10.10.26/local/lib/python/UserDict.py

Source Code (starting at line 170):

def __repr__(self):
    return repr(dict(self.iteritems()))
}}} {{{id=105| D['test'] = 11^(10^7) os.system('ls -lh %s/shelve*'%DATA) /// -rw-r--r-- 1 wstein staff 6.6M Nov 29 13:11 /Users/wstein/.sage/psage_notebook.sagenb/home/admin/31/data//shelve.db 0 }}}

As you can see above, the file suddenly got way bigger, so the database got written to immediately this time, for some reason...

The following loads a new copy of the integer from disk into memory.

{{{id=103| time n = D['test'] /// Time: CPU 0.04 s, Wall: 0.04 s }}} {{{id=102| n.ndigits() /// 10413927 }}} {{{id=107| D.close() /// }}}

Gotcha: the D.keys() method doesn't work as you might think until a query is made.  If you want to keep track of what you put into the database, you should store that as a special entry as well.

{{{id=112| D = shelve.open(DATA+'shelve', 'c', protocol=2) D.keys() /// [] }}} {{{id=115| D.has_key('test') /// True }}} {{{id=166| D.keys() /// ['test', '12', '6'] }}} {{{id=110| time n = D['test'] /// Time: CPU 0.05 s, Wall: 0.05 s }}} {{{id=111| n.ndigits() /// 10413927 }}} {{{id=101| D.keys() /// ['test', '12', '6'] }}} {{{id=118| /// }}}

So we'll store which numbers we've factored in a set.

{{{id=117| D['numbers'] = set([6, 12]) /// }}} {{{id=121| # the following has an intentional bug! for p in primes(150): n = 2^p-1 f = factor(n) D[str(n)] = f D['numbers'].add(n) /// }}} {{{id=116| D['numbers'] /// set([12, 6]) }}}

Hey, what happened?  Each time we do D['numbers'] a new copy of the set is loaded from disk, then appended to, then forgotten (since we don't save it).  There is a writeback=True option we could pass to open, but it has a bunch of its own problems.  So we instead do the following:

{{{id=122| # the works fine, but has scalability issues. for p in primes(150): n = 2^p-1 f = factor(n) D[str(n)] = f nums = D['numbers'] nums.add(n) D['numbers'] = nums /// }}} {{{id=113| D['numbers'] /// set([3, 147573952589676412927, 7, 8191, 12, 2147483647, 131071, 576460752303423487, 524287, 31, 8388607, 137438953471, 6, 2047, 10141204801825835211973625643007, 140737488355327, 2722258935367507707706996859454145691647, 604462909807314587353087, 158456325028528675187087900671, 2361183241434822606847, 8796093022207, 2535301200456458802993406410751, 9444732965739290427391, 2199023255551, 9671406556917033397649407, 170141183460469231731687303715884105727, 162259276829213363391578010288127, 127, 618970019642690137449562111, 2305843009213693951, 696898287454081973172991196020261297061887, 10384593717069655257060992658440191, 9007199254740991, 649037107316853453566312041152511, 536870911, 174224571863520493293247799005065324265471, 713623846352979940529142984724747568191373311]) }}}

We have our database, and it works!

{{{id=125| D.close() D = shelve.open(DATA+'shelve', 'c', protocol=2) D[str(2^97-1)] /// 11447 * 13842607235828485645766393 }}}

Let's make a table:

{{{id=134| for n in sorted(D['numbers']): if n > 10^30: break print '%-30s\t%s'%(n, D[str(n)]) /// 3 3 6 2 * 3 7 7 12 2^2 * 3 31 31 127 127 2047 23 * 89 8191 8191 131071 131071 524287 524287 8388607 47 * 178481 536870911 233 * 1103 * 2089 2147483647 2147483647 137438953471 223 * 616318177 2199023255551 13367 * 164511353 8796093022207 431 * 9719 * 2099863 140737488355327 2351 * 4513 * 13264529 9007199254740991 6361 * 69431 * 20394401 576460752303423487 179951 * 3203431780337 2305843009213693951 2305843009213693951 147573952589676412927 193707721 * 761838257287 2361183241434822606847 228479 * 48544121 * 212885833 9444732965739290427391 439 * 2298041 * 9361973132609 604462909807314587353087 2687 * 202029703 * 1113491139767 9671406556917033397649407 167 * 57912614113275649087721 618970019642690137449562111 618970019642690137449562111 158456325028528675187087900671 11447 * 13842607235828485645766393 }}} {{{id=132| /// }}}

Note: deleting that big entry doesn't make the database smaller, unfortunately.    This is the same problem we had with anydbm above.  Your mileage may vary.  For most research applications, this may be fine.

{{{id=129| del D['test'] D.close() os.system('ls -lh %s/shelve*'%DATA) /// -rw-r--r-- 1 wstein staff 6.6M Nov 28 16:17 /Users/wstein/.sage/psage_notebook.sagenb/home/admin/31/data//shelve.db 0 }}} {{{id=131| /// }}}

Let's store a hundred thousand factorizations in the database, close it, then reopen it and see how long it takes to load a factorization.

{{{id=135| %time nums = D['numbers'] for n in range(1,10^5): f = factor(n) D[str(n)] = f nums.add(n) D['numbers'] = nums /// CPU time: 21.70 s, Wall time: 23.03 s }}} {{{id=137| D.close() /// }}} {{{id=124| time D = shelve.open(DATA+'shelve', 'c', protocol=2) time k = D['2010'] /// Time: CPU 0.00 s, Wall: 0.00 s Time: CPU 0.00 s, Wall: 0.00 s }}} {{{id=167| k /// 2 * 3 * 5 * 67 }}}

Compare doing the same with one huge Python dictionary that is pickled, and you begin to see the scalability differences!

{{{id=142| %time D2 = {} for n in range(1,10^5): f = factor(n) D2[str(n)] = f /// CPU time: 15.03 s, Wall time: 15.04 s }}} {{{id=140| time save(D2, DATA+'pickle1.sobj') /// Time: CPU 9.18 s, Wall: 9.19 s }}} {{{id=143| time D2 = load(DATA + 'pickle1.sobj') time k = D['2010'] /// Time: CPU 3.30 s, Wall: 3.30 s Time: CPU 0.00 s, Wall: 0.00 s }}}

I hope the above example gives you a sense of scalability of different approaches.  In particular, the database will easily scale out to millions or billions of numbers.  The dict will fail miserably, with load times taking hours, and memory usage skyrocketing.     Making a billion files on your filesystem (one for each integer), obviously won't work either. It works fine for a few numbers, and fails miserably for a lot of them (as I've learned the hard way...).

And so far we aren't using much, just some very simple Python code built on top of dbm, which is almost from the original UNIX days of the 1970s...  oh, and pickle, of course.

What we have done so far in this worksheet is the bare minimum you need to know about in order to use Sage to make a genuinely scalable research oriented code.  The cool thing is that you can mostly understand everything deeply -- it isn't too complicated.   (That said, I once took a semester course at Harvard by the co-author of berkeley db, which is a dbm implementation, and the implementations of programs that provide the dbm api get potentially pretty sophisticated.)

{{{id=138| /// }}}

Scalability: Size isn't the only thing that matters -- concurrent writes.

Another key part of scalability is having support for more than one process simultaneously write to a single database.  You might think this doesn't matter for you, and maybe it doesn't.  But for research math applications, it often does.  Imagine you have a cluster of computers with 120 cores total, and you want to launch 120 processes that all compute some integer factorizations, and store their results in the database.  Their results might be a few hundred million integer factorizations (say).  If you try to do this using the above approach, you'll run into serious hell:

  1. The database file has to be stored somewhere that all the processes have read/write access to, and every read/write will involve updating that file.
  2. If multiple processes are all trying to read/write, they will just mess each other up, and the database file will soon become totally corrupted.  Game over.

For some applications, using shelve works. If you only need to scale up in size, or will mainly be using the database for read-only use, e.g., you compute a few gigabytes of data stored in files all over, and want to put it all in a database so you can work with it more easily (e.g., Cremona's tables, which are read only).  Then you just make one quick pass through the files, build up a database, and work with that. 

{{{id=148| /// }}}

Shelve pros and cons.

Pros:

  • Much easier to use than anydbm, since it uses pickle.  (In Sage, make sure to use protocol=2!).
  • Scales up size wise.
  • Fast, assuming anydbm turns out to be fast on your system.

Cons:

  • Total failure to scale up for multiple simultaneous writes.
  • Performance depends on what anydbm you have (so even the underlying database file need not be portable!). 
  • No way to do interesting queries on your data. 
  • No undo (transaction support).
{{{id=147| /// }}} {{{id=16| /// }}}

ZODB

ZODB is an object oriented Python database.  It's pretty much the canonical one for this, and has been around for a looooong time.    It's kind of like SQLite in that the entire database is usually stored in a single file (so if your database will be at large, make sure your filesystem supports files that are bigger than 2GB!). 

Some Pros:

  • ZODB is actively maintained and there are new bugfix releases (and minor feature improvements) regularly.  It's definitely not dead.
  • ZODB is mature and stable.
  • ZODB supports transactions: "Full ACID-compatible transaction support (including savepoints); History/undo ability. 
  • In some cases, a perfect intermediate between going with a full database and just writing files to disk.   Can be a good part of a math research workflow.
  • Included in every copy of Sage, and fairly easy to easy_install into any Python.
  • ZODB does have support for simultaneous writes but you have to use something called ZEO, which can have very serious performance and security implications. 

Cons: 

  • Your data is tied to Python and your Python code (via pickles).  It's not language agnostic like sqlite or mongodb.   
  • Indexing and querying of data is severely limited compared to what is possible with SQL or noSQL databases.
  • I can't get their test suite to work.
  • I can't get it to pickle SageObject, so it's usefulness is severely crippled in practice.
  • An optional install in Python, with several dependencies.
{{{id=4| /// }}}

Here's a math-oriented tutorial to wet your appetite.  We're in the Sage notebook, so there is a variable DATA that is a scratch directory where we can create a file.  This is where we'll put our database.

{{{id=158| from ZODB.FileStorage import FileStorage from ZODB.DB import DB import transaction /// }}} {{{id=152| file = DATA + 'zodb.fs' os.system('rm -f %s/zodb.fs*'%DATA) /// 0 }}} {{{id=3| storage = FileStorage(file) db = DB(storage) connection = db.open() root = connection.root() /// }}} {{{id=155| /// }}}

To store data in the database:  just use root like a dictionary and store objects.  It's as easy as that.

To actually permanently save them, you use transaction.commit(), or to abort (and not store them), do transaction.abort(). 

Sample application: making a database of factorizations of integers.

{{{id=9| for p in primes(100,150): n = 2^p - 1 print p, time F = factor(n) root[n] = F /// 101 Time: CPU 0.03 s, Wall: 0.03 s 103 Time: CPU 0.03 s, Wall: 0.03 s 107 Time: CPU 0.00 s, Wall: 0.00 s 109 Time: CPU 0.04 s, Wall: 0.04 s 113 Time: CPU 0.00 s, Wall: 0.00 s 127 Time: CPU 0.00 s, Wall: 0.00 s 131 Time: CPU 0.01 s, Wall: 0.01 s 137 Time: CPU 0.22 s, Wall: 0.24 s 139 Time: CPU 0.25 s, Wall: 0.27 s 149 Time: CPU 0.50 s, Wall: 0.53 s }}} {{{id=8| root /// {713623846352979940529142984724747568191373311: 86656268566282183151 * 8235109336690846723986161, 10141204801825835211973625643007: 2550183799 * 3976656429941438590393, 2722258935367507707706996859454145691647: 263 * 10350794431055162386718619237468234569, 162259276829213363391578010288127: 162259276829213363391578010288127, 2535301200456458802993406410751: 7432339208719 * 341117531003194129, 10384593717069655257060992658440191: 3391 * 23279 * 65993 * 1868569 * 1066818132868207, 696898287454081973172991196020261297061887: 5625767248687 * 123876132205208335762278423601, 170141183460469231731687303715884105727: 170141183460469231731687303715884105727, 174224571863520493293247799005065324265471: 32032215596496435569 * 5439042183600204290159, 649037107316853453566312041152511: 745988807 * 870035986098720987332873} }}} {{{id=1| transaction.commit() /// Traceback (most recent call last): File "", line 1, in File "_sage_input_6.py", line 10, in exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("dHJhbnNhY3Rpb24uY29tbWl0KCk="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))' + '\n', '', 'single') File "", line 1, in File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpIb0kcP/___code___.py", line 2, in exec compile(u'transaction.commit()' + '\n', '', 'single') File "", line 1, in File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/transaction/_manager.py", line 96, in commit return self.get().commit(sub, deprecation_wng=False) File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/transaction/_transaction.py", line 395, in commit self._commitResources() File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/transaction/_transaction.py", line 495, in _commitResources rm.commit(self) File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/ZODB/Connection.py", line 498, in commit self._commit(transaction) File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/ZODB/Connection.py", line 543, in _commit self._store_objects(ObjectWriter(obj), transaction) File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/ZODB/Connection.py", line 570, in _store_objects p = writer.serialize(obj) # This calls __getstate__ of obj File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/ZODB/serialize.py", line 407, in serialize return self._dump(meta, obj.__getstate__()) File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/ZODB/serialize.py", line 416, in _dump self._p.dump(state) File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python/copy_reg.py", line 70, in _reduce_ex raise TypeError, "can't pickle %s objects" % base.__name__ TypeError: can't pickle SageObject objects }}}

Woops, what just happened?   The default pickle protocol is 1 in ZODB, and it is not user configurable (so far as I can tell).  According to this recent discussion on the ZODB devel list there is some worry about this...

I absolutely can't figure out how to deal with this nicely, beyond making a bunch of changes to ZODB's code itself.   So...

Next we'll talk about SQLalchemy and SQLite/PostgreSQL, which provide a very different approach to this whole problem.... as we will see.

{{{id=161| /// }}}