Math 480: Lecture 26 -- Databases, Part 1

William Stein, 2011-05-25


Motivation: Be able to store and manipulate data that arises when using Sage. 
Good news!  You're using Sage, hence Python, and there is a huge range of excellent database technology available.   Many object oriented, relational, and noSQL databases have excellent Python interfaces and support, and the Python language supports object serialization.   With Sage you have far more powerful and scalable tools available for storing data to disk, indexing it, and manipulating it, than with any other mathematics software platform out there. 


Topics we will discuss:

  1. pickle: Python object serialization
  2. open: Using the filesystem to store and retrieve data
  3. SQLite: a relational database, included in Sage

 

{{{id=105| /// }}}

If you are just going to fall asleep, the following is the one thing you really need to learn today.   Everything else below just enhances your depth of understanding.

Make a complicated object:

{{{id=107| A = [(2/3, int(5)), matrix(QQ, 1, 4, [1,2,-5/3,8]), sin(x^2)] /// }}}

You can save that one object to a file on disk:

{{{id=104| save(A, '/tmp/A.sobj') /// }}}

You can then load it back from disk:

{{{id=108| load('/tmp/A.sobj') /// [(2/3, 5), [ 1 2 -5/3 8], sin(x^2)] }}}

Cleanup our "mess":

{{{id=109| os.unlink('/tmp/A.sobj') /// }}}

You can also just save A to the current cell, then click to download it to your computer, and possibly load it into another copy of Sage elsewhere.

{{{id=110| save(A, 'A.sobj') /// }}}

The rest of this lecture will give you a bit more depth of understanding about how this works.

{{{id=118| /// }}} {{{id=9| /// }}}

pickle: Python object serialization

Pickling refers to turning almost any object $X$ into a single ugly-to-look-at string s.  You can then save s somewhere, and (hopefully) load it later.    This process is known as object serialization [wikipedia], and is also very important for parallel distributed computation.

Remarks about other mathematics software:

  1. Some Sage objects are wrappers around objects defined in other systems, e.g., Maxima, Pari, Singular, GAP, etc.,   In some case, these may be difficult to pickle.  However, in most cases math software does provide some form of serialization of objects.
  2. In PARI, object data structures are all fairly straightforward, so the print representation of most objects can simply be evaluated to get them back using the eval command.
  3. In Magma, object data structures are very complicated and there is no way to serialize most of them (as far as I know).  There also wasn't even an eval command until fairly recently, but fortunately there is one now.  (On very simple input, the eval in Magma is roughly 10 times slower to call than the eval command in PARI and Python, so watch out.)

Examples:

{{{id=8| import pickle s = pickle.dumps(int(2011)) s /// 'I2011\n.' }}} {{{id=68| type(s) /// }}} {{{id=69| print s /// I2011 . }}}

The loads command turns our pickle back into an object:

{{{id=72| n = pickle.loads(s); n /// 2011 }}} {{{id=75| type(n) /// }}} {{{id=73| /// }}}

The explain_pickle command attempts to produce Sage code that, when evaluated in Sage, produces the same result as unpickling the pickle.

{{{id=70| explain_pickle(s) /// 2011r }}} {{{id=76| /// }}}

Next, let's pickle a more complicated data structure:

{{{id=7| s = pickle.dumps([20r, long(11)]); s /// '(lp0\nI20\naL11L\na.' }}} {{{id=78| print s /// (lp0 I20 aL11L a. }}} {{{id=79| explain_pickle(s) /// [20r, long(11)] }}} {{{id=81| pickle.loads(s) /// [20, 11L] }}} {{{id=80| /// }}}

The pickle of a Sage integer is even more complicated, since the pickle stores the callable that can be used to recreate the integer, along with binary data that efficiently represents the integer (not in base 10!).   The representation is not in a base 10, since base conversion is potentially slow, and all numbers are stored internally in base 2.

{{{id=6| s = pickle.dumps(2011); s /// "csage.rings.integer\nmake_integer\np0\n(S'1ur'\np1\ntp2\nRp3\n." }}} {{{id=83| print s /// csage.rings.integer make_integer p0 (S'1ur' p1 tp2 Rp3 . }}} {{{id=85| explain_pickle(s) /// pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer') pg_make_integer('1ur') }}}

How fast is pickling and unpickling a big Sage integer?

{{{id=86| n = ZZ.random_element(10^1000) # a 1000 digit Sage Integer timeit('s = pickle.dumps(n)') s = pickle.dumps(n) timeit('k = pickle.loads(s)') /// 625 loops, best of 3: 45.9 µs per loop 625 loops, best of 3: 34.4 µs per loop }}}

It takes much longer (ten times longer!) to pickle a Python int.   Part of this is probably base 2 <--> base 10 conversion overhead...

{{{id=90| n = int(n) # same 1000 digit Python int timeit('s = pickle.dumps(n)') s = pickle.dumps(n) timeit('k = pickle.loads(s)') /// 625 loops, best of 3: 476 µs per loop 625 loops, best of 3: 72.9 µs per loop }}} {{{id=84| /// }}}

You can also pickle objects of classes you define...

{{{id=5| class Foo: def __init__(self, x): self.x = x def __repr__(self): return 'Foo x=%s'%self.x f = Foo('2010') s = pickle.dumps(f); s /// "(i__main__\nFoo\np0\n(dp1\nS'x'\np2\nS'2010'\np3\nsb." }}} {{{id=4| C = pickle.loads(s); type(C) /// }}} {{{id=1| C /// Foo x=2010 }}} {{{id=93| /// }}}

BIG FAT WARNING:  The Python modules (code or compiled .so's) that defines objects is NOT stored in the pickled form of the object.  (Pretty obvious with the integer example above!)   If the relevant Python modules don't exist in the right place, then the pickle will simply be useless.

This means that if somebody decides to rename or move some code in Sage, it can easily render pickles useless.  So be careful.  We do have something called "the pickle jar", which helps ensure that in Sage itself this doesn't cause too much trouble.  

Example: All of the state of the Sage notebook used to be stored as pickles of Python classes that are part of the source code of the notebook.   I wanted to move the code of the Sage notebook out of the Sage library, and make the notebook a separate project.  This was nearly impossible because of how I had designed those pickles.   Tim Dumol and I spent over a week writing and testing code to load notebook pickles, them convert the data structures to very simple data structures (e.g., dictionaries, strings) that didn't use any special classes, then resave them.  The resulting new saved pickles can be read by any Python independently of Sage or the notebook.  This makes it possible to move the notebook code out of the Sage library.  However, it is still there (just waiting to confuse you!), in case somebody tries to load an old Sage Notebook instance using a new version of Sage, since we want to migrate the old notebook pickles to the new format.  (This code and capability will be removed soon, since it was over a year ago that the notebook was removed from the Sage library.)

Customization: You can fully customize how any class gets pickled, including Cython classes (where you pretty much have to customize them).  This can make pickling more robust and potentially faster.  Also, careful thought about customizing how objects get pickled can make them more robust in case you change your mind later (the matrix code in Sage is particularly good this way).  The example below illustrates how two seemingly similar classes can have massively difference pickling performance, depending on whether somebody cared to write some fast pickling code.

Moral: For longterm use of data, using pickles is very dangerous and should be avoided if possible.  For shortterm use (over the course of a few minutes, weeks or months), using pickles is incredibly useful.  Think of pickles like a jar of pickles that you buy from the store (and open).  They have to be refrigerators and they have an expiration date.  But they last a while.

{{{id=14| A = random_matrix(Integers(10^100), 200) time s = pickle.dumps(A) /// Time: CPU 6.26 s, Wall: 6.26 s }}}

Here B is exactly the same matrix as A, except the entries are viewed as being in $\ZZ$ instead of $\ZZ/10^{100}\ZZ$.  Yet it pickles 60 times more quickly (somebody should fix this!). 

{{{id=13| B = A.change_ring(ZZ) time t = pickle.dumps(B) /// Time: CPU 0.11 s, Wall: 0.11 s }}} {{{id=11| 6.26/.11 /// 56.9090909090909 }}}

Pickles in Sage

Sage has some convenience functions for working with pickles:  load, save, loads, dumps

There are also save and dumps methods on any classes that derives from SageObject.

The main thing that the load/save/loads/dumps functions in Sage do, over the pickle methods, is they transparently by default do in memory zlib compression. Also, save and load combine pickling with actually writing the pickle string out to a file.  Also, load can load many other types of objects, for example load a pickle off of a webpage.

We illustrate all this below.

{{{id=19| A = matrix(ZZ, 4, 20, [1..80]); A /// [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}} {{{id=22| len(pickle.dumps(A)) /// 489 }}} {{{id=23| # the sage dumps method compresses by default -- here we get a factor of 2 savings len(dumps(A)) /// 282 }}}

Of course, the compressed version is unreadable to the eye since it is zlib compressed:

{{{id=24| print dumps(A) /// xmNa]p?!6k2 ?ð l,|;I%Wq(׿ш|%b4qycKTnxLv,zؓ jۯ0* UB<ΖYxw&-O̘ ))32"gOAI&j6%hhrFJP^mhXkamubl%sK'D }}}

Compared to:

{{{id=25| print pickle.dumps(A) /// csage.matrix.matrix0 unpickle p0 (csage.matrix.matrix_integer_dense Matrix_integer_dense p1 csage.matrix.matrix_space MatrixSpace p2 (csage.rings.integer_ring IntegerRing p3 (tRp4 I4 I20 I00 tp5 Rp6 csage.structure.mutability Mutability p7 (I00 tp8 Rp9 (dp10 S'1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 1g 1h 1i 1j 1k 1l 1m 1n 1o 1p 1q 1r 1s 1t 1u 1v 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 2g' p11 I0 tp12 Rp13 . }}}

loads can parse both the compressed and uncompressed pickles (it figures out which is right by assuming compressed, getting an error, then trying uncompressed).

{{{id=28| loads(dumps(A)) /// [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}} {{{id=29| loads(pickle.dumps(A)) /// [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}}

Compression has a performance penalty:

{{{id=30| timeit('loads(dumps(A))') /// 625 loops, best of 3: 192 µs per loop }}} {{{id=33| timeit('loads(dumps(A,compress=False), compress=False)') /// 625 loops, best of 3: 130 µs per loop }}}

We can save a pickle to a file and load it from a file:

{{{id=97| save(A, 'A.sobj') /// }}} {{{id=34| save(A, '/tmp/A.sobj') /// 34 }}} {{{id=36| load('/tmp/A.sobj') /// [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}} {{{id=96| os.unlink('/tmp/A.sobj') # clean up /// }}}

We can load a pickle from a webpage too, which is pretty cool:

{{{id=37| X = load('http://wiki.wstein.org/11/480a/5-25?action=AttachFile&do=get&target=A.sobj') X /// Attempting to load remote file: http://wiki.wstein.org/11/480a/5-25?action=AttachFile&do=get&target=A.sobj Loading: [.] [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}} {{{id=39| X = load('http://wiki.wstein.org/11/480a/5-25?action=AttachFile&do=get&target=A.sobj', verbose=False); X /// [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}}

Conclusion: 

  • Understanding object serialization is useful if you do some research computations, and want to record the results in a way that you can later easily recover. As long as later isn't "too late".  
  • It requires very little thought to use.  save(obj, 'filename.sobj') and load('filename.sobj')
  • You could make a simple "database" that anybody can easily use over the web by: (1) putting a bunch of sobj's on a webpage, and (2) writing a Python function that uses Sage's load command to remotely grab them off that webpage when requested. Very simple. 
{{{id=43| /// }}}

Opening Files

If you want to store a plain string to disk, and load it later, it is critical to master the Python open command.  This is very similar to the C library open command, hence to the open command in most programming languages.  You can use this one builtin Python command to both read and write files, and also to iterate through the lines of a file, seek to given positions, etc. 

{{{id=42| file = open('/tmp/file', 'w'); file /// }}} {{{id=40| file.write("This is a line.") /// }}} {{{id=46| file.close() /// }}} {{{id=45| open('/tmp/file').read() /// 'This is a line.' }}} {{{id=47| file = open('/tmp/file'); file /// }}} {{{id=48| file.seek(3) /// }}} {{{id=49| file.read(4) /// 's is' }}} {{{id=50| file.seek(0) file.read() /// 'This is a line.' }}} {{{id=103| file.close() os.unlink('/tmp/file') /// }}} {{{id=102| /// }}}

One can do a lot with a file, or a bunch of files in a directory.  Don't use a sophisticated database just because you don't understand or know how to use files.  Now you do.  In some cases, they are a great solution.  

 

Pickling + Files: @disk_cached_function

Here's a nice decorator (written by Tom Boothby) that combines files with pickling.  

{{{id=54| disk_cached_function? ///

File: /sagenb/flask/sage-4.6.2/local/lib/python2.6/site-packages/sage/misc/cachefunc.py

Type: <type ‘classobj’>

Definition: disk_cached_function(f)

Docstring:

Decorator for DiskCachedFunction.

EXAMPLES:

sage: dir = tmp_dir()
sage: @disk_cached_function(dir)
... def foo(x): return next_prime(2^x)%x
sage: x = foo(200);x
11
sage: @disk_cached_function(dir)
... def foo(x): return 1/x
sage: foo(200)
11
sage: foo.clear_cache()
sage: foo(200)
1/200
}}} {{{id=61| /// }}} {{{id=51| if os.path.exists('/tmp/factor_cache'): import shutil shutil.rmtree('/tmp/factor_cache') @disk_cached_function('/tmp/factor_cache') def my_factor(n): return factor(n) /// }}} {{{id=57| time my_factor(2^157+1) /// 3 * 15073 * 2350291 * 17751783757817897 * 96833299198971305921 Time: CPU 0.08 s, Wall: 0.08 s }}} {{{id=58| time my_factor(2^157+1) /// 3 * 15073 * 2350291 * 17751783757817897 * 96833299198971305921 Time: CPU 0.00 s, Wall: 0.00 s }}} {{{id=59| os.listdir('/tmp/factor_cache') /// ['my_factor-182687704666362864775460604089535377456991567873.sobj', 'my_factor-182687704666362864775460604089535377456991567873.key.sobj'] }}} {{{id=60| time my_factor(2^157+3) /// 5^3 * 557 * 2623880856967509727475197186205175977838299 Time: CPU 0.02 s, Wall: 0.02 s }}} {{{id=62| os.listdir('/tmp/factor_cache') /// ['my_factor-182687704666362864775460604089535377456991567875.sobj', 'my_factor-182687704666362864775460604089535377456991567875.key.sobj', 'my_factor-182687704666362864775460604089535377456991567873.sobj', 'my_factor-182687704666362864775460604089535377456991567873.key.sobj'] }}} {{{id=63| load('/tmp/factor_cache/%s'%os.listdir('/tmp/factor_cache')[0]) /// (((182687704666362864775460604089535377456991567875,), ()), 5^3 * 557 * 2623880856967509727475197186205175977838299) }}} {{{id=64| load('/tmp/factor_cache/%s'%os.listdir('/tmp/factor_cache')[1]) /// ((182687704666362864775460604089535377456991567875,), ()) }}} {{{id=98| /// }}}

Clean our mess:

{{{id=100| import shutil shutil.rmtree('/tmp/factor_cache') /// }}} {{{id=99| /// }}}

Summary:

  • save/load: If you remember nothing else from today's lecture, remember these two commands, which allow you to very easily store and load most any object.
  • open: It is easy to open and write to and read from files in Python.
  • disk_cached_function: provides a function decorator that makes a function only ever have to evaluate a given input once, and in the future it just remembers the inputs automatically.  It combines pickling with using open to read and write data to the filesystem. You can manage the cached inputs to the function outside of Sage, just by adding files to the cache directory (e.g., if you computed values of a disk_cached_function on different computers, you could just dump all the directories of files that result into a single big directory and have the combined cache).

Next:

  • SQLite: a relational database that is included in Sage.  This is not at all Python specific, but it has excellent support for using it from Python.  It's an extremely popular database -- according to their website it is the most widely deployed database there is.   For example, iPhone apps all use it track their data... 
  • (Maybe) SQLalchemy: an object relational mapper that is included in Sage.  SQLalchemy provides much more Python-friendly support on top of some relational database, but built on SQLite or MySQL or PostgreSQL.
{{{id=66| /// }}} {{{id=65| /// }}}