Math 480: Lecture 26 -- Databases, Part 1
William Stein, 2011-05-25
Topics we will discuss:
{{{id=105| /// }}}
If you are just going to fall asleep, the following is the one thing you really need to learn today. Everything else below just enhances your depth of understanding.
Make a complicated object:
{{{id=107| A = [(2/3, int(5)), matrix(QQ, 1, 4, [1,2,-5/3,8]), sin(x^2)] /// }}}You can save that one object to a file on disk:
{{{id=104| save(A, '/tmp/A.sobj') /// }}}You can then load it back from disk:
{{{id=108| load('/tmp/A.sobj') /// [(2/3, 5), [ 1 2 -5/3 8], sin(x^2)] }}}Cleanup our "mess":
{{{id=109| os.unlink('/tmp/A.sobj') /// }}}You can also just save A to the current cell, then click to download it to your computer, and possibly load it into another copy of Sage elsewhere.
{{{id=110| save(A, 'A.sobj') /// }}}The rest of this lecture will give you a bit more depth of understanding about how this works.
{{{id=118| /// }}} {{{id=9| /// }}}Pickling refers to turning almost any object $X$ into a single ugly-to-look-at string s. You can then save s somewhere, and (hopefully) load it later. This process is known as object serialization [wikipedia], and is also very important for parallel distributed computation.
Remarks about other mathematics software:
Examples:
{{{id=8| import pickle s = pickle.dumps(int(2011)) s /// 'I2011\n.' }}} {{{id=68| type(s) ///The loads command turns our pickle back into an object:
{{{id=72| n = pickle.loads(s); n /// 2011 }}} {{{id=75| type(n) ///The explain_pickle command attempts to produce Sage code that, when evaluated in Sage, produces the same result as unpickling the pickle.
{{{id=70| explain_pickle(s) /// 2011r }}} {{{id=76| /// }}}Next, let's pickle a more complicated data structure:
{{{id=7| s = pickle.dumps([20r, long(11)]); s /// '(lp0\nI20\naL11L\na.' }}} {{{id=78| print s /// (lp0 I20 aL11L a. }}} {{{id=79| explain_pickle(s) /// [20r, long(11)] }}} {{{id=81| pickle.loads(s) /// [20, 11L] }}} {{{id=80| /// }}}The pickle of a Sage integer is even more complicated, since the pickle stores the callable that can be used to recreate the integer, along with binary data that efficiently represents the integer (not in base 10!). The representation is not in a base 10, since base conversion is potentially slow, and all numbers are stored internally in base 2.
{{{id=6| s = pickle.dumps(2011); s /// "csage.rings.integer\nmake_integer\np0\n(S'1ur'\np1\ntp2\nRp3\n." }}} {{{id=83| print s /// csage.rings.integer make_integer p0 (S'1ur' p1 tp2 Rp3 . }}} {{{id=85| explain_pickle(s) /// pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer') pg_make_integer('1ur') }}}How fast is pickling and unpickling a big Sage integer?
{{{id=86| n = ZZ.random_element(10^1000) # a 1000 digit Sage Integer timeit('s = pickle.dumps(n)') s = pickle.dumps(n) timeit('k = pickle.loads(s)') /// 625 loops, best of 3: 45.9 µs per loop 625 loops, best of 3: 34.4 µs per loop }}}It takes much longer (ten times longer!) to pickle a Python int. Part of this is probably base 2 <--> base 10 conversion overhead...
{{{id=90| n = int(n) # same 1000 digit Python int timeit('s = pickle.dumps(n)') s = pickle.dumps(n) timeit('k = pickle.loads(s)') /// 625 loops, best of 3: 476 µs per loop 625 loops, best of 3: 72.9 µs per loop }}} {{{id=84| /// }}}You can also pickle objects of classes you define...
{{{id=5| class Foo: def __init__(self, x): self.x = x def __repr__(self): return 'Foo x=%s'%self.x f = Foo('2010') s = pickle.dumps(f); s /// "(i__main__\nFoo\np0\n(dp1\nS'x'\np2\nS'2010'\np3\nsb." }}} {{{id=4| C = pickle.loads(s); type(C) ///BIG FAT WARNING: The Python modules (code or compiled .so's) that defines objects is NOT stored in the pickled form of the object. (Pretty obvious with the integer example above!) If the relevant Python modules don't exist in the right place, then the pickle will simply be useless.
This means that if somebody decides to rename or move some code in Sage, it can easily render pickles useless. So be careful. We do have something called "the pickle jar", which helps ensure that in Sage itself this doesn't cause too much trouble.
Example: All of the state of the Sage notebook used to be stored as pickles of Python classes that are part of the source code of the notebook. I wanted to move the code of the Sage notebook out of the Sage library, and make the notebook a separate project. This was nearly impossible because of how I had designed those pickles. Tim Dumol and I spent over a week writing and testing code to load notebook pickles, them convert the data structures to very simple data structures (e.g., dictionaries, strings) that didn't use any special classes, then resave them. The resulting new saved pickles can be read by any Python independently of Sage or the notebook. This makes it possible to move the notebook code out of the Sage library. However, it is still there (just waiting to confuse you!), in case somebody tries to load an old Sage Notebook instance using a new version of Sage, since we want to migrate the old notebook pickles to the new format. (This code and capability will be removed soon, since it was over a year ago that the notebook was removed from the Sage library.)
Customization: You can fully customize how any class gets pickled, including Cython classes (where you pretty much have to customize them). This can make pickling more robust and potentially faster. Also, careful thought about customizing how objects get pickled can make them more robust in case you change your mind later (the matrix code in Sage is particularly good this way). The example below illustrates how two seemingly similar classes can have massively difference pickling performance, depending on whether somebody cared to write some fast pickling code.
Moral: For longterm use of data, using pickles is very dangerous and should be avoided if possible. For shortterm use (over the course of a few minutes, weeks or months), using pickles is incredibly useful. Think of pickles like a jar of pickles that you buy from the store (and open). They have to be refrigerators and they have an expiration date. But they last a while.
{{{id=14| A = random_matrix(Integers(10^100), 200) time s = pickle.dumps(A) /// Time: CPU 6.26 s, Wall: 6.26 s }}}Here B is exactly the same matrix as A, except the entries are viewed as being in $\ZZ$ instead of $\ZZ/10^{100}\ZZ$. Yet it pickles 60 times more quickly (somebody should fix this!).
{{{id=13| B = A.change_ring(ZZ) time t = pickle.dumps(B) /// Time: CPU 0.11 s, Wall: 0.11 s }}} {{{id=11| 6.26/.11 /// 56.9090909090909 }}}Pickles in Sage
Sage has some convenience functions for working with pickles: load, save, loads, dumps
There are also save and dumps methods on any classes that derives from SageObject.
The main thing that the load/save/loads/dumps functions in Sage do, over the pickle methods, is they transparently by default do in memory zlib compression. Also, save and load combine pickling with actually writing the pickle string out to a file. Also, load can load many other types of objects, for example load a pickle off of a webpage.
We illustrate all this below.
{{{id=19| A = matrix(ZZ, 4, 20, [1..80]); A /// [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}} {{{id=22| len(pickle.dumps(A)) /// 489 }}} {{{id=23| # the sage dumps method compresses by default -- here we get a factor of 2 savings len(dumps(A)) /// 282 }}}Of course, the compressed version is unreadable to the eye since it is zlib compressed:
{{{id=24| print dumps(A) /// xmNa]p?!6k2?ðl,|;I%Wq(ш|%b4qycKTnxLv,zؓjۯ0* UB<ΖYxw&-O̘ ))32"gOAI&j6%hhrFJP^mhXkamubl%sK'D }}}Compared to:
{{{id=25| print pickle.dumps(A) /// csage.matrix.matrix0 unpickle p0 (csage.matrix.matrix_integer_dense Matrix_integer_dense p1 csage.matrix.matrix_space MatrixSpace p2 (csage.rings.integer_ring IntegerRing p3 (tRp4 I4 I20 I00 tp5 Rp6 csage.structure.mutability Mutability p7 (I00 tp8 Rp9 (dp10 S'1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 1g 1h 1i 1j 1k 1l 1m 1n 1o 1p 1q 1r 1s 1t 1u 1v 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 2g' p11 I0 tp12 Rp13 . }}}loads can parse both the compressed and uncompressed pickles (it figures out which is right by assuming compressed, getting an error, then trying uncompressed).
{{{id=28| loads(dumps(A)) /// [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}} {{{id=29| loads(pickle.dumps(A)) /// [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}}Compression has a performance penalty:
{{{id=30| timeit('loads(dumps(A))') /// 625 loops, best of 3: 192 µs per loop }}} {{{id=33| timeit('loads(dumps(A,compress=False), compress=False)') /// 625 loops, best of 3: 130 µs per loop }}}We can save a pickle to a file and load it from a file:
{{{id=97| save(A, 'A.sobj') /// }}} {{{id=34| save(A, '/tmp/A.sobj') /// 34 }}} {{{id=36| load('/tmp/A.sobj') /// [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}} {{{id=96| os.unlink('/tmp/A.sobj') # clean up /// }}}We can load a pickle from a webpage too, which is pretty cool:
{{{id=37| X = load('http://wiki.wstein.org/11/480a/5-25?action=AttachFile&do=get&target=A.sobj') X /// Attempting to load remote file: http://wiki.wstein.org/11/480a/5-25?action=AttachFile&do=get&target=A.sobj Loading: [.] [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}} {{{id=39| X = load('http://wiki.wstein.org/11/480a/5-25?action=AttachFile&do=get&target=A.sobj', verbose=False); X /// [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80] }}}Conclusion:
If you want to store a plain string to disk, and load it later, it is critical to master the Python open command. This is very similar to the C library open command, hence to the open command in most programming languages. You can use this one builtin Python command to both read and write files, and also to iterate through the lines of a file, seek to given positions, etc.
{{{id=42| file = open('/tmp/file', 'w'); file ///One can do a lot with a file, or a bunch of files in a directory. Don't use a sophisticated database just because you don't understand or know how to use files. Now you do. In some cases, they are a great solution.
Here's a nice decorator (written by Tom Boothby) that combines files with pickling.
{{{id=54| disk_cached_function? ///File: /sagenb/flask/sage-4.6.2/local/lib/python2.6/site-packages/sage/misc/cachefunc.py
Type: <type ‘classobj’>
Definition: disk_cached_function(f)
Docstring:
Decorator for DiskCachedFunction.
EXAMPLES:
sage: dir = tmp_dir() sage: @disk_cached_function(dir) ... def foo(x): return next_prime(2^x)%x sage: x = foo(200);x 11 sage: @disk_cached_function(dir) ... def foo(x): return 1/x sage: foo(200) 11 sage: foo.clear_cache() sage: foo(200) 1/200
Clean our mess:
{{{id=100| import shutil shutil.rmtree('/tmp/factor_cache') /// }}} {{{id=99| /// }}}Summary: