Math 581d: 2010-12-03 -- Databases for Math Research -- MongoDB

William Stein

 

The last database we'll talk about is called MongoDB.  It's another example of a noSQL database (so SQL isn't used).  It's vastly more powerful than the key:value stores we've talked about, more scalable, and it is extremely efficient.   I did many benchmarks comparing MongoDB with other very fast databases (e.g., Tokyo Cabinet), and amazingly for large numbers of small records MongoDB is as good or better. 

Getting Started

MongoDB does not come with Sage.  To use it, you must install it yourself.

  1. Install the MongoDB program if you want to run your own MongoDB server.   You have to put the mongodb binaries somewhere in your PATH. 
  2. Install the pymongo Python package, so you can use MongoDB from Sage.   You just type "sage: !easy_install pymongo" and wait a few seconds. 
{{{id=6| /// }}}

Try it Out

How:

  1. Make sure you have MongoDB installed on your computer.
  2. Type the following to start your own MongoDB server:  
    mkdir -p /tmp/mongotest
    mongod --dbpath /tmp/mongotest/ --bind_ip localhost --port 29000
    
  3. Then proceed as below (in Sage).
{{{id=4| import pymongo connection = pymongo.Connection('localhost:29000') db = connection.db F = db.factorizations /// }}}

Now:

{{{id=11| F.drop() # empty contents of this collection /// }}} {{{id=3| for p in primes(150): n = 2^p-1 f = factor(n) # You insert a document by calling insert with a document # or list of documents to insert. doc = {'n':str(n), 'factor':[(str(q),int(e)) for (q,e) in f], 'shape':"2^%s-1"%p} z = F.insert(doc, check=True) /// }}} {{{id=1| F.count() /// 35 }}}

You typically query the database by constructing a query dictionary.

{{{id=15| query = {'n':str(2^29-1)} results = F.find(query); results /// }}}

The result of the query is an iterable:

{{{id=13| results.next() /// {u'n': u'536870911', u'shape': u'2^29-1', u'_id': ObjectId('4cf87bcb8c667a74970000ba'), u'factor': [[u'233', 1], [u'1103', 1], [u'2089', 1]]} }}} {{{id=23| results.next() /// Traceback (most recent call last): File "", line 1, in File "_sage_input_64.py", line 10, in exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("cmVzdWx0cy5uZXh0KCk="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))' + '\n', '', 'single') File "", line 1, in File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpEqb476/___code___.py", line 2, in exec compile(u'results.next()' + '\n', '', 'single') File "", line 1, in File "build/bdist.macosx-10.6-i386/egg/pymongo/cursor.py", line 604, in next StopIteration }}} {{{id=32| F.find({'shape':'2^19-1'}).next() /// {u'n': u'524287', u'shape': u'2^19-1', u'_id': ObjectId('4cf87bcb8c667a74970000b8'), u'factor': [[u'524287', 1]]} }}} {{{id=34| /// }}}

You can insert any documents you want into F.  There's no fixed "schema" that they all have to have.

{{{id=25| F.insert({"description":"This is a collection of factorizations of numbers.", "author":"William Stein"}) /// ObjectId('4cf87f598c667a7497000186') }}} {{{id=22| F.count() /// 36 }}} {{{id=36| %time # First, we make a list of the factorizations. We do not bother to set the 'shape' # field below, since these are not factorization of a special form. Note that we take # care to only store basic types (strings, ints, lists, dicts, etc.) in the database. v = [] for n in range(1,10^5): f = factor(n) v.append({'n':str(n), 'factor':[(str(q),int(e)) for (q,e) in f]}) /// CPU time: 16.60 s, Wall time: 16.62 s }}}

We do the actual insert in one single call to the database.

{{{id=28| # manipulate=False speeds it up a little time z = F.insert(v, manipulate=False, check_keys=False) /// Time: CPU 1.04 s, Wall: 1.04 s }}} {{{id=30| F.find({'description': {'$exists':True}}).next() /// {u'_id': ObjectId('4cf87be48c667a74970000d4'), u'description': u'This is a collection of factorizations of numbers.', u'author': u'William Stein'} }}} {{{id=31| /// }}}

The possible queries that you can do are extremely powerful.  See the MongoDB documentation's query page. 

{{{id=29| /// }}}

You can also make an index on a given field, which matters greatly once the database gets bigger.  To illustrate this, let's insert 100000 factorizations. 

{{{id=20| F.count() /// 100035 }}} {{{id=18| timeit("F.find({'n':str(randint(2,10000))})") /// 625 loops, best of 3: 15.7 µs per loop }}} {{{id=19| F.ensure_index('n') /// u'n_1' }}} {{{id=27| timeit("F.find({'n':str(randint(2,10000))})") /// 625 loops, best of 3: 15.5 µs per loop }}}

One hundred thousand isn't enough to make much of a difference with such simple keys.     For larger data sets, ensure_index is absolutely critical.

We can also do almost everything from the monogo command line prompt, which is literally a Javascript interpreter. 

MongoDB shell version: 1.6.3
connecting to: localhost:29000/test
> use db
switched to db db
> db.factorizations
db.factorizations
> f = db.factorizations
db.factorizations
> z=0; for(i=1;i<=4;i++) { z+=i; }; z     /* Look, javascript! */
10
> f.find({'n':'2010'})               
{ "_id" : ObjectId("4cf87de0f94482457dbfbbbe"), "factor" : [ [ "2", 1 ], [ "3", 1 ], [ "5", 1 ], [ "67", 1 ] ], "n" : "2010" }
> f.find({'shape':'2^29-1'})
{ "_id" : ObjectId("4cf87dd78c667a749700016c"), "shape" : "2^29-1", "factor" : [ [ "233", 1 ], [ "1103", 1 ], [ "2089", 1 ] ], "n" : "536870911" }
> f.find({"description":{"\$exists":true}})        
{ "_id" : ObjectId("4cf87f598c667a7497000186"), "description" : "This is a collection of factorizations of numbers.", "author" : "William Stein" }

{{{id=38| /// }}}

Next lecture: discussion of the overall architecture for the modular forms and L-functions database project: database + webserver, etc.