TITLE: The Architecture I am Using for my Next Modular Forms Database
SPEAKER: William Stein
DATE: October 2010

------------------------------------

ABSTRACT: I have oodles of data, available on various web pages,
computed in files only I know how to use, and I have the potential to
generate much new data.  This year, I am putting all of this data into
a massive database server, and making everything available in a
queryable form on the web.  Thanks to the National Science Foundation,
I have no real worries about hardware; I can easily allocate several
terabytes of very fast disk space to this database, and have several
thousand dollars budgeted that can be used to buy extra computers and
disks for backup purposes.  Last time I seriously pursued putting
together a big database like this was in 2003, and technology has come
a long way since then.  Computers are 64-bit, which helps enormously
with scaling, and there are finally useful documented-oriented
databases.  This talk is about the technological architecture I intend
to use to put together this cutting-edge database, and I hope it will
be of interest to other people in our Focused Research Group (FRG),
who have similar goals.  I am very willing to help them get going with
this technology.   (This is joint work with Mike Hansen.)

--------------------------------------------------------
Part 1. Overall Architecture
--------------------------------------------------------

 1. Database: * MongoDB master -- disk.math.washington.edu (in Seattle)
              * MongoDB slave 1 -- in Seattle on William Stein's OS X desktop (?)
              * MongoDB slave 2 -- in Waterloo on Mike Rubinstein OS X computer

    I will attempt to limit the database footprint for this project to
    4 terabytes, so that a single $350 USB disk plugged into any
    computer (Linux, OS X, etc.)  can server as a redundant MongoDB
    slave.  No sharding will be used for my project.

    The master MongoDB server will run directly on our 24TB fileserver
    listening only on localhost; this machine has very few user
    accounts. A user who needs direct access to the database will have
    their ssh key added to a limited account on this machine, and via
    ssh port forwarding, they will be able to access the database,
    using a login and password that gives them access to a subset of
    the databases or collections served by MongoDB.  Note that a
    single MongoDB server can simultaneously server numerous
    completely independent databases.

 2. Web Interface:
              * Flask microframework: http://flask.pocoo.org/
              * Apache: via mod_wsgi
 
    Mike Hansen and I are using the Flask Python library (Flask is
    from the same group that brought us Jinja, Sphinx, etc.) to
    develop a web front-end for to the database that will enable
    anybody to easily make fast queries that on certain collections of
    data, and easily scan through the results.  We will create indexes
    in the MongoDB database that specifically support the queries that
    are available through the web interface.  We will deploy our Flask
    application using Apache's mod_wsgi module, which is quick and
    scalable.

3.  Programmatic Interface:
 
    We will setup a read-only MongoDB slave server on a separate
    machine, which will be available for sophisticated users that wish
    to make arbitrary queries against the database, or use Sage to
    grab objects out of the database.  Some interesting queries (e.g.,
    map-reduce which can run javascript on millions of documents) can
    take a long time and put a heavy load on the database server, but
    since this will be a separate server, such queries will have no
    impact at all on our master MongoDB server.

    MongoDB has official support for fully accessing a MongoDB server
    using any of C, C++, Java, Javascript, Perl, PHP, Python (hence
    Sage!), and Ruby.  There are numerous other languages that are not
    officially supported, but are listed here:
    http://www.mongodb.org/display/DOCS/Drivers
    Unfortunately, no other math software, e.g., Magma, Mathematica, Maple, 
    or Matlab, is in that list.
    

WEB UPLOAD? 

   Data upload is also done through 3, but with a connection to the
   master server instead of a slave.  There is definitely no web page
   upload for data in this model, and I have no interest or plans in
   creating such a thing as part of this architecture, due to the
   security issues.  However, if somebody else makes one, they could
   act as an "editor" and submit the results of uploads to the MongoDB
   database.


--------------------------------------------------------
Part 2. The Database -- MongoDB  
--------------------------------------------------------

David Farmer has put forth an idea that "the basic building blocks in
the project are the individual homepages of each object of interest."
MongoDB (http://www.mongodb.org/) is a relatively new *documented
oriented* database management system, hence much different than a SQL
database such as SQLite, MySQL, or PostgreSQL.  MongoDB documents
correspond to Farmer's idea of homepages.  With MongoDB, not only can
you store and retrieve *documents*, you can also build indexes and do
elaborate optimized queries.  Also, all data can be optimatically
replicated on any number of backup servers.

I've tested using MongoDB to deal with tons of data I generated this
summer, related to modular forms, for a project with Barry Mazur.  I
also tested putting all of Cremona's tables of elliptic curves and the
Stein-Watkins tables of elliptic curves in a single big MongoDB
database.  It made a vast amount of data (hundreds of gigabytes) feel
"small".  This is "feel" is critical to a database solution for this
project, and I've never had this feeling before with any other
database I've seriously used, which includes: PostgreSQL, MySQL,
sqlite, ZODB, and custom filesystem based stores.

How to learn about MongoDB: Go to http://www.mongodb.org/ and start
browsing.  There are tons of quickstarts, tutorials, articles, and
videos of talks, slides, etc.  Though MongoDB is free and open source
(and written in C++), there is a company behind MongoDB, which does a
lot of proselitizing.  There are also some not-quite-finished books
about MongoDB; I read them by temporarily signing up for an 
O'Reilly Safari books membership (my.safaribooksonline.com), reading
them, then unsubscribing.  Perhaps they will be published by now. 

(NOTE: MongoDB has essentially only one competitor, which is the
"apache CouchDB" project: http://couchdb.apache.org/.)

Setting up your own simple MongoDB server is easy:

   1. Download binaries from http://www.mongodb.org/downloads and put 
      them somewhere in your PATH.  They are available for Linux, 
      OS X, Windows, and Solaris.

   2. Start a MongoDB server running by typing "mongod".

      TECHNICAL NOTES: I usually type something more involved:

        mongod --dbpath /lvm/array/lmfdb/mongodb --bind_ip localhost  --port 29000

        The dbpath option specifies where the files for the database
        are stored and the bind_ip and port options makes it so mongod
        accepts connections on localhost port 29000; otherwise,
        anybody in the world could just connect to your mongodb and
        delete all your data!!  If you want to run mongod on a remote
        server somewhere, but easily connect to it from your laptop
        (say), setup an ssh tunnel by simply typing:

            ssh -L 29000:localhost:29000 remote.computer.edu

        Then you can pretend that port 29000 on your laptop *is* port
        29000 on the remote server, and things will just work.  It's
        also possible to create accounts with various permissions from
        the mongo console -- see the mongo documentation for details.
        However, if you setup accounts make sure to use an ssh tunnel
        whenever using them, since mongod itself doesn't use a secure
        socket, so your password would get sent in the clear.

   3. You can connect to your new MongoDB server with the mongo console:

        wstein@disk$ mongo localhost:29000
        MongoDB shell version: 1.6.1
        connecting to: localhost:29000/test
        > show dbs
        admin
        local
        research
        > help 
                db.help()                    help on db methods
                ...

   4.   More importantly, you can also connect from Sage (or any Python):

        (a) If you have not already done so, install pymongo, which
        takes about 10 seconds:

              sage: !easy_install pymongo
              Searching for pymongo
              Reading http://pypi.python.org/simple/pymongo/
              Reading http://github.com/mongodb/mongo-python-driver
              Best match: pymongo 1.9
              Downloading http://pypi.python.org/packages/source/p/pymongo/pymongo-1.9.tar.gz#md5=12e12163e6cc22993808900fb9629252
              Processing pymongo-1.9.tar.gz
              Running pymongo-1.9/setup.py -q bdist_egg --dist-dir /tmp/easy_install-nIodTu/pymongo-1.9/egg-dist-tmp-gCPRfG
              warning: no files found matching '*.h' under directory 'pymongo'
              bson/time64.c:279: warning: ‘check_tm’ defined but not used
              zip_safe flag not set; analyzing archive contents...
              Adding pymongo 1.9 to easy-install.pth file

              Installed /usr/local/sage/sage-4.6.alpha1/local/lib/python2.6/site-packages/pymongo-1.9-py2.6-linux-x86_64.egg
              Processing dependencies for pymongo
              Finished processing dependencies for pymongo
              
              sage: quit   # important! 

        (b) Now use it:
              sage: import pymongo
              sage: C = pymongo.Connection('localhost:29000')
              sage: C.database_names()
              [u'research', u'admin', u'local']
              sage: R = C.research; R
              Database(Connection('localhost', 29000), u'research')
              sage: R.[tab key]
              R.add_son_manipulator  R.create_collection    R.logout               R.remove_user
              R.add_user             R.dereference          R.name                 R.reset_error_history
              R.authenticate         R.drop_collection      R.next                 R.set_profiling_level
              R.collection_names     R.error                R.previous_error       R.system_js
              R.command              R.eval                 R.profiling_info       R.validate_collection
              R.connection           R.last_status          R.profiling_level      
              sage: R.collection_names()
              [u'mazur_irreg.done', u'system.indexes', u'mazur_irreg', u'mazur_irreg2.done',
               u'mazur_irreg2', u'mazur_irreg3.done', u'mazur_irreg3', u'mazur_irreg.f2_multiplicities',
               u'ellcurves', u'heegner_point_heights', u'shimura_curves',
               u'fs.chunks', u'fs.files']


DATABASES, COLLECTIONS, and DOCUMENTS:

    A MongoDB server serves a collection of completely *independent*
    databases.  A database is a set of *collections*, and a collection
    is a set of documents.  A document is just a basically like a
    Python dictionary, but only a limited number of datatypes are
    allowed.  Technically, a document is a "BSON" document, where BSON
    is a slight binary generalization of the Javascript notion of
    JSON.

DOCUMENTS are limited:

    A MongoDB document must be at most 4MB in size.  Let's push
    the limits, to see what this means in practice:

             sage: foo = R.foo
             sage: foo.insert({'test':'a'*(4*10^6)})
             ObjectId('4cae369075688b3eab000006')
             sage: foo.insert({'test':'a'*(5*10^6)})
             Traceback (most recent call last):
             ...
             InvalidDocument: document too large - BSON documents are limited to 4 MB

    So you could store a string with 4 million characters, but not
    5 million; for reference, this is about 1,000 typed pages of text.
              
HOW TO STORE BIG STUFF:

    The "fs.chunks" and "fs.files" collections above are created
    automatically in MongoDB to implement something called "GridFS" =
    "Grid Filesystem".  As mentioned above, each MongoDB document must
    be at most 4MB in size, but GridFS allows you to get around this
    and store gigantic data in MongoDB, but with no indexing and
    searching capabilities.  It's basically just a key:value store,
    built on top of MongoDB.

    Here is how to use it (continuing our example):

        sage: import gridfs
        sage: G = gridfs.GridFS(R)
        sage: import gridfs
        sage: G = gridfs.GridFS(R)
        sage: G.put('a'*(5*10^6), filename='test1')
        ObjectId('4cae3ba075688b3eab000008')
        sage: x = G.get_last_version('test1').read()
        sage: len(x)
        5000000
        sage: x[:30]
        'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'

    You can store arbitrary Sage objects using dumps and loads:

        sage: M = ModularSymbols(389,2)
        sage: G.put(dumps(M), filename='modsym389')
        ObjectId('4cae3c1a75688b3eab00001d')
        sage: loads(G.get_last_version('modsym389').read())
        Modular Symbols space of dimension 65 for Gamma_0(389) of weight 2 with sign 0 over Rational Field
           
    You can have exactly one GridFS per database, so if you have
    documents in all sorts of collections that somehow point to
    GridFS "files", you'll need to choose some systematic way
    of naming the files. 
            

--------------------------------------------------------
Part 3. The Web Interface -- FLASK, mod_wsgi, apache
--------------------------------------------------------

FLASK: "Flask is a micro webdevelopment framework for Python." 

       http://flask.pocoo.org/

Here is "hello world" written using Flask:

       from flask import Flask
       app = Flask(__name__)
       
       @app.route("/")
       def hello():
           return "Hello World!"
       
       if __name__ == "__main__":
           app.run()

To install Flask into any Python instance, just do "easy_install Flask".  For example,
to make Flask work in Sage, do 
       
      sage -sh
      easy_install Flask

and you got it.

I don't have time in this talk to go into detail about how to use Flask in
general.  The documentation at their website is excellent.  In short,
you use decorators to construct the URL mapping, deal with GET and
POST requests, etc.  You can also put static/ and templates/
subdirectories in your Python project, and relevant files will get
pulled from there, e.g., for static HTML and Jinja2 templates (Flask
is heavily tied to Jinja2).

Mike Hansen and I are building a demo Flask site that illustrates the
architecture sketched above and provides access to a large table of
over a hundred million elliptic curves, consisting of the union of the
Cremona tables and the Stein-Watkins tables.  This section describes
this demo in detail.  This will in fact form the core for the new
modular forms database, though I'm sure much of it will get
rewritten and polished.   In the rest of this talk, I'll give a quick
demo of this site and walk through of the code.

The *DEMO* is here:

      http://db.modform.org/

Try it out!  The rest of this talk is about the architecture of this
website, which is running on boxen.math.washington.edu, in Seattle. 

APACHE/WSGI setup:

(1) We created a file

         /etc/apache2/sites-available/lmfdb
         
    with the contents:

        NameVirtualHost db.modform.org:80
        <VirtualHost db.modform.org:80>
            ServerName db.modform.org
            WSGIDaemonProcess lmfdb threads=5
            WSGIScriptAlias / /home/mhansen/lmfdb/lmfdb.wsgi
            <Directory /home/mhansen/lmfdb>
                WSGIProcessGroup lmfdb
                WSGIApplicationGroup %{GLOBAL}
                Order deny,allow
                Allow from all
            </Directory>
        </VirtualHost>

    and made a symbolic link:

         /etc/apache2/sites-available/lmfdb --> /etc/apache2/sites-enabled/lmfdb 

(2) The WSGI appliction is defined by this file:
  
         http://sage.math.washington.edu/home/mhansen/lmfdb/lmfdb.wsgi

    The main thing that this file has to do is define some object called "application"
    which will obey the WSGI protocol.   There are a few other things in there to let
    it know about the virtual environment where Mike has Flask, etc. installed.  Here
    is the contents of the file:

        import os, sys
        sys.path.append('/home/mhansen/lmfdb')
        os.environ['PYTHON_EGG_CACHE'] = '/home/mhansen/lmfdb/.python-eggs'

        activate_this = '/home/mhansen/lmfdb/env/bin/activate_this.py'
        execfile(activate_this, dict(__file__=activate_this))

        from lmfdb import app as application

    Note that this file doesn't really make sense out of context, and it is important
    to look at the files mentioned above. 

(3) To understand the application, you need to look at the files in the following tarball:
  
        http://wstein.org/talks/2010-10-lmfdb/lmfdb-flack.tar.bz2

In there, in addition to the templates you'll find lmfdb.py, which looks like this:

##################################################################

from flask import Flask, url_for, render_template, request
app = Flask(__name__)

from pymongo import Connection
db = Connection(port=int(29000)).research

def ellcurves_list(f):
    from functools import wraps
    from utils import LazyMongoDBPagination
    @wraps(f)
    def wrapper(**kwargs):
        kwds = f(**kwargs)
        pagination = LazyMongoDBPagination(query=kwds.pop('curves'),
                                       per_page=50,
                                       page=request.args.get('page', 1),
                                       endpoint=f.__name__,
                                       endpoint_params=kwargs)

        return render_template(f.__name__ + '.html',
                               pagination=pagination, **kwds)
    return wrapper

@app.route('/ellcurves/conductor/<query>')
def ellcurves(query):
    import re
    values = map(int, re.findall(r'(\d+)', query))
    if len(values) == 2:
        return ellcurves_conductor_range(*values)
    elif len(values) == 1:
        return ellcurves_of_conductor(*values)
    else:
        return render_template('invalid_query.html', query=query)

# omitted ...

@app.route('/ellcurve')
def ellcurve():
    a = request.args
    level = int(a.get('level','11'))
    iso_class = a.get('iso_class', 'a')
    number = int(a.get('number', 1))
    cursor = db.ellcurves.find({'level':level, 'iso_class':iso_class, 'number':number})
    return render_template('ellcurve.html', count=cursor.count(True),
                           curve=cursor.next())

@app.route('/')
def index():
    return render_template('index.html')

if __name__ == '__main__':
    app.run(debug=True,host='0.0.0.0', port=8765)

##################################################################

SUMMARY:

This talk has laid out the architecture that I will be using for my
new web-based databases, which I recently developed jointly with Mike
Hansen.  It uses the following free open source tools together in a
natural way:

     * Python: a high quality programming language
     * MongoDB: a scalable documented-oriented database
     * Flask: a "micro-framework" for Python-based web apps
     * Jinja2: a general purpose templating language
     * Apache + WSGI: high performance scalable web server for Python

You can try out a demo that combines the above right now at:

      http://db.modform.org

There are other related technologies that I'm currently not planning
on using, but might use, depending on further investigation, e.g.,
     * MongoKit -- a python module that brings structured schema and
       validation layer on top of the great pymongo driver.