TITLE: The Architecture I am Using for my Next Modular Forms Database
SPEAKER: William Stein
DATE: October 2010
------------------------------------
ABSTRACT: I have oodles of data, available on various web pages,
computed in files only I know how to use, and I have the potential to
generate much new data. This year, I am putting all of this data into
a massive database server, and making everything available in a
queryable form on the web. Thanks to the National Science Foundation,
I have no real worries about hardware; I can easily allocate several
terabytes of very fast disk space to this database, and have several
thousand dollars budgeted that can be used to buy extra computers and
disks for backup purposes. Last time I seriously pursued putting
together a big database like this was in 2003, and technology has come
a long way since then. Computers are 64-bit, which helps enormously
with scaling, and there are finally useful documented-oriented
databases. This talk is about the technological architecture I intend
to use to put together this cutting-edge database, and I hope it will
be of interest to other people in our Focused Research Group (FRG),
who have similar goals. I am very willing to help them get going with
this technology. (This is joint work with Mike Hansen.)
--------------------------------------------------------
Part 1. Overall Architecture
--------------------------------------------------------
1. Database: * MongoDB master -- disk.math.washington.edu (in Seattle)
* MongoDB slave 1 -- in Seattle on William Stein's OS X desktop (?)
* MongoDB slave 2 -- in Waterloo on Mike Rubinstein OS X computer
I will attempt to limit the database footprint for this project to
4 terabytes, so that a single $350 USB disk plugged into any
computer (Linux, OS X, etc.) can server as a redundant MongoDB
slave. No sharding will be used for my project.
The master MongoDB server will run directly on our 24TB fileserver
listening only on localhost; this machine has very few user
accounts. A user who needs direct access to the database will have
their ssh key added to a limited account on this machine, and via
ssh port forwarding, they will be able to access the database,
using a login and password that gives them access to a subset of
the databases or collections served by MongoDB. Note that a
single MongoDB server can simultaneously server numerous
completely independent databases.
2. Web Interface:
* Flask microframework: http://flask.pocoo.org/
* Apache: via mod_wsgi
Mike Hansen and I are using the Flask Python library (Flask is
from the same group that brought us Jinja, Sphinx, etc.) to
develop a web front-end for to the database that will enable
anybody to easily make fast queries that on certain collections of
data, and easily scan through the results. We will create indexes
in the MongoDB database that specifically support the queries that
are available through the web interface. We will deploy our Flask
application using Apache's mod_wsgi module, which is quick and
scalable.
3. Programmatic Interface:
We will setup a read-only MongoDB slave server on a separate
machine, which will be available for sophisticated users that wish
to make arbitrary queries against the database, or use Sage to
grab objects out of the database. Some interesting queries (e.g.,
map-reduce which can run javascript on millions of documents) can
take a long time and put a heavy load on the database server, but
since this will be a separate server, such queries will have no
impact at all on our master MongoDB server.
MongoDB has official support for fully accessing a MongoDB server
using any of C, C++, Java, Javascript, Perl, PHP, Python (hence
Sage!), and Ruby. There are numerous other languages that are not
officially supported, but are listed here:
http://www.mongodb.org/display/DOCS/Drivers
Unfortunately, no other math software, e.g., Magma, Mathematica, Maple,
or Matlab, is in that list.
WEB UPLOAD?
Data upload is also done through 3, but with a connection to the
master server instead of a slave. There is definitely no web page
upload for data in this model, and I have no interest or plans in
creating such a thing as part of this architecture, due to the
security issues. However, if somebody else makes one, they could
act as an "editor" and submit the results of uploads to the MongoDB
database.
--------------------------------------------------------
Part 2. The Database -- MongoDB
--------------------------------------------------------
David Farmer has put forth an idea that "the basic building blocks in
the project are the individual homepages of each object of interest."
MongoDB (http://www.mongodb.org/) is a relatively new *documented
oriented* database management system, hence much different than a SQL
database such as SQLite, MySQL, or PostgreSQL. MongoDB documents
correspond to Farmer's idea of homepages. With MongoDB, not only can
you store and retrieve *documents*, you can also build indexes and do
elaborate optimized queries. Also, all data can be optimatically
replicated on any number of backup servers.
I've tested using MongoDB to deal with tons of data I generated this
summer, related to modular forms, for a project with Barry Mazur. I
also tested putting all of Cremona's tables of elliptic curves and the
Stein-Watkins tables of elliptic curves in a single big MongoDB
database. It made a vast amount of data (hundreds of gigabytes) feel
"small". This is "feel" is critical to a database solution for this
project, and I've never had this feeling before with any other
database I've seriously used, which includes: PostgreSQL, MySQL,
sqlite, ZODB, and custom filesystem based stores.
How to learn about MongoDB: Go to http://www.mongodb.org/ and start
browsing. There are tons of quickstarts, tutorials, articles, and
videos of talks, slides, etc. Though MongoDB is free and open source
(and written in C++), there is a company behind MongoDB, which does a
lot of proselitizing. There are also some not-quite-finished books
about MongoDB; I read them by temporarily signing up for an
O'Reilly Safari books membership (my.safaribooksonline.com), reading
them, then unsubscribing. Perhaps they will be published by now.
(NOTE: MongoDB has essentially only one competitor, which is the
"apache CouchDB" project: http://couchdb.apache.org/.)
Setting up your own simple MongoDB server is easy:
1. Download binaries from http://www.mongodb.org/downloads and put
them somewhere in your PATH. They are available for Linux,
OS X, Windows, and Solaris.
2. Start a MongoDB server running by typing "mongod".
TECHNICAL NOTES: I usually type something more involved:
mongod --dbpath /lvm/array/lmfdb/mongodb --bind_ip localhost --port 29000
The dbpath option specifies where the files for the database
are stored and the bind_ip and port options makes it so mongod
accepts connections on localhost port 29000; otherwise,
anybody in the world could just connect to your mongodb and
delete all your data!! If you want to run mongod on a remote
server somewhere, but easily connect to it from your laptop
(say), setup an ssh tunnel by simply typing:
ssh -L 29000:localhost:29000 remote.computer.edu
Then you can pretend that port 29000 on your laptop *is* port
29000 on the remote server, and things will just work. It's
also possible to create accounts with various permissions from
the mongo console -- see the mongo documentation for details.
However, if you setup accounts make sure to use an ssh tunnel
whenever using them, since mongod itself doesn't use a secure
socket, so your password would get sent in the clear.
3. You can connect to your new MongoDB server with the mongo console:
wstein@disk$ mongo localhost:29000
MongoDB shell version: 1.6.1
connecting to: localhost:29000/test
> show dbs
admin
local
research
> help
db.help() help on db methods
...
4. More importantly, you can also connect from Sage (or any Python):
(a) If you have not already done so, install pymongo, which
takes about 10 seconds:
sage: !easy_install pymongo
Searching for pymongo
Reading http://pypi.python.org/simple/pymongo/
Reading http://github.com/mongodb/mongo-python-driver
Best match: pymongo 1.9
Downloading http://pypi.python.org/packages/source/p/pymongo/pymongo-1.9.tar.gz#md5=12e12163e6cc22993808900fb9629252
Processing pymongo-1.9.tar.gz
Running pymongo-1.9/setup.py -q bdist_egg --dist-dir /tmp/easy_install-nIodTu/pymongo-1.9/egg-dist-tmp-gCPRfG
warning: no files found matching '*.h' under directory 'pymongo'
bson/time64.c:279: warning: ‘check_tm’ defined but not used
zip_safe flag not set; analyzing archive contents...
Adding pymongo 1.9 to easy-install.pth file
Installed /usr/local/sage/sage-4.6.alpha1/local/lib/python2.6/site-packages/pymongo-1.9-py2.6-linux-x86_64.egg
Processing dependencies for pymongo
Finished processing dependencies for pymongo
sage: quit # important!
(b) Now use it:
sage: import pymongo
sage: C = pymongo.Connection('localhost:29000')
sage: C.database_names()
[u'research', u'admin', u'local']
sage: R = C.research; R
Database(Connection('localhost', 29000), u'research')
sage: R.[tab key]
R.add_son_manipulator R.create_collection R.logout R.remove_user
R.add_user R.dereference R.name R.reset_error_history
R.authenticate R.drop_collection R.next R.set_profiling_level
R.collection_names R.error R.previous_error R.system_js
R.command R.eval R.profiling_info R.validate_collection
R.connection R.last_status R.profiling_level
sage: R.collection_names()
[u'mazur_irreg.done', u'system.indexes', u'mazur_irreg', u'mazur_irreg2.done',
u'mazur_irreg2', u'mazur_irreg3.done', u'mazur_irreg3', u'mazur_irreg.f2_multiplicities',
u'ellcurves', u'heegner_point_heights', u'shimura_curves',
u'fs.chunks', u'fs.files']
DATABASES, COLLECTIONS, and DOCUMENTS:
A MongoDB server serves a collection of completely *independent*
databases. A database is a set of *collections*, and a collection
is a set of documents. A document is just a basically like a
Python dictionary, but only a limited number of datatypes are
allowed. Technically, a document is a "BSON" document, where BSON
is a slight binary generalization of the Javascript notion of
JSON.
DOCUMENTS are limited:
A MongoDB document must be at most 4MB in size. Let's push
the limits, to see what this means in practice:
sage: foo = R.foo
sage: foo.insert({'test':'a'*(4*10^6)})
ObjectId('4cae369075688b3eab000006')
sage: foo.insert({'test':'a'*(5*10^6)})
Traceback (most recent call last):
...
InvalidDocument: document too large - BSON documents are limited to 4 MB
So you could store a string with 4 million characters, but not
5 million; for reference, this is about 1,000 typed pages of text.
HOW TO STORE BIG STUFF:
The "fs.chunks" and "fs.files" collections above are created
automatically in MongoDB to implement something called "GridFS" =
"Grid Filesystem". As mentioned above, each MongoDB document must
be at most 4MB in size, but GridFS allows you to get around this
and store gigantic data in MongoDB, but with no indexing and
searching capabilities. It's basically just a key:value store,
built on top of MongoDB.
Here is how to use it (continuing our example):
sage: import gridfs
sage: G = gridfs.GridFS(R)
sage: import gridfs
sage: G = gridfs.GridFS(R)
sage: G.put('a'*(5*10^6), filename='test1')
ObjectId('4cae3ba075688b3eab000008')
sage: x = G.get_last_version('test1').read()
sage: len(x)
5000000
sage: x[:30]
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
You can store arbitrary Sage objects using dumps and loads:
sage: M = ModularSymbols(389,2)
sage: G.put(dumps(M), filename='modsym389')
ObjectId('4cae3c1a75688b3eab00001d')
sage: loads(G.get_last_version('modsym389').read())
Modular Symbols space of dimension 65 for Gamma_0(389) of weight 2 with sign 0 over Rational Field
You can have exactly one GridFS per database, so if you have
documents in all sorts of collections that somehow point to
GridFS "files", you'll need to choose some systematic way
of naming the files.
--------------------------------------------------------
Part 3. The Web Interface -- FLASK, mod_wsgi, apache
--------------------------------------------------------
FLASK: "Flask is a micro webdevelopment framework for Python."
http://flask.pocoo.org/
Here is "hello world" written using Flask:
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "Hello World!"
if __name__ == "__main__":
app.run()
To install Flask into any Python instance, just do "easy_install Flask". For example,
to make Flask work in Sage, do
sage -sh
easy_install Flask
and you got it.
I don't have time in this talk to go into detail about how to use Flask in
general. The documentation at their website is excellent. In short,
you use decorators to construct the URL mapping, deal with GET and
POST requests, etc. You can also put static/ and templates/
subdirectories in your Python project, and relevant files will get
pulled from there, e.g., for static HTML and Jinja2 templates (Flask
is heavily tied to Jinja2).
Mike Hansen and I are building a demo Flask site that illustrates the
architecture sketched above and provides access to a large table of
over a hundred million elliptic curves, consisting of the union of the
Cremona tables and the Stein-Watkins tables. This section describes
this demo in detail. This will in fact form the core for the new
modular forms database, though I'm sure much of it will get
rewritten and polished. In the rest of this talk, I'll give a quick
demo of this site and walk through of the code.
The *DEMO* is here:
http://db.modform.org/
Try it out! The rest of this talk is about the architecture of this
website, which is running on boxen.math.washington.edu, in Seattle.
APACHE/WSGI setup:
(1) We created a file
/etc/apache2/sites-available/lmfdb
with the contents:
NameVirtualHost db.modform.org:80
ServerName db.modform.org
WSGIDaemonProcess lmfdb threads=5
WSGIScriptAlias / /home/mhansen/lmfdb/lmfdb.wsgi
WSGIProcessGroup lmfdb
WSGIApplicationGroup %{GLOBAL}
Order deny,allow
Allow from all
and made a symbolic link:
/etc/apache2/sites-available/lmfdb --> /etc/apache2/sites-enabled/lmfdb
(2) The WSGI appliction is defined by this file:
http://sage.math.washington.edu/home/mhansen/lmfdb/lmfdb.wsgi
The main thing that this file has to do is define some object called "application"
which will obey the WSGI protocol. There are a few other things in there to let
it know about the virtual environment where Mike has Flask, etc. installed. Here
is the contents of the file:
import os, sys
sys.path.append('/home/mhansen/lmfdb')
os.environ['PYTHON_EGG_CACHE'] = '/home/mhansen/lmfdb/.python-eggs'
activate_this = '/home/mhansen/lmfdb/env/bin/activate_this.py'
execfile(activate_this, dict(__file__=activate_this))
from lmfdb import app as application
Note that this file doesn't really make sense out of context, and it is important
to look at the files mentioned above.
(3) To understand the application, you need to look at the files in the following tarball:
http://wstein.org/talks/2010-10-lmfdb/lmfdb-flack.tar.bz2
In there, in addition to the templates you'll find lmfdb.py, which looks like this:
##################################################################
from flask import Flask, url_for, render_template, request
app = Flask(__name__)
from pymongo import Connection
db = Connection(port=int(29000)).research
def ellcurves_list(f):
from functools import wraps
from utils import LazyMongoDBPagination
@wraps(f)
def wrapper(**kwargs):
kwds = f(**kwargs)
pagination = LazyMongoDBPagination(query=kwds.pop('curves'),
per_page=50,
page=request.args.get('page', 1),
endpoint=f.__name__,
endpoint_params=kwargs)
return render_template(f.__name__ + '.html',
pagination=pagination, **kwds)
return wrapper
@app.route('/ellcurves/conductor/')
def ellcurves(query):
import re
values = map(int, re.findall(r'(\d+)', query))
if len(values) == 2:
return ellcurves_conductor_range(*values)
elif len(values) == 1:
return ellcurves_of_conductor(*values)
else:
return render_template('invalid_query.html', query=query)
# omitted ...
@app.route('/ellcurve')
def ellcurve():
a = request.args
level = int(a.get('level','11'))
iso_class = a.get('iso_class', 'a')
number = int(a.get('number', 1))
cursor = db.ellcurves.find({'level':level, 'iso_class':iso_class, 'number':number})
return render_template('ellcurve.html', count=cursor.count(True),
curve=cursor.next())
@app.route('/')
def index():
return render_template('index.html')
if __name__ == '__main__':
app.run(debug=True,host='0.0.0.0', port=8765)
##################################################################
SUMMARY:
This talk has laid out the architecture that I will be using for my
new web-based databases, which I recently developed jointly with Mike
Hansen. It uses the following free open source tools together in a
natural way:
* Python: a high quality programming language
* MongoDB: a scalable documented-oriented database
* Flask: a "micro-framework" for Python-based web apps
* Jinja2: a general purpose templating language
* Apache + WSGI: high performance scalable web server for Python
You can try out a demo that combines the above right now at:
http://db.modform.org
There are other related technologies that I'm currently not planning
on using, but might use, depending on further investigation, e.g.,
* MongoKit -- a python module that brings structured schema and
validation layer on top of the great pymongo driver.