TITLE: The Architecture I am Using for my Next Modular Forms Database SPEAKER: William Stein DATE: October 2010 ------------------------------------ ABSTRACT: I have oodles of data, available on various web pages, computed in files only I know how to use, and I have the potential to generate much new data. This year, I am putting all of this data into a massive database server, and making everything available in a queryable form on the web. Thanks to the National Science Foundation, I have no real worries about hardware; I can easily allocate several terabytes of very fast disk space to this database, and have several thousand dollars budgeted that can be used to buy extra computers and disks for backup purposes. Last time I seriously pursued putting together a big database like this was in 2003, and technology has come a long way since then. Computers are 64-bit, which helps enormously with scaling, and there are finally useful documented-oriented databases. This talk is about the technological architecture I intend to use to put together this cutting-edge database, and I hope it will be of interest to other people in our Focused Research Group (FRG), who have similar goals. I am very willing to help them get going with this technology. (This is joint work with Mike Hansen.) -------------------------------------------------------- Part 1. Overall Architecture -------------------------------------------------------- 1. Database: * MongoDB master -- disk.math.washington.edu (in Seattle) * MongoDB slave 1 -- in Seattle on William Stein's OS X desktop (?) * MongoDB slave 2 -- in Waterloo on Mike Rubinstein OS X computer I will attempt to limit the database footprint for this project to 4 terabytes, so that a single $350 USB disk plugged into any computer (Linux, OS X, etc.) can server as a redundant MongoDB slave. No sharding will be used for my project. The master MongoDB server will run directly on our 24TB fileserver listening only on localhost; this machine has very few user accounts. A user who needs direct access to the database will have their ssh key added to a limited account on this machine, and via ssh port forwarding, they will be able to access the database, using a login and password that gives them access to a subset of the databases or collections served by MongoDB. Note that a single MongoDB server can simultaneously server numerous completely independent databases. 2. Web Interface: * Flask microframework: http://flask.pocoo.org/ * Apache: via mod_wsgi Mike Hansen and I are using the Flask Python library (Flask is from the same group that brought us Jinja, Sphinx, etc.) to develop a web front-end for to the database that will enable anybody to easily make fast queries that on certain collections of data, and easily scan through the results. We will create indexes in the MongoDB database that specifically support the queries that are available through the web interface. We will deploy our Flask application using Apache's mod_wsgi module, which is quick and scalable. 3. Programmatic Interface: We will setup a read-only MongoDB slave server on a separate machine, which will be available for sophisticated users that wish to make arbitrary queries against the database, or use Sage to grab objects out of the database. Some interesting queries (e.g., map-reduce which can run javascript on millions of documents) can take a long time and put a heavy load on the database server, but since this will be a separate server, such queries will have no impact at all on our master MongoDB server. MongoDB has official support for fully accessing a MongoDB server using any of C, C++, Java, Javascript, Perl, PHP, Python (hence Sage!), and Ruby. There are numerous other languages that are not officially supported, but are listed here: http://www.mongodb.org/display/DOCS/Drivers Unfortunately, no other math software, e.g., Magma, Mathematica, Maple, or Matlab, is in that list. WEB UPLOAD? Data upload is also done through 3, but with a connection to the master server instead of a slave. There is definitely no web page upload for data in this model, and I have no interest or plans in creating such a thing as part of this architecture, due to the security issues. However, if somebody else makes one, they could act as an "editor" and submit the results of uploads to the MongoDB database. -------------------------------------------------------- Part 2. The Database -- MongoDB -------------------------------------------------------- David Farmer has put forth an idea that "the basic building blocks in the project are the individual homepages of each object of interest." MongoDB (http://www.mongodb.org/) is a relatively new *documented oriented* database management system, hence much different than a SQL database such as SQLite, MySQL, or PostgreSQL. MongoDB documents correspond to Farmer's idea of homepages. With MongoDB, not only can you store and retrieve *documents*, you can also build indexes and do elaborate optimized queries. Also, all data can be optimatically replicated on any number of backup servers. I've tested using MongoDB to deal with tons of data I generated this summer, related to modular forms, for a project with Barry Mazur. I also tested putting all of Cremona's tables of elliptic curves and the Stein-Watkins tables of elliptic curves in a single big MongoDB database. It made a vast amount of data (hundreds of gigabytes) feel "small". This is "feel" is critical to a database solution for this project, and I've never had this feeling before with any other database I've seriously used, which includes: PostgreSQL, MySQL, sqlite, ZODB, and custom filesystem based stores. How to learn about MongoDB: Go to http://www.mongodb.org/ and start browsing. There are tons of quickstarts, tutorials, articles, and videos of talks, slides, etc. Though MongoDB is free and open source (and written in C++), there is a company behind MongoDB, which does a lot of proselitizing. There are also some not-quite-finished books about MongoDB; I read them by temporarily signing up for an O'Reilly Safari books membership (my.safaribooksonline.com), reading them, then unsubscribing. Perhaps they will be published by now. (NOTE: MongoDB has essentially only one competitor, which is the "apache CouchDB" project: http://couchdb.apache.org/.) Setting up your own simple MongoDB server is easy: 1. Download binaries from http://www.mongodb.org/downloads and put them somewhere in your PATH. They are available for Linux, OS X, Windows, and Solaris. 2. Start a MongoDB server running by typing "mongod". TECHNICAL NOTES: I usually type something more involved: mongod --dbpath /lvm/array/lmfdb/mongodb --bind_ip localhost --port 29000 The dbpath option specifies where the files for the database are stored and the bind_ip and port options makes it so mongod accepts connections on localhost port 29000; otherwise, anybody in the world could just connect to your mongodb and delete all your data!! If you want to run mongod on a remote server somewhere, but easily connect to it from your laptop (say), setup an ssh tunnel by simply typing: ssh -L 29000:localhost:29000 remote.computer.edu Then you can pretend that port 29000 on your laptop *is* port 29000 on the remote server, and things will just work. It's also possible to create accounts with various permissions from the mongo console -- see the mongo documentation for details. However, if you setup accounts make sure to use an ssh tunnel whenever using them, since mongod itself doesn't use a secure socket, so your password would get sent in the clear. 3. You can connect to your new MongoDB server with the mongo console: wstein@disk$ mongo localhost:29000 MongoDB shell version: 1.6.1 connecting to: localhost:29000/test > show dbs admin local research > help db.help() help on db methods ... 4. More importantly, you can also connect from Sage (or any Python): (a) If you have not already done so, install pymongo, which takes about 10 seconds: sage: !easy_install pymongo Searching for pymongo Reading http://pypi.python.org/simple/pymongo/ Reading http://github.com/mongodb/mongo-python-driver Best match: pymongo 1.9 Downloading http://pypi.python.org/packages/source/p/pymongo/pymongo-1.9.tar.gz#md5=12e12163e6cc22993808900fb9629252 Processing pymongo-1.9.tar.gz Running pymongo-1.9/setup.py -q bdist_egg --dist-dir /tmp/easy_install-nIodTu/pymongo-1.9/egg-dist-tmp-gCPRfG warning: no files found matching '*.h' under directory 'pymongo' bson/time64.c:279: warning: ‘check_tm’ defined but not used zip_safe flag not set; analyzing archive contents... Adding pymongo 1.9 to easy-install.pth file Installed /usr/local/sage/sage-4.6.alpha1/local/lib/python2.6/site-packages/pymongo-1.9-py2.6-linux-x86_64.egg Processing dependencies for pymongo Finished processing dependencies for pymongo sage: quit # important! (b) Now use it: sage: import pymongo sage: C = pymongo.Connection('localhost:29000') sage: C.database_names() [u'research', u'admin', u'local'] sage: R = C.research; R Database(Connection('localhost', 29000), u'research') sage: R.[tab key] R.add_son_manipulator R.create_collection R.logout R.remove_user R.add_user R.dereference R.name R.reset_error_history R.authenticate R.drop_collection R.next R.set_profiling_level R.collection_names R.error R.previous_error R.system_js R.command R.eval R.profiling_info R.validate_collection R.connection R.last_status R.profiling_level sage: R.collection_names() [u'mazur_irreg.done', u'system.indexes', u'mazur_irreg', u'mazur_irreg2.done', u'mazur_irreg2', u'mazur_irreg3.done', u'mazur_irreg3', u'mazur_irreg.f2_multiplicities', u'ellcurves', u'heegner_point_heights', u'shimura_curves', u'fs.chunks', u'fs.files'] DATABASES, COLLECTIONS, and DOCUMENTS: A MongoDB server serves a collection of completely *independent* databases. A database is a set of *collections*, and a collection is a set of documents. A document is just a basically like a Python dictionary, but only a limited number of datatypes are allowed. Technically, a document is a "BSON" document, where BSON is a slight binary generalization of the Javascript notion of JSON. DOCUMENTS are limited: A MongoDB document must be at most 4MB in size. Let's push the limits, to see what this means in practice: sage: foo = R.foo sage: foo.insert({'test':'a'*(4*10^6)}) ObjectId('4cae369075688b3eab000006') sage: foo.insert({'test':'a'*(5*10^6)}) Traceback (most recent call last): ... InvalidDocument: document too large - BSON documents are limited to 4 MB So you could store a string with 4 million characters, but not 5 million; for reference, this is about 1,000 typed pages of text. HOW TO STORE BIG STUFF: The "fs.chunks" and "fs.files" collections above are created automatically in MongoDB to implement something called "GridFS" = "Grid Filesystem". As mentioned above, each MongoDB document must be at most 4MB in size, but GridFS allows you to get around this and store gigantic data in MongoDB, but with no indexing and searching capabilities. It's basically just a key:value store, built on top of MongoDB. Here is how to use it (continuing our example): sage: import gridfs sage: G = gridfs.GridFS(R) sage: import gridfs sage: G = gridfs.GridFS(R) sage: G.put('a'*(5*10^6), filename='test1') ObjectId('4cae3ba075688b3eab000008') sage: x = G.get_last_version('test1').read() sage: len(x) 5000000 sage: x[:30] 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' You can store arbitrary Sage objects using dumps and loads: sage: M = ModularSymbols(389,2) sage: G.put(dumps(M), filename='modsym389') ObjectId('4cae3c1a75688b3eab00001d') sage: loads(G.get_last_version('modsym389').read()) Modular Symbols space of dimension 65 for Gamma_0(389) of weight 2 with sign 0 over Rational Field You can have exactly one GridFS per database, so if you have documents in all sorts of collections that somehow point to GridFS "files", you'll need to choose some systematic way of naming the files. -------------------------------------------------------- Part 3. The Web Interface -- FLASK, mod_wsgi, apache -------------------------------------------------------- FLASK: "Flask is a micro webdevelopment framework for Python." http://flask.pocoo.org/ Here is "hello world" written using Flask: from flask import Flask app = Flask(__name__) @app.route("/") def hello(): return "Hello World!" if __name__ == "__main__": app.run() To install Flask into any Python instance, just do "easy_install Flask". For example, to make Flask work in Sage, do sage -sh easy_install Flask and you got it. I don't have time in this talk to go into detail about how to use Flask in general. The documentation at their website is excellent. In short, you use decorators to construct the URL mapping, deal with GET and POST requests, etc. You can also put static/ and templates/ subdirectories in your Python project, and relevant files will get pulled from there, e.g., for static HTML and Jinja2 templates (Flask is heavily tied to Jinja2). Mike Hansen and I are building a demo Flask site that illustrates the architecture sketched above and provides access to a large table of over a hundred million elliptic curves, consisting of the union of the Cremona tables and the Stein-Watkins tables. This section describes this demo in detail. This will in fact form the core for the new modular forms database, though I'm sure much of it will get rewritten and polished. In the rest of this talk, I'll give a quick demo of this site and walk through of the code. The *DEMO* is here: http://db.modform.org/ Try it out! The rest of this talk is about the architecture of this website, which is running on boxen.math.washington.edu, in Seattle. APACHE/WSGI setup: (1) We created a file /etc/apache2/sites-available/lmfdb with the contents: NameVirtualHost db.modform.org:80 ServerName db.modform.org WSGIDaemonProcess lmfdb threads=5 WSGIScriptAlias / /home/mhansen/lmfdb/lmfdb.wsgi WSGIProcessGroup lmfdb WSGIApplicationGroup %{GLOBAL} Order deny,allow Allow from all and made a symbolic link: /etc/apache2/sites-available/lmfdb --> /etc/apache2/sites-enabled/lmfdb (2) The WSGI appliction is defined by this file: http://sage.math.washington.edu/home/mhansen/lmfdb/lmfdb.wsgi The main thing that this file has to do is define some object called "application" which will obey the WSGI protocol. There are a few other things in there to let it know about the virtual environment where Mike has Flask, etc. installed. Here is the contents of the file: import os, sys sys.path.append('/home/mhansen/lmfdb') os.environ['PYTHON_EGG_CACHE'] = '/home/mhansen/lmfdb/.python-eggs' activate_this = '/home/mhansen/lmfdb/env/bin/activate_this.py' execfile(activate_this, dict(__file__=activate_this)) from lmfdb import app as application Note that this file doesn't really make sense out of context, and it is important to look at the files mentioned above. (3) To understand the application, you need to look at the files in the following tarball: http://wstein.org/talks/2010-10-lmfdb/lmfdb-flack.tar.bz2 In there, in addition to the templates you'll find lmfdb.py, which looks like this: ################################################################## from flask import Flask, url_for, render_template, request app = Flask(__name__) from pymongo import Connection db = Connection(port=int(29000)).research def ellcurves_list(f): from functools import wraps from utils import LazyMongoDBPagination @wraps(f) def wrapper(**kwargs): kwds = f(**kwargs) pagination = LazyMongoDBPagination(query=kwds.pop('curves'), per_page=50, page=request.args.get('page', 1), endpoint=f.__name__, endpoint_params=kwargs) return render_template(f.__name__ + '.html', pagination=pagination, **kwds) return wrapper @app.route('/ellcurves/conductor/') def ellcurves(query): import re values = map(int, re.findall(r'(\d+)', query)) if len(values) == 2: return ellcurves_conductor_range(*values) elif len(values) == 1: return ellcurves_of_conductor(*values) else: return render_template('invalid_query.html', query=query) # omitted ... @app.route('/ellcurve') def ellcurve(): a = request.args level = int(a.get('level','11')) iso_class = a.get('iso_class', 'a') number = int(a.get('number', 1)) cursor = db.ellcurves.find({'level':level, 'iso_class':iso_class, 'number':number}) return render_template('ellcurve.html', count=cursor.count(True), curve=cursor.next()) @app.route('/') def index(): return render_template('index.html') if __name__ == '__main__': app.run(debug=True,host='0.0.0.0', port=8765) ################################################################## SUMMARY: This talk has laid out the architecture that I will be using for my new web-based databases, which I recently developed jointly with Mike Hansen. It uses the following free open source tools together in a natural way: * Python: a high quality programming language * MongoDB: a scalable documented-oriented database * Flask: a "micro-framework" for Python-based web apps * Jinja2: a general purpose templating language * Apache + WSGI: high performance scalable web server for Python You can try out a demo that combines the above right now at: http://db.modform.org There are other related technologies that I'm currently not planning on using, but might use, depending on further investigation, e.g., * MongoKit -- a python module that brings structured schema and validation layer on top of the great pymongo driver.