diff options
author | Marc Abramowitz <marc@marc-abramowitz.com> | 2016-03-07 18:52:36 -0800 |
---|---|---|
committer | Marc Abramowitz <marc@marc-abramowitz.com> | 2016-03-07 18:52:36 -0800 |
commit | cc83e06efff71b81ca5a3ac6df65775971181295 (patch) | |
tree | d52fa3f1a93730f263c2c5ac8266de8e5fb12abf /docs/url-parsing-with-wsgi.txt | |
download | paste-git-tox_coverage.tar.gz |
tox.ini: Measure test coveragetox_coverage
Diffstat (limited to 'docs/url-parsing-with-wsgi.txt')
-rw-r--r-- | docs/url-parsing-with-wsgi.txt | 304 |
1 files changed, 304 insertions, 0 deletions
diff --git a/docs/url-parsing-with-wsgi.txt b/docs/url-parsing-with-wsgi.txt new file mode 100644 index 0000000..856971f --- /dev/null +++ b/docs/url-parsing-with-wsgi.txt @@ -0,0 +1,304 @@ +URL Parsing With WSGI And Paste ++++++++++++++++++++++++++++++++ + +:author: Ian Bicking <ianb@colorstudy.com> +:revision: $Rev$ +:date: $LastChangedDate$ + +.. contents:: + +Introduction and Audience +========================= + +This document is intended for web framework authors and integrators, +and people who want to understand the internal architecture of Paste. + +.. include:: include/contact.txt + +URL Parsing +=========== + +.. note:: + + Sometimes people use "URL", and sometimes "URI". I think URLs are + a subset of URIs. But in practice you'll almost never see URIs + that aren't URLs, and certainly not in Paste. URIs that aren't + URLs are abstract Identifiers, that cannot necessarily be used to + Locate the resource. This document is *all* about locating. + +Most generally, URL parsing is about taking a URL and determining what +"resource" the URL refers to. "Resource" is a rather vague term, +intentionally. It's really just a metaphor -- in reality there aren't +any "resources" in HTTP; there are only requests and responses. + +In Paste, everything is about WSGI. But that can seem too fancy. +There are four core things involved: the *request* (personified in the +WSGI environment), the *response* (personified inthe +``start_response`` callback and the return iterator), the WSGI +application, and the server that calls that application. The +application and request are objects, while the server and response are +really more like actions than concrete objects. + +In this context, URL parsing is about mapping a URL to an +*application* and a *request*. The request actually gets modified as +it moves through different parts of the system. Two dictionary keys +in particular relate to URLs -- ``SCRIPT_NAME`` and ``PATH_INFO`` -- +but any part of the environment can be modified as it is passed +through the system. + +Dispatching +=========== + +.. note:: + + WSGI isn't object oriented? Well, if you look at it, you'll notice + there's no objects except built-in types, so it shouldn't be a + surprise. Additionally, the interface and promises of the objects + we do see are very minimal. An application doesn't have any + interface except one method -- ``__call__`` -- and that method + *does* things, it doesn't give any other information. + +Because WSGI is action-oriented, rather than object-oriented, it's +more important what we *do*. "Finding" an application is probably an +intermediate step, but "running" the application is our ultimate goal, +and the only real judge of success. An application that isn't run is +useless to us, because it doesn't have any other useful methods. + +So what we're really doing is *dispatching* -- we're handing the +request and responsibility for the response off to another object +(another actor, really). In the process we can actually retain some +control -- we can capture and transform the response, and we can +modify the request -- but that's not what the typical URL resolver will +do. + +Motivations +=========== + +The most obvious kind of URL parsing is finding a WSGI application. + +Typically when a framework first supports WSGI or is integrated into +Paste, it is "monolithic" with respect to URLs. That is, you define +(in Paste, or maybe in Apache) a "root" URL, and everything under that +goes into the framework. What the framework does internally, Paste +does not know -- it probably finds internal objects to dispatch to, +but the framework is opaque to Paste. Not just to Paste, but to +any code that isn't in that framework. + +That means that we can't mix code from multiple frameworks, or as +easily share services, or use WSGI middleware that doesn't apply to +the entire framework/application. + +An example of someplace we might want to use an "application" that +isn't part of the framework would be uploading large files. It's +possible to keep track of upload progress, and report that back to the +user, but no framework typically is capable of this. This is usually +because the POST request is completely read and parsed before it +invokes any application code. + +This is resolvable in WSGI -- a WSGI application can provide its own +code to read and parse the POST request, and simultaneously report +progress (usually in a way that *another* WSGI application/request can +read and report to the user on that progress). This is an example +where you want to allow "foreign" applications to be intermingled with +framework application code. + +Finding Applications +==================== + +OK, enough theory. How does a URL parser work? Well, it is a WSGI +application, and a WSGI server, in the typical "WSGI middleware" +style. Except that it determines which application it will serve +for each request. + +Let's consider Paste's ``URLParser`` (in ``paste.urlparser``). This +class takes a directory name as its only required argument, and +instances are WSGI applications. + +When a request comes in, the parser looks at ``PATH_INFO`` to see +what's left to parse. ``SCRIPT_NAME`` represents where we are *now*; +it's the part of the URL that has been parsed. + +There's a couple special cases: + +The empty string: + + URLParser serves directories. When ``PATH_INFO`` is empty, that + means we got a request with no trailing ``/``, like say ``/blog`` + If URLParser serves the ``blog`` directory, then this won't do -- + the user is requesting the ``blog`` *page*. We have to redirect + them to ``/blog/``. + +A single ``/``: + + So, we got a trailing ``/``. This means we need to serve the + "index" page. In URLParser, this is some file named ``index``, + though that's really an implementation detail. You could create + an index dynamically (like Apache's file listings), or whatever. + +Otherwise we get a string like ``/path...``. Note that ``PATH_INFO`` +*must* start with a ``/``, or it must be empty. + +URLParser pulls off the first part of the path. E.g., if +``PATH_INFO`` is ``/blog/edit/285``, then the first part is ``blog``. +It appends this to ``SCRIPT_NAME``, and strips it off ``PATH_INFO`` +(which becomes ``/edit/285``). + +It then searches for a file that matches "blog". In URLParser, this +means it looks for a filename which matches that name (ignoring the +extension). It then uses the type of that file (determined by +extension) to create a WSGI application. + +One case is that the file is a directory. In that case, the +application is *another* URLParser instance, this time with the new +directory. + +URLParser actually allows per-extension "plugins" -- these are just +functions that get a filename, and produce a WSGI application. One of +these is ``make_py`` -- this function imports the module, and looks +for special symbols; if it finds a symbol ``application``, it assumes +this is a WSGI application that is ready to accept the request. If it +finds a symbol that matches the name of the module (e.g., ``edit``), +then it assumes that is an application *factory*, meaning that when +you call it with no arguments you get a WSGI application. + +Another function takes "unknown" files (files for which no better +constructor exists) and creates an application that simply responds +with the contents of that file (and the appropriate ``Content-Type``). + +In any case, ``URLParser`` delegates as soon as it can. It doesn't +parse the entire path -- it just finds the *next* application, which +in turn may delegate to yet another application. + +Here's a very simple implementation of URLParser:: + + class URLParser(object): + def __init__(self, dir): + self.dir = dir + def __call__(self, environ, start_response): + segment = wsgilib.path_info_pop(environ) + if segment is None: # No trailing / + # do a redirect... + for filename in os.listdir(self.dir): + if os.path.splitext(filename)[0] == segment: + return self.serve_application( + environ, start_response, filename) + # do a 404 Not Found + def serve_application(self, environ, start_response, filename): + basename, ext = os.path.splitext(filename) + filename = os.path.join(self.dir, filename) + if os.path.isdir(filename): + return URLParser(filename)(environ, start_response) + elif ext == '.py': + module = import_module(filename) + if hasattr(module, 'application'): + return module.application(environ, start_response) + elif hasattr(module, basename): + return getattr(module, basename)( + environ, start_response) + else: + return wsgilib.send_file(filename) + +Modifying The Request +===================== + +Well, URLParser is one kind of parser. But others are possible, and +aren't too hard to write. + +Lets imagine a URL like ``/2004/05/01/edit``. It's likely that +``/2004/05/01`` doesn't point to anything on file, but is really more +of a "variable" that gets passed to ``edit``. So we can pull them off +and put them somewhere. This is a good place for a WSGI extension. +Lets put them in ``environ["app.url_date"]``. + +We'll pass one other applications in -- once we get the date (if any) +we need to pass the request onto an application that can actually +handle it. This "application" might be a URLParser or similar system +(that figures out what ``/edit`` means). + +:: + + class GrabDate(object): + def __init__(self, subapp): + self.subapp = subapp + def __call__(self, environ, start_response): + date_parts = [] + while len(date_parts) < 3: + first, rest = wsgilib.path_info_split(environ['PATH_INFO']) + try: + date_parts.append(int(first)) + wsgilib.path_info_pop(environ) + except (ValueError, TypeError): + break + environ['app.date_parts'] = date_parts + return self.subapp(environ, start_response) + +This is really like traditional "middleware", in that it sits between +the server and just one application. + +Assuming you put this class in the ``myapp.grabdate`` module, you +could install it by adding this to your configuration:: + + middleware.append('myapp.grabdate.GrabDate') + +Object Publishing +================= + +Besides looking in the filesystem, "object publishing" is another +popular way to do URL parsing. This is pretty easy to implement as +well -- it usually just means use ``getattr`` with the popped +segments. But we'll implement a rough approximation of `Quixote's +<http://www.mems-exchange.org/software/quixote/>`_ URL parsing:: + + class ObjectApp(object): + def __init__(self, obj): + self.obj = obj + def __call__(self, environ, start_response): + next = wsgilib.path_info_pop(environ) + if next is None: + # This is the object, lets serve it... + return self.publish(obj, environ, start_response) + next = next or '_q_index' # the default index method + if next in obj._q_export and getattr(obj, next, None): + return ObjectApp(getattr(obj, next))( + environ, start_reponse) + next_obj = obj._q_traverse(next) + if not next_obj: + # Do a 404 + return ObjectApp(next_obj)(environ, start_response) + + def publish(self, obj, environ, start_response): + if callable(obj): + output = str(obj()) + else: + output = str(obj) + start_response('200 OK', [('Content-type', 'text/html')]) + return [output] + +The ``publish`` object is a little weak, and functions like +``_q_traverse`` aren't passed interesting information about the +request, but this is only a rough approximation of the framework. +Things to note: + +* The object has standard attributes and methods -- ``_q_exports`` + (attributes that are public to the web) and ``_q_traverse`` + (a way of overriding the traversal without having an attribute for + each possible path segment). + +* The object isn't rendered until the path is completely consumed + (when ``next`` is ``None``). This means ``_q_traverse`` has to + consume extra segments of the path. In this version ``_q_traverse`` + is only given the next piece of the path; Quixote gives it the + entire path (as a list of segments). + +* ``publish`` is really a small and lame way to turn a Quixote object + into a WSGI application. For any serious framework you'd want to do + a better job than what I do here. + +* It would be even better if you used something like `Adaptation + <http://www.python.org/peps/pep-0246.html>`_ to convert objects into + applications. This would include removing the explicit creation of + new ``ObjectApp`` instances, which could also be a kind of fall-back + adaptation. + +Anyway, this example is less complete, but maybe it will get you +thinking. |