Posts Tagged ‘Python’

How to extract just the text from html page articles

Saturday, October 27th, 2012

One of the reasons I keep going back to Python is because of the lxml library.

Not only is it terrific in terms of handling xml, it can do wonders with html of all flavors, even badly-formed and specification-invalid html data.

A common task I have these days is to grab the text from an html page or article (e.g., in curating content for Macaronics).

As this gist shows, lxml makes this dead simple, using xpath and the “descendant-or-self::” axis selector.

The only real work is understanding the page structure and creating the correct xpath expression for each site (the readability algorithm is essentially a collection of these rules), and monitoring their changes over time so that the xpath expression can be updated accordingly.

Another bonus is that it works with foreign language sites, too, provided the parser is passed the same encoding as defined in the target page’s Content-Type meta tag.

Here’s an example of grabbing the text from a web article by Facta, a Japanese business magazine, and saving it as a text file, so I can add it to the list of articles in Macaronics:

>>> import urllib, text_grabber
>>> data=urllib.urlopen('http://facta.co.jp/article/201211043-print.html').read()
>>> t=text_grabber.facta_print(data)
>>> import codecs
>>> f=codecs.open('facta-201211043-print.txt', 'w', 'utf-8'); f.write(t); f.close()

Using Microsoft’s Translator API with Python

Monday, May 7th, 2012

Before Macaronics, I experimented with automated machine translation.

Microsoft provides a Translator API which performs machine translation on any natural language text.

Unlike Google’s paid Translation API, Microsoft offers a free tier in theirs, for up to 2 million characters per month.

I found the signup somewhat confusing, though, since I had to create more than one account and register for a couple of different services:

  1. I had to register for a Windows Live ID
  2. While logged in with my Live ID, I needed to create an account at the Azure Data Market
  3. Next, I had to go to the Microsoft Translator Data Service and pick a plan (I chose the free, 2 million characters per month option)
  4. Finally, I had to register an Azure Application (since I was testing, I didn’t want to use a public url, and fortunately that form accepted ‘localhost’, though it insisted on my using ‘https’ in the definition)

The last form, i.e., the Azure Application registration, provides two critical fields for API access:

  • Client ID — this is any old string I want to use as an identifier (i.e., I choose it)
  • Client Secret — this is provided by the form and cannot be changed

With all the registrations out of the way, it was time to try a few translations.

The technical docs were well-written, but since there was nothing for Python, I’ve included an example for accessing the HTTP Interface.

My code is based on Doug Hellmann’s article on urllib2, enhanced with Michael Foord’s examples for error-handling urllib2 requests.

Here’s a simple usage example from Japanese to English, in the Python REPL:

>>> import msmt
>>> token = msmt.get_access_token(MY_CLIENT_ID, MY_CLIENT_SECRET)
>>> msmt.translate(token, 'これはペンです', 'en', 'ja')
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">This is a pen</string>

The API returns XML, so a final processing step for a real program would be to use something like lxml to parse out the translation result.

Here’s a snippet for getting just the translated result out of the XML object returned by the API.

In the case of the example above, this is just the classic1 phrase:

This is a pen

[1] It’s classic in that “This is a pen” is the first English sentence Japanese students learn in school (or so I’m told)

Re-creating Mailinator in Python

Friday, November 11th, 2011

Update: February 21, 2012

I’ve extended this concept into a framework for creating an intelligent email-based agent server, whereby email sent to designated inboxes get dynamic, or custom replies.

It’s the same logic used by the TeamWork.io web service and I’ve decided to open source it on github: https://github.com/dpapathanasiou/intelligent-smtp-responder


Paul Tyma, the creator of Mailinator, once wrote about its architecture. He said that after starting with sendmail, he found it necessary to write his own SMTP server from scratch.

While he never released the Java source code of his server, I wanted to see if I could re-create it using Python, since I also wanted to understand how state machines work in that language.

The Basic Server

To start, I needed some code that would listen on a specific port, and read and respond to clients.

Python’s SocketServer module makes this simple.

Here, in a few lines, is a multi-threaded TCP server that listens on port 8888 of the local machine and echoes back what a connected client sends to it:

#!/usr/bin/python
import SocketServer
cr_lf = "\r\n"
class SMTPRequestHandler (SocketServer.StreamRequestHandler):
    def handle (self):
        try:
            while 1:
                client_msg = self.rfile.readline()
                self.wfile.write(client_msg.rstrip()+cr_lf) # a simple echo
        except Exception, e:
            print e
# server hostname and port to listen on
server_config = ('localhost', 8888) 
if __name__ == '__main__':
    tcpserver = SocketServer.ThreadingTCPServer(server_config, SMTPRequestHandler) 
    tcpserver.serve_forever() 

Start it from a command line prompt (if the port number you choose is less than 1025, then you need to do this as root):

$ python server.py

And test it using telnet:

$ telnet localhost 8888
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
This is an echo
This is an echo
Ok, I get it
Ok, I get it
What next?
What next?

Handling SMTP

Now I needed to be able to understand and reply to SMTP requests. The protocol is fairly simple, with only a handful of commands.

Each command consists of four letters, which appear at the start of the stream sent by the client, and terminated with “\r\n”.

SMTP commands

Tyma did not, however, implement the full list of SMTP commands, since RSET (Reset), VRFY (Verify), NOOP (No operation), and others are used by spammers to abuse or even take over a server, and are rarely required by legitimate email clients.

The server needs to be able to handle the basic interaction, so HELO (Hello) / EHLO (Extended Hello), MAIL (Mail from), RCPT TO (Recipient To), and DATA all need to be supported.

At first glance, it’s tempting to try to implement it like this:

class SMTPRequestHandler (SocketServer.StreamRequestHandler):
    def handle (self):
        try:
            data = {}
            while 1:
                client_msg = self.rfile.readline()
                if client_msg.startswith('MAIL FROM:'):
                    data['sender'] = get_email_address(client_msg)
                elif client_msg.startswith('RCPT TO:'):
                    data['recipient'] = = get_email_address(client_msg)
                ...
                elif client_msg.startswith('QUIT'):
                    break
        except Exception, e:
            print e

Where get_email_address() is defined as, for example, something like this:

def get_email_address (s):
    """Parse out the first email address found in the string and return it"""
    for token in s.split():
        if token.find('@') > -1:
            # token will be in the form:
            # 'FROM:' or 'TO:'
            # and with or without the <>
            for email_part in token.split(':'): 
                if email_part.find('@') > -1:
                    return email_part.strip('<>')

But this gets messy in a hurry. While some commands fit within the neat single-line /^CMND rest of data\r\n/ pattern, others do not.

RCPT, for example, can be repeated multiple times, and once DATA is seen, every subsequent line must be collected until the final /^\.$/ appears.

State Machines to the rescue

A state machine provides a much better way of handling SMTP requests. In his excellent article, David Mertz defines a state machine as:

a directed graph, consisting of a set of nodes and a corresponding set of transition functions. The machine “runs” by responding to a series of events. Each event is in the domain of the transition function belonging to the “current” node, where the function’s range is a subset of the nodes. The function returns the “next” (perhaps the same) node. At least one of these nodes must be an end-state. When an end-state is reached, the machine stops.

And that corresponds exactly to what happens when a client interacts with an SMTP server:

SMTP State Diagram

Brass Tacks

Creating a state machine in Python is simple, since Python allows you to pass functions as higher-order objects. The statemachine.py implementation in Mertz’s article was done in just a few lines of code.

To handle each SMTP node, I defined a series of functions, one for each server response or command.

Here are the function prototypes, where the cargo parameter is a tuple, containing both the stream from/to requests are read and responses written, and a dict of data collected from the request:

def greeting (cargo):
def helo (cargo):
def mail (cargo):
def rcpt (cargo):
def data (cargo):
def process (cargo):

The state machine is defined within the SMTPRequestHandler class like this:

class SMTPRequestHandler (SocketServer.StreamRequestHandler):
    def handle (self):
        try:
            m = StateMachine()
            m.add_state('greeting', greeting)
            m.add_state('helo', helo)
            m.add_state('mail', mail)
            m.add_state('rcpt', rcpt)
            m.add_state('data', data)
            m.add_state('process', process)
            m.add_state('done', None, end_state=1)
            m.set_start('greeting')
            m.run((self, {}))
        except Exception, e:
            print e

So that each function knows how to recognize its assigned command, I defined and compiled these regular expressions. These are created as globals, since it’s more efficient to initiate them once, and have each subsequent method call use the already-existing version.

import re
helo_pattern = re.compile('^HELO', re.IGNORECASE)
ehlo_pattern = re.compile('^EHLO', re.IGNORECASE)
mail_pattern = re.compile('^MAIL', re.IGNORECASE)
rcpt_pattern = re.compile('^RCPT', re.IGNORECASE)
data_pattern = re.compile('^DATA', re.IGNORECASE)
end_pattern = re.compile('^.$')

The greeting() function, which begins the interaction with the client, sends a simple message and passes control to the helo() function. It looks like this:

def greeting (cargo):
    stream = cargo[0]
    stream.wfile.write('220 localhost SMTP'+cr_lf)
    return ('helo', cargo)

Later in the sequence, the mail() function, which is the first node from which data is collected (in this case, the email address of the sender), is the first to save information in the cargo’s dict. It looks like this:

def mail (cargo):
    stream = cargo[0]
    client_msg = stream.rfile.readline()
    if mail_pattern.search(client_msg):
        sender = get_email_address(client_msg)
        if sender is None:
            stream.wfile.write(bad_request+cr_lf)
            return ('done', cargo)
        else:
            email_data = cargo[1]
            email_data['sender'] = sender
            return ('rcpt', (stream, email_data))
    else:
        stream.wfile.write(bad_request+cr_lf)
        return ('done', cargo)        

Here, if the request is not recognized or invalid, the client sees the bad_request message, and the connection is closed, since control passes to the done end-state.

I followed Tyma’s example and defined bad_request as “550 No such user” (which, as he notes, is ironic, since Mailinator accepts email sent to any user).

It also doesn’t conform to the protocol, since I’m supposed to give different error messages at different nodes, but since clients are always disconnected after any type of invalid request, it hardly matters what they see in that scenario.

If a client is well-behaved, the final method called is process() which decides what to do with the client’s email. The data dict will contain three parameters: ‘sender’ (the email address of the sender), ‘recipients’ (a list of email addresses), and ‘data’ (the contents which followed the DATA command ahead of the final ‘.’).

def process (cargo):
    email_data = cargo[1]
    # do something with the email_data dict here
    return ('done', cargo)

Basically, this is where the data can be saved to disk/db (so that it can be served by a web browser later, e.g.), MIME-parsed (to remove attachments, etc.), or just trashed (if you have reason to believe the sender is a spambot or zombie network, e.g.).

Tyma describes various measures for dealing with attacks from spambots and zombies which I haven’t implemented here, but would be relatively easy to add to both the data() and process() functions.

Obtaining the ip address of the client is done using the stream.client_address[0] attribute.

 

Quoted Name Searching in Pyparsing with searchparser.py

Friday, September 17th, 2010

The searchparser.py module has a flaw when it comes to quoted phrases with punctuation.

Searching for something like this:

“C. Montgomery Burns, Esq.”

results in a nasty stacktrace:

Traceback (most recent call last):
  File "searchparser.py", line 302, in 
    if ParserTest().Test():
  File "searchparser.py", line 289, in Test
    r = self.Parse(item)
  File "searchparser.py", line 170, in Parse
    return self.evaluate(self._parser(query)[0])
  File "/var/lib/python-support/python2.5/pyparsing.py", line 1049, in parseString
    loc, tokens = self._parse( instring, 0 )
  File "/var/lib/python-support/python2.5/pyparsing.py", line 925, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/var/lib/python-support/python2.5/pyparsing.py", line 2560, in parseImpl
    return self.expr._parse( instring, loc, doActions, callPreParse=False )
  File "/var/lib/python-support/python2.5/pyparsing.py", line 925, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/var/lib/python-support/python2.5/pyparsing.py", line 2431, in parseImpl
    raise maxException
pyparsing.ParseException: Expected """ (at char 2), (line:1, col:3)

The reason is that words in searchparser.py are defined as consisting of just letters and numbers (or alphanums, in pyparsing-speak) in lines 94 and 95:

Word(alphanums)

The solution is to define a string which contains all the possible punctuation characters to expect in a quoted search string, and include it in the parser grammar for word.

For example, for people’s names, the likely punctuation characters to expect are:

punctuation = ",.'`&-"

So, using that definition at the start of the parser(self) method, we edit the lines for operatorWord() to look like this:

operatorWord = Group(Combine(Word(alphanums+punctuation) + Suppress('*'))).setResultsName('wordwildcard') | \
                    Group(Word(alphanums+punctuation)).setResultsName('word')

With that change, we can find Monty Burns using his exact name, periods and commas included.



Creating and Sending Emails with File Attachments in Python

Saturday, September 4th, 2010

This is a good example of how to create and send emails with file attachments in Python.

It assumes, however, that every attached file is binary, and uses a generic application/octet-stream mimetype for each.

It also makes encoding assumptions about the strings passed to the Subject line, recipient email addresses, and the message body text which may be incorrect in actual use.

This updated version addresses those potential problems, no pun intended.

Attached File Mimetype

Instead of assuming application/octet-stream, we can use the python-magic module (available here, here, or by apt-get install python-magic on Debian and Ubuntu systems) to determine it explicitly:

MIME_MAGIC = None
try:
    import magic
    MIME_MAGIC = magic.open(magic.MAGIC_MIME)
    MIME_MAGIC.load()
except ImportError:
    pass
    
def get_file_mimetype (filename):
    """Return the mimetype string for this file"""
    result = None
    if MIME_MAGIC:
        try:
            result = MIME_MAGIC.file(filename)
        except (IOError):
            pass
    return result

The get_file_mimetype() function returns a string in the form 'Content-Type major type/Content-Type minor type', e.g. 'text/plain charset=us-ascii', 'application/pdf', etc.

Now we can change the loop that attaches files to the message from this:

    for file in files:
        part = MIMEBase('application', "octet-stream")
        part.set_payload( open(file,"rb").read() )
        Encoders.encode_base64(part)
        part.add_header('Content-Disposition', 'attachment; filename="%s"'
                       % os.path.basename(file))
        msg.attach(part)

To this, below. Note that since the mimetype can be text, we change the file read flag from “rb” (binary) to just “r”, as necessary.

    for file in files:
 
        file_read_flags = "rb"
        try:
            mimestring = get_file_mimetype(file)
            if mimestring.startswith('text'):
                file_read_flags = "r"
            mimestring_parts = mimestring.split('/')
            part = MIMEBase(mimestring_parts[0], mimestring_parts[1])
        except AttributeError, IndexError:
            # cannot determine the mimetype so use the generic 'application/octet-stream'
            part = MIMEBase('application', 'octet-stream')

        part.set_payload( open(file, file_read_flags).read() )
        Encoders.encode_base64(part)
        part.add_header('Content-Disposition', 'attachment; filename="%s"'
                       % os.path.basename(file))
        msg.attach(part)

Subject Encoding

The original version used a simple assignment to define the Subject line:

    msg['Subject'] = subject

But this limits the Subject line to 7-bit ASCII characters only. For foreign language support and other encodings, it’s better to use the email.Header package, which requires an additional import:

from email.Header import Header

The Subject line assignment changes to:

    # always pass Unicode strings to Header, otherwise it will use RFC 2047 encoding even on plain ASCII strings
    msg['Subject'] = Header(to_unicode(subject), 'iso-8859-1') 

Where the to_unicode() function is defined as:

def to_unicode (s):
    """Convert the given byte string to unicode, using the standard encoding,
    unless it's already encoded that way"""
    if s:
        if isinstance(s, unicode):
            return s
        else:
            return unicode(s, 'utf-8')

Email Address Encoding

Unlike the Subject line, all email addresses must be ascii, so instead of defining the recipient list like this:

    msg['To'] = COMMASPACE.join(to)

We should map an explicit ascii encoding function over each email address, like this:

    msg['To'] = COMMASPACE.join(map(lambda x: x.encode('ascii'), to)) 

Body Text Encoding

Finally, the message body text, regardless of whether or not it’s plain text, html, or both, must be unicode. So we go from this:

    msg.attach( MIMEText(text) )

To this:

    msg.attach(MIMEText(to_bytestring(text), 'plain', 'utf-8'))

If we want an html message body, we would do this:

    msg.attach(MIMEText(to_bytestring(html_text), 'html', 'utf-8'))

Actually, if you are going to use html in email messages at all, the best practice is to provide both a plain text and an html equivalent together, like this:

    msg.attach(MIMEText(to_bytestring(text), 'plain', 'utf-8'))
    msg.attach(MIMEText(to_bytestring(html_text), 'html', 'utf-8'))

In all the examples above, the to_bytestring() function is defined as:

def to_bytestring (s):
    """Convert the given unicode string to a bytestring, using the standard encoding,
    unless it's already a bytestring"""
    if s:
        if isinstance(s, str):
            return s
        else:
            return s.encode('utf-8')

A Complete Example

Putting it all together, this function lets you send the same email to multiple recipients, with optional files (binary or text) as attachments, and an optional message body in html.

It also allows you to define the “Reply-To” header of the message as a email address different from the one used to send the message.

def send(sender, subject, recipient_list=[], text, html=None, files=[], replyto=None):
    """Send a message to the given recipient list, with the optionally attached files"""
    msg = MIMEMultipart('alternative')
    msg['From'] = sender.encode('ascii') 
    # make sure email addresses do not contain non-ASCII characters
    msg['To'] = COMMASPACE.join(map(lambda x: x.encode('ascii'), recipient_list)) 
    if replyto:
        # make sure email addresses do not contain non-ASCII characters
        msg['Reply-To'] = replyto.encode('ascii') 
    msg['Date'] = formatdate(localtime=True)

    # always pass Unicode strings to Header, otherwise it will use RFC 2047 encoding even on plain ASCII strings
    msg['Subject'] = Header(to_unicode(subject), 'iso-8859-1') 

    # always use Unicode for the body text, both plain and html content types
    msg.attach(MIMEText(to_bytestring(text), 'plain', 'utf-8'))
    if html:
        msg.attach(MIMEText(to_bytestring(html), 'html', 'utf-8'))

    for file in files:
 
        file_read_flags = "rb"
        try:
            mimestring = get_file_mimetype(file)
            if mimestring.startswith('text'):
                file_read_flags = "r"
            mimestring_parts = mimestring.split('/')
            part = MIMEBase(mimestring_parts[0], mimestring_parts[1])
        except AttributeError, IndexError:
            # cannot determine the mimetype so use the generic 'application/octet-stream'
            part = MIMEBase('application', 'octet-stream')

        part.set_payload( open(file, file_read_flags).read() )
        Encoders.encode_base64(part)
        part.add_header('Content-Disposition', 'attachment; filename="%s"'
                       % os.path.basename(file))
        msg.attach(part)

    smtp = smtplib.SMTP(mail_server)
    smtp.sendmail(sender, recipient_list, msg.as_string() )
    smtp.close()

python setup.py uninstall

Tuesday, July 20th, 2010

Python’s distutils mechanism makes distributing and installing modules simple.

In most cases, either

python setup.py build
python setup.py install

or just

python setup.py install

is all that’s necessary.

Unfortunately (and somewhat surprisingly), there’s no uninstall option specified.

Manually deleting the .egg-info file and corresponding folder from the python site-packages folder is one way, but if the installer used an alternative or custom setup, then there is no way to be sure all the associated files and dependencies are gone.

The way around this is to use the --record switch with setup.py at install, which will log all the files corresponding to the module:

python setup.py install --record files.txt

Then, to uninstall (either ahead of a version upgrade or outright deletion), just use the contents of files.txt to guide the removal:

cat files.txt | xargs rm -rf

Hat tip to Michal Čihař, via StackOverFlow.

Search Engine Style Query Handling

Wednesday, June 16th, 2010

The pyparsing library is a terrific way of parsing and executing grammars.

It’s yet another reason I continue to work in more and more in Python at the expense of Common Lisp, despite Python’s pedigree as a language for teaching programming to the uninitiated.

Among the examples in the wiki is searchparser.py which adapts pyparsing to the task of handling full-text queries in the way most search engines do: exact phrases in quotes, multiple phrases grouped by parentheses, compound queries joined by “AND”, “OR”, and “NOT” operators recursively, etc.

After experimenting with it for a while, there was one change I made which seemed an improvement over the original:

The evaluateQuotes() method takes an argument, which represents the string containing an exact phrase defined by quotes in the original query.

def evaluateQuotes(self, argument):
    """Evaluate quoted strings

    First is does an 'and' on the indidual search terms, then it asks the
    function GetQuoted to only return the subset of ID's that contain the
    literal string.
    """
    r = Set()
    search_terms = []
    for item in argument:
        search_terms.append(item[0])
        if len(r) == 0:
            r = self.evaluate(item)
        else:
            r = r.intersection(self.evaluate(item))
    return self.GetQuotes(' '.join(search_terms), r)

As the documentation says, it looks up each individual word of the phrase first, and then invokes GetQuotes() with two parameters: the entire phrase string, and the result of all the individual lookups which were common to every word in the phrase.

If, however, the underlying data structure supports the idea of finding an exact phrase within a block of text efficiently, then there is no need to lookup each word of the larger phrase individually.

So evaluateQuotes() can be simplified to:

def evaluateQuotes(self, argument):
    """Evaluate quoted strings by invoking GetQuotes() on the entire quoted term"""
    search_terms = []
    for item in argument:
        search_terms.append(item[0])
    return self.GetQuotes(' '.join(search_terms))

The signature for the GetQuotes() method becomes:

def GetQuotes(self, search_string):

And finally, implementing GetQuotes() is simple, i.e., all it has to do is return a set containing occurences of the exact search_string within the database.

Using reCaptcha with a Python Server

Wednesday, August 19th, 2009

Here’s how to handle reCaptcha form submissions in Python, on the server side:

#!/usr/bin/python

import urllib, urllib2

recaptcha_private_key = '...[your private key goes here]...'

recaptcha_server_name = 'http://www.google.com/recaptcha/api/verify'
recaptcha_server_form = 'https://www.google.com/recaptcha/api/challenge'

def check (client_ip_address, recaptcha_challenge_field, recaptcha_response_field):
    """Return the recaptcha reply for the client's challenge responses"""
    params = urllib.urlencode(dict(privatekey=recaptcha_private_key,
                                   remoteip=client_ip_address,
                                   challenge=recaptcha_challenge_field,
                                   response=to_bytestring(recaptcha_response_field)))
    data = None
    try:
        f = urllib2.urlopen(recaptcha_server_name, params)
        data = f.read()
        f.close()
    except HTTPError:
        pass
    except URLError:
        pass
    return data

def confirm (client_ip_address, recaptcha_challenge_field, recaptcha_response_field):
    """Return True/False based on the recaptcha server's reply"""
    result = False
    reply = check (client_ip_address, recaptcha_challenge_field, recaptcha_response_field)
    if reply:
        if reply.lower().startswith('true'):
            result = True
    return result

Just call confirm (client_ip_address, recaptcha_challenge_field, recaptcha_response_field) to get the result, either True (i.e., the captcha was completed correctly) or False.

The client_ip_address, recaptcha_challenge_field, recaptcha_response_field fields are provided by parsing the results of POSTing the form (e.g., using a web server module like Mod_python or mod_wsgi).

And the to_bytestring() function, which is defined as follows, insures that any unicode character (which do appear occasionally in captchas) get handled correctly:

def to_bytestring (s):
    """Convert the given unicode string to a bytestring, using the standard encoding,
    unless it's already a bytestring"""
    if s:
        if isinstance(s, str):
            return s
        else:
            return s.encode('utf-8')

This code is now available for download from github.


Update August 3, 2012: Since I’m now learning Go, I’ve also created go-recaptcha, a package version in that language #golang

 

Date and Time Representation in Python

Thursday, August 13th, 2009

This is a good summary of all the date and time possibilities in Python, with examples.

Handling File Uploads from a Flex client using Python

Thursday, June 18th, 2009

Handling file uploads in python is fairly simple.

The server receives a file object which contains both the file’s name and its data content.

Reading the file object depends on the html form that sends the request to the server:

<form enctype=”multipart/form-data” action=”save_file.py” method=”post”>
<p>File: <input type=”file” name=”filename”></p>
<p><input type=”submit” value=”Upload”></p>
</form>

Since, in this example, the input variable whose type is “file” is named “filename”, the file object is accessed by reading form["filename"] on the server.

The file’s name string is obtained by reading the form["filename"].filename attribute, and form["filename"].file contains the actual file data.

Of course, “filename” is arbitrary, and can be anything, as long as the server is synced with the client form.

Or so I thought until I tried handling a file upload from an Adobe Flex client.

Flex doesn’t bother with html forms, of course, and has its own logic for handling file uploads from the client.

What’s not clear in those docs (though it is referenced somewhat opaquely in the FileReference class sample http post) is that Flex will send the file object as “Filedata”.

So the python code on the server must use form["Filedata"] or the upload won’t work.