Rolling Shutter
September 2nd, 2010Cause:
And Effect:
Ever since Google’s PageRank came to dominate online search, scammers have tried to promote their own sites (usually related to porn, prescription drugs, mortgage rip-offs, etc.) by creating as many links from other sites as they can.
After starting with web guestbooks (which were all the rage in the GeoCities era), they moved to blog comments.
The problem was so bad a few years ago, Jeremy Zawodny declared PageRank dead.
Fortunately, both Google and various blogging platforms made improvements in terms of how they identified comment spam, and now most of it gets trapped easily.
In fact, it’s amusing to read through the spam logs and see what kinds of messages the commentators (which are usually posted by robots, but not in all cases) leave for me to approve.
Here’s a compendium of the different types of comment spam messages I’ve gotten recently. I’ve preserved the spelling and grammar as-is.
1. Straight-up Spammers
These comments make no bones about the spam they’re promoting. The text of the comment has nothing at all to do with the post or the blog, and there’s a direct link in the comment text to the url they want people and search engine spiders to visit.
The amusing thing about this type of comment is that most of them use “Google.com” as their homepage url.
Hey guys i would like to tell you about this [spam site link goes here]
Heya, I have found a great way to [a brief description of "amazing" results with a spam link]
Hi guys im so excited, i recently [you know what goes here]
2. Flatterers
These comments are more subtle.
At first glance, they seem legitimate, because they’re laudatory and brief (which gets them off the hook in terms of matching the context of the original post).
Instead, the spam is in their homepage urls.
I suppose the idea is that I would get taken in by the nice sentiment and that I won’t notice where it links back to.
ya — nice blog. i love it
Usually I do not write-up on blogs, but I wish to say that this article quite forced me to perform so! Thanks, extremely nice article.
I find you entry interesting do I’ve added the track to it on our blog
Great blog! I definitely love how it is easy on my eyes and also the data are well written.
I find you entry interesting do I’ve added the track to it on our blog
amazing thanks already bookmark
Keep up the amazing work!! I love how you wrote this and I also like the colors here on this site. Very good opinions expressed here
![]()
¡Gracias!
3. Flatterers with Comprehension Problems
Are these hoping for an approval AND a reply, perhaps as a signal to the spambots to post another spam comment under the same post?
I would need to be significantly dumber than I am now to fall for that.
A mutual associate sent me the url to your blog. I like how you really focus and get to the point but can you run through that last part again?
Fantastic blog post, thanks. Can you expand on the second para-graph in a little more detail please?
4. The Indignants
These are straight-up spammers who are angry at me for deleting their original comment (which never existed, of course).
I’m not sure of the psychology behind this concept; even if I believed they came from real people, I’m not sure why I’m supposed to unblock them now.
Why have you deleted my comment? It’s in fact useful unlike almost all the comments posted here… I’m going to post it again please don’t remove it as lots of people will discover it very valuable. Hey guys [spam pitch and link follows]
Why did you remove my post… My post was actually useful unlike most of these comments. Ill post it again. Hello, I have been using [spam]
5. Non-Sequiturs
These make no sense at all, perhaps deliberately as a means of avoiding bayesian filters. Like the flatterers, the spam is in the homepage url.
These have been on the decline recently.
Compassion, empathy and recognition of others’ humanity is on the decline in this country. Along with these declines will be the decline of our democracy. I believe that medialoid [it goes on like this for a while]
well… good luck finding a free of charge minute. I hope you’ve better luck doing that than I do.
6. Internet Meme Echoes
A newer trend is to take a meme or tweet that has been recently popular and echo it as a comment. The comment itself is innocuous or funny, etc, but the spam is in the homepage url.
I’m not sure what the idea behind these are, either. Other than relying on me to ignore where the commentator’s homepage links back to, why am I supposed to allow this comment?
To show that I’m hip, I’m with it?
Welcome to the new decade: Java is a restricted platform, Google is evil, Apple is a monopoly and Microsoft are the underdogs
PDFMiner is a pdf parsing library written in Python by Yusuke Shinyama.
In addition to the pdf2txt.py and dumppdf.py command line tools, there is a way of analyzing the content tree of each page.
Since that’s exactly the kind of programmatic parsing I wanted to use PDFMiner for, this is a more complete example, which continues where the default documentation stops.
This example is still a work-in-progress, with room for improvement.
In the next few sections, I describe how I built up each function, resolving problems I encountered along the way. The impatient can just download the completed file here instead.
Basic Framework
Here are the python imports we need for PDFMiner:
from pdfminer.pdfparser import PDFParser, PDFDocument, PDFNoOutlines from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage
Note that for the layout access logic to work properly, you need to be using pdfminer version 20100619p1 (for a short time last month, there was a mismatch between the documentation, which said the latest version was pdfminer-20100424, and the source trunk; it turns out these layout examples won’t work with version 20100424 at all).
Since PDFMiner requires a series of initializations for each pdf file, I’ve started with this wrapper (Lisp macro style) function to take care of the basic preliminary actions (file IO, PDFMminer object creation and connection, etc.).
def with_pdf (pdf_doc, pdf_pwd, fn, *args):
"""Open the pdf document, and apply the function, returning the results"""
result = None
try:
# open the pdf file
fp = open(pdf_doc, 'rb')
# create a parser object associated with the file object
parser = PDFParser(fp)
# create a PDFDocument object that stores the document structure
doc = PDFDocument()
# connect the parser and document objects
parser.set_document(doc)
doc.set_parser(parser)
# supply the password for initialization
doc.initialize(pdf_pwd)
if doc.is_extractable:
# apply the function and return the result
result = fn(doc, *args)
# close the pdf file
fp.close()
except IOError:
# the file doesn't exist or similar problem
pass
return result
The first two parameters are the name of the pdf file, and its password. The third parameter, fn, is a higher-order function which takes the instance of the pdfminer.pdfparser.PDFDocument created, and applies whatever action we want (get the table of contents, walk through the pdf page by page, etc.)
The last part of the signature, *args, is an optional list of parameters that can be passed to the high-order function as needed (I could have gone with keyword arguments here instead, but a simple list is enough for these examples).
As a warm-up, here’s an example of how to use the with_pdf() function to fetch the table of contents from a pdf file:
def _parse_toc (doc):
"""With an open PDFDocument object, get the table of contents (toc) data
[this is a higher-order function to be passed to with_pdf()]"""
toc = []
try:
outlines = doc.get_outlines()
for (level,title,dest,a,se) in outlines:
toc.append( (level, title) )
except PDFNoOutlines:
pass
return toc
The _parse_toc() function is the higher-order function which gets passed to with_pdf() as the fn parameter. It expects a single parameter, doc, which is the the instance of the pdfminer.pdfparser.PDFDocument created within with_pdf() itself (note that if with_pdf() couldn’t find the file, then _parse_toc() doesn’t get called).
With all the PDFMiner overhead and initialization done by with_pdf(), _parse_toc() can just focus on collecting the table of content data and returning them as a list. The get_outlines() can raise a “PDFNoOutlines” error, so I catch it as an exception, and simply return an empty list in that case.
All that’s left to do is define the function that invokes _parse_toc() for a specific pdf file; this is also the function that any external users of this module would use to get the table of contents list. Note that the pdf password defaults to an empty string (which is what PDFMiner will use for documents that aren’t password-protected), but that can be overriden as needed.
def get_toc (pdf_doc, pdf_pwd=''):
"""Return the table of contents (toc), if any, for this pdf file"""
return with_pdf(pdf_doc, pdf_pwd, _parse_toc)
Page Parsing
Next, onto layout analysis. Using the with_pdf() wrapper, we can reproduce the example in the documentation with this higher-order function:
def _parse_pages (doc):
"""With an open PDFDocument object, get the pages and parse each one
[this is a higher-order function to be passed to with_pdf()]"""
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in doc.get_pages():
interpreter.process_page(page)
# receive the LTPage object for this page
layout = device.get_result()
# layout is an LTPage object which may contain child objects like LTTextBox, LTFigure, LTImage, etc.
And this external function, which defines the specific pdf file to analyze:
def get_pages (pdf_doc, pdf_pwd=''):
"""Process each of the pages in this pdf file"""
with_pdf(pdf_doc, pdf_pwd, _parse_pages)
So far, this code doesn’t do anything exciting: it just loads each page into a pdfminer.layout.LTPage object, closes the pdf file, and exits.
Within each pdfminer.layout.LTPage instance, though, is an objs attribute, which defines the tree of pdfminer.layout.LT* child objects as in the documentation:
In this example, I’m going to collect all the text from each page in a top-down, left-to-right sequence, merging any multiple columns into a single stream of consecutive text.
The results are not always perfect, but I’m using a fuzzy logic based on physical position and column width, which is very good in most cases.
I’m also going to save any images found to a separate folder, and mark their position in the text with <img /> tags.
Right now, I’m only able to extract jpeg images, whereas xpdf’s pdfimages tool is capable of getting to non-jpeg images and saving them as ppm format.
I’m not sure if the problem is within PDFMiner or how I’m using it, but since someone else asked the same question in the PDFMiner mailing list, I suspect it’s the former.
This requires a few updates to the _parse_pages() function, as follows:
def _parse_pages (doc, images_folder):
"""With an open PDFDocument object, get the pages, parse each one, and return the entire text
[this is a higher-order function to be passed to with_pdf()]"""
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
text_content = [] # a list of strings, each representing text collected from each page of the doc
for i, page in enumerate(doc.get_pages()):
interpreter.process_page(page)
# receive the LTPage object for this page
layout = device.get_result()
# layout is an LTPage object which may contain child objects like LTTextBox, LTFigure, LTImage, etc.
text_content.append(parse_lt_objs(layout.objs, (i+1), images_folder))
return text_content
and the updated get_pages() function becomes:
def get_pages (pdf_doc, pdf_pwd='', images_folder='/tmp'):
"""Process each of the pages in this pdf file and print the entire text to stdout"""
print '\n\n'.join(with_pdf(pdf_doc, pdf_pwd, _parse_pages, *tuple([images_folder])))
New in both functional signatures is images_folder, which is a parameter that refers to the place on the local filesystem where any extracted images will be be saved (this is also an example of why defining with_pdf() with an optional *args list comes in handy).
Aggregating Text
Within the _parse_pages() function, text_content is a new variable of type list, which collects the text of each page, and I’ve added an enumeration structure around doc.get_pages(), to keep track of which page we’re accessing at any given time. This is useful for saving images correctly, since some pdf files use the same image name in multiple places to refer to different images (this creates problems for dumppdf.py’s -i switch, for example).
The new critical line in _parse_pages() is this one:
text_content.append(parse_lt_objs(layout.objs, (i+1), images_folder))
Since the tree of page objects is recursive in nature (e.g., a pdfminer.layout.LTFigure object may have multiple child objects), it’s better to handle the actual text parsing and image collection in a separate function. That function, parse_lt_objs(), looks like this:
def parse_lt_objs (lt_objs, page_number, images_folder, text=[]):
"""Iterate through the list of LT* objects and capture the text or image data contained in each"""
text_content = []
for lt_obj in lt_objs:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
# text
text_content.append(lt_obj.get_text())
elif isinstance(lt_obj, LTImage):
# an image, so save it to the designated folder, and note it's place in the text
saved_file = save_image(lt_obj, page_number, images_folder)
if saved_file:
# use html style <img /> tag to mark the position of the image within the text
text_content.append('<img src="'+os.path.join(images_folder, saved_file)+'" />')
else:
print >> sys.stderr, "Error saving image on page", page_number, lt_obj.__repr__
elif isinstance(lt_obj, LTFigure):
# LTFigure objects are containers for other LT* objects, so recurse through the children
text_content.append(parse_lt_objs(lt_obj.objs, page_number, images_folder, text_content))
return '\n'.join(text_content)
In this example, I’m concerned with just four objects which may appear within a pdfminer.layout.LTPage object:
For the simple text and image extraction I’m doing here, this is enough. There is room for improvement, though, since I’m ignoring several types of pdfminer.layout.LT* objects which do appear in pdf pages.
If you try to run get_pages() now, you might get this error, in the text_content.append(lt_obj.get_text()) line (it will depend on the content of the pdf file you’re trying to parse, as well as how your instance of Python is configured, and whether or not you installed PDFMiner with cmap for CJK languages).
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 61: ordinal not in range(128)
This function, which I wrote after reading this article, solves the problem:
def to_bytestring (s, enc='utf-8'):
"""Convert the given unicode string to a bytestring, using the standard encoding,
unless it's already a bytestring"""
if s:
if isinstance(s, str):
return s
else:
return s.encode(enc)
So the updated version of parse_lt_objs() becomes:
def parse_lt_objs (lt_objs, page_number, images_folder, text=[]):
"""Iterate through the list of LT* objects and capture the text or image data contained in each"""
text_content = []
for lt_obj in lt_objs:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
# text
text_content.append(lt_obj.get_text())
elif isinstance(lt_obj, LTImage):
# an image, so save it to the designated folder, and note it's place in the text
saved_file = save_image(lt_obj, page_number, images_folder)
if saved_file:
# use html style <img /> tag to mark the position of the image within the text
text_content.append('<img src="'+os.path.join(images_folder, saved_file)+'" />')
else:
print >> sys.stderr, "Error saving image on page", page_number, lt_obj.__repr__
elif isinstance(lt_obj, LTFigure):
# LTFigure objects are containers for other LT* objects, so recurse through the children
text_content.append(parse_lt_objs(lt_obj.objs, page_number, images_folder, text_content))
return '\n'.join(text_content)
Running this version gives reasonable results on pdf files where the text is single-column, and without many sidebars, abstracts, summary quotes, or other fancy typesetting layouts.
It really breaks down, though, in the case of multi-column pages: the resulting text_content jumps from one paragraph to the next, in no coherent order.
PDFMiner does provide two grouping functions, group_textbox_lr_tb and group_textbox_tb_rl [lr=left-to-right, tb=top-to-bottom], but they do the grouping literally, without considering the likelihood that the content of one textbox logically belongs after another’s.
Fortunately, though, each object also provides a bbox (bounding box) attribute, which is a four-part tuple of the object’s page position: (x0, y0, x1, y1).
Using the bbox data, we can group the text according to its position and width, making it more likely the columns we join together this way represent the correct logical flow of the text.
To aggregate the text this way, I added the following Python dictionary variable to the parse_lt_objs() code, just before iterating through the list of lt_objs: page_text={}.
The key for each entry is a tuple of the bbox’s (x0, x1) points, and the corresponding value is a list of text strings found within that bbox. The x0 value tells me the left offset for a given piece of text and the x1 value tells me how wide it is.
So by grouping text which starts at the same horizontal plane and has the same width, I can aggregate all paragraphs belonging to the same column, regardless of their vertical position or length.
Conceptually, each entry in the page_text dictionary represents all the text associated with each physical column.
When I tried this the first time, I was surprised (though in retrospect, I shouldn’t have been, since nothing about parsing pdfs is neat or clean), that two textboxes which look perfectly aligned visually have slightly different x0 and x1 values (at least according to PDFMiner).
For example, one paragraph may have x0 and x1 values of 28.16 and 153.32 respectively, and the paragraph right underneath it had an x0 value of 29.04 and an x1 value of 152.09.
To get around this, I wrote the following update function, which assigns key tuples based on how close an (x0, x1) pair lies within an existing entry’s key. The 20 percent value was arrived at by trial-and-error, and seems to be acceptable for most pdf files I tried.
def update_page_text_hash (h, lt_obj, pct=0.2):
"""Use the bbox x0,x1 values within pct% to produce lists of associated text within the hash"""
x0 = lt_obj.bbox[0]
x1 = lt_obj.bbox[2]
key_found = False
for k, v in h.items():
hash_x0 = k[0]
if x0 >= (hash_x0 * (1.0-pct)) and (hash_x0 * (1.0+pct)) >= x0:
hash_x1 = k[1]
if x1 >= (hash_x1 * (1.0-pct)) and (hash_x1 * (1.0+pct)) >= x1:
# the text inside this LT* object was positioned at the same
# width as a prior series of text, so it belongs together
key_found = True
v.append(to_bytestring(lt_obj.get_text()))
h[k] = v
if not key_found:
# the text, based on width, is a new series,
# so it gets its own series (entry in the hash)
h[(x0,x1)] = [to_bytestring(lt_obj.get_text())]
return h
With this in place, I could update the parse_lt_objs() to use it.
def parse_lt_objs (lt_objs, page_number, images_folder, text=[]):
"""Iterate through the list of LT* objects and capture the text or image data contained in each"""
text_content = []
page_text = {} # k=(x0, x1) of the bbox, v=list of text strings within that bbox width (physical column)
for lt_obj in lt_objs:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
# text, so arrange is logically based on its column width
page_text = update_page_text_hash(page_text, lt_obj)
elif isinstance(lt_obj, LTImage):
# an image, so save it to the designated folder, and note it's place in the text
saved_file = save_image(lt_obj, page_number, images_folder)
if saved_file:
# use html style <img /> tag to mark the position of the image within the text
text_content.append('<img src="'+os.path.join(images_folder, saved_file)+'" />')
else:
print >> sys.stderr, "error saving image on page", page_number, lt_obj.__repr__
elif isinstance(lt_obj, LTFigure):
# LTFigure objects are containers for other LT* objects, so recurse through the children
text_content.append(parse_lt_objs(lt_obj.objs, page_number, images_folder, text_content))
for k, v in sorted([(key,value) for (key,value) in page_text.items()]):
# sort the page_text hash by the keys (x0,x1 values of the bbox),
# which produces a top-down, left-to-right sequence of related columns
text_content.append('\n'.join(v))
return '\n'.join(text_content)
The last block before the return statement sorts the page_text (x0, x1) keys so that the resulting text is returned in a top-down, left-to-right sequence, based on where the text appeared visually on the page.
Extracting Images
The last thing to discuss in this example is the extraction of images.
As I mentioned above, this area needs improvement, since it seems that I can only extract jpeg images using PDFMiner (though to be fair to Yusuke, he does describe it as a tool that “focuses entirely on getting and analyzing text data“, so perhaps doing more than jpeg is out-of-scope for this library).
Within parse_lt_objs(), the following function is called if an LTImage is found; it was based on studying the dumppdf.py source code and how it handled image extraction requests:
def save_image (lt_image, page_number, images_folder):
"""Try to save the image data from this LTImage object, and return the file name, if successful"""
result = None
if lt_image.stream:
file_stream = lt_image.stream.get_rawdata()
file_ext = determine_image_type(file_stream[0:4])
if file_ext:
file_name = ''.join([str(page_number), '_', lt_image.name, file_ext])
if write_file(images_folder, file_name, lt_image.stream.get_rawdata(), flags='wb'):
result = file_name
return result
The save_image() function needs the following two supporting functions defined:
def determine_image_type (stream_first_4_bytes):
"""Find out the image file type based on the magic number comparison of the first 4 (or 2) bytes"""
file_type = None
bytes_as_hex = b2a_hex(stream_first_4_bytes)
if bytes_as_hex.startswith('ffd8'):
file_type = '.jpeg'
elif bytes_as_hex == '89504e47':
file_type = ',png'
elif bytes_as_hex == '47494638':
file_type = '.gif'
elif bytes_as_hex.startswith('424d'):
file_type = '.bmp'
return file_type
The determine_image_type() function is based on the concept of magic numbers, where it’s (sometimes) possible to tell what a binary stream means by exmaing the first two or fours bytes.
In theory, a pdf file can have any of these image types, but in practice, the only one PDFMiner can seem to find as an LTImage object are jpegs.
def write_file (folder, filename, filedata, flags='w'):
"""Write the file data to the folder and filename combination
(flags: 'w' for write text, 'wb' for write binary, use 'a' instead of 'w' for append)"""
result = False
if os.path.isdir(folder):
try:
file_obj = open(os.path.join(folder, filename), flags)
file_obj.write(filedata)
file_obj.close()
result = True
except IOError:
pass
return result
The write_file() function is just basic file IO, but it does some convenient things around checking that the designated folder exists, too.
Finally, to support all three image saving functions, we need the following python imports:
import sys import os from binascii import b2a_hex
Sample Results
So, how well does it work? It’s surprisingly good, as it turns out.
Here’s an example from using the above code to process the Hacker Monthly Issue 2 pdf file (this was part of the process I used to convert this file to e-book format for inclusion in the Fifobooks Catalog).
Page 5, which looks like this visually:
came out like this:
<img src="/tmp/5_Im0.jpeg" /> “Leave the ad revenue and crazy business model revenue streams to the startups with venture funding.” on the company. But the advantage here is that after a few months off the ground you'll have a clear sense of how soon that day can come. Another advantage of a bootstrapped company on the SaaS model is that it's really easy to calculate your cash flow. It goes without saying that the people you work with should have complementary skills to your own, but the bootstrapper's "slow but steady" mindset is just as important to the health of your company. you'll find a lot of people may not be comfortable with this approach. Weed those people out as co-found- ers when you're bootstrapping a company. A one and done approach won't work here. off Hours Almost every bootstrapped company begins as an off-hours tinkering project. That's true of Carbonmade, which Dave built for himself first; that's true of TypeFrag, which I built over the course of a week during my sophomore year in college; that's true of 37signals' Basecamp, true of Anthony's Hype Machine and lots of other companies. The good thing about bootstrap- ping is that you don't need to spend a single penny outside of server costs and you can even do most things locally before having to pay any money on a server. your biggest expense is time, and that's why off hours are so important. Consult on the Side The way we started Carbonmade, the way 37signals started, the way Harvest started, and many other startups too, was by first running a consulting shop. We ran a design con- sulting company called nterface that Carbonmade grew out of. It's great, because the money you're bringing in through client work tides you over while you're waiting for your startup to grow. Carbonmade was live for nearly 18 months before we started working on it full-time. During those first 18 months, we were taking on lots of client work to pay our bills. The great thing about consulting through the early months is that you can take on fewer and fewer jobs as your revenue builds up. For example, you may need a dozen large projects during the first year and only two or three during the second year. That was the case for us. I know of other successful bootstrapped companies that during the first year would take on a single client project for a month or two, charging an appropriate amount, and that would give them just enough leeway to work on their startup for two or three months. Then they'd rinse and repeat. They did this for the first year and a half before making enough money to work on their startup full-time. there's no need to Rush When you're bootstrapping there's no rush to get things out the door, even though that's all you hear these 5
While there were some small problems around capitalization and spacing, the conversion did recognize and save the background image, it distinguished the summary quote as being separate from the rest of the text, and the columns were merged correctly, flowing in the same manner the author wrote them.
There are several things I’d like to be able to do better; some probably require changes to PDFMiner itself, while others are things in my code which I should improve.
Python’s distutils mechanism makes distributing and installing modules simple.
In most cases, either
python setup.py build
python setup.py install
or just
python setup.py install
is all that’s necessary.
Unfortunately (and somewhat surprisingly), there’s no uninstall option specified.
Manually deleting the .egg-info file and corresponding folder from the python site-packages folder is one way, but if the installer used an alternative or custom setup, then there is no way to be sure all the associated files and dependencies are gone.
The way around this is to use the --record switch with setup.py at install, which will log all the files corresponding to the module:
python setup.py install --record files.txt
Then, to uninstall (either ahead of a version upgrade or outright deletion), just use the contents of files.txt to guide the removal:
cat files.txt | xargs rm -rf
Hat tip to Michal Čihař, via StackOverFlow.
This morning, we were practicing koshinage, and my uke was doing some terrific ukemi, rolling forward out of the throw, smoothly.
I asked him how he did it.
“Simple,” he replied. “Instead of grabbing nage’s arm on the way down [in preparation for a break fall], just extend both your arms in front of you, and follow them forward. Try it.”
We switched, and instead of forward, I kept finding myself vertical, landing more quietly than my usual crash-and-slam, but somewhat awkwardly.
“I’m having trouble projecting out enough to go forward,” I told him.
He made another suggestion, but by this time, the instructor noticed our conversation, and walked over to us.
“What’s going on?”
I explained about the soft ukemi out of koshinage, and how I was having trouble mastering it.
“Soft ukemi?” the instructor chortled sarcastically, albeit with a hint of a smile, “Just hit the ground!”
The pyparsing library is a terrific way of parsing and executing grammars.
It’s yet another reason I continue to work in more and more in Python at the expense of Common Lisp, despite Python’s pedigree as a language for teaching programming to the uninitiated.
Among the examples in the wiki is searchparser.py which adapts pyparsing to the task of handling full-text queries in the way most search engines do: exact phrases in quotes, multiple phrases grouped by parentheses, compound queries joined by “AND”, “OR”, and “NOT” operators recursively, etc.
After experimenting with it for a while, there was one change I made which seemed an improvement over the original:
The evaluateQuotes() method takes an argument, which represents the string containing an exact phrase defined by quotes in the original query.
def evaluateQuotes(self, argument):
"""Evaluate quoted strings
First is does an 'and' on the indidual search terms, then it asks the
function GetQuoted to only return the subset of ID's that contain the
literal string.
"""
r = Set()
search_terms = []
for item in argument:
search_terms.append(item[0])
if len(r) == 0:
r = self.evaluate(item)
else:
r = r.intersection(self.evaluate(item))
return self.GetQuotes(' '.join(search_terms), r)
As the documentation says, it looks up each individual word of the phrase first, and then invokes GetQuotes() with two parameters: the entire phrase string, and the result of all the individual lookups which were common to every word in the phrase.
If, however, the underlying data structure supports the idea of finding an exact phrase within a block of text efficiently, then there is no need to lookup each word of the larger phrase individually.
So evaluateQuotes() can be simplified to:
def evaluateQuotes(self, argument):
"""Evaluate quoted strings by invoking GetQuotes() on the entire quoted term"""
search_terms = []
for item in argument:
search_terms.append(item[0])
return self.GetQuotes(' '.join(search_terms))
The signature for the GetQuotes() method becomes:
def GetQuotes(self, search_string):
And finally, implementing GetQuotes() is simple, i.e., all it has to do is return a set containing occurences of the exact search_string within the database.
When debian is the only OS running on a machine, running apt-get update and apt-get upgrade works well.
Under my Mac/Debian dual boot setup, though, upgrading has always been something of a crap shoot.
The most recent problem was this error, after an upgrade and a reboot:
EBDA is big; kernel setup stack overlaps LILO second stage
This solution was closest to my situation, and although the advice was helpful, I managed to fix mine with fewer steps:
diskutil list command to find out where Linux was installed on hard drive (in my setup, it was /dev/sda3)/etc/lilo.conf (it looked correct, so I didn’t make and edits)/sbin/lilo -vThis time, when I chose Linux from the rEFIt menu, the kernal loaded normally (/dev/sda3 was scanned because of an improper unmount, or lack thereof, but fsck did its job without incident), and my linux partition was usable again.
It was nice to be at BarCampNYC5 today. I presented a talk on e-book creation in the morning, and I was fortunate to have a great audience with good questions.
The presentation touched on the following topics:
• “Build a digital book with EPUB” — a programming tutorial, for those familiar with generating XHTML and XML
• Sigil — a free and open source WYSIWYG e-book editor
• Calibre — a free and open source e-book library app which can handle format conversions and device syncing
• FiFoBooks.com — an e-book marketplace created by my NYC-based startup
The full slide deck can be downloaded here.
UPDATE, June 8, 2010: for creating all-illustrated e-books, such as comics, graphic novels, and manga, we’ve released a new tool called Composer.
All you need is a stack of images (png, gif, jpg) — upload them, set the page sequence, and (optionally) define chapters. Composer does the rest, and produces a file in e-book format (either .epub or .cbz).
UPDATE, June 27, 2010: with Apple’s update to the iPhone OS (renamed iOS) to iOS version 4, reading .epub files on the iPhone and iTouch is even simpler than before. Get the free iBooks app from Apple, and use iTunes to load .epub files on to your device, as per the instructions shown here.
Now that baseball season has started, I’m reminded of the following anecdote from the autobiography of Oh Sadaharu, a legendary Japanese baseball player who still holds the record for career home runs with 868 (sorry, Barry).
Oh is well-known for his Flamingo Batting Stance, which had him coiled and standing on one leg at the plate. Less well-known is the story behind the stance, which was influenced by Aikido.
In this excerpt, Oh tells how he and Hiroshi Arakawa, his batting coach, came to consult with Ueshiba Sensei for the first time:
One day — or rather late one night — Arakawa-san confronted me as I was about to retire. “A discovery!” he said. He was waving a book in his hand. It was by yet another actor, the well-known Kikugoro. The celebrated performer had disclosed in his book that he had tried to incorporate Aikido into his own training. Specifically, what he sought from Aikido was the idea of ma, the space and/or time “in between.”
“This,” Arakawa-san said, was the “essense of what we are looking for. All that remains is to apply it. Now you may wonder how this is to be done? Here we have a chance, because we have a living example to learn from.”
He had me read a chapter of the book. This excerpt told of Kikugoro’s visit to the great Aikido Master Ueshiba Morihei Sensei. Kikugoro waited around and waited around until the Sensei would speak to him. He asked, “Sir, what is ma?”
To this, the great teacher coolly replied, “If that’s all you’ve got to ask me, you must be a lousy actor.”
I was puzzled. I handed the book back to Arakawa-san, with no idea as to what I was supposed to have drawn from it. He could barely contain himself.
“Can you imagine a guy saying something like that to Kikugoro!”
I nodded, still uncomprehending. “So?”
“So, the Sensei is a living master. He is there for us as well as Kikugoro. We will go to him.”
The first time I saw him, he was approaching eighty. His appearance and manner, though, were vigorous. He looked more like a fifteenth-century village elder than a master of the martial arts — that is, until he began to perform the movements he had perfected over a lifetime. When he finished his session, we spoke to him. It was Arakawa-san’s turn to play the straight man.
“What is ma?” he asked, deliberately echoing Kikugoro. But the Sensei answered him differently.
“Ma exists because there is an opponent.”
“I understand,” Arakawa-san said. This seemed to jibe with something he was thinking. He took me by the elbow.
“You see,” he said to to me, “in the case of baseball it would be the pitcher and batter. The one exists for the other; they are caught, both, in the ma of the moment. The pitcher tries in that instance of time and space to throw off a batter’s timing; the batter tries to outwit the pitcher. The two are struggling to take advantage of the ma that exists between them. That’s what makes baseball so extraordinarily difficult.”
The Sensei looked at both of us as if we were crazy men. His eyes seemed to darken, and he turned them on Arakawa-san. He remained silent for a moment, then said:
“I will tell you something, you’re a lousy teacher!”
I tried not to smile as I saw Arakawa-san lower his head, bowed with almost the same words that had been heaped on Kikugoro.
“You see, you’re no good when thinking of ma,” Ueshiba Sensei continued. “Ma is there because the opponent is there. If you don’t like that situation, all you have to do is eliminate the ma between you and the opponent. That is the real task. To eliminate the ma. Make the opponent yours. Absorb and incorporate his thinking into your own. Become one with him so you know him perfectly and can be one step ahead of his every movement.”