Automated Text Processing

There are many instances where automating a text processing chore is essential, and can save time and money vs. hand processing of many large text files.

Some of the most common examples of text that can be processed by computer are mailing or address lists, web site content, source code, web data entry forms, emails, stock tickers, books and literature, scientific data such as genome data, or numerical data in the form of tables of text (tab or comma separated text is a common method of transferring tabular data such as spreadsheets).

There are many reasons to process text, and many different types of processing. For example search engines process text to create an index so that words and phrases may be found quickly, other text processing tasks attempt to create or format data using a raw text stream as input. For example, a real-time news feed from a news service like Dow Jones requires significant processing to add line breaks, strip off header information, and pretty up the layout, before it becomes a human readable news story to be put on Yahoo News or another web site.

In automated text processing I think of two different "paradigms." One is transformational programming where the goal is to take text A and turn it into text B in a systematic way. The other is data extraction where the goal is to extract information about the text for use in a database or search engine or other similar system. This distinction is useful when considering what sorts of tasks and tools are necessary for your project.

Tools suitable for transformational programming consist of macro languages, text editors, UNIX shell utilities, and text processing languages like awk, and Perl. To an extent other high level languages are suitable for transformational programming, Python, the Lisp family, and others. I don't classify C or other lower level languages as suitable, because they take too long to develop with, and don't provide suitable text specific tools within the languages themselves. However there are a family of little languages, some of which compile into C programs, that are extremely powerful. I am thinking of compiler toolkits, such as Lex, YACC, FIXME: add more compiler generators.

What are these tools? And how can you take advantage of them?

Transformational Programming

If your goal is to create a text output using text input, then you are doing transformational programming. Many programmers would immediately reach for Perl and begin hacking away until they had something which seemed to work. This is generally a bad idea. Perl may be the language of choice, but before reflexively grabbing it, it pays to consider other options. Shell tools like textutils, macro languages such as M4, and text editors such as sed are some of the first transformational programming languages one should consider. Why? Because they are simple, and much more easily debugged interactively.

A macro language is a system specifically for transforming one text into another. The C preprocessor is a macro language, albeit a clumsy one. In Lisp, you can create macros using Lisp itself. This is one of the most powerful features of the language, and lets you create small programs to write your larger program for you. In SAS, a programming language I personally hate, macros are the only means to get a full featured Turing Complete language, the base SAS system without the macro language is actually terribly limited.

M4 is a general purpose macro language which is designed to be used on any sort of text. Some examples of what you can do with M4 are to create web pages (all of the pages on my site were created with the aid of M4 and "make"), to extract specific portions of a text (such as all the links from an HTML file, or a list of dependency files for your C programs), to insert information into text files (such as dates, copyright information, and so-forth), and to restructure a file (for example putting one data record on each line).

The most effective way to harness m4 is to use it when writing more complicated programs. By creating macros which expand into often repeated or complicated expressions you can automate the task of writing other code, such as web pages. Put together with the program SED, the m4 macro language can be used for general manipulation of text files. While sed itself is a very powerful language, it is complicated to rely on it for more than simple substitutions and deletions. However by substituting macros into a file and passing the new file on to m4, you can harness the power of m4 to make more sophisticated changes.

This illustrates a common principle of transformational programming, stepwise refinement. If you can transform a text from A to B, then B to C, and C to D, you can make complicated changes through several simple steps.

To re-arrange the chapters of your book for example, you might edit all the chapter headings with SED, to insert m4 macros, and then use m4's "diversions" to reorder the chapters, or split them into separate files. Or you might edit your address list to insert macros that expand into SQL commands, then pass them to an SQL interpreter and insert the names of of all your clients into your relational database. You might even use SED to extract certain statements from your source code, change them into calls to an M4 macro, and have the m4 macro expand into Prolog statements. The Prolog statements could provide raw data for a program which deduces which portions of your data processing system will fail if one of your data vendors doesn't meet their deadlines. In other words, the technique is very powerful.

Data Extraction

Although the difference between data extraction and transformational programming has fuzzy boundaries, the address list example given above for instance, there are some tasks for which macros and editors do not fit well. An example would be making a table of word frequencies, or trying to detect common grammatical errors. Actually making the word frequency table could be accomplished by transforming all non-letters into carriage returns, then deleting blank lines, and sorting the resulting word list. This example shows that transformational programming is maybe more powerful than you realize. Still there are times when you want to run a computer program in a command oriented language, and high level programming languages provide an excellent tool. Lexical analyzers, parsers, and programming languages like Python, Perl, Scheme, Emacs Lisp, and any other of your favorites come in handy when you need to parse, extract, compare, and quantify text contents. There are some especially powerful techniques created by computer scientists who study computer language theory. Parser generators, and lexical analyzers can write programs for you, a technique called generated code. Combining these with a high level language allows you to process computer languages and summarize your own code, to let you understand it better.