Making Digital Resources from Print Material

April 3rd, 2004

Making digital resources from print material isn’t easy, nor should it be done without a plan. Many philosophies exist on how and why certain things should be done. This article is simply one man’s views for the future. Hopefully, these ideas can be implemented, in full, or in part, to the benefit of mankind.

Global Premises of the System

The system described here assumes a web-based implementation - that is to say, all the processing is done via the web, and no special software is required by the end user, except a compatible browser.

Several terms are used in this article, as defined here:

  1. System: The whole system of scripts, applications, etc.
  2. User(s): Any person doing the human labor for any part of the processing.
  3. Database: The core system database, or series of databases.
  4. Project: Each item being processed, whether that item is a series of illustrations, a book, a magazine, a newspaper, etc.
  5. Scan(s): The scanned page images.
  6. Image(s): Any portion of a scan which would be included in the final content, such as illustrations, mathmatical formulas, or other non-text data.
  7. History: The complete record of any actions taken by a user or a system for each project.

The article assumes that the system operates in a certain way, namely with the following features:

  1. Multiple users do any manual labor needed, volunteers or employees, whatever the case may be.
  2. Every action taken by a user or the system should be recorded in the database.

Basic Processing Steps

Step 1: Finding Desirable Print Material

In my opinion, not everything should be digitized. A thoughtful selection process is needed to decide what to work on, and what to ignore. Along with this comes the prioritization of the material. There is not a whole lot to say about this, except to note that it really comes down to personal preference and interest, and possibly, a defined project’s scope. Once the material is selected, a processing project record should be created in the system database.

Step 2: Copyright Clearance

If the target content is public domain, a copyright clearance check is required to verify the status of copyright on the selected material. This clearance should require very little human interaction, which would consist solely of a person entering selected metadata into a form such as title, author, publish date, etc. The system should query various sources, as needed, to derive the copyright status. This discovered data should be verified by a qualified person. The verified information should then be added to the database and associated with the proper project.

Any copyright information which is retrieved by the system should be stored in the database for possible future use, whether it applies to the current project or not. This data could be added to an existing local copyright clearance database, or used to build one from scratch over time.

If the target content is not public domain, then it is assumed that the system has implicit permission to process the copyrighted material, as the case might be for a company providing conversion services to clients who provide the content.

In the case of copyrighted material, the copyright info should be added to the project record, and put into the copyright clearance database as well.

Step 3: Scanning

At this point, there is no good way to automate the scanning - it takes manpower. Even automated scanning machines require a human’s physical interaction. This article assumes that the user who originally selected the text for processing has access to the physical material. This user would then scan each and every page of the print material, naming the files using numeric identifiers according to sequence in the, ignoring book page numbers, etc. The scans should then be submitted to the system, individually or in batch(es). In theory, this submission would be done via some sort of upload system, either direct FTP, or via a webpage. The webpage method is generally impractical since it isn’t designed to handle the size of files required for page scans, which can often exceed several megabytes each. These scans should then be associated with the current project.

Step 4: Scan Verification

Once the scans are fully uploaded, a second user (a user who did not scan and upload) should verify that the sequences of scans makes sense. Any questions or possible errors should be brought to the attention of the person who scanned the material. Each issue should be documented and resolved. If any issues cannot be resolved (torn pages, blotted ink, etc), they should be noted as such in the history record. Once this process is completed, the scans should be made available for the next step of processing.

Step 5: Scan Processing

Next, each scan image should be processed on an individual basis. This step would include fixing skew angle, cropping multi-column content, resizing, resampling, or other cleanup of the image which might be required to insure usability of the image. Multi-column images should be renamed by appending a letter (a-z) to the image filename. After scan cleanup, images which should be part of the final text should be created from the original page scans and saved in a separate directory using filenames which easily correlate to the scan image it is derived from. Any captions associated with the images should be saved to the database or a text file and associated with the image record. This step is complete only after every scan and image is fully processed for the entire project.

Step 6: OCR Processing

The OCR processing should be as automatic as possible, and using a simple interface would allow the users to control OCR software located on the server. OCR for many projects could be highly automated, with users simple activating the OCR process, and verifying the output to make sure the OCR didn’t encounter issues. More complicated material may require special OCR configuration, and if not possible on the server, the image should be downloaded by users with OCR capability, with the outputted text upload back to the system. Text output should be saved to files using the same filename as the images it was derived from, but in a separate directory. Once all OCR is complete, the project can be released for the next step. The history log should reflect how each page was OCRed, by the server, or manually by a user.

Step 7: Proofreading Round 1

The first round of proofreading should concentrate on only correcting errors where the text output does not match the OCR output. It should not be concerned with formatting (italics, etc), or spelling errors, except where spelling errors don’t match the scan. Any corrections should be logged to the history, recording the correcting user, the page it was found on, and the exact correction. This data is important for proactively teaching the OCR system to eliminate errors for later projects.

Step 8: Proofreading Round 2

Round two of the proofreading should concentrate on formatting and correcting spelling mistakes which do appear in the original material, should such spelling correction be desired. For example, spell correcting a Mark Twain novel is not desired, as the errors are the author’s intended style. Footnotes (and other forms thereof) locations should be marked in the text and actual footnote text saved in a separate file. Again, each change should be logged. Formatting changes don’t really need to be logged, but could be for consistancy.

Step 9: Preassembly

After proofing, the system should automatically assemble the text files into a single merged text file, saving the file using the project’s ID in the project’s directory. Markers denoting each file’s content should be placed where each scan’s text starts. Once the system has completed this task, the text should be made available to a user to verify the assembly is complete and accurate.

Step 10: Assembly

This step’s purpose is to introduce markers for images (created in step 5), including image captions into the merged text file. During this step, structure should be defined, such as chapter breaks, and other such presentational features (blockquotes, stanzas of verse, etc). Footnote references should also be verified. Once this step is complete, the updated file should be saved as a different filename than the preassembly output file in the project directory.

Step 11: Markup

After assembly, the system should automatically convert scan image number markers, structure markers, footnote references, and formatting marker to full markup. During the markup step, a user should verify this automatic markup and add any other markup required to the text. The final output should be saved as an XML file using the project’s ID number and in the project directory.

Step 12: Cataloging

Metadata from the project should be verified by a qualified cataloger. The cataloger may update the metadata, in which even the system should add the changes to the history record. At this point, the release should be assigned to the project.

Step 13: Master Creation

During final assembly, the system should prepend the metadata, in full XML format, to the marked-up file. It should also do any other markup, such as adding license data, processing notes and history, etc to the file. This XML master file should be verified by a qualified user, either making changes as needed, or adding notes of any changes needed and sending the file back to the proper step for reprocessing.

Step 14: Format Generation

Any formats which will be static files should be generated by the system and verified by users.

Step 15: Archiving

Where necessary, the system should package up the master XML document, along with the images, scans, full history, OCR text output, and output files from other steps, including the generated static format output. A manifest file should be including, citing each file and it’s description. This archive should be saved to various locations, and preferably burned to CD for safekeeping.

Step 16: Packaging

Where needed, the system should package up static formats, such as HTML+images, etc. Again, a manifest file should be generated and included. The packages should be saved using the release ID as part of or the whole filename. These packages, and any single-file versions of the project should be saved to the public repository. The master document, scan images, images, etc should be saved to the master repository under the release ID. Derivative versions, such as might be need for multi-volume projects, should be packaged under the same release ID, with added information to denote it as individual volumes of a single project. These multiple volumes should not be given new release IDs.

Step 17: Delivery

Once the master files are saved to the private repository, and the packages saved to the public repository, the catalog data should be added to the public catalog and the file made officially available to the public. The catalog record inteface should provide links to download mirros, along with access to on-the-fly format conversion for formats which are not stored statically.

Optional Sub Processes

Sub 1: Format Conversion

On demand format conversion should parse the master document as required by the particular output format’s need. If required, a package should be created as described in step 15. This process can be accomplished by the same system used in step 14. These files should be named like 100.html or 100.txt, etc.

Sub 2: Language Translation

Automatic language translation of texts should be made available. Once the system has automatically translated the content, the new content could be processed as described in step 7 through step 16. Sub 1 may also be implemented. Again, these derivative packages should be given the same release ID as the parent document, but marked as a different language derivative, using ISO 3-character standard language codes. For example: 100-eng.xml and 100-spa.xml.

Sub 3: Voice Synthesis

Voice synthesis of the texts can be greatly automated using a series of processing steps. The voice synthesis system should implement multiple voices for different characters, narration, etc. Consistant rules should be created for denoting footnotes and other "peripheral" data, since it must be synthesized inline, not as an appendix. All synthesis media masters should be saved in Shorten Audio format, which provides non-lossy compression.

  • Step 1: Voice Markup

    Each character in the content should be assigned a voice, and content marked to identify where each person is speaking. In the case of theatrical content (Hamlet, for example), this may already exist and only need to be verified and associated with the selected voice profile.

  • Step 2: Test Synthesis

    A few passages for each voice profile should be sampled and synthesized for verification by a user. Once verified, the system should synthesize the full content.

  • Step 3: Synthesis

    The system should synthesize the entire content in blocks, possibly as small as a few paragraphs at a time. This makes it possible to correct mistakes and only have to re-synthesize small portions, and replacing that output file.

  • Step 3: Proofing Round 1

    Users should listen to the blocks of synthesis output and read along with the text. Any errors should be corrected by instructing the synthesis engine to re-synthesize that block containing the error. All such corrections should be added to the project history.

  • Step 4: Block Merging

    The system should automatically merge the blocks into a seamless file or series of files (for large content). This process should be verified by comparing a log file generated during the process to existing synthesis block counts.

  • Step 5: Proofing Round 2

    A user should listen to the full synthesis, reading along in the text and check for continuity, noting any issues, and sending back to step 3 for correction.

  • Step 6: Approval

    Final approval of the synthesis output should result in the output being cataloged as defined in step 12, then archived, packaged, and delivered as defined in step 15 through step 17. A master synthesis output copy should be burned to CD, again, for safekeeping.

  • Step 7: Format Conversion

    Shorten Audio files aren’t common in the real world, but the file format is the best for masters. Once the masters are created, they should be automatically converted to more popular formats, such as patent-encumbered MP3, or open-source Ogg Vorbis, for delivery to the users. These copies should be created and saved as static files, as on-the-fly conversion is simply too hardware intensive for a repository server to handle.

The Repository

After processing, the work is not completed. The repository management and use is just as important as the processing. The following section details the construction and management of the repository. The repository contains the original scans, the master XML document, proofread OCR files, images, and possibly Shorten Audio masters, alternative language master XML documents, etc. Along with the masters, any static versions of the content (audio and text) should be stored in the repository.

Part 1: Directory Structure

The repository is a large filesystem, and careful consideration should go into creating it. The following directory structure is my suggestion for a good filesystem (assuming the release ID of the content is 100):

repo/
     1/
       0/
	 100/
	     100-man.xml
	     100-man.txt
	     100-log.xml
	     100-log.txt
	     masters/
	     	     text/
			  100-eng.xml
				...
			  100-spa.xml
		     audio/
		           eng/
		               100-eng-001.shn
		              	...
		               100-eng-999.shn
		           spa/
		               100-spa-001.shn
		               	...
		               100-spa-999.shn
	     text/
	          eng/
	              100-eng.txt
	              100-eng.xml
	              100-eng.html
	              100-eng.txt.zip
	              100-eng.xml.zip
	              100-eng.html.zip
	              	...
	          spa/
	              100-spa.txt
	              100-spa.xml
	              100-spa.html
	              100-spa.txt.zip
	              100-spa.xml.zip
	              100-spa.html.zip
	              	...
	     scans/
	           100-0001.png
	           	...
	           100-9999.png
	     images/
	            100-0001a.png
	            	...
	            100-9999z.png
	     audio/
	     	   eng/
		       100-eng-001.ogg
		       	...
		       100-eng-999.ogg
		   spa/
		       100-spa-001.ogg
		        ...
		       100-spa-009.ogg
	     iso/
	         100.iso

Most of the filesystem is self-explanatory, with a couple of possible exceptions:

  1. 100-man.* files in the root are the manifest files which should contain all metadata associated with the content, along with a list of all files available for the content. This file should be automatically updated whenever the system changes any of the data, adds a new format or language translation, etc.
  2. 100-log.* files in the root are the change log files which should contain a log of all processing steps and updates to the content files. This file should be automatically updated whenever changes are made to the content or the repository for the content.
  3. 100.iso in the iso directory is a CD image of the entire contents of the 100 root directory, except the ISO itself of course. This image should contain all the manifest, change log, scans, images, text, audio, and tranlations of the content which are available. This iso file should be rebuilt whenever changes are made to the content.

The masters subdirectory should not be writable except the system itself, nor should it be available via FTP or rsync, etc. Everything else should be readable by the world, and available via FTP, etc.

None of the files in the repository should ever be manually edited for any reason!


Although I never finished writing this concept, I thought I’d put it up for those who may be interested in the part that is done. One day, I may revisit this.

Digitizing Texts: The next generation.

September 13th, 2003

What is it about the eBook world that blinds people to quality and makes them wholly quantity-centric? Do numbers mean more than usefulness? The last twenty years of converting paper books to eTexts and eBooks - particularly works in the public domain - has been an evolutionary “how to” learning curve. Although the quantity of texts converted is impressive, the quality has greatly suffered, and this lack of quality is beginning to catch up with us.

It is time that we take a brief look at the issue of quality digital conversion of paper texts: to study what has been done, and to offer a revolutionary new path integrating what we now know is best, and what we predict will be important in the future. After all, the purpose of digitizing the “Paper Domain” is not only preservation, but to make the content wholly useful for future generations and their needs, and not just meet present-day needs.

Thus, in the first part of this article, I will present a summary overview of the preferred, next generation process that anyone, whether a private individual or a major organized effort, should follow when digitizing paper works.

In the second part of this article I will review several of the larger organized efforts to digitize public domain books and similar types of works. This review will focus on the deficiencies they have, as well as mention the innovative approaches they have implemented.


Locating appropriate printed material

The first hurdle is to find and appropriate usable copies of the print books. This is often difficult because of the lack of quantity for many older books. Even when a book is found, one must consider the impact destroying the book will have on the literary world. For example, it would not be wise to destroy an original copy of a book if only half a dozen are known to exist. For those of you whom may be less knowledgeable about working with printed material, I will explain that often, it is necessary to cut the binding of a book in order to get the loose pages to scan accurately. Once cut, the book cannot generally be re-bound. Scanning a rare book can be done without cutting, but it requires more manual labor. In my example, of a book with only six copies, this would be acceptable “extra” work.

Copyright clearance

Getting authorative copyright clearance isn’t generally too difficult, because of basic copyright laws in various countries. Still, each book needs to be cleared individually to insure that no copyright extensions or special case rules might apply. For books where copyright may apply, it is necessary to get permission from the copyright holder to digitize their work. Once this copyright information is obtained, it should be saved in a database for future reference and inclusion in an authorative catalog (described later). Once cleared and prepared, a barcode should be assigned to the printed book for tracking.

Scanning the printed pages

Scanning the loose pages of the print book isn’t a particular difficult task. However, it is very important to save the original scan images. Many eBooks no longer have these images available, making it extremely difficult to track typographical errors. Also, these books don’t often include illustrations because they were done only in plain text, and without the original scans (and no long-term storage for the printed pages, as noted above), it is very difficult to get these illustrations from the specific edition of the book that the text came from. These scans should be saved in an economical format such as PNG or compressed TIF, which maintains scalability and quality without sacrificing file space. Original scan images are particularly important for tracking the pedigree of the book. Scan images should also be saved to some distributable medium, such as CD, for archival purposes, bar-coded to match the printed book’s barcode and stored in a safe location.

OCR (Optical Character Recognition) of the scanned images

The OCRing process is relatively straightforward with various OCR software packages. The resulting raw text is then used to put together the actual full text. The resulting raw text files need to be archive in their original form. These original text files are important for version tracking and error correction, among other things. Again, these should be burned to CD and bar-coded.

Proofing

This step is actually a multipart process. Humans should proofread the raw OCR’ed text, at least twice, and preferably, three times. This insures that the OCR accurately deciphered all the characters from the scans. It is crucial that at least two different people proofread each page of text. This keeps the human brain’s natural “auto-adjust” feature from becoming too much of an issue. The proofing step’s main complication comes at the end, with the pre-assembly task. Because the raw OCR text files are straight text, they loose the entire original formatting of the print book. As the pages are being proofed, it is imperative to rebuild all the original formatting.

Assembly

Assembly of the master document is the single most complex step in the entire process. If the master document is straight text, it cannot maintain the formatting and character encoding which is necessary to properly reproduce the content. Flat text also does not offer a consistent metadata and content structure. While no file format is going to be the absolute best solution for every case, we can certainly do better than plain text and arbitrary XML vocabularies. Personally, I recommend a strict subset of the Open eBook Publication Structure specification, which can implement Dublin Core metadata as well as MathML and SVG as needed. My only point of opposition with the OEBPS format is that it is a subset of XHTML, and is therefore prone to the many problems of such a loosely specified vocabulary. This issue, however, can be resolved easily enough. The assembly process is crucial to the proper production of eBooks. Well marked-up master documents make it very simple to convert to other formats, both text and other media. This master format is not for normal distribution uses, although it can be distributed along with the “user” formats such as HTML, RTF, PDF, etc. The master document needs to include page numbers, full metadata, references to embedded illustrations, etc. This information is needed so that it can automatically reference the scanned images and raw OCR’ed text files for each page, among other uses. This master document needs to be accessibility-friendly as well and should contain only structural and semantic markup, and exclude presentational markup. This enables format conversion systems to create versions of the document supporting all the features of a particular format, without being stuck with too many complex special cases that would have to be addressed with custom code for each case.

Format conversion

Creating these “user” formats is very simple once a rich master document is available. By using a single master document, many other formats can be generated without having to redo massive amounts of work, as is the case currently. These “user” formats can even be generated on the fly, based on the user’s own preferences for things such as font, font size, colors, etc.

Alternative formats

Another crucial piece of the puzzle addresses accessibility. Voice synthesis technology has advanced far enough to make synthesized books quite enjoyable. While human “readings” will always be better, synthesized audio is a very large step forward. (For those of you who’s only exposure to synthesis is from Microsoft Reader, check out Rhetorical Systems’ text-to-speech demo.) The voice synthesis system could use Shorten audio format for masters (or another lossless format), and convert to various other audio formats as needed, including MP3, Ogg Vorbis, WAV, AIFF, and even streamed formats, again even based on user preferences. Once these masters are made, they should be archived and placed on CDs, with the barcode.

Language translation

Another large hole in the picture is the lack of texts in multiple languages. Generally, only the more popular books are ever translated to other languages, leaving tens of thousands of books out of foreign libraries. Language translation software can be 90% accurate, according to one expert. (My grandfather, Eldred Linden, who has been translating texts in a half a dozen languages for over a decade.) With a proofing system similar to the OCR proofing step, language translation could be quality-controlled well enough to make it a very real possibility in the immediate future.

Cataloging

Implementation of an authorative catalog is relatively simple in the scheme of things. This authorative catalog should index scanned images, raw OCR’ed text, proofread text, rich master documents, format options, alternative media, and language translations - all in one database. Of course, this database should include World Cat and Library of Congress compliant MARC data, and Dublin Core metadata, along with copyright clearance information, processing history, version history, and pointers to each format at various locations, etc. Graphics for CD label artwork could be built on the fly from catalog data, or graphic artists could create CD label images that can be indexed within the catalog. Similarly, graphics for other uses, such as printable covers, etc. can be indexed as well. The catalog itself should be widely distributed and made available in multiple languages. Maintaining the catalog data could then be done by small groups of people interested in particular subsets of the catalog. For example, the Edgar Allan Poe Society of Baltimore might be interested in maintaining the catalog entries for Poe’s works.

Distribution

Of course, users must be able to access these texts for them to be truly useful. The most obvious interface is the aforementioned authorative catalog. Not only should users be able to download the format of their choice in the language of their choice, but they should also have the option of downloading CD images or receiving CDs via snail mail. With all the data available to the catalog, CD masters can be created on-the-fly based on user preferences, giving users the ability to select what they want included on their CD (either image or disc). For example, I might want a CD image for all of Edgar Allan Poe’s works for research purposes. For this, I would want the scanned images and a couple of different format flavors. On the other hand, I may want every work done in a particular year range, and I could simply select this year range via a web form and have the system build my CD image for me. When the image is done, I could receive an email with a link to download my image so I can burn it, along with a link to the appropriate CD label graphic so I can print a nice label. Building CD case sleeve graphics could be done on the fly, with a complete list of everything on the CD. Again, I could print this on sleeve paper for my CD case. Naturally, once a custom collection is made (similar to a shopping cart system), the files can be downloaded in native format or compressed into a package, skipping the CD image step entirely.


While this system may sound daunting to many people, it is nothing to be afraid of. When broken down properly, the entire puzzle can be put together piece by piece. Most of the technology needed to make it all happen is either open source, or can be purchased for moderate sums using grant funds. The equipment to run such a system is also quite simple when broken down into pieces. The benefit of the “puzzle” approach is that no two steps would have to be done using the same equipment, or even in the same physical location.

Yes, I have left out many smaller items of interest, but rest assured, they all have their place, and can all be taken care of. This entire vision is scalable and sustainable - to enhance our lives now, and for our children and grandchildren to use and expand to suit their own needs.


There are currently four major eBook production projects, each of which has their own ways of doing things, some positive and some negative.

Project Gutenberg [Link]

Project Gutenberg is the original eBook producer on the Internet, starting back in July 1971 with the “Declaration of Independence”. In the following 32 years, not a whole lot has changed for some aspects of the work, but recently, several important things have changed. PG has done well by allowing other people to mirror their archives and by actively maintaining a good volunteer group. The biggest problem with the system is the core file format in use. Gutenberg relies on a decades-old vanilla text format that does not handle character encoding, embedded illustrations, or other much-needed data. The inconsistent data structure also makes it hard to work with the files. Another large issue is the current state of PG’s catalog. With very little metadata available, the catalog has little use except to decipher the extremely vague file-naming scheme that PG employs. Gutenberg currently has no real version tracking in place, so when content is updated (usually typographical fixes), it is hard to find these things out. Again, I must commend PG on their volunteer base. Doing a production project on any scale requires volunteers, and here, Gutenberg has succeeded. In recent history, PG’s biggest improvement came with the introduction of the Distributed Proofreaders website. The DP system allows volunteers to proofread the raw OCR texts against the original scanned images, one page at a time. By implementing a two-step proofing process, DP has done well on quality. However, this system does not create rich master documents, so most of the original formatting is still lost in the final product. Any eBook production system would be wise to take advantage of DP’s successful system.

Million Book Project at the Internet Archive [Link]

The Million Book Project is very ambitious project to - you guessed it - digitize one million books. While this goal is indeed admirable, so far, the project has done little but make existing problems even worse. They have, however, maintained archives of the original scan images, an improvement on PG’s work. The biggest problem I have found with MBP is their catalog - and I use the term “catalog” very loosely because their “catalog” is basically just a very rough listing that can be viewed using various criteria. An example of one of the issues with their catalog is the highly inaccurate topic headings. For instance, there are six topic heads for Shaktism, not one of which is correctly spelled. Another issue is that author names and titles are not consistently normalized using standard notation. This, of course, makes alphabetical browsing very difficult, not to mention forcing the user to constantly make mental adjustments when reading the list. While MBP has attempted to implement Dublin Core metadata, it lacks anything beyond the very basic fields, so is useless in any practical sense. Even while pointing out these issues, I must commend them on their scanned images archive.

Electronic Book Center at the University of Virginia Library [Link]

This project is probably the most complete academic-centric library of eBooks and eTexts. They have carefully created master documents in SGML, therefore maintaining more of the original formatting than other projects have. The Electronic Book Center doesn’t really have a catalog, per se, although they have very neatly laid-out listing pages. This is their biggest problem, as it makes it hard to find different eBooks. Another, smaller, issue is that a lot of the library is only accessible from the University, and inaccessible to the general public. EBC did this for various reasons, primarily to control access to copyrighted work, and while unfortunately, it is necessary to keep some publishers and providers happy. While I would like to see EBC implement an XML vocabulary for their master documents, their use of SGML puts them well ahead of the pack in regard to complete masters.

Making of America [Link]

The Making of America project deals mostly with American content from the late 18th century to the late 19th century, but they have done an exceptional job with the content they have. The most impressive part of MOA is their catalog and search, which far exceeds that of any other project. They have carefully indexed a lot of the important metadata, using normalized notation that makes the catalog very searchable. While their particular search engine technology is a bit slow, that should not detract too much from the good work on the data behind it. MOA has also preserved the original scanned images, one more point that puts them ahead of most projects. So far, I have not located any text versions of the content, which is a bit odd, considering the generally good system they seem to have. Unfortunately, scanned images alone aren’t enough for eBook work - there must be text flavors available. The project’s website can be a little confusing, but it’s understandable with the search engine behind it and all the metadata they try to display.


In combining all four of these projects, we don’t even have one full “next generation” project, a disheartening thought. The good news is that with a little work, all these projects (and others) can support a full compliment of features, proper archives, and invaluable format options for their entire content holdings.

Not one of these projects deals with accessibility issues, particularly, none of them have any system, rudimentary or not, for speech synthesis. All of the projects have major issues with their catalogs, either missing crucial metadata, or lacking a good, understandable interface to display the metadata that is available. Some of the projects have multiple formats available, but they are ad-hoc and inconsistent at best. It is hard to find out some information regarding the internal workings of the projects, with the exception of Project Gutenberg, where I have volunteered a good deal in the past few years. Some of the projects may indeed have long-term storage systems for the print books after they have been processed.

EBooks are an invaluable technology in this era, and will continue to be so for at least a few decades. Even in the future, should the current concept of eBooks be usurped by newer technology, the work done now will be important in any migration effort. In order to insure full reusability, current projects need to keep the larger picture in mind.

For those of you whom may be directly involved with any of these projects, I welcome feedback and corrections. I have endeavored to make this article as accurate as possible, but with limited access to parts of the projects, it is difficult to clearly address each point.


Special thanks goes to Dr. Jon Noring for graciously allowing me to use his “Paper Domain” concept.

eTextBooks: The case for electronic textbooks in schools.

June 26th, 2003

In response to Why not eBook textbooks? @ Geek.com

I’ve been a complete geek since I was old enough to even say the word. I would have loved to have PDA and eBook technology available when I was in high school.

When making digital formats available, publishers can actually increase their profit margin. Because of reduction of production costs (paper, ink, etc.), electronic formats could be cheaper than textbooks, but in the bigger picture, provide the better profit margin. Distribution is wholly online, or cheap-to-ship CDs, and not thousands of pounds of paper books. This significantly cuts down on distribution costs as well.

Futher savings can be made on the publishing end when new editions are needed, as they can quite easily issue addendums and/or partial replacements (depending on the file format), instead of incurring a whole new round of production and distribution costs. This would make it much more economical to keep the textbooks up to date - which is a serious problem in schools. I often had arguments with my teachers because new facts contradicted the science books.

The matter of backups for electronic media is quite simple to take care of. While initially, it might be costly to setup, having a couple of kiosks in school hallways or offices would make it quite simple for a student to reload their particular etextbooks if they upgraded their personal PDA, or the file became corrupted, etc. Backups on the school level are even less complex.

The simple facts are:

  1. CDs are cheaper and more efficient to create, store, distribute, and replace.
  2. PDA technology (Hiebook, eBookMan, Palm, PocketPC based) is getting cheaper and cheaper.
  3. File management for archives, along with appropriate DRM standards are not that complex to work out.
  4. In a work world which is almost completely immersed in technology in almost every field, the use of such technology in high schools would surely better prepare our students for the “real world.” No, students should not learn to completely rely on calculators and spell-checkers, but face it, everyone else does.
  5. A custom eTextbook device could be engineered to meet the specific needs of such a system. The sheer market size for such a device would make it very cheap when purchased in large quantities (at a school system, county, or state level).
  6. Most of the software infrastructure for such a system (eTextbook device included) is, or can be made, available via open source licenses

Really, the only things keeping such a large technology advancement out of our schools are greedy companies and timid politicians.

On eBooks: An almost pointless rant about eBooks.

June 24th, 2003

I’ve often been asked about my personal opinions regarding the eBook format wars and related topics. Generally, I avoid lengthy discussions about them, but I have decided it is time to put some of my thoughts into words. It may be rambling at times, so bear with me. Many things I think are by no means original and acknolowdgements to all the different contributors would be longer than the text itself, so forgive me for not typing out a bibliography.

Currently, there are many formats from which a user may choose when creating, publishing, or using an eBook. This often leads to confusion and frustration, as there never seems to be a true “correct” format to choose. One format may work well for the user’s PDA, while another may be suited for their desktop computer. This means that users need to worry with two different formats, just to fulfill their own particular needs. The case for publishers is quite different. Publishers must attempt to gauge the market saturation of each format based on only vague statistics which are often highly inaccurate in the first place. Once they have made assumptions on the value of each format, they must decide which ones to actually provide to the end-user. Naturally, we cannot ignore such factors as DRM, distribution methods, etc. in this equation. So many variables often intimate small-time publishers and keep them from taking advantage of eBook technology. Larger publishers may choose to “ramrod” a particular format, generally based solely on their own market perspective. All of these factors are further influenced by corporate marketing and direct vendor pressure. Unfortunately for the end-user, such is the taste of capitalism.

Capitalism should not be the name of the game in certain venues, and I would much prefer a socialist mentality in regard to eBook technology. Innovation is crucial to any growth, but too many conflicting changes surely hurts all. Some of the “bigger names” in the eBook world, limited as it is, have come together to attempt to build a “standard” for eBook formatting under the banner of the Open eBook Forum. While noble, this act alone will not suffice to solve the prevailing issue. Content authors, publishers, distributors, and end-users alike must collaborate on the solution, or the entire puzzle will not be completed. OeB’s high membership fees and limited perspective, if left unchecked, will only serve to make such matters worse.

I have had the pleasure of living both in the United States and Canada, and I dream of a country built upon the merger of the two. To some, this would be heresy, to others, a pipe dream. Call it what you will, but I will not cease to hope for such improvements. In Canada, I feel safe walking down almost every street, no matter the time of the day or night. When I go into the local library, I am recieved by staff whom are helpful and courteous - from deep down, not fabricated or forced. People help each other in small ways everywhere I look: to cross the street, or hold the door, or clean up spilled coffee. Older people often offer to let me go ahead of them in line, to which I always refuse. Yes, there are homeless and needy persons in Canada, most of which have simply set upon harder times than others. In the States, I generally analyze each street, judging threats to my safety and sanity, and what such threats may inflict upon others around me. When I walk into a store, I am taken care of by employees who seem to be only waiting for lunch, closing time, or their next break. They are polite, but not often forthcoming with real helpfulness. I often feel the need to manipulate situations to extract the assistance I may need, or compelled to spend long minutes “shooting the shit” with nervous, chatty people who seem to only want to make themselves feel like they are doing their job, and don’t particuarly seem to care what I think. Most of the homeless people I have met in the States don’t appear to be motivated to change themselves or their lives. Don’t get me wrong, my comparisions between the States and Canada are not based on scientific statistics or studies, but more of a random sampling of my own experiences in small towns and big cities. I love the United States and Canada, each for it’s own particular qualities.

So, you ask, what does this have to do with eBooks? Like most things in life, everything has something to do with anything else. The mentality seems to be so similar from one venue to the next, it is hard to decipher the line between them. Many people I talk to about eBooks don’t grasp the possibilities of the technology, nor can they appreciate the hardships in the evolution of it. Authors seem to be timid about eBooks, lest their meaning is lost. Publishers are paranoid and don’t want “their” product to be pirated. Vendors don’t see it as a viable technology, because they can’t make a big enough profit. End-users are confused by too many variables, and give up on getting a solution for their needs.

If each group of people in the equation decided to lay aside their own personal fear, wealth, paranoia, and frustration, we could build the eBook puzzle, together, to the benefit of all. Unrealistic? Maybe. Impossible? Never!

What if one construction company decided to do every portion of a project instead of subcontracting out to qualified people for each piece? If this were the case, a single contractor would have to be so large as to have all its own experts from every imaginable field, from concrete to plumbing and everything in between. Such a company would be so large as to surely collapse upon itself under its own administrative load. Projects would be bogged down in their own mire. This is insanity, you are surely saying, but is it? The industrial revolution led to specialization of professions so that each piece of a puzzle could be put in its place by those best qualified to do so. Have we, the eBook community as a whole, not learned from the lessons of history? A single format will never solve every issue, nor a single company serve every need.

Multiple formats is not the issue at the heart of our woes. Capatilistic companies do not shape our perspective on their own. We, the authors, publishers, vendors, and end-users, all must build own solutions based on our needs, working together when possible.

When the founding fathers of the United States came together and wrote the framework for modern government, they did not have all the solutions, nor were they so arrogant as to believe they did. They did, however, believe that they and their successors and decendants would and could find solutions to the issues. Abraham Lincoln did not have answers to the underlying sociological issues of racism and slavery when he uttered the words “…all persons held as slaves…be then, thenceforward, and forever free;”. Winston Churchill said, in 1941, “Never give in–never, never, never, never, in nothing great or small, large or petty, never give in except to convictions of honour and good sense. Never yield to force; never yield to the apparently overwhelming might of the enemy.” Little did he know the price of such a stance that would come in just a few short years. To this day, history shows that because a portion of the world did not give in or yield to force, most of the world remains generally free.

There is no right or wrong way in the eBook world, much like most of life itself. We must simple use “good sense” and “convictions of honor” in building the foundation for the future of literature. Printed books will never become extinct, but like everything else, they will have their place in the world. Bantering and soapboxing issues doesn’t usually solve them. If you think you have a better idea, work on it, and see what the world has to say. Alexander Graham Bell did. Henry Ford did. Edward Jenner did. Igor Sikorsky did. Why not you?

Do I have practical solutions to some of the actual issues? Yes, I do. I have been working on them, with a lot of input from other people, for the past four years. Do I have everything solved? Don’t be absurd, I’m only one man. Do I want to help solve the entire puzzle? Absolutely!

eBooks: The case for electronic books.

June 11th, 2003

If people started thinking outside the spine, e-books would really start to take off as the viable technology we know it to be. Here are some of the things I have run across and/or thought about in my e-bookie life.

  1. Most people I talk to think that ebooks are only things like “How to get your site listed in search engines” and “Make $500,000 a month at home” kind of crap. The saddest part is that several of these people worked in arenas where they should know better (blind services, etc).
  2. Some ebook formats (PDF, LIT, OeB) are far too stuck in the print mentality to one extent or another. E-books don’t have fixed page numbers or spines or covers. I will never understand why people think they do or should. E-books don’t need to be displayed on a “cool” page-oriented background image, etc. Yes, one may argue that scholars want page numbers, or librarians want spine metadata, or that covers look “cool,” but it all comes down to one fundamental thing–they aren’t needed for anything. Having nice images (including covers) is good, don’t get me wrong, but they should be molded into the overall presentation, not like a cover on a book. The spine metadata can be represented in more suitable ways. Unfortunately, these formats are the most prevalent ones (in the media, etc.–not necessarily in implementation).
  3. Complicated DRM is only going to stunt technology and/or market persuasion. Look at what happened with PDF, for example. I think their DRM would have worked, had they not made such a big deal out of it, and had they kept it in proportion to the need. DRM is mandatory for some things, useful for others, but completely wasted for some. Implementing DRM for public domain material, for example, seems daft to me.
  4. Finding specific e-book titles is hit-and-run. There is no easy and accurate way to find them. Google is the best, but getting the actual book one is searching for can be interesting. I’ve ended up on those dreaded “porn-popper” pages because it happened to match somehow, and wasn’t an obvious bad result. Building a good online catalog with a simple search interface would go a long way, in my opinion. After talking with the powers that be, I’ve found that getting such a catalog going for PG, for example, would be a weekend project–and not just a hack job of it, either. Maintaining it, because of PG’s current cataloging system, would be more complex, but certainly doable.
  5. Of course, the general format issue must come up in any list of ails and ills. While I have no hard statistical evidence to back up my opinions on this matter, I have strong feelings that PG’s text format (among others) is holding things up somewhat. While PGTXT is usable and non-proprietary, it is hideous to most people for one reason or another. Keeping plain text versions is a wise plan, so we don’t get stuck with unusable data later, but, we should have more enhanced versions available for EVERY text. Yes, this is my most common point of discussion, and yes, I have been working on the issue myself, via raptorbook.org, but solitary work won’t get it done anytime soon. There are others, such as BlackMask–I’m not alone–but back to my point. When I hear complaints about PGTXT or other formats, I can’t help but grunt and say to myself, “So do something about it.” An average of 10 minutes per text is all it takes to get an etext from PG into various other formats using the RaptorBook Engine, and I’m just hoping someone else comes up with an even better solution than I have.

So, out of my list of ills, there are two items I am working on myself–cataloging and formatting, and while slow, it certainly isn’t hard work. A couple hundred ants can move a pile of leaves in a matter of hours; a couple dozen humans could format PG’s entire archive in a matter of weeks.

If you want better formatting for eBooks–then do it. Improve upon it, then do it again.

If you want audio versions of eBooks–then do it. Improve upon it, then do it again.

Work, when done properly, is not wasted, no matter how repetious or incorrect it is. Life is simple, as are life’s problems. Break it down into smaller problems and solve each one, thus solving the entire problem.