Friday, August 27, 2010

Back to School

There's been a little discussion over on the flickr AlphaSmart forum about NaNoWriMo, and (as is often the case there) the topic has broadened into talking about other writing machines, and old computers we've had, and of course, I've butted in and talked up typewriters and the Typewriter Brigade. Regular readers will know that I spent quite a bit of time using my AlphaSmart Pro to retype last year's draft, originally written on a Royal KMG and a Smith-Corona Skyriter. We talked about OCR software in the Brigade topic last year in the NaNo forums, and I did some experiments, as I have access to a sheet-feeding scanner on our office copier. The best of the choices open to me appeared to be the Tesseract OCR engine which I installed somewhere in early November, ran it on scans of my pages thus far, and then quit when it spat out a bunch of text like "1112 98djhs do pifu hjegf"

Not helpful.

So, I dug out the AlphaSmart and spent time retyping it with the goal of getting my proof printed before the deadline. (Just made it.) The AlphaSmart has strengths like super-portability and super-simplicity which are ideal for producing words, but I ultimately wanted to see more of the text at a time as I was rewriting or laying in my changes from the typed copy, something impossible to do in the 4x40 confines of the AlphaSmart window. And after typing all that stuff in, I'm burned out on the whole book in general right now.

So, I'm thinking about looking back into OCR as an option for this year. Tesseract is a very no-frills OCR program, and lacks pretty interfaces and intuitive controls, but that's OK, since I am, at heart and by choice, a geek. I've been reading more about training Tesseract -- sending it to school, essentially -- and am thinking about writing up some sample pages on my main machines that I can use to refine its guesses. My own retyped copy has a number of transcription errors, beyond the usual first-draft grammar hangups and plot issues. If you see a typecast discussing a lazy dog, jumping foxes, and brown liquor jugs, know that school's in session.

16 comments:

deek said...

One OCR methodology would be to scan pages after you type them. Whether that would be page at a time or after a day's writing session.

I know that my scanner/software setup gives me great OCR text, but at about 30 seconds per page, the thought of doing that with a 200 page document is overwhelming (even though its less than 2 hours, full-bore).

mpclemens said...

Out setup is a bit more automatic. The copier has a sheet feeder, so I just plunk down the draft in the feeder and hit "scan", and it all gets shot into a multi-page .TIF file, which is exactly what Tesseract wants to read. It can chew on the text as long as it wants.

The trick is having it turn out something useful from the typing in the first place.

Elizabeth H. said...

I'm just a tech support geek and not a programmer, so I come at these things differently, but I'd recommend scaring up an out of date copy of OmniPage. It isn't free or open source, but it sure works! And if you get a back version on eBay or whatever, I don't believe it's all that expensive. The OmniPage I use is actually a very limited version that came with my scanner. It does what I need to do and does it well.

I only have a flatbed scanner, so I had to do the very slow page at a time thing. Took forever, even for just a hundred pages, but then I did a first-sweep edit as I was doing it. I copied the text and pasted as plain text into a word processor and tidied up as I went along. But you should also be able to just create a tif with your big office scanner and then work through it at your leisure with the OCR software.

mpclemens said...

Yes, OmniPage is regularly brought up in OCR discussions. It is rather dear in terms of cost and tech requirements, though, and doesn't run on either of my platforms-of-choice. Just for fun, I re-ran Tesseract on my 2009 draft, and it actually did a good job with a lot of it, without training. I think dirty typeslugs and using multiple machines might have tripped it up.

James Watterson said...

I just got new batteries for my AS. It comes in handy especially with school starting back next week. I have horrible OCR setup, the flat bead. I wish I had a sheet fed scanner!!! It would make the nightmare of OCR seem less involving.

Unknown said...

This is a sample of my OCR output, totally unedited and unretouched:

- - -

This was originally typed on my Olympia SM-7 portable (7)
typewriter with a faded ribbon. (I have new ribbons coming.)
The OCR scan is done on a sheet-fed Canon LiDE 90 USB-
powered flatbed scanner. The key to successful OCR output
is the scanning software, Vuescan, from wwwl www.hamrick.com
for either windows or Mac, $39.95 a copy. It has OCR built
in as part of its output.
I use the Mac version, running OS X "Tiger." My typed page
scan output is completely automatic; no "training" needed.
I will try to post a grayscale scan of this typed page so
you can see that it is far from being an optimal scan candidate
=GB=


- - -

For anyone curious, here is a grayscale scan of the typed sheet:

http://www.graybyrd.com/ftp/typed-page.jpg

I was not able to override the automatic contrast enhancement of the scan software, so this scan looks significantly better than the actual sheet.

I'm reminded of the City Editor who would tell his reporters only once to put a new ribbon in their typewriter, as he was sick of reading the dim type. Next time he would visit the reporter's desk with a pair of scissors in hand: "snip" ... "Oh," he would say. "It seems you'll need a new ribbon!"

=GB=

mpclemens said...

That's a good exercise: here's the same text run through Tesseract, after I saved it to a .tif file first:

-----
This was originally typed on my Olympia SM-? portable (7)
typewriter with a faded ribbon. (I have new ribbons coming.)
The OCR scan is done on a sheet-fed Canon LiDE 90 USB-
powered flatbed scanner. The key to successful OCR output
is the scanning software, Vuescan, from www; www.hamrick.com
for either windows or Mac, $39.95 a copy. It has OCR built
in as part of its output.
I use the Mac version, running OS X "Tiger." My typed page
scan output is completely automatic; no "training" needed.
I will try to post a grayscale scan of this typed page so
you can see that it is far from being an optimal scan candidate.
=GB=
-----

No training is needed for Tesseract out of the box, but I think the typeface and the state of the Ribbon on my Royal gave it some fits.

Unknown said...

Excellent follow-up: nice to see that your OCR worked so well on the scan sample.

FYI, I had tried the OmniPage SE OCR software from ScanSoft that came with my Canon LiDE 90. It was a dismal failure. Output from the typewritten page looked like Polish curse words.

Given the success of these scan tests, I'd have no problem typing 250-word pages for NaNoWriMo on my Olympia, knowing that I can scan them into the Mac & Scrivener for editing.

=GB=

mpclemens said...

I've always found the nice rounded typeface of the Olympia to be easy-reading. Looks like the software agrees.

@Grampa: if you're serious about NaNo'ing on a typewriter, there's always room in the Typewriter Brigade. And obviously there's crossover with the AlphaSmart folks too, despite how much we might make fun of them come November.

Elizabeth H. said...

FYI, I had tried the OmniPage SE OCR software from ScanSoft that came with my Canon LiDE 90.

Huh...that's the same scanner I have and the same software that works so well for me. I do vaguely recall that it worked much better if you did your scanning within the program--it isn't very good at scanning preexisting image files. Maybe I should retract my praise for that reason.

Thanks for the Vuescan mention! Looks like a great product at a reasonable price.

Mike Speegle said...

I...erm...I just type it again on the computer during the second draft.

...yeah.

Unknown said...

LFP -- I did the OPage-SE scan within the MP Navigator app. I'm sure that the dim-ribbon content was the problem. VueScan does a remarkable auto contrast-enhancement, which I'm thinking is the secret.

For other more normal content, the OPage-SE software is really great. I'd recommend it without reservation.

=GB=

Anonymous said...

I've never gotten OmniPage to work, though I don't really know how to work it. Besides, my scanner is pathetically slow, like the old ladies in the grocery store who lean on the carts and shuffle along. The problem is, the old lady can't see and continuously picks up the wrong items and you have to reach into her cart and put them back.


That would be your grandmother, in general. My metaphoric skills are horrible, but classes start again soon and maybe I'll be able to think up something coherent next time I decide to meta-story your blog.

Duffy Moon said...

Can't really comment on OCR, and quite frankly it sorta frightens me.

However: one thing that might help is getting the typed manuscript in more readable shape. A possible way to do that was demonstrated by Richard Polt:

He took Selectric-style ribbons (carbon on film) and re-spooled them on regular 1/2" spools, and threaded that through his manual machines. He ended up with some really superbly crisp type.

Now, the disclaimer: I've tried this myself and have had very poor results. The film ribbon keeps getting crumpled or doesn't feed fast enough/far enough for each typed letter, and so parts of letters are missing. Somehow Polt got it to work on several machines, including an SM-4 (I saw this, and tried it, first-hand). On my own SM-4 I just can't get it to work.

Hopefully your mileage may vary...

mpclemens said...

I keep passing over Selectric ribbons at the thrift store -- they had a box of six for a while, even -- because I simply cannot imaine myself being able to spool that extra width on a 1/2" spool. I have enough trouble dealing with ribbons that fit.

Clear type does appear to be tripping me up, though. It would not be out of order for me to brush the type on the Skyriter.

Charles said...

A scanning suggestion follows. This works.

1. Type a page.
2. Scan the page into .PDF format 600x600 resolution. 300x300 resolution works also but you will have more OCR errors.
3. Start OmniPage 17.
4. Open the document within OmniPage.
5. Save the OCR'd document in .DOC format.