Tech Tip: Convert to Text!

Posted on Updated on

By Ken Fox

You know what really grinds my gears? When I open a PDF file containing what appears to be digitally-formatted text and find that it is non-copyable and non-searchable. The ability to search, copy and paste text are essential functions of digital communications – so the idea that a text is born digitally and therefore ASCII (American Standard Code for Information Interchange) encoded, and that somebody wittingly or unwittingly should remove that functionality – it leads to much weeping and wailing and gnashing of teeth on my part.

Well just last week I was sent a large PDF document with more than 70 pages of text. So I opened it in Adobe Acrobat, and tried to execute a search for a key term, and found that it was (you guessed it) another one of those documents that had signs of ASCII-formatted text in its progeny, but through the manipulations of some kind of monster, been reduced to the mere semblance of text, no more searchable than a stack of paper.

So naturally I commenced with my usual process of wailing and gnashing, but after a few minutes of that I got a notion that maybe I should try something different. In near desperation, I got the idea that – just maybe – if I “select all” and paste it into a text editor then some hitherto-hidden ASCII-encoded text might appear. Worth a try, right?

So I hit control-A, and THIS happened:

Hello!

“Why yes,” I said out loud, “in fact I WOULD like to run text recognition to make the text on this page accessible – THANKS for asking!”

I clicked Yes.

Then I got asked for some settings, which I ignored and just clicked OK – opting for the default option in my excitement.

Adobe Acrobat then leapt through my document, systematically performing the miracle of breathing life into the dead letters at the rate of about a page a second – slightly faster for the “born digital” main portion, and a bit slower for some appendices that bore the stigmata of pre-digital technology.

The result was perfectly copyable, pastable, searchable text in the main body of the document. As for the typewritten appendices, Acrobat almost flawlessly converted them into digital text as well, while maintaining the visual features of the original typed text. Basically, the document looked identical to how it had looked prior to the procedure but was now digitally functional. The only letters and numbers that resisted the resurrection were data from a single table with a very small typeface – those few characters remained a heretical community of graphics in the midst of a near-universal mass conversion.

Optical text recognition technology has come a long way in a few short years.

Now if you work anywhere in the legal industry (or do any kind of office work), then there is a good chance you have been able to follow right along, and to some of you, this is already old news and why am I boring you. But if there are any among you who don’t know what I’m talking about with text that can be searched and copied – you need to learn a few tricks that will make your life a whole lot easier. Begin with learning these commands, which work on almost all text-editing software:

CTL-F … Find text in document

CTL-A … Select All

CTL-X … Cut selected text

CTL-C … Copy selected text

CTL-V … Paste the last text you cut or copied

CTL-Z … Undo last operation

CTL-Y … Redo undone operation

CTL-H … Find all identified text in document and replace with other text

You can use point-and-click menus for these operations as well, but I find the keyboard shortcuts easier. These features, and many others, are now standard practice in office work – so learning them will not get you ahead so much as get you caught up with the rest of us.

And if you ever come across a text, especially a longish one, for which the above commands do not work, try to do minimal weeping & wailing and tooth-gnashing. And when you are done that, wipe the tears off your keyboard and try the simple operation described above. Failing that, try something else. And if all else fails, ask your friend in IT to perform a miracle. Because there is no reason to tolerate text in a digital file that cannot function as digital text.

Tell us what you think

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s