How to convert pdf files for translation in CAT tools
Well, since CAT tools are becoming more and more indispensable in translation business reducing translation time and improving general translation quality so, my first post here I decided to write about preparing PDF files for translation in any of the CAT tools such as SDL Trados or MemoQ. There are basically two types of PDFs: one can be quite easily converted to Word format using some of the well-known software such as PDF to Word converter, ABBYY PDF Transformer and the like (the list may be quite long, so pick up whichever you like best). Unlike the first type the second one is way more interesting representing the PDFs that were obtained from scanned papers. As a matter of fact they are pictures that a persistent customer wants you to translate. I am not feeling like talking about extra cost that may be charged for such type of translation and whether or not such job should even be taken up. As for us at STT such job is a quite a challenging assignment.
1) First of all such type of PDF needs to be converted into Word anyway as this format is most suitable for a CAT tool (on top of some other popular formats that such tools can work with, of course, but here we are mostly interested in Word). For this purpose ABBYY Finereader seems to be the best solution as it provides the best flexibility in processing pictures. So, go ahead and read the file.
2) Despite the urgency to get the file ready for CAT in one click of your fingers there is no single and fast solution for this. Even though FineReader can automatically read and save the file it is highly recommended to mark it up manually after pre-reading. Single out pictures (i.e. those text parts that do not have to be translated) and tables in the first place. Try to avoid individual text blocks as their location may be quite unpredictable later on. Also, you may not want to read headers as they may appear at any part of the text. Your goal is to get as much plain text ready for CAT as possible. It is quite a tedious and time consuming job though time spent on marking at this stage will pay off later when working in CAT.
3) After the OCR (Optical Character Recognition) our assignment looks very much like the original paper in Word. You can even feed it to CAT. Though shortly after you may find yourself copying and pasting thousands and thousands of tags from source to target and excavating tiny particles of meaningful information from that deposit of tags. Enough to make your life quite miserable in a short while. Moreover after struggling with tags in CAT it may be quite disappointing to see that your text looks very weird after exporting to target: tables and pictures start floating everywhere despite your desperate attempts to bring everything back to normal.
4) It is after the OCR that more attention needs to be paid to preparing the files. There are two ways of getting more or less appropriate result. One is to kill all the formatting (Ctrl+A – Select All, Ctrl+Space). However please note that this way all the tables will become plain text and they will have to be built up again after translation. Another solution is to use CodeZapper macro for Word from David Turner. It is quite a powerful tool actually. In one click it removes all the unnecessary codes, hyphens and whatsoever that you can’t see. If the file is overloaded with pictures it can remove them to a separate file and place them back after translation. This is very helpful as SDL Trados for example is very slow at processing huge files where most of its size is comprised of pictures. Moreover CodeZapper can calculate words in text boxes and the whole file!
5) Well, unfortunately Zapper does not make miracles either. The file’s formatting may still remain hectic. So, after using the macro look through the file searching for corrupted tables and illegible words and correct them manually. The file does not have to look like the original at this stage. Get as much legible text as possible. If required the whole text may even be converted to a table (one column only) to get proper segmentation (it is quite irritating to come across a segment broken in the middle and the translation of the segments has to be interchanged. It is going to look well in the target, though it is not good for translation memory).
6) After you’ve checked the tables feed the file to CAT, translate, get the target and now only start working on proper formatting of the text. Make headers, insert pictures, create styles that you’re going to use for making up table of contents going one page after another and hopefully your document will look like properly translated original in the end.
It needs saying that the afore-mentioned effort only pays off in case of major projects if you’re expecting or handling repetitive files of similar content with a huge number of repetitions and context matches. Otherwise the fastest way to translate a ‘pictured’ PDF is to do it by manually tying the translation from scratch on a blank paper leaving out the original format. Another solution would be to translate one page from repetitive pages trying to keep the required format (tables, pictures and so on) and use it as a template for further work. This way you can also save some time as all the fuss with FineReader may be quite time consuming.