GuidesUnix (eng)

Crop and split book scan in 3 com­mands

Com­mand line pro­ces­sing of mul­ti­page book‐type scan­ned do­cu­ments with Image­Ma­gick.

TL;DR:

convert -density 300 orig-scan.pdf pages.png
convert `ls pages-*.png` -crop 3704x1852+160+20 +repage -crop 50%x100% pages-split.png
convert `ls pages-split*` -page 100%x100% result.pdf

Ima­gine you have scan­ned part of a book or some other ma­te­rial con­sis­ting of mul­tiple pages and the re­sul­ting do­cu­ment is PDF where every page con­ta­ins two fa­cing pages of the ori­gi­nal book. Mo­re­ver, there are ugly borthers you don’t want ei­ther.

Screenshot of the original PDF

Now, the re­sult you want is a clean PDF with one page of the book per page in the PDF. I’ll show here what to use to achieve that (qua­lity of the out­put and amount of adit­ti­o­nal work de­pends on how ca­re­fully you scan­ned the book).

You will need convert, which is on most Li­nux ma­chi­nes, I think. If not, try Image­Ma­gick web­site. Since we won’t use ano­ther tool here, there’s a good chance, that this tu­to­rial ap­plies not only to Li­nux and Mac, but rou­ghly to Win­dows, too.

One little war­ning – Image­Ma­gic doesn’t sup­port mul­ti­threa­ding and it’s not exactly fast, so de­pen­ding on the num­ber of pages in your scan, the ope­rati­ons mi­ght take VERY long time. For large do­cu­ments, try using -monitor op­tion.

Step 1.: PDF to image sequence

(If you op­ted for image sequence out­put from your scan­ner rather than PDF then you can skip this part.)

This es­sen­ti­ally ta­kes every page from PDF and sa­ves it as an bitmap image. (And it actu­ally works even for text do­cu­ments, not only PDFs con­sis­ting of images.)

convert -density 300 orig-scan.pdf pages.png

This will re­sult in sequence of fi­les like pages-0.png pages-1.png pages-2.png, be­cause Image­Ma­gick very smart. If you im­ple­men­tation by any chance overwri­tes the out­put, try pages-%02d.png. Don’t ask.

The -density 300 is DPI setting. If you don’t want to OCR it la­ter, you can use 300 or even lower, for OCR use at le­ast 400 (if the orig. scan has it). There’s no „ori­gi­nal“ setting as far as I know.

If you want to ex­tract only se­lec­ted pages from the PDF, use e.g. orig-scan.pdf[0-13] (this ex­tracts 14 pages, 0–13).

Step 2.: Cle­a­ning (crop­ping) and split­ting to sin­gle pages

Very of­ten you mi­ght need to crop the pages be­cause the scan­ning area was lar­ger than the book (in my case the black bor­ders). When you’re ca­re­ful du­ring scan­ning, the crop will be same for all the pages, so you can use a batch crop.

I su­g­gest you open one of the pages in Gimp or si­mi­liar soft­ware to make the me­a­su­re­ments for crop­ping. You will need re­sul­ting wi­dth and he­i­ght and X and Y off­sets. See image be­low for ex­pla­nation.

Setting up crop in ImageMagic

I will leave the actual me­a­su­ring to you. If you can’t do that, you pro­ba­bly don’t need any of this. (OK, I’ll be nice – try se­lecting de­si­red re­gion and ho­ver with mouse over re­le­vant points, you’ll see co­or­di­na­tes in the Gimp’s sta­tus bar on the bot­tom of the screen.)

convert `ls pages-*.png` -crop 3704x1852+160+20 +repage -crop 50%x100% pages-split.png

So, this ta­kes all the PNGs (I use `ls` be­cause Image­Ma­gick screwes the sor­ting), crops them to 3704×1852 px star­ting at 160×20 px, then re­pages (that’s re­setting po­si­ti­ons af­ter crop) and splits the images to two in the mi­d­dle. For de­bu­g­ging, I su­g­gest lea­ving the -crop 50%x100% out.

Step 3.: Re­com­bi­ning back to PDF

Now we have all the sin­gle pages, we can put them back to PDF.

convert `ls pages-split*` -page 100%x100% result.pdf

It is kinda im­por­tant to set the page size correctly, which I con­si­der a bit of ma­gic, but ba­si­cally – try it wi­thout any setting, if it do­esnt work, fi­d­dle with it un­til it does what you want. Here’s the re­fe­rence Image Ge­o­me­try.

Resulting PDF

That’s it. Hope you en­joyed the ride.

Napsat komentář

Vaše emailová adresa nebude zveřejněna. Vyžadované informace jsou označeny *