Q & A
― PDF 内のリンクテキストを抽出する方法 ―

Q. PDF にあるページ内リンク上のテキストを取得するにはどの製品が必要ですか？またどうすればいいですか？

A. ページ内のテキストを取得するには TET が必要です。またページ内リンクの情報は pCOS 関数を使って取得できます。

PDF のページ内リンクは annotation と呼ばれる要素で作成します。annotation 要素の情報を pCOS 関数で取得し、その領域にあるテキストを抽出することで、リンクテキストを取得します。処理手順は以下の通りです：

pCOS 関数で PDF のページ数を取得する
各ページの annotation 数を取得する
annotation のうち、destpage を持つものの領域を取得する
TET で該当領域を読み込み、テキストを抽出する

ページに Rotate や CropBox を持つ PDF では、期待通りにリンクテキストを抽出できません。取得したリンクの座標が見た目と合わない場合の対処法もご覧ください。

サンプルコードは以下の通りです。


/*
 *  Annotations.java
 *  PDF よりページ内を参照している Annotation を抽出し、参照先ページと
 *  Annotation 上の文字列を出力します。
 *
 *  copyright (c) 1997-2012 infoTek K.K. all rights reserved.
 *  当ソースコードにより生じたすべての不利益について、当社は責任を負いません。
 */

import java.io.*;
import com.pdflib.TETException;
import com.pdflib.TET;

public class Annotations{
    public static void main(String argv[]){
        TET tet = null;
        int pages, annots, doc, page, destpage;
        double x1, y1, x2, y2;
        String tmp, optlist;

        try{
            if(argv.length != 1){
                throw new Exception("usage: Annotations ");
            }

            tet = new TET();
            doc = tet.open_document(argv[0], "");

            // ページ数を取得
            pages = (int)tet.pcos_get_number(doc, "length:pages");
            for(int i=0; i<pages; i++){
                // Annotation 数を取得
                annots = (int)tet.pcos_get_number(doc, "length:pages[" + i + "]/annots");
                for(int j=0; j<annots; j++){
                    tmp = "pages[" + i + "]/annots[" + j + "]";

                    // 参照先ページを取得
                    destpage = (int)tet.pcos_get_number(doc, tmp + "/destpage");
                    if(destpage == -1) continue;  // out of document(e.x. URL)
                    System.out.println("destpage: " + destpage);

                    // Annotation の領域を取得
                    x1 = tet.pcos_get_number(doc, tmp + "/Rect[0]");
                    y1 = tet.pcos_get_number(doc, tmp + "/Rect[1]");
                    x2 = tet.pcos_get_number(doc, tmp + "/Rect[2]");
                    y2 = tet.pcos_get_number(doc, tmp + "/Rect[3]");
      System.out.println("position: " + i+1 + " ページ、(" + (int)x1 + ", " + (int)y1 + ")");

                    // 領域上の文字列を取得
                    optlist = "includebox={{" + x1 + " "
                                              + y1 + " "
                                              + x2 + " "
                                              + y2 + "}}";
                    optlist += " granularity=page";
                    page = tet.open_page(doc, i+1, optlist);
                    System.out.println("String: " + tet.get_text(page) + "\n");
                    tet.close_page(page);
                }
            }
        }
        catch(TETException e){
            System.err.print("[" + e.get_errnum() + "] " + e.get_apiname() +
                            ": " + e.get_errmsg() + "\n");
        }
        catch(Exception e){
            System.err.println(e.getMessage());
        }
        finally{
            if(tet != null)
                tet.delete();
        }

        System.exit(0);
    }
}

Java 1.7 / PDFlib TET 4.1

(Jan 30, 2018 - )

Q & A― PDF 内のリンクテキストを抽出する方法 ―

Q & A
― PDF 内のリンクテキストを抽出する方法 ―