PDFlib GmbH München , Germany

www . pdflib . com

FontReporter 1 . 3

®

A plugin for analyzing fonts in PDF

1 Installing PDFlib FontReporter

2 Working with FontReporter

2 . 1 What can you do with FontReporter ?

2 . 2 Overview of PDF Font Formats

2 . 3 Contents of a Font Report

2 . 4 Investigate PDF Problems with FontReporter

2 . 5 Error Messages

A Revision History

PDFlib GmbH München , Germany

www . pdflib . com

FontReporter 1 . 3

®

A plugin for analyzing fonts in PDF

Copyright © 2005-2008 PDFlib GmbH . All rights reserved .

PDFlib GmbH Franziska-Bilek-Weg 9 , 80339 München , Germany www . pdflib . com

This publication and the information herein is furnished as is , is subject to change without notice , and should not be construed as a commitment by PDFlib GmbH . PDFlib GmbH assumes no responsibility or liability for any errors or inaccuracies , makes no warranty of any kind ( express , implied or statutory ) with respect to this publication , and expressly disclaims any and all warranties of merchantability , fitness for particular purposes and noninfringement of third party rights .

Adobe , Acrobat , and PostScript are trademarks of Adobe Systems Inc . AIX , IBM , OS / 390 , WebSphere , iSeries , and zSeries are trademarks of International Business Machines Corporation . ActiveX , Microsoft , Windows , and Windows NT are trademarks of Microsoft Corporation . Apple , Macintosh and TrueType are trademarks of Apple Computer , Inc . Unicode and the Unicode logo are trademarks of Unicode , Inc . Unix is a trademark of The Open Group . Java and Solaris are a trademark of Sun Microsystems , Inc . Other company product and service names may be trademarks or service marks of others .

Thank you for using PDFlib FontReporter , a free Acrobat plugin provided by PDFlib GmbH . PDFlib GmbH offers software for creating and processing PDF documents . Please visit our Web site to learn more about our products .

You can use PDFlib FontReporter free of charge ; however , it is not in the public domain . This software cannot be sold or redistributed ( whether for a fee or at no charge ), either stand-alone or in combination with any other product , without the express written permission of PDFlib GmbH .

Although PDFlib FontReporter is not a commercial product , we strive to provide high quality . If you run into problems you are encouraged to contact us at support@pdflib . com .

Contents

1 Installing PDFlib FontReporter 5

2 Working with FontReporter 7

2 . 1

What can you do with FontReporter ? 7

2 . 2

Overview of PDF Font Formats 9

2 . 3

Contents of a Font Report 11

2 . 4

Investigate PDF Problems with FontReporter 14

2 . 5

Error Messages 15

A Revision History 17

Contents 3

4 Contents

1 Installing PDFlib FontReporter

Requirements . The PDFlib FontReporter plugin works with Acrobat 6 / 7 / 8 Standard and Professional on Windows and Mac , and Acrobat 9 Standard , Pro and Pro Extended on Windows . The plugin doesn ’ t work with Acrobat Elements or any version of Acrobat Reader / Adobe Reader .

Installing FontReporter on Windows . All plugin-related files must be copied to the subdirectory » PDFlib FontReporter « in the Acrobat plugin folder . This is done automatically by the plugin installer , but can also be done manually . A typical location of the plugin folder is as follows :

C :\ Program Files \ Adobe \ Acrobat 9 . 0 \ Acrobat \ plug_ins \ PDFlib FontReporter

Installing FontReporter for Acrobat 6 / 7 / 8 on the Mac . With Acrobat 6 / 7 / 8 the plugin folder is not visible in the finder . Make sure that Acrobat is not running and follow these steps : > Extract the plugin files by double-clicking the disk image (. dmg ). > Locate the Acrobat application icon in the finder . It is usually located in a folder

which has a name similar to the following :

/ Applications / Adobe Acrobat 8 . 0 Professional

> Single-click on the Acrobat application icon and select File , Get Info . > In the window that pops up click the triangle next to Plug-ins . > Click Add ... and select the FontReporter folder from the folder which has been created

in the first step . Note that this folder will not immediately show up in the list of plugins , but only when you open the info window next time .

Multi-lingual Interface . FontReporter supports multiple languages in the user interface and generated font reports . Depending on the application language of Acrobat , FontReporter will choose its interface language automatically . Currently English and German interfaces are available . If Acrobat runs in any other language mode , Font- Reporter will use the English interface .

Trouble-shooting . If the FontReporter plugin doesn ’ t seem to work , make sure that in Edit , Preferences , [ General ...], Startup the » Use only certified plug-ins « box is unchecked .

Chapter 1 : Installing PDFlib FontReporter 5

6 Chapter 1 : Installing PDFlib FontReporter

2 Working with FontReporter

2 . 1 What can you do with FontReporter ?

FontReporter is a useful tool if you are interested in fonts within PDF documents . It provides font- and encoding-related information which will helps in a variety of situations : > analyze printing problems ( e . g . a particular font causes printing errors ) > investigate text extraction problems ( e . g . copying text from a PDF results in garbage ) > visualize Unicode mappings for a font > find flaws in the PDF creation workflow ( e . g . printer driver converted a PostScript

Type 1 font to Type 3 ) > test whether ToUnicode mapping tables ( required for PDF / A compliance ) are present > identify logos and symbols which are represented as text in a PDF > learn which fonts are contained in a PDF , and which glyphs they contain ( e . g . the file

size is too large because some fonts ended up in the PDF unintentionally ) > check font subsets to see which glyphs are contained in the subset > learn more about PDF font technology

Using FontReporter is as easy as bringing up the menu Plug-Ins , PDFlib FontReporter ..., Create Font Report in Acrobat . This will create a font report for all pages of the current PDF document as a separate PDF . Two pages from typical font reports are shown in Figure 2 . 1 . Fig . 2 . 1 Sample font reports

Chapter 2 : Working with FontReporter 7

Supported PDF and font formats . FontReporter supports all PDF versions up to PDF 1 . 8 , the file format created by Acrobat 9 . All font and encoding formats available in PDF are supported , as well as all types of embedded font data .

Advantages over Acrobat ’ s font properties panel . All versions of Acrobat including Adobe Reader provide font information via File , Document Properties ..., Fonts . However , Acrobat ’ s font overview is limited in use ; FontReporter provides the following advantages compared to Acrobat ’ s font list : > FontReporter provides much more information about each font > FontReporter deals with CJK font names even on Western systems > FontReporter provides glyph tables containing the glyphs of a font along with their

widths , names , and Unicode values > FontReporter presents the output as a PDF document so that you can save or print it > FontReporter is guaranteed to process the full document , regardless of which pages

have already been displayed in Acrobat

PDF text extraction with PDFlib TET . FontReporter is an auxiliary tool to our PDFlib Text Extraction Toolkit ( TET ). TET is software for extracting the text contents of PDF documents . It is available both as a standalone program and a programming library / component which can be integrated into existing software . TET extracts text from all kinds of PDF documents and normalizes the text to Unicode . FontReporter can be used to create Unicode mapping tables for PDF documents which do not contain enough information for extracting text , or which contain wrong Unicode mapping tables . Fully functional evaluation versions of TET are available for download from www . pdflib . com .

TET PDF IFilter . TET PDF IFilter extracts text and metadata from PDF documents and makes it available to search and retrieval software on Windows . This allows PDF documents to be searched on the local desktop , a corporate server , or the Web . TET PDF IFilter is based on the patented PDFlib Text Extraction Toolkit ( TET ). TET PDF IFilter is a robust implementation of Microsoft ’ s IFilter indexing interface . It works with all search and retrieval products which support the IFilter interface , e . g . SharePoint and SQL Server . Fully functional evaluation versions of TET PDF IFilter are available for download from www . pdflib . com .

Free TET Plugin . The TET Plugin is a free companion to the FontReporter Plugin . It can be installed in Adobe Acrobat and allows interactive use of the Text Extraction Toolkit ( TET ) with any PDF document that is currently open in Acrobat . Using the TET plugin you can access TET ’ s functionality and experiment with TET options . The TET plugin can freely be downloaded from www . pdflib . com .

8 Chapter 2 : Working with FontReporter

2 . 2 Overview of PDF Font Formats

PDF supports a confusing array of font formats , the details of which can get confusing . In order to help you interpret the reports created by FontReporter we provide a quick summary of PDF font formats and their most important properties .

While the format of a font in a PDF document depends on the format of the original font used to compose the document , this is not the only factor which plays a role here . Other factors include the configuration options in the PDF-creating software , the settings of the printer driver used to generate PostScript data for PDF conversion , the set of characters in the document , the overall number of used characters , and more .

A particularly important aspect is the distinction between simple fonts and composite ( CID ) fonts .

Simple fonts . Simple fonts comprise the PostScript Type 1 ( ncluding Multiple Master ), TrueType , and Type 3 types , and are addressed with 8-bit codes . They are therefore limited a maximum of 256 characters . Simple fonts use a name-based encoding , which is a table for mapping the character codes to the glyphs in the font .

Composite ( CID ) fonts . Composite or CID ( character ID ) fonts come in PostScript and TrueType flavors . They can contain up to 65535 characters and are much more flexible than simple fonts . While CID fonts often use 2-byte codes for addressing the glyphs in the font , more complicated schemes with a variable number of bytes per character ( 1-4 ) are used for CJK fonts . Instead of an encoding table CID fonts require a CMap ( Character Map ) for providing the mapping from character codes to actual glyphs in the font . Dozens of predefined CMaps are available for common CJK fonts . The font ’ s character collection specifies a particular set of Chinese , Japanese , or Korean characters . So-called Identity CMaps are used ( mainly for Western fonts ) in order to directly address the glyphs in a font without any intermediate mapping table .

Comparison of PDF font formats . Table 2 . 1 details the font formats supported in PDF , and explains which original font formats can be converted to these types by the PDF creation software .

Table 2 . 1 Font formats in PDF

Name

Type1

Notes

Classic PostScript Type 1 fonts . In addition to the original Type 1 format they can also be embedded as CFF ( Compressed Font Format ) under the name Type1C ( Type 1 Compressed ).

These fonts are the result of classic PostScript Type 1 fonts or OpenType fonts with PostScript outlines .

MMType1 ( In Acrobat : MM ) Multiple Master fonts are an extension of the Type 1 format , and are rarely used .

These fonts are the result of PostScript Type 1 Multiple Master fonts .

Type3 User-defined fonts , i . e . the glyphs are described by raw vector or image operations instead of a readymade font . Type 3 fonts are always embedded . They are mainly intended for bitmapped fonts and logo fonts .

These fonts are often the result of a printer driver converting a PostScript Type 1 or TrueType font to a bitmap font . Some applications use Type 3 fonts for achieving special effects , such as filling an area with a pattern .

Chapter 2 : Working with FontReporter 9

Table 2 . 1 Font formats in PDF

Name

TrueType

CIDFontType0

Notes

TrueType fonts can directly be embedded in PDF . Since not all parts of the original TrueType font are required in PDF the embedded font data does not necessarily comprise a valid TrueType font .

These fonts are the result of TrueType or Type 42 fonts , or OpenType fonts with TrueType outlines .

( In Acrobat : Type 1 ( CID )) CID font with PostScript outlines ; similar to Type 1 fonts the font data can be embedded as CFF under the name CIDFontType0C .

These fonts are the result of OpenType fonts with PostScript outlines .

CIDFontType2 ( In Acrobat : TrueType ( CID )) CID font with TrueType outlines .

These fonts are the result of TrueType fonts or OpenType fonts with TrueType outlines .

OpenType Directly embedding OpenType fonts requires PDF 1 . 6 . In contrast to CIDFontType0 it allows to embed the full OpenType font file . This format is still very rare .

These fonts are the result of OpenType fonts with PostScript outlines .

10 Chapter 2 : Working with FontReporter

2 . 3 Contents of a Font Report

FontReporter collects general information , font-related information , and glyph tables for all fonts in a PDF . These categories are accessible in multiple ways : > bookmarks contain general and font-related information as clickable hypertext > overview pages contain general and font-related information on a printable page ;

the overview pages contain clickable links so that you can easily navigate to a particular font ’ s glyph table > detailed glyph tables repeat the font-related information , and list the glyphs in a

font

Clicking one of the bookmarks or using the links in the overview section you can quickly navigate to the corresponding glyph tables for a font . FontReporter will copy the fonts from the original document to the font report without any modification . All font properties ( e . g . embedding and encoding ) will remain unchanged .

Note FontReporter will only process data which is actually represented with font and encoding structures in the PDF . Other means of representing text are ignored , such as images containing text , or characters which are drawn with vector graphics ( also called outline text ).

2 . 3 . 1 General Information

The general information in a font report contains of the following : > file name of the original PDF document > number of pages and fonts in the document > PDF Producer , i . e . the name of the software used to produce the PDF ( not the software used to compose the document ).

2 . 3 . 2 Font and Encoding Information

For each font in the document the following pieces of information will be provided .

Font name . If the font name starts with six random characters and a plus sign , the font is a subset which does not contain all characters which were originally present in the font . The subset prefix is useful when multiple subsets of the same font are embedded in a document . CJK font names will be displayed in their native spelling if possible .

Font type . The PDF font type ; see Section 2 . 2 , » Overview of PDF Font Formats «, page 9 , for a list of font formats .

Embedding and subsetting status . The embedding status of the font , including subset information and the format of the embedded font data .

Encoding . The encoding defines the mapping of character codes to glyphs for simple fonts . This may be the name of one of PDF ’ s predefined encodings WinAnsiEncoding or MacRomanEncoding ( these are called Ansi and Roman in Acrobat ) or custom for a nonstandard encoding . If no encoding information is given explicitly , but the encoding which is built into the font must be used , the word builtin is displayed instead of an encoding name .

Chapter 2 : Working with FontReporter 11

CMap . For CID fonts the name of a predefined CMap ( character map ) will be listed as encoding , plus the name of the corresponding character collection ( Chinese , Japanese , Korean , or Identity ).

Additional font information . If a simple font contains a CharSet entry ( a list with the names of all glyphs contained in a subset ) this will be noted . Similarly , if a CID font contains a CIDSet entry ( a list with all CIDs contained in a subset ) this will also be noted .

Symbol bit . The symbol bit signals that a font contains characters outside the Adobe standard Latin character set . This may be relevant for font substitution and text extraction operations . For example , a non-embedded font with the symbol bit cannot be substituted .

Unicode mapping table . IF the font contains an explicit ToUnicode mapping table this will be mentioned . ToUnicode tables are crucial for text extraction and search operations .

2 . 3 . 3 Glyph Tables

FontReporter will create detailed glyph tables for each font in the document . The table organization depends on the font type . Regardless of the font type , the rows and columns of each table will be numbered from 0 to F ( these are the hex codes for the numbers 0-15 ); empty slots in the tables will contain a small dot as a substitute .

Table organization for simple fonts . Since simple fonts can address at most 256 glyphs , a single page with 16x16 slots is sufficient . All glyphs present in the font / encoding combination will be shown . Unencoded glyphs in the font and unused encoding entries will not be shown .

Table organization for CID fonts . Since CID fonts can contain thousands of glyphs , efficient table organization is important for achieving compact font reports while at the same time providing quick access to particular code ranges .

If the font contains a CIDSet table , only the CIDs contained in this table will be shown ; otherwise the CIDs which are actually used in the document will be shown . All text on all pages of the document will be processed to determine the set of used CIDs , while text in hypertext elements ( such as form fields ) will be ignored .

A separate page will be created for each block of 256 CIDs . The starting number of the block ( in hex ) and the number of glyphs per block are provided in the page heading and the corresponding bookmark . For example , the heading CID x0000 means that this block contains CIDs 0000-00FF ( hex ), or decimal 0-255 ; the heading CID x0100 means that this block contains CIDs 0100-01FF ( hex ), or decimal 256-511 . Empty blocks will be omitted to reduce the overall size of the font report .

Information for each glyph . For each glyph in a table the following information will be shown : > A gray rectangle showing the glyph width ( the height of the rectangle is constant ,

and not related to the glyph geometry ) > The actual glyph will be displayed . Sometimes the font does not contain the corresponding glyph description . In this case the font ’ s . notdef glyph will be displayed ; depending on the type of font it may be represented as a space character , hollow box ,

12 Chapter 2 : Working with FontReporter

crossed box , or similar . For CID fonts with vertical writing mode the glyphs will be

positioned such that they match the table grid . > For simple ( 8-bit ) fonts the name of the glyph will be shown . > If the font contains a Unicode mapping table ( a ToUnicode CMap ), the glyph ’ s Unicode

value ( s ) will be shown in U + xxxx notation if available . Multiple Unicode values may be present for glyphs which map to a sequence of multiple Unicode characters , such as ligatures and fractions . If a ToUnicode table is present for the font , but it does not contain an entry for this glyph ( this may happen for non-textual symbols which are contained in a text font ), FontReporter will display the string ( missing ) instead . If the Unicode values contain a surrogate pair ( two UTF-16 values ) the corresponding UTF-32 value will be displayed instead .

Chapter 2 : Working with FontReporter 13

2 . 4 Investigate PDF Problems with FontReporter

This section lists some common problem scenarios along with hints for interpreting the font report in order to identify problems .

Text Extraction does not work . FontReporter is very useful if you try to extract text from a PDF ( using PDFlib TET , Adobe Acrobat , or any other tool ) and the the extracted text is incomplete or wrong . Some hints : > For simple fonts check the encoding , glyph name , and Unicode mapping . In many

cases errors in the PDF , font , or encoding are easy to identify . In PDFlib TET you can correct many kinds of errors by supplying appropriate processing options or custom Unicode mapping tables .

> For some font / encoding combinations text extraction will only work if a proper ToUnicode mapping table is present . The font report will tell you whether or not this data structure is present in the PDF .

Complicated Unicode mappings . In some situations the Unicode mapping of a glyph may not be obvious . For examples , a font ’ s Unicode mapping can decompose ligatures into multiple constituent characters in order to facilitate text extraction . On the other hand , ligatures sometimes have wrong Unicode mappings which thwart text extraction .

Error message in Acrobat . For some problematic PDFs Acrobat will complain

Cannot find or create font XXX . Some characters may not display or print correctly .

and display bullets instead of the font ’ s characters . This usually happens when the font is neither embedded nor installed on the system , and Acrobat cannot substitute the font because the symbol bit is set .

Type 3 fonts . Type 3 fonts may cause the PDF to print slowly or prevent text extracting and editing . The font report will provide a useful overview of Type 3 fonts and the glyphs they contain .

Glyph complement of font subsets . Font subsets do not contain all glyphs which were initially available in a font . The font report displays the glyph complement ( set of available glyphs ) in a font subset .

Duplicate fonts . In some situations ( e . g . combining pages from several PDFs ) multiple subsets of the same font may be present in a PDF , or even multiple instances of the same font . This can cause file size bloat or even printing problems . The font report will clearly identify this problem .

Distinguish text and graphics . Sometimes it is difficult to see whether content in a PDF is actually represented as native text , vector graphics , or an image . The font report helps in identifying various uncommon scenarios : > Text which is represented as vector graphics ( outline text ) or an image will be missing from the font report . > If a logo or other symbol is represented by a special font you can easily identify the

logo font in the font report .

14 Chapter 2 : Working with FontReporter

2 . 5 Error Messages

Unaccessible content data . FontReporter will include this message in the font report if it was unable to enumerate the page contents for determining which glyphs are used in the document . Most commonly this will happen with encrypted documents , but also if the page description is damaged .

Invalid page content state , glyphs cannot be drawn . This message may appear for unembedded fonts which cannot be processed . Acrobat usually complains with the message A font required for font substitution is missing , or substitutes a system-installed font .

Chapter 2 : Working with FontReporter 15

16 Chapter 2 : Working with FontReporter

A Revision History

Revision history of this manual

Date Changes

July 3 , 2008

> Minor changes for FontReporter 1 . 3 ( new : support for Acrobat 9 on Windows )

March 26 , 2007

> Minor changes for FontReporter 1 . 2 ( new : support for Acrobat 8 on Mac OS X )

January 30 , 2006

> Minor additions for FontReporter 1 . 1

February 14 , 2005

> Initial version for FontReporter 1 . 0 . 0

Known problems in this version . We are currently aware of the following minor problems : > In rare cases the glyphs of Type 3 fonts may appear too small or too large , or not on

the baseline . > Although PDF Producer entries with non-Latin characters will be displayed properly

in the bookmarks , they will appear garbled in the overview page . > Sometimes not all glyph names are shown for simple fonts with the predefined encodings WinAnsi or MacRoman . > Glyph names for built-in encodings are not shown , nor those for custom encodings

which are based on a font ’ s built-in encoding . > Unicode values will not be shown for CID fonts with standard CJK character collections and simple fonts with the predefined encodings WinAnsi or MacRoman . However , since these are fixed mappings no document-specific information is lost .

A Revision History 17