PDFlib GmbH München , Germany www . pdflib . com FontReporter 1 . 3 ® A plugin for analyzing fonts in PDF Copyright © 2005-2008 PDFlib GmbH . All rights reserved . PDFlib GmbH Franziska-Bilek-Weg 9 , 80339 München , Germany www . pdflib . com This publication and the information herein is furnished as is , is subject to change without notice , and should not be construed as a commitment by PDFlib GmbH . PDFlib GmbH assumes no responsibility or liability for any errors or inaccuracies , makes no warranty of any kind ( express , implied or statutory ) with respect to this publication , and expressly disclaims any and all warranties of merchantability , fitness for particular purposes and noninfringement of third party rights . Adobe , Acrobat , and PostScript are trademarks of Adobe Systems Inc . AIX , IBM , OS / 390 , WebSphere , iSeries , and zSeries are trademarks of International Business Machines Corporation . ActiveX , Microsoft , Windows , and Windows NT are trademarks of Microsoft Corporation . Apple , Macintosh and TrueType are trademarks of Apple Computer , Inc . Unicode and the Unicode logo are trademarks of Unicode , Inc . Unix is a trademark of The Open Group . Java and Solaris are a trademark of Sun Microsystems , Inc . Other company product and service names may be trademarks or service marks of others . Thank you for using PDFlib FontReporter , a free Acrobat plugin provided by PDFlib GmbH . PDFlib GmbH offers software for creating and processing PDF documents . Please visit our Web site to learn more about our products . You can use PDFlib FontReporter free of charge ; however , it is not in the public domain . This software cannot be sold or redistributed ( whether for a fee or at no charge ), either stand-alone or in combination with any other product , without the express written permission of PDFlib GmbH . Although PDFlib FontReporter is not a commercial product , we strive to provide high quality . If you run into problems you are encouraged to contact us at support@pdflib . com . Contents 1 Installing PDFlib FontReporter 5 2 Working with FontReporter 7 2 . 1 What can you do with FontReporter ? 7 2 . 2 Overview of PDF Font Formats 9 2 . 3 Contents of a Font Report 11 2 . 4 Investigate PDF Problems with FontReporter 14 2 . 5 Error Messages 15 A Revision History 17 Contents 3 4 Contents 1 Installing PDFlib FontReporter Requirements . The PDFlib FontReporter plugin works with Acrobat 6 / 7 / 8 Standard and Professional on Windows and Mac , and Acrobat 9 Standard , Pro and Pro Extended on Windows . The plugin doesn ’ t work with Acrobat Elements or any version of Acrobat Reader / Adobe Reader . Installing FontReporter on Windows . All plugin-related files must be copied to the subdirectory » PDFlib FontReporter « in the Acrobat plugin folder . This is done automatically by the plugin installer , but can also be done manually . A typical location of the plugin folder is as follows : C :\ Program Files \ Adobe \ Acrobat 9 . 0 \ Acrobat \ plug_ins \ PDFlib FontReporter Installing FontReporter for Acrobat 6 / 7 / 8 on the Mac . With Acrobat 6 / 7 / 8 the plugin folder is not visible in the finder . Make sure that Acrobat is not running and follow these steps : > Extract the plugin files by double-clicking the disk image (. dmg ). > Locate the Acrobat application icon in the finder . It is usually located in a folder which has a name similar to the following : / Applications / Adobe Acrobat 8 . 0 Professional > Single-click on the Acrobat application icon and select File , Get Info . > In the window that pops up click the triangle next to Plug-ins . > Click Add ... and select the FontReporter folder from the folder which has been created in the first step . Note that this folder will not immediately show up in the list of plugins , but only when you open the info window next time . Multi-lingual Interface . FontReporter supports multiple languages in the user interface and generated font reports . Depending on the application language of Acrobat , FontReporter will choose its interface language automatically . Currently English and German interfaces are available . If Acrobat runs in any other language mode , Font- Reporter will use the English interface . Trouble-shooting . If the FontReporter plugin doesn ’ t seem to work , make sure that in Edit , Preferences , [ General ...], Startup the » Use only certified plug-ins « box is unchecked . Chapter 1 : Installing PDFlib FontReporter 5 6 Chapter 1 : Installing PDFlib FontReporter 2 Working with FontReporter 2 . 1 What can you do with FontReporter ? FontReporter is a useful tool if you are interested in fonts within PDF documents . It provides font- and encoding-related information which will helps in a variety of situations : > analyze printing problems ( e . g . a particular font causes printing errors ) > investigate text extraction problems ( e . g . copying text from a PDF results in garbage ) > visualize Unicode mappings for a font > find flaws in the PDF creation workflow ( e . g . printer driver converted a PostScript Type 1 font to Type 3 ) > test whether ToUnicode mapping tables ( required for PDF / A compliance ) are present > identify logos and symbols which are represented as text in a PDF > learn which fonts are contained in a PDF , and which glyphs they contain ( e . g . the file size is too large because some fonts ended up in the PDF unintentionally ) > check font subsets to see which glyphs are contained in the subset > learn more about PDF font technology Using FontReporter is as easy as bringing up the menu Plug-Ins , PDFlib FontReporter ..., Create Font Report in Acrobat . This will create a font report for all pages of the current PDF document as a separate PDF . Two pages from typical font reports are shown in Figure 2 . 1 . Fig . 2 . 1 Sample font reports Chapter 2 : Working with FontReporter 7 Supported PDF and font formats . FontReporter supports all PDF versions up to PDF 1 . 8 , the file format created by Acrobat 9 . All font and encoding formats available in PDF are supported , as well as all types of embedded font data . Advantages over Acrobat ’ s font properties panel . All versions of Acrobat including Adobe Reader provide font information via File , Document Properties ..., Fonts . However , Acrobat ’ s font overview is limited in use ; FontReporter provides the following advantages compared to Acrobat ’ s font list : > FontReporter provides much more information about each font > FontReporter deals with CJK font names even on Western systems > FontReporter provides glyph tables containing the glyphs of a font along with their widths , names , and Unicode values > FontReporter presents the output as a PDF document so that you can save or print it > FontReporter is guaranteed to process the full document , regardless of which pages have already been displayed in Acrobat PDF text extraction with PDFlib TET . FontReporter is an auxiliary tool to our PDFlib Text Extraction Toolkit ( TET ). TET is software for extracting the text contents of PDF documents . It is available both as a standalone program and a programming library / component which can be integrated into existing software . TET extracts text from all kinds of PDF documents and normalizes the text to Unicode . FontReporter can be used to create Unicode mapping tables for PDF documents which do not contain enough information for extracting text , or which contain wrong Unicode mapping tables . Fully functional evaluation versions of TET are available for download from www . pdflib . com . TET PDF IFilter . TET PDF IFilter extracts text and metadata from PDF documents and makes it available to search and retrieval software on Windows . This allows PDF documents to be searched on the local desktop , a corporate server , or the Web . TET PDF IFilter is based on the patented PDFlib Text Extraction Toolkit ( TET ). TET PDF IFilter is a robust implementation of Microsoft ’ s IFilter indexing interface . It works with all search and retrieval products which support the IFilter interface , e . g . SharePoint and SQL Server . Fully functional evaluation versions of TET PDF IFilter are available for download from www . pdflib . com . Free TET Plugin . The TET Plugin is a free companion to the FontReporter Plugin . It can be installed in Adobe Acrobat and allows interactive use of the Text Extraction Toolkit ( TET ) with any PDF document that is currently open in Acrobat . Using the TET plugin you can access TET ’ s functionality and experiment with TET options . The TET plugin can freely be downloaded from www . pdflib . com . 8 Chapter 2 : Working with FontReporter 2 . 2 Overview of PDF Font Formats PDF supports a confusing array of font formats , the details of which can get confusing . In order to help you interpret the reports created by FontReporter we provide a quick summary of PDF font formats and their most important properties . While the format of a font in a PDF document depends on the format of the original font used to compose the document , this is not the only factor which plays a role here . Other factors include the configuration options in the PDF-creating software , the settings of the printer driver used to generate PostScript data for PDF conversion , the set of characters in the document , the overall number of used characters , and more . A particularly important aspect is the distinction between simple fonts and composite ( CID ) fonts . Simple fonts . Simple fonts comprise the PostScript Type 1 ( ncluding Multiple Master ), TrueType , and Type 3 types , and are addressed with 8-bit codes . They are therefore limited a maximum of 256 characters . Simple fonts use a name-based encoding , which is a table for mapping the character codes to the glyphs in the font . Composite ( CID ) fonts . Composite or CID ( character ID ) fonts come in PostScript and TrueType flavors . They can contain up to 65535 characters and are much more flexible than simple fonts . While CID fonts often use 2-byte codes for addressing the glyphs in the font , more complicated schemes with a variable number of bytes per character ( 1-4 ) are used for CJK fonts . Instead of an encoding table CID fonts require a CMap ( Character Map ) for providing the mapping from character codes to actual glyphs in the font . Dozens of predefined CMaps are available for common CJK fonts . The font ’ s character collection specifies a particular set of Chinese , Japanese , or Korean characters . So-called Identity CMaps are used ( mainly for Western fonts ) in order to directly address the glyphs in a font without any intermediate mapping table . Comparison of PDF font formats . Table 2 . 1 details the font formats supported in PDF , and explains which original font formats can be converted to these types by the PDF creation software . Table 2 . 1 Font formats in PDF Name Type1 Notes Classic PostScript Type 1 fonts . In addition to the original Type 1 format they can also be embedded as CFF ( Compressed Font Format ) under the name Type1C ( Type 1 Compressed ). These fonts are the result of classic PostScript Type 1 fonts or OpenType fonts with PostScript outlines . MMType1 ( In Acrobat : MM ) Multiple Master fonts are an extension of the Type 1 format , and are rarely used . These fonts are the result of PostScript Type 1 Multiple Master fonts . Type3 User-defined fonts , i . e . the glyphs are described by raw vector or image operations instead of a readymade font . Type 3 fonts are always embedded . They are mainly intended for bitmapped fonts and logo fonts . These fonts are often the result of a printer driver converting a PostScript Type 1 or TrueType font to a bitmap font . Some applications use Type 3 fonts for achieving special effects , such as filling an area with a pattern . Chapter 2 : Working with FontReporter 9 Table 2 . 1 Font formats in PDF Name TrueType CIDFontType0 Notes TrueType fonts can directly be embedded in PDF . Since not all parts of the original TrueType font are required in PDF the embedded font data does not necessarily comprise a valid TrueType font . These fonts are the result of TrueType or Type 42 fonts , or OpenType fonts with TrueType outlines . ( In Acrobat : Type 1 ( CID )) CID font with PostScript outlines ; similar to Type 1 fonts the font data can be embedded as CFF under the name CIDFontType0C . These fonts are the result of OpenType fonts with PostScript outlines . CIDFontType2 ( In Acrobat : TrueType ( CID )) CID font with TrueType outlines . These fonts are the result of TrueType fonts or OpenType fonts with TrueType outlines . OpenType Directly embedding OpenType fonts requires PDF 1 . 6 . In contrast to CIDFontType0 it allows to embed the full OpenType font file . This format is still very rare . These fonts are the result of OpenType fonts with PostScript outlines . 10 Chapter 2 : Working with FontReporter 2 . 3 Contents of a Font Report FontReporter collects general information , font-related information , and glyph tables for all fonts in a PDF . These categories are accessible in multiple ways : > bookmarks contain general and font-related information as clickable hypertext > overview pages contain general and font-related information on a printable page ; the overview pages contain clickable links so that you can easily navigate to a particular font ’ s glyph table > detailed glyph tables repeat the font-related information , and list the glyphs in a font Clicking one of the bookmarks or using the links in the overview section you can quickly navigate to the corresponding glyph tables for a font . FontReporter will copy the fonts from the original document to the font report without any modification . All font properties ( e . g . embedding and encoding ) will remain unchanged . Note FontReporter will only process data which is actually represented with font and encoding structures in the PDF . Other means of representing text are ignored , such as images containing text , or characters which are drawn with vector graphics ( also called outline text ). 2 . 3 . 1 General Information The general information in a font report contains of the following : > file name of the original PDF document > number of pages and fonts in the document > PDF Producer , i . e . the name of the software used to produce the PDF ( not the software used to compose the document ). 2 . 3 . 2 Font and Encoding Information For each font in the document the following pieces of information will be provided . Font name . If the font name starts with six random characters and a plus sign , the font is a subset which does not contain all characters which were originally present in the font . The subset prefix is useful when multiple subsets of the same font are embedded in a document . CJK font names will be displayed in their native spelling if possible . Font type . The PDF font type ; see Section 2 . 2 , » Overview of PDF Font Formats «, page 9 , for a list of font formats . Embedding and subsetting status . The embedding status of the font , including subset information and the format of the embedded font data . Encoding . The encoding defines the mapping of character codes to glyphs for simple fonts . This may be the name of one of PDF ’ s predefined encodings WinAnsiEncoding or MacRomanEncoding ( these are called Ansi and Roman in Acrobat ) or custom for a nonstandard encoding . If no encoding information is given explicitly , but the encoding which is built into the font must be used , the word builtin is displayed instead of an encoding name . Chapter 2 : Working with FontReporter 11 CMap . For CID fonts the name of a predefined CMap ( character map ) will be listed as encoding , plus the name of the corresponding character collection ( Chinese , Japanese , Korean , or Identity ). Additional font information . If a simple font contains a CharSet entry ( a list with the names of all glyphs contained in a subset ) this will be noted . Similarly , if a CID font contains a CIDSet entry ( a list with all CIDs contained in a subset ) this will also be noted . Symbol bit . The symbol bit signals that a font contains characters outside the Adobe standard Latin character set . This may be relevant for font substitution and text extraction operations . For example , a non-embedded font with the symbol bit cannot be substituted . Unicode mapping table . IF the font contains an explicit ToUnicode mapping table this will be mentioned . ToUnicode tables are crucial for text extraction and search operations . 2 . 3 . 3 Glyph Tables FontReporter will create detailed glyph tables for each font in the document . The table organization depends on the font type . Regardless of the font type , the rows and columns of each table will be numbered from 0 to F ( these are the hex codes for the numbers 0-15 ); empty slots in the tables will contain a small dot as a substitute . Table organization for simple fonts . Since simple fonts can address at most 256 glyphs , a single page with 16x16 slots is sufficient . All glyphs present in the font / encoding combination will be shown . Unencoded glyphs in the font and unused encoding entries will not be shown . Table organization for CID fonts . Since CID fonts can contain thousands of glyphs , efficient table organization is important for achieving compact font reports while at the same time providing quick access to particular code ranges . If the font contains a CIDSet table , only the CIDs contained in this table will be shown ; otherwise the CIDs which are actually used in the document will be shown . All text on all pages of the document will be processed to determine the set of used CIDs , while text in hypertext elements ( such as form fields ) will be ignored . A separate page will be created for each block of 256 CIDs . The starting number of the block ( in hex ) and the number of glyphs per block are provided in the page heading and the corresponding bookmark . For example , the heading CID x0000 means that this block contains CIDs 0000-00FF ( hex ), or decimal 0-255 ; the heading CID x0100 means that this block contains CIDs 0100-01FF ( hex ), or decimal 256-511 . Empty blocks will be omitted to reduce the overall size of the font report . Information for each glyph . For each glyph in a table the following information will be shown : > A gray rectangle showing the glyph width ( the height of the rectangle is constant , and not related to the glyph geometry ) > The actual glyph will be displayed . Sometimes the font does not contain the corresponding glyph description . In this case the font ’ s . notdef glyph will be displayed ; depending on the type of font it may be represented as a space character , hollow box , 12 Chapter 2 : Working with FontReporter crossed box , or similar . For CID fonts with vertical writing mode the glyphs will be positioned such that they match the table grid . > For simple ( 8-bit ) fonts the name of the glyph will be shown . > If the font contains a Unicode mapping table ( a ToUnicode CMap ), the glyph ’ s Unicode value ( s ) will be shown in U + xxxx notation if available . Multiple Unicode values may be present for glyphs which map to a sequence of multiple Unicode characters , such as ligatures and fractions . If a ToUnicode table is present for the font , but it does not contain an entry for this glyph ( this may happen for non-textual symbols which are contained in a text font ), FontReporter will display the string ( missing ) instead . If the Unicode values contain a surrogate pair ( two UTF-16 values ) the corresponding UTF-32 value will be displayed instead . Chapter 2 : Working with FontReporter 13 2 . 4 Investigate PDF Problems with FontReporter This section lists some common problem scenarios along with hints for interpreting the font report in order to identify problems . Text Extraction does not work . FontReporter is very useful if you try to extract text from a PDF ( using PDFlib TET , Adobe Acrobat , or any other tool ) and the the extracted text is incomplete or wrong . Some hints : > For simple fonts check the encoding , glyph name , and Unicode mapping . In many cases errors in the PDF , font , or encoding are easy to identify . In PDFlib TET you can correct many kinds of errors by supplying appropriate processing options or custom Unicode mapping tables . > For some font / encoding combinations text extraction will only work if a proper ToUnicode mapping table is present . The font report will tell you whether or not this data structure is present in the PDF . Complicated Unicode mappings . In some situations the Unicode mapping of a glyph may not be obvious . For examples , a font ’ s Unicode mapping can decompose ligatures into multiple constituent characters in order to facilitate text extraction . On the other hand , ligatures sometimes have wrong Unicode mappings which thwart text extraction . Error message in Acrobat . For some problematic PDFs Acrobat will complain Cannot find or create font XXX . Some characters may not display or print correctly . and display bullets instead of the font ’ s characters . This usually happens when the font is neither embedded nor installed on the system , and Acrobat cannot substitute the font because the symbol bit is set . Type 3 fonts . Type 3 fonts may cause the PDF to print slowly or prevent text extracting and editing . The font report will provide a useful overview of Type 3 fonts and the glyphs they contain . Glyph complement of font subsets . Font subsets do not contain all glyphs which were initially available in a font . The font report displays the glyph complement ( set of available glyphs ) in a font subset . Duplicate fonts . In some situations ( e . g . combining pages from several PDFs ) multiple subsets of the same font may be present in a PDF , or even multiple instances of the same font . This can cause file size bloat or even printing problems . The font report will clearly identify this problem . Distinguish text and graphics . Sometimes it is difficult to see whether content in a PDF is actually represented as native text , vector graphics , or an image . The font report helps in identifying various uncommon scenarios : > Text which is represented as vector graphics ( outline text ) or an image will be missing from the font report . > If a logo or other symbol is represented by a special font you can easily identify the logo font in the font report . 14 Chapter 2 : Working with FontReporter 2 . 5 Error Messages Unaccessible content data . FontReporter will include this message in the font report if it was unable to enumerate the page contents for determining which glyphs are used in the document . Most commonly this will happen with encrypted documents , but also if the page description is damaged . Invalid page content state , glyphs cannot be drawn . This message may appear for unembedded fonts which cannot be processed . Acrobat usually complains with the message A font required for font substitution is missing , or substitutes a system-installed font . Chapter 2 : Working with FontReporter 15 16 Chapter 2 : Working with FontReporter A Revision History Revision history of this manual Date Changes July 3 , 2008 > Minor changes for FontReporter 1 . 3 ( new : support for Acrobat 9 on Windows ) March 26 , 2007 > Minor changes for FontReporter 1 . 2 ( new : support for Acrobat 8 on Mac OS X ) January 30 , 2006 > Minor additions for FontReporter 1 . 1 February 14 , 2005 > Initial version for FontReporter 1 . 0 . 0 Known problems in this version . We are currently aware of the following minor problems : > In rare cases the glyphs of Type 3 fonts may appear too small or too large , or not on the baseline . > Although PDF Producer entries with non-Latin characters will be displayed properly in the bookmarks , they will appear garbled in the overview page . > Sometimes not all glyph names are shown for simple fonts with the predefined encodings WinAnsi or MacRoman . > Glyph names for built-in encodings are not shown , nor those for custom encodings which are based on a font ’ s built-in encoding . > Unicode values will not be shown for CID fonts with standard CJK character collections and simple fonts with the predefined encodings WinAnsi or MacRoman . However , since these are fixed mappings no document-specific information is lost . A Revision History 17