[iText-questions] BaseFont with UTF-8?

Discussion:

Lars Nagel (Trium)

2007-07-24 15:49:24 UTC

Hi all,

Is it possible to set the encoding to UTF-8 / Unicode instead of e. g.
BaseFont.CP1252. I want to pass UTF-8 to BaseFont.createFont(...) as it
is done with BaseFont.CP1252 in the following example.
BaseFont baseFont = BaseFont.createFont(urlFont.getPath(),
BaseFont.CP1252, BaseFont.EMBEDDED);

As there is no constant available in class BaseFont, I would like to
know the String I have to pass to the function.

Thanks in advance,
Lars
--
Lars Nagel

Trium Analysis Online GmbH
Hohenlindenerstr. 1
81677 München

Fon : +49 89 2060269 21
Fax : +49 89 2060269 11
Internet: www.trium.de

Amtsgericht Muenchen, HRB 134012
Managing Directors:
Dr. Martin Daumer, Michael Scholz

Paulo Soares

2007-07-24 15:53:54 UTC

Permalink

UTF-8 is an Unicode stream representation that doesn't mean anything after the text resides in a String. The constant you're looking for is BaseFont.IDENTITY_H but note that you'll need a TrueType font for this to work.

Paulo

-----Original Message-----
Behalf Of Lars Nagel (Trium)
Sent: Tuesday, July 24, 2007 4:49 PM
To: Post all your questions about iText here
Subject: [iText-questions] BaseFont with UTF-8?
Hi all,
Is it possible to set the encoding to UTF-8 / Unicode instead
of e. g.
BaseFont.CP1252. I want to pass UTF-8 to
BaseFont.createFont(...) as it
is done with BaseFont.CP1252 in the following example.
BaseFont baseFont = BaseFont.createFont(urlFont.getPath(),
BaseFont.CP1252, BaseFont.EMBEDDED);
As there is no constant available in class BaseFont, I would like to
know the String I have to pass to the function.
Thanks in advance,
Lars
--
Lars Nagel
Trium Analysis Online GmbH
Hohenlindenerstr. 1
81677 München
Fon : +49 89 2060269 21
Fax : +49 89 2060269 11
Internet: www.trium.de
Amtsgericht Muenchen, HRB 134012
Dr. Martin Daumer, Michael Scholz

Aviso Legal:
Esta mensagem é destinada exclusivamente ao destinatário. Pode conter informação confidencial ou legalmente protegida. A incorrecta transmissão desta mensagem não significa a perca de confidencialidade. Se esta mensagem for recebida por engano, por favor envie-a de volta para o remetente e apague-a do seu sistema de imediato. É proibido a qualquer pessoa que não o destinatário de usar, revelar ou distribuir qualquer parte desta mensagem.

Disclaimer:
This message is destined exclusively to the intended receiver. It may contain confidential or legally protected information. The incorrect transmission of this message does not mean the loss of its confidentiality. If this message is received by mistake, please send it back to the sender and delete it from your system immediately. It is forbidden to any person who is not the intended receiver

mark storer

2007-07-24 20:17:10 UTC

Permalink

Furthermore, I don't believe UTF-8 is a legal encoding in a PDF stream. It is my understanding that you have to use a uniform byte size for characters. PDF has several means at its disposal to define an encoding, and none of them are capable of supporting characters with variable byte sizes.

How can you define an encoding in PDF:
1) A named encoding (PdfDocEncoding, WinAnsiEncoding, MacRomanEncoding, etc), all 1-byte
2) A named encoding plus a differences array (replace this code point with that character). 1-byte
3) A differences array without a named encoding (which IIRC defaults to PdfDocEncoding). 1-byte
4) A "CharMap". This is a postscript-esque file containing a mapping of code points to characters. 1, 2, or 4 byte characters. I believe a 3-byte char map would be legal, but I've never seen one.

There are many predefined charmaps for things like UTF-16, or Big 5. The predefined charmaps are all specific to a single language. There's one for utf-16 JP, one for CN, and so forth.

You need to use different charmaps for different writing directions. I'm not really sure why. So one example of a charmap name, "UniGB-UTF16-H", maps UTF characters to a Chinese language font written horizontally.

And then there's the "Identity" charmaps. These allow you to write glyph values into a stream for a particular font. Glyphs are in no particular order in any given font, and can vary from one release of that font to the next.

iText, IIRC, is smart enough to take a TrueType (or OpenType, I'd imagine) font and pull out all the glyph indexes it needs so you can ignore the language-specific limitations placed on the pre-released charmaps.

Thus endeth the lesson.

--Mark Storer
Senior Software Engineer
Cardiff.com

#include <disclaimer>
typedef std::Disclaimer<Cardiff> DisCard;

-----Original Message-----
Behalf Of Paulo
Soares
Sent: Tuesday, July 24, 2007 8:54 AM
To: Post all your questions about iText here
Subject: Re: [iText-questions] BaseFont with UTF-8?
UTF-8 is an Unicode stream representation that doesn't mean
anything after the text resides in a String. The constant
you're looking for is BaseFont.IDENTITY_H but note that
you'll need a TrueType font for this to work.
Paulo

would like to

know the String I have to pass to the function.
Thanks in advance,
Lars
--
Lars Nagel
Trium Analysis Online GmbH
Hohenlindenerstr. 1
81677 München
Fon : +49 89 2060269 21
Fax : +49 89 2060269 11
Internet: www.trium.de
Amtsgericht Muenchen, HRB 134012
Dr. Martin Daumer, Michael Scholz

Esta mensagem é destinada exclusivamente ao destinatário.
Pode conter informação confidencial ou legalmente protegida.
A incorrecta transmissão desta mensagem não significa a perca
de confidencialidade. Se esta mensagem for recebida por
engano, por favor envie-a de volta para o remetente e
apague-a do seu sistema de imediato. É proibido a qualquer
pessoa que não o destinatário de usar, revelar ou distribuir
qualquer parte desta mensagem.
This message is destined exclusively to the intended
receiver. It may contain confidential or legally protected
information. The incorrect transmission of this message does
not mean the loss of its confidentiality. If this message is
received by mistake, please send it back to the sender and
delete it from your system immediately. It is forbidden to
any person who is not the intended receiver to use,
distribute or copy any part of this message.
--------------------------------------------------------------
-----------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and
a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/

--Mark Storer

AKA Sheriff Lucas the Lost

---------------------------------
Moody friends. Drama queens. Your life? Nope! - their life, your story.
Play Sims Stories at Yahoo! Games.

Mark Storer

2007-07-25 17:58:04 UTC

Permalink

Quick addendum:

Just because you can't write UTF-8 bytes into a PDF doesn't mean you can't use it as input. Just know that it'll be translated along the way. iText is quite capable of handling all that for you.

--Mark Storer
Senior Software Engineer
Cardiff.com

#include <disclaimer>
typedef std::Disclaimer<Cardiff> DisCard;

-----Original Message-----
From: itext-questions-***@lists.sourceforge.net [mailto:itext-questions-***@lists.sourceforge.net]On Behalf Of mark storer
Sent: Tuesday, July 24, 2007 1:17 PM
To: itext-***@lists.sourceforge.net
Subject: Re: [iText-questions] BaseFont with UTF-8?

Furthermore, I don't believe UTF-8 is a legal encoding in a PDF stream. It is my understanding that you have to use a uniform byte size for characters. PDF has several means at its disposal to define an encoding, and none of them are capable of supporting characters with variable byte sizes.