Discussion:
[iText-questions] Can't open or write very large PDF files
WMJ
2011-11-07 00:47:08 UTC
Permalink
iText seems to have trouble opening or writing PDF files larger than 2GB.

I have examined the source code and found that some classes used the 32bit int type to indicate the file offset. They should be changed to long integer (64bit integer) to better support large PDF files.
WMJ
2011-11-07 03:09:44 UTC
Permalink
I've just uploaded a patch for the very-large PDF file issue in iTextSharp, both read and write.
Paulo Soares
2011-11-07 10:20:26 UTC
Permalink
Thank you, I'll see how to integrate it.

Paulo

________________________________
From: WMJ [mailto:***@yahoo.com]
Sent: Monday, November 07, 2011 3:10 AM
To: Post all your questions about iText here
Subject: Re: [iText-questions] Can't open or write very large PDF files

I've just uploaded a patch for the very-large PDF file issue in iTextSharp, both read and write.
WMJ
2011-11-07 15:40:27 UTC
Permalink
Thank you for working on it.


The patch pack may contain some other fixes about the recently reported iText issues, for instance, double-byte PDF Name decoding, etc.


The fix for 2GB+ PDF files mostly includes reading and writing the Xref table and streams.


It may break existing codes since some 32-bit int fields or properties are changed to 64-bit long values. Perhaps it is appropriate to replace existing internal and private fields such as file length, offset, or other similar fields with long values, but keep current API unmodified (the uploaded fix has broken them and consequently applications using those API may need modification and recompilation), and introduce new 2GB+ supported APIs such as LongLength, LongOffset. So the existing code can be left there peacefully. And developers who meet with the 2GB limitation can adopt those long value properties to surmount the issue.
Post by Paulo Soares
________________________________
Subject: Re: [iText-questions] Can't open or write very large PDF files
Thank you, I'll see how to integrate it.
 
Paulo
Post by Paulo Soares
________________________________
Sent: Monday, November 07, 2011 3:10 AM
To: Post all your questions about iText here
Subject: Re: [iText-questions] Can't open or write very large PDF files
I've just uploaded a patch for the very-large PDF file issue in iTextSharp, both read and write.
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Paulo Soares
2011-11-07 16:03:44 UTC
Permalink
I'll probably change the API to return and use long, this area is not visible for most of the users.

Paulo

________________________________
From: WMJ [mailto:***@yahoo.com]
Sent: Monday, November 07, 2011 3:40 PM
To: Post all your questions about iText here
Subject: Re: [iText-questions] Can't open or write very large PDF files

Thank you for working on it.

The patch pack may contain some other fixes about the recently reported iText issues, for instance, double-byte PDF Name decoding, etc.

The fix for 2GB+ PDF files mostly includes reading and writing the Xref table and streams.

It may break existing codes since some 32-bit int fields or properties are changed to 64-bit long values. Perhaps it is appropriate to replace existing internal and private fields such as file length, offset, or other similar fields with long values, but keep current API unmodified (the uploaded fix has broken them and consequently applications using those API may need modification and recompilation), and introduce new 2GB+ supported APIs such as LongLength, LongOffset. So the existing code can be left there peacefully. And developers who meet with the 2GB limitation can adopt those long value properties to surmount the issue.

________________________________
From: Paulo Soares <***@glintt.com>
To: Post all your questions about iText here <itext-***@lists.sourceforge.net>
Subject: Re: [iText-questions] Can't open or write very large PDF files

Thank you, I'll see how to integrate it.

Paulo

________________________________
From: WMJ [mailto:***@yahoo.com]
Sent: Monday, November 07, 2011 3:10 AM
To: Post all your questions about iText here
Subject: Re: [iText-questions] Can't open or write very large PDF files

I've just uploaded a patch for the very-large PDF file issue in iTextSharp, both read and write.



------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
iText-questions mailing list
iText-***@lists.sourceforge.net<mailto:iText-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Kevin Day
2011-11-07 16:41:59 UTC
Permalink
Thanks guys - this has been on my 'to-do' list since July, and I haven't had
a chance to work on it yet - sorry.
If we are going to go down this path, making a bunch of breaking changes,
are there any objections to a re-implementation of RAFOA?
I think a much cleaner implementation would be to have an interface for
RAFOA and have different implementations for ByteBacked,
RandomAccessFileBacked, and ByteBufferBacked. Factory methods would then
return the appropriate implementation.
I'm not sure if that's worth looking at or not... sorry again for my
tardiness in working on this.

--
View this message in context: http://itext-general.2136553.n4.nabble.com/Can-t-open-or-write-very-large-PDF-files-tp3997140p4005097.html
Sent from the iText - General mailing list archive at Nabble.com.
Leonard Rosenthol
2011-11-07 15:57:34 UTC
Permalink
What does this mean?

PDF Names are always in UTF8, so they can NOT contain double-bytes (at least not in native form)

Leonard

From: WMJ <***@yahoo.com<mailto:***@yahoo.com>>
Reply-To: WMJ <***@yahoo.com<mailto:***@yahoo.com>>, Post here <itext-***@lists.sourceforge.net<mailto:itext-***@lists.sourceforge.net>>
Date: Mon, 7 Nov 2011 07:40:27 -0800
To: Post here <itext-***@lists.sourceforge.net<mailto:itext-***@lists.sourceforge.net>>
Subject: Re: [iText-questions] Can't open or write very large PDF files

double-byte PDF Name decoding
WMJ
2011-11-08 23:24:41 UTC
Permalink
Yes, according to ISO32000, PDF Names should be UTF-8 encoded if there are double-byte characters originally.
Nevertheless, when dealing with some legacy PDF files, I have encountered quite a lot of documents which used ANSI in their PDF Names. Thus I have got to introduced a static Encoding member in the fix which allows the developer to specify what encoding is used to decode the encoded PDF Names.
Even without my private fix, iText currently can not decode UTF-8 encoded PDF Names since it mistakenly assumes that each character in the PDF Name is one-byte character.

WMJ
Post by Paulo Soares
________________________________
Subject: Re: [iText-questions] Can't open or write very large PDF files
What does this mean?
PDF Names are always in UTF8, so they can NOT contain double-bytes (at least not in native form)
Leonard
Date: Mon, 7 Nov 2011 07:40:27 -0800
Subject: Re: [iText-questions] Can't open or write very large PDF files
double-byte PDF Name decoding
WMJ
2011-11-09 02:58:03 UTC
Permalink
Thank you for pointing out that.


I knows about the ISO32000.

The problem is that we keep running into quite considerable amount of documents which uses ANSI to encode PDF names.

For example, please take a look at the attached sample.pdf. (simply opening it with a notepad is enough)

The BaseFont name of font TT0(145 0 R), and font TT1 (139 0 R) are encoded with GB2312 rather than UTF-8 actually, which represented as byte arrays are:

BaseFont/#BA#DA#CC#E5

[0]: 186
[1]: 218
[2]: 204
[3]: 229


and


BaseFont/#CB#CE#CC#E5

[0]: 203
[1]: 206
[2]: 204
[3]: 229


respectively. Thus I have to introduce a customized fix for those legacy documents.

Despite of my customized fix about those legacy documents. The implementation of decoding UTF-8 encoded PDF names in iText today is wrong:
The problem, mostly lies in the NextToken method of the PRTokeniser class. The decode a "#FF"-notated byte, changes it into a character, and that character is immediately appended, rather than reformed with subsequent bytes, to a StringBuilder. However, a double-byte, or UTF-8 encoded character requires more than one #FF notation sequences, hence the implementation of iText nowadays will break a UTF-8 encoded character into several characters.
The correct implementation should be firstly decode the PDF names into byte arrays, if it contains the #FF notation, and then use Encoding.UTF8.GetString, Encoding.GetEncoding("GBK").GetString, or other methods to get decoded strings out of the tokenized byte arrays.
Post by Paulo Soares
________________________________
发送日期 2011幎11月9日, 星期䞉, 䞊午 7:34
䞻题: Re: [iText-questions] Can't open or write very large PDF files
You need to be VERY CAREFUL about this

A Name is composed of a '/' followed by a series of 8bit characters that are encoded according to the rules of UTF-8.  There is NO ANSI in PDF.  You might have meant ASCII, but that's a subset of UTF-8, so you're still fine there.
As such, I am not sure what you think you need to do with the code in iText.  It knows how to encode/decode the string as UTF-8.  
If you have a sample PDF that demonstrates a problem – please post.
Leonard
Date: Tue, 8 Nov 2011 15:23:29 -0800
Subject: RE: [iText-questions] Can't open or write very large PDF files
Yes, according to ISO32000, PDF Names should be UTF-8 encoded if there are double-byte characters originally.
Nevertheless, when dealing with some legacy PDF files, I have encountered quite a lot of documents which used ANSI in their PDF Names. Thus I have got to introduced a static Encoding member in the fix which allows the developer to specify what encoding is used to decode the encoded PDF Names.
Even without my private fix, iText currently can not decode UTF-8 encoded PDF Names since it mistakenly assumes that each character in the PDF Name is one-byte character.
WMJ
Post by Paulo Soares
________________________________
Subject: Re: [iText-questions] Can't open or write very large PDF files
What does this mean?
PDF Names are always in UTF8, so they can NOT contain double-bytes (at least not in native form)
Leonard
Date: Mon, 7 Nov 2011 07:40:27 -0800
Subject: Re: [iText-questions] Can't open or write very large PDF files
double-byte PDF Name decoding
Leonard Rosenthol
2011-11-10 14:38:42 UTC
Permalink
So the #HEX syntax was introduced in PDF 1.2 as a way to introduce other values (especially spaces!) into the Name w/o changing the simple definition. The change to UTF8 was introduced with PDF 1.6. However, in NO INSTANCE was the string (after # decoding) to be treated in any encoding except PDDocEncoding (aka ISO Latin 1) or UTF8. Any other encoding would simply produce incorrect display – and since you don't know anything about the encoding, one can only GUESS (and probably wrongly).

The iText token is correct and matches that of Acrobat/Reader.

Parse the string and decode any escaping to produce a single string of 8bit characters, which is considered to be in UTF8.

I see no changes necessary here.

Remember, Names and Strings are DIFFERENT DATA TYPES – you can't treat them the same!

Leonard

From: WMJ <***@yahoo.com<mailto:***@yahoo.com>>
Reply-To: WMJ <***@yahoo.com<mailto:***@yahoo.com>>, Post here <itext-***@lists.sourceforge.net<mailto:itext-***@lists.sourceforge.net>>
Date: Tue, 8 Nov 2011 18:58:03 -0800
To: Post here <itext-***@lists.sourceforge.net<mailto:itext-***@lists.sourceforge.net>>
Subject: Re: [iText-questions] Can't open or write very large PDF files

Thank you for pointing out that.

I knows about the ISO32000.
The problem is that we keep running into quite considerable amount of documents which uses ANSI to encode PDF names.
For example, please take a look at the attached sample.pdf. (simply opening it with a notepad is enough)
The BaseFont name of font TT0 (145 0 R), and font TT1 (139 0 R) are encoded with GB2312 rather than UTF-8 actually, which represented as byte arrays are:

BaseFont/#BA#DA#CC#E5
[0]: 186
[1]: 218
[2]: 204
[3]: 229

and

BaseFont/#CB#CE#CC#E5
[0]: 203
[1]: 206
[2]: 204
[3]: 229

respectively. Thus I have to introduce a customized fix for those legacy documents.

Despite of my customized fix about those legacy documents. The implementation of decoding UTF-8 encoded PDF names in iText today is wrong:
The problem, mostly lies in the NextToken method of the PRTokeniser class. The decode a "#FF"-notated byte, changes it into a character, and that character is immediately appended, rather than reformed with subsequent bytes, to a StringBuilder. However, a double-byte, or UTF-8 encoded character requires more than one #FF notation sequences, hence the implementation of iText nowadays will break a UTF-8 encoded character into several characters.
The correct implementation should be firstly decode the PDF names into byte arrays, if it contains the #FF notation, and then use Encoding.UTF8.GetString, Encoding.GetEncoding("GBK").GetString, or other methods to get decoded strings out of the tokenized byte arrays.

________________________________
发件人 Leonard Rosenthol <***@adobe.com<mailto:***@adobe.com>>
收件人 WMJ <***@yahoo.com<mailto:***@yahoo.com>>
发送日期 2011幎11月9日, 星期䞉, 䞊午 7:34
䞻题: Re: [iText-questions] Can't open or write very large PDF files

You need to be VERY CAREFUL about this


A Name is composed of a '/' followed by a series of 8bit characters that are encoded according to the rules of UTF-8. There is NO ANSI in PDF. You might have meant ASCII, but that's a subset of UTF-8, so you're still fine there.

As such, I am not sure what you think you need to do with the code in iText. It knows how to encode/decode the string as UTF-8.

If you have a sample PDF that demonstrates a problem – please post.

Leonard

From: WMJ <***@yahoo.com<mailto:***@yahoo.com>>
Reply-To: WMJ <***@yahoo.com<mailto:***@yahoo.com>>
Date: Tue, 8 Nov 2011 15:23:29 -0800
To: Leonard Rosenthol <***@adobe.com<mailto:***@adobe.com>>
Subject: RE: [iText-questions] Can't open or write very large PDF files

Yes, according to ISO32000, PDF Names should be UTF-8 encoded if there are double-byte characters originally.
Nevertheless, when dealing with some legacy PDF files, I have encountered quite a lot of documents which used ANSI in their PDF Names. Thus I have got to introduced a static Encoding member in the fix which allows the developer to specify what encoding is used to decode the encoded PDF Names.
Even without my private fix, iText currently can not decode UTF-8 encoded PDF Names since it mistakenly assumes that each character in the PDF Name is one-byte character.

WMJ
________________________________
From: Leonard Rosenthol <***@adobe.com<mailto:***@adobe.com>>
To: WMJ <***@yahoo.com<mailto:***@yahoo.com>>; Post here <itext-***@lists.sourceforge.net<mailto:itext-***@lists.sourceforge.net>>
Subject: Re: [iText-questions] Can't open or write very large PDF files

What does this mean?

PDF Names are always in UTF8, so they can NOT contain double-bytes (at least not in native form)

Leonard

From: WMJ <***@yahoo.com<mailto:***@yahoo.com>>
Reply-To: WMJ <***@yahoo.com<mailto:***@yahoo.com>>, Post here <itext-***@lists.sourceforge.net<mailto:itext-***@lists.sourceforge.net>>
Date: Mon, 7 Nov 2011 07:40:27 -0800
To: Post here <itext-***@lists.sourceforge.net<mailto:itext-***@lists.sourceforge.net>>
Subject: Re: [iText-questions] Can't open or write very large PDF files

double-byte PDF Name decoding
WMJ
2011-11-11 01:41:36 UTC
Permalink
Hello,


Java and .NET has a different string class.

The implementation in iText may be correct.

But in iTextSharp it is wrong.


WMJ
Post by Paulo Soares
________________________________
Subject: Re: [iText-questions] Can't open or write very large PDF files
So the #HEX syntax was introduced in PDF 1.2 as a way to introduce other values (especially spaces!) into the Name w/o changing the simple definition.  The change to UTF8 was introduced with PDF 1.6.  However, in NO INSTANCE was the string (after # decoding) to be treated in any encoding except PDDocEncoding (aka ISO Latin 1) or UTF8.   Any other encoding would simply produce incorrect display – and since you don't know anything about the encoding, one can only GUESS (and probably wrongly).
The iText token is correct and matches that of Acrobat/Reader.
Parse the string and decode any escaping to produce a single string of 8bit characters, which is considered to be in UTF8.
I see no changes necessary here.
Remember, Names and Strings are DIFFERENT DATA TYPES – you can't treat them the same!
Leonard
Date: Tue, 8 Nov 2011 18:58:03 -0800
Subject: Re: [iText-questions] Can't open or write very large PDF files
Thank you for pointing out that.
I knows about the ISO32000.
The problem is that we keep running into quite considerable amount of documents which uses ANSI to encode PDF names.
For example, please take a look at the attached sample.pdf. (simply opening it with a notepad is enough)
BaseFont/#BA#DA#CC#E5
[0]: 186
[1]: 218
[2]: 204
229
Post by Paulo Soares
and
BaseFont/#CB#CE#CC#E5
[0]: 203
[1]: 206
[2]: 204
[3]: 229
respectively. Thus I have to introduce a customized fix for those legacy documents.
The problem, mostly lies in the NextToken method of the PRTokeniser class. The decode a "#FF"-notated byte, changes it into a character, and that character is immediately appended, rather than reformed with subsequent bytes, to a StringBuilder. However, a double-byte, or UTF-8 encoded character requires more than one #FF notation sequences, hence the implementation of iText nowadays will break a UTF-8 encoded character into several characters.
The correct implementation should be firstly decode the PDF names into byte arrays, if it contains the #FF notation, and then use Encoding.UTF8.GetString, Encoding.GetEncoding("GBK").GetString, or other methods to get decoded strings out of the tokenized byte arrays.
Post by Paulo Soares
________________________________
发送日期 2011幎11月9日, 星期䞉, 䞊午 7:34
䞻题: Re: [iText-questions] Can't open or write very large PDF files
You need to be VERY CAREFUL about this

A Name is composed of a '/' followed by a series of 8bit characters that are encoded according to the rules of UTF-8.  There is NO ANSI in PDF.  You might have meant ASCII, but that's a subset of UTF-8, so you're still fine there.
As such, I am not sure what you think you need to do with the code in iText.  It knows how to encode/decode the string as UTF-8.  
If you have a sample PDF that demonstrates a problem – please post.
Leonard
Date: Tue, 8 Nov 2011 15:23:29 -0800
Subject: RE: [iText-questions] Can't open or write very large PDF files
Yes, according to ISO32000, PDF Names should be UTF-8 encoded if there are double-byte characters originally.
Nevertheless, when dealing with some legacy PDF files, I have encountered quite a lot of documents which used ANSI in their PDF Names. Thus I have got to introduced a static Encoding member in the fix which allows the developer to specify what encoding is used to decode the encoded PDF Names.
Even without my private fix, iText currently can not decode UTF-8 encoded PDF Names since it mistakenly assumes that each character in the PDF Name is one-byte character.
WMJ
Post by Paulo Soares
________________________________
Subject: Re: [iText-questions] Can't open or write very large PDF files
What does this mean?
PDF Names are always in UTF8, so they can NOT contain double-bytes (at least not in native form)
Leonard
Date: Mon, 7 Nov 2011 07:40:27 -0800
Subject: Re: [iText-questions] Can't open or write very large PDF files
double-byte PDF Name decoding
Continue reading on narkive:
Loading...