Discussion:
[iText-questions] Can't view text segments in certain PDF files
Jake C
2007-03-20 16:44:26 UTC
Permalink
We use an OCR product to generate a PDF from a TIF with the original image
plus hidden text, so that you can search/select the text, but only see the
originally scanned image. We then use Adobe FlashPaper 2 to turn it into a
SWF that can be imbedded in a web page. However, the hidden text is being
stripped out of the final SWF, so that it is no longer searchable. Adobe
considers this a "limitation" (we consider it a "bug"). Most other OCR
software has the same problem as the platform we chose, but there is one
that seems to convert to SWF just fine. In an attempt to find out what the
difference was between the two files, I tried to use the Tree Viewer from
iText to examine the contents of the files. However, when I select the
Content node of the one that gets the text stripped out, I don't see
anything. If I use the API to try to extract the Stream directly, I get a
NullPointerException.

So I guess I really have two questions.

1) Is there something wrong with how the PDF is constructed that we cannot
examine the text content with iText, or is there a bug in iText?

2) Is there a way we can manipulate the PDF from the OCR software we chose
to make it structurally look like the one that actually keeps the text when
converted to SWF?

I'm attaching a copy of the two files (0112_094_no_text_select.pdf from our
selected OCR product, which we cannot view the text content, and
0112_094_text_select.pdf from the other product, which we CAN view the text
content, and actually keeps the text in the SWF) in a zip file.

OK, it seems I can't attach a file, or the message gets refused. I've
uploaded it to http://www.sharebigfile.com/file/116699/0112-094-zip.html

_________________________________________________________________
i'm making a difference. Make every IM count for the cause of your choice.
Join Now.
http://clk.atdmt.com/MSN/go/msnnkwme0080000001msn/direct/01/?href=http://im.live.com/messenger/im/home/?source=hmtagline
Paulo Soares
2007-03-20 18:02:04 UTC
Permalink
See if it works now.

Paulo

----- Original Message -----
From: "Jake C" <***@hotmail.com>
To: <itext-***@lists.sourceforge.net>
Sent: Tuesday, March 20, 2007 4:44 PM
Subject: [iText-questions] Can't view text segments in certain PDF files
Post by Jake C
We use an OCR product to generate a PDF from a TIF with the original image
plus hidden text, so that you can search/select the text, but only see the
originally scanned image. We then use Adobe FlashPaper 2 to turn it into a
SWF that can be imbedded in a web page. However, the hidden text is being
stripped out of the final SWF, so that it is no longer searchable. Adobe
considers this a "limitation" (we consider it a "bug"). Most other OCR
software has the same problem as the platform we chose, but there is one
that seems to convert to SWF just fine. In an attempt to find out what the
difference was between the two files, I tried to use the Tree Viewer from
iText to examine the contents of the files. However, when I select the
Content node of the one that gets the text stripped out, I don't see
anything. If I use the API to try to extract the Stream directly, I get a
NullPointerException.
So I guess I really have two questions.
1) Is there something wrong with how the PDF is constructed that we cannot
examine the text content with iText, or is there a bug in iText?
2) Is there a way we can manipulate the PDF from the OCR software we chose
to make it structurally look like the one that actually keeps the text when
converted to SWF?
I'm attaching a copy of the two files (0112_094_no_text_select.pdf from our
selected OCR product, which we cannot view the text content, and
0112_094_text_select.pdf from the other product, which we CAN view the text
content, and actually keeps the text in the SWF) in a zip file.
OK, it seems I can't attach a file, or the message gets refused. I've
uploaded it to http://www.sharebigfile.com/file/116699/0112-094-zip.html
_________________________________________________________________
i'm making a difference. Make every IM count for the cause of your choice.
Join Now.
http://clk.atdmt.com/MSN/go/msnnkwme0080000001msn/direct/01/?href=http://im.live.com/messenger/im/home/?source=hmtagline
--------------------------------------------------------------------------------
Post by Jake C
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--------------------------------------------------------------------------------
Post by Jake C
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
Jake C
2007-03-20 18:33:48 UTC
Permalink
No, there is actually text there now, but not a single one is alphanumeric.
I'm pasting in the text that I copied/pasted into notepad:

¿œœ·Œ»²¬ó·²·¬·¿¬·²¹
«¬·Ž·¬§
ª»²Œ±®ó­«°°Ž·»Œ ¬ž»®³¿Žóž§Œ®¿«Ž·œ­ œ¿Žœ«Ž¿ó
׬
·²
°Ž¿²¬
±³·¬¬»Œô
Œ»ª»Ž±°»Œ
Œ·®»œ¬Ž§
°®»œ»Œ·²¹
«­«¿ŽŽ§
٤
ª»®§
°Ž¿²¬ Œ»­·¹²
¿œœ·Œ»²¬ó·²·¬·¿¬·²¹
»²¹·²»»®·²¹
·­
ª¿Ž«¿ŸŽ»
®·­µó¿­­»­­³»²¬ °®±œ»­­ô
¿
©±«ŽŒ
¿ º±®³¿ŽŽ§ Œ±œ«³»²¬»Œ
ß²¿Ž§­·­
Ú«²œ¬·±² Ûª»²¬
°Ž¿²¬
³·¬·¹¿¬·²¹
·²·¬·¿¬·²¹
¬®¿²­Ž¿¬»Œ
°»®º±®³·²¹
°®±ª·Œ»­
°®»°¿®·²¹
³±®»
Œ»¬¿·Ž»Œ
Ú«²œ¬·±²
¿ Œ·­¬·²œ¬Ž§
°Ž¿²¬
»ª»²¬
°®±ª·Œ»­ ¿
Ÿ¿­»Ž·²»
°»®³·¬­ ¿
¿°°®±¿œž
Ÿ»¬©»»²
³·¬·¹¿¬·²¹
°Ž¿²¬
¿
»ª»²¬«¿ŽŽ§ Œ»œ±³°±­»Œ
±®
«²¿ª¿·Ž¿Ÿ·Ž·¬§
¯«¿²¬·¬¿¬·ª»Ž§ ³»¿­«®»Œò
œ±²­¬®«œ¬·²¹
°®»ª»²¬
œ±²­»¯«»²œ»­ô
®»Ž¿¬·±²­ž·°­
Ÿ»¬©»»²
·²ª»²¬±®§
³¿·²ó
¬¿·²»Œô
¿œœ±³°Ž·­ž»Œò
·²
»Ž·³·²¿¬·²¹
ž»¿¬ó®»³±ª¿Ž
­«œœ»­­º«ŽŽ§ ³¿·²¬¿·²»Œò
¿
¿
ÔÑÝßò
œ±²­·Œ»®»Œ
ïò
îò ݱ²¬¿·²³»²¬ ±ª»®°®»­­«®»
ŸŽ±©Œ±©² Ÿ§
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Date: Tue, 20 Mar 2007 18:02:04 -0000
See if it works now.
Paulo
Sent: Tuesday, March 20, 2007 4:44 PM
Subject: [iText-questions] Can't view text segments in certain PDF files
Post by Jake C
We use an OCR product to generate a PDF from a TIF with the original image
plus hidden text, so that you can search/select the text, but only see the
originally scanned image. We then use Adobe FlashPaper 2 to turn it into a
SWF that can be imbedded in a web page. However, the hidden text is being
stripped out of the final SWF, so that it is no longer searchable. Adobe
considers this a "limitation" (we consider it a "bug"). Most other OCR
software has the same problem as the platform we chose, but there is one
that seems to convert to SWF just fine. In an attempt to find out what the
difference was between the two files, I tried to use the Tree Viewer from
iText to examine the contents of the files. However, when I select the
Content node of the one that gets the text stripped out, I don't see
anything. If I use the API to try to extract the Stream directly, I get a
NullPointerException.
So I guess I really have two questions.
1) Is there something wrong with how the PDF is constructed that we cannot
examine the text content with iText, or is there a bug in iText?
2) Is there a way we can manipulate the PDF from the OCR software we chose
to make it structurally look like the one that actually keeps the text
when
converted to SWF?
I'm attaching a copy of the two files (0112_094_no_text_select.pdf from
our
selected OCR product, which we cannot view the text content, and
0112_094_text_select.pdf from the other product, which we CAN view the
text
content, and actually keeps the text in the SWF) in a zip file.
OK, it seems I can't attach a file, or the message gets refused. I've
uploaded it to http://www.sharebigfile.com/file/116699/0112-094-zip.html
_________________________________________________________________
i'm making a difference. Make every IM count for the cause of your choice.
Join Now.
http://clk.atdmt.com/MSN/go/msnnkwme0080000001msn/direct/01/?href=http://im.live.com/messenger/im/home/?source=hmtagline
--------------------------------------------------------------------------------
Post by Jake C
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--------------------------------------------------------------------------------
Post by Jake C
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
<< 0112_094_no_text_select_mod.pdf >>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
_________________________________________________________________
Get a FREE Web site, company branded e-mail and more from Microsoft Office
Live! http://clk.atdmt.com/MRT/go/mcrssaub0050001411mrt/direct/01/
Paulo Soares
2007-03-20 18:50:46 UTC
Permalink
The text pasted from the PDF to the clipboard is correct. It will probably
require more investigation but not related to iText.

Paulo

----- Original Message -----
From: "Jake C" <***@hotmail.com>
To: <itext-***@lists.sourceforge.net>
Sent: Tuesday, March 20, 2007 6:33 PM
Subject: Re: [iText-questions] Can't view text segments in certain PDF files
Post by Jake C
No, there is actually text there now, but not a single one is
alphanumeric.
¿½½·¼»²¬ó·²·¬·¿¬·²¹
«¬·´·¬§
ª»²¼±®ó­«°°´·»¼ ¬¸»®³¿´ó¸§¼®¿«´·½­ ½¿´½«´¿ó
׬
·²
°´¿²¬
±³·¬¬»¼ô
¼»ª»´±°»¼
¼·®»½¬´§
°®»½»¼·²¹
«­«¿´´§
¾§
ª»®§
°´¿²¬ ¼»­·¹²
¿½½·¼»²¬ó·²·¬·¿¬·²¹
»²¹·²»»®·²¹
·­
ª¿´«¿¾´»
®·­µó¿­­»­­³»²¬ °®±½»­­ô
¿
©±«´¼
¿ º±®³¿´´§ ¼±½«³»²¬»¼
ß²¿´§­·­
Ú«²½¬·±² Ûª»²¬
°´¿²¬
³·¬·¹¿¬·²¹
·²·¬·¿¬·²¹
¬®¿²­´¿¬»¼
°»®º±®³·²¹
°®±ª·¼»­
°®»°¿®·²¹
³±®»
¼»¬¿·´»¼
Ú«²½¬·±²
¿ ¼·­¬·²½¬´§
°´¿²¬
»ª»²¬
°®±ª·¼»­ ¿
¾¿­»´·²»
°»®³·¬­ ¿
¿°°®±¿½¸
¾»¬©»»²
³·¬·¹¿¬·²¹
°´¿²¬
¿
»ª»²¬«¿´´§ ¼»½±³°±­»¼
±®
«²¿ª¿·´¿¾·´·¬§
¯«¿²¬·¬¿¬·ª»´§ ³»¿­«®»¼ò
½±²­¬®«½¬·²¹
°®»ª»²¬
½±²­»¯«»²½»­ô
®»´¿¬·±²­¸·°­
¾»¬©»»²
·²ª»²¬±®§
³¿·²ó
¬¿·²»¼ô
¿½½±³°´·­¸»¼ò
·²
»´·³·²¿¬·²¹
¸»¿¬ó®»³±ª¿´
­«½½»­­º«´´§ ³¿·²¬¿·²»¼ò
¿
¿
ÔÑÝßò
½±²­·¼»®»¼
ïò
îò ݱ²¬¿·²³»²¬ ±ª»®°®»­­«®»
¾´±©¼±©² ¾§
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Date: Tue, 20 Mar 2007 18:02:04 -0000
See if it works now.
Paulo
Sent: Tuesday, March 20, 2007 4:44 PM
Subject: [iText-questions] Can't view text segments in certain PDF files
Post by Jake C
We use an OCR product to generate a PDF from a TIF with the original image
plus hidden text, so that you can search/select the text, but only see the
originally scanned image. We then use Adobe FlashPaper 2 to turn it into a
SWF that can be imbedded in a web page. However, the hidden text is being
stripped out of the final SWF, so that it is no longer searchable. Adobe
considers this a "limitation" (we consider it a "bug"). Most other OCR
software has the same problem as the platform we chose, but there is one
that seems to convert to SWF just fine. In an attempt to find out what the
difference was between the two files, I tried to use the Tree Viewer from
iText to examine the contents of the files. However, when I select the
Content node of the one that gets the text stripped out, I don't see
anything. If I use the API to try to extract the Stream directly, I get a
NullPointerException.
So I guess I really have two questions.
1) Is there something wrong with how the PDF is constructed that we cannot
examine the text content with iText, or is there a bug in iText?
2) Is there a way we can manipulate the PDF from the OCR software we chose
to make it structurally look like the one that actually keeps the text
when
converted to SWF?
I'm attaching a copy of the two files (0112_094_no_text_select.pdf from
our
selected OCR product, which we cannot view the text content, and
0112_094_text_select.pdf from the other product, which we CAN view the
text
content, and actually keeps the text in the SWF) in a zip file.
OK, it seems I can't attach a file, or the message gets refused. I've
uploaded it to http://www.sharebigfile.com/file/116699/0112-094-zip.html
_________________________________________________________________
i'm making a difference. Make every IM count for the cause of your choice.
Join Now.
http://clk.atdmt.com/MSN/go/msnnkwme0080000001msn/direct/01/?href=http://im.live.com/messenger/im/home/?source=hmtagline
--------------------------------------------------------------------------------
Post by Jake C
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--------------------------------------------------------------------------------
Post by Jake C
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
<< 0112_094_no_text_select_mod.pdf >>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
_________________________________________________________________
Get a FREE Web site, company branded e-mail and more from Microsoft Office
Live! http://clk.atdmt.com/MRT/go/mcrssaub0050001411mrt/direct/01/
--------------------------------------------------------------------------------
Post by Jake C
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--------------------------------------------------------------------------------
Post by Jake C
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
Jake C
2007-03-20 19:41:56 UTC
Permalink
What did you do to the original document to create your modified version?
Why can't the TreeViewPDF tool view the Content of either my original
version or your modified version? Is it possible to make the structure of
one that doesn't convert to FlashPaper to look like the one that DOES
convert to FlashPaper using iText?
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Date: Tue, 20 Mar 2007 18:50:46 -0000
The text pasted from the PDF to the clipboard is correct. It will probably
require more investigation but not related to iText.
Paulo
----- Original Message -----
Sent: Tuesday, March 20, 2007 6:33 PM
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Post by Jake C
No, there is actually text there now, but not a single one is
alphanumeric.
¿œœ·Œ»²¬ó·²·¬·¿¬·²¹
«¬·Ž·¬§
ª»²Œ±®ó­«°°Ž·»Œ ¬ž»®³¿Žóž§Œ®¿«Ž·œ­ œ¿Žœ«Ž¿ó
׬
·²
°Ž¿²¬
±³·¬¬»Œô
Œ»ª»Ž±°»Œ
Œ·®»œ¬Ž§
°®»œ»Œ·²¹
«­«¿ŽŽ§
٤
ª»®§
°Ž¿²¬ Œ»­·¹²
¿œœ·Œ»²¬ó·²·¬·¿¬·²¹
»²¹·²»»®·²¹
·­
ª¿Ž«¿ŸŽ»
®·­µó¿­­»­­³»²¬ °®±œ»­­ô
¿
©±«ŽŒ
¿ º±®³¿ŽŽ§ Œ±œ«³»²¬»Œ
ß²¿Ž§­·­
Ú«²œ¬·±² Ûª»²¬
°Ž¿²¬
³·¬·¹¿¬·²¹
·²·¬·¿¬·²¹
¬®¿²­Ž¿¬»Œ
°»®º±®³·²¹
°®±ª·Œ»­
°®»°¿®·²¹
³±®»
Œ»¬¿·Ž»Œ
Ú«²œ¬·±²
¿ Œ·­¬·²œ¬Ž§
°Ž¿²¬
»ª»²¬
°®±ª·Œ»­ ¿
Ÿ¿­»Ž·²»
°»®³·¬­ ¿
¿°°®±¿œž
Ÿ»¬©»»²
³·¬·¹¿¬·²¹
°Ž¿²¬
¿
»ª»²¬«¿ŽŽ§ Œ»œ±³°±­»Œ
±®
«²¿ª¿·Ž¿Ÿ·Ž·¬§
¯«¿²¬·¬¿¬·ª»Ž§ ³»¿­«®»Œò
œ±²­¬®«œ¬·²¹
°®»ª»²¬
œ±²­»¯«»²œ»­ô
®»Ž¿¬·±²­ž·°­
Ÿ»¬©»»²
·²ª»²¬±®§
³¿·²ó
¬¿·²»Œô
¿œœ±³°Ž·­ž»Œò
·²
»Ž·³·²¿¬·²¹
ž»¿¬ó®»³±ª¿Ž
­«œœ»­­º«ŽŽ§ ³¿·²¬¿·²»Œò
¿
¿
ÔÑÝßò
œ±²­·Œ»®»Œ
ïò
îò ݱ²¬¿·²³»²¬ ±ª»®°®»­­«®»
ŸŽ±©Œ±©² Ÿ§
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Date: Tue, 20 Mar 2007 18:02:04 -0000
See if it works now.
Paulo
Sent: Tuesday, March 20, 2007 4:44 PM
Subject: [iText-questions] Can't view text segments in certain PDF files
Post by Jake C
We use an OCR product to generate a PDF from a TIF with the original
image
plus hidden text, so that you can search/select the text, but only see
the
originally scanned image. We then use Adobe FlashPaper 2 to turn it
into
Post by Jake C
Post by Jake C
a
SWF that can be imbedded in a web page. However, the hidden text is
being
Post by Jake C
Post by Jake C
stripped out of the final SWF, so that it is no longer searchable.
Adobe
Post by Jake C
Post by Jake C
considers this a "limitation" (we consider it a "bug"). Most other OCR
software has the same problem as the platform we chose, but there is
one
Post by Jake C
Post by Jake C
that seems to convert to SWF just fine. In an attempt to find out what
the
difference was between the two files, I tried to use the Tree Viewer
from
Post by Jake C
Post by Jake C
iText to examine the contents of the files. However, when I select the
Content node of the one that gets the text stripped out, I don't see
anything. If I use the API to try to extract the Stream directly, I get
a
Post by Jake C
Post by Jake C
NullPointerException.
So I guess I really have two questions.
1) Is there something wrong with how the PDF is constructed that we
cannot
examine the text content with iText, or is there a bug in iText?
2) Is there a way we can manipulate the PDF from the OCR software we
chose
to make it structurally look like the one that actually keeps the text
when
converted to SWF?
I'm attaching a copy of the two files (0112_094_no_text_select.pdf from
our
selected OCR product, which we cannot view the text content, and
0112_094_text_select.pdf from the other product, which we CAN view the
text
content, and actually keeps the text in the SWF) in a zip file.
OK, it seems I can't attach a file, or the message gets refused. I've
uploaded it to
http://www.sharebigfile.com/file/116699/0112-094-zip.html
Post by Jake C
Post by Jake C
_________________________________________________________________
i'm making a difference. Make every IM count for the cause of your
choice.
Join Now.
http://clk.atdmt.com/MSN/go/msnnkwme0080000001msn/direct/01/?href=http://im.live.com/messenger/im/home/?source=hmtagline
--------------------------------------------------------------------------------
Post by Jake C
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--------------------------------------------------------------------------------
Post by Jake C
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
<< 0112_094_no_text_select_mod.pdf >>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
_________________________________________________________________
Get a FREE Web site, company branded e-mail and more from Microsoft
Office
Post by Jake C
Live! http://clk.atdmt.com/MRT/go/mcrssaub0050001411mrt/direct/01/
--------------------------------------------------------------------------------
-------------------------------------------------------------------------
Post by Jake C
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--------------------------------------------------------------------------------
Post by Jake C
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
_________________________________________________________________
It’s tax season, make sure to follow these few simple tips
http://articles.moneycentral.msn.com/Taxes/PreparationTips/PreparationTips.aspx?icid=HMMartagline
Paulo Soares
2007-03-20 22:58:19 UTC
Permalink
----- Original Message -----
From: "Jake C" <***@hotmail.com>
To: <itext-***@lists.sourceforge.net>
Sent: Tuesday, March 20, 2007 7:41 PM
Subject: Re: [iText-questions] Can't view text segments in certain PDF files
Post by Jake C
What did you do to the original document to create your modified version?
Removed the invisible text rendering.
Post by Jake C
Why can't the TreeViewPDF tool view the Content of either my original
version or your modified version? Is it possible to make the structure of
It probably has limitations. You should look at
http://www.windjack.com/products/pdfcanopener.html.
Post by Jake C
one that doesn't convert to FlashPaper to look like the one that DOES
convert to FlashPaper using iText?
That's something that can't be done without the Flash environment and it
goes somewhat above the scope of this mailing list.

Paulo
Post by Jake C
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Date: Tue, 20 Mar 2007 18:50:46 -0000
The text pasted from the PDF to the clipboard is correct. It will probably
require more investigation but not related to iText.
Paulo
----- Original Message -----
Sent: Tuesday, March 20, 2007 6:33 PM
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Post by Jake C
No, there is actually text there now, but not a single one is
alphanumeric.
¿½½·¼»²¬ó·²·¬·¿¬·²¹
«¬·´·¬§
ª»²¼±®ó­«°°´·»¼ ¬¸»®³¿´ó¸§¼®¿«´·½­ ½¿´½«´¿ó
׬
·²
°´¿²¬
±³·¬¬»¼ô
¼»ª»´±°»¼
¼·®»½¬´§
°®»½»¼·²¹
«­«¿´´§
¾§
ª»®§
°´¿²¬ ¼»­·¹²
¿½½·¼»²¬ó·²·¬·¿¬·²¹
»²¹·²»»®·²¹
·­
ª¿´«¿¾´»
®·­µó¿­­»­­³»²¬ °®±½»­­ô
¿
©±«´¼
¿ º±®³¿´´§ ¼±½«³»²¬»¼
ß²¿´§­·­
Ú«²½¬·±² Ûª»²¬
°´¿²¬
³·¬·¹¿¬·²¹
·²·¬·¿¬·²¹
¬®¿²­´¿¬»¼
°»®º±®³·²¹
°®±ª·¼»­
°®»°¿®·²¹
³±®»
¼»¬¿·´»¼
Ú«²½¬·±²
¿ ¼·­¬·²½¬´§
°´¿²¬
»ª»²¬
°®±ª·¼»­ ¿
¾¿­»´·²»
°»®³·¬­ ¿
¿°°®±¿½¸
¾»¬©»»²
³·¬·¹¿¬·²¹
°´¿²¬
¿
»ª»²¬«¿´´§ ¼»½±³°±­»¼
±®
«²¿ª¿·´¿¾·´·¬§
¯«¿²¬·¬¿¬·ª»´§ ³»¿­«®»¼ò
½±²­¬®«½¬·²¹
°®»ª»²¬
½±²­»¯«»²½»­ô
®»´¿¬·±²­¸·°­
¾»¬©»»²
·²ª»²¬±®§
³¿·²ó
¬¿·²»¼ô
¿½½±³°´·­¸»¼ò
·²
»´·³·²¿¬·²¹
¸»¿¬ó®»³±ª¿´
­«½½»­­º«´´§ ³¿·²¬¿·²»¼ò
¿
¿
ÔÑÝßò
½±²­·¼»®»¼
ïò
îò ݱ²¬¿·²³»²¬ ±ª»®°®»­­«®»
¾´±©¼±©² ¾§
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Date: Tue, 20 Mar 2007 18:02:04 -0000
See if it works now.
Paulo
Sent: Tuesday, March 20, 2007 4:44 PM
Subject: [iText-questions] Can't view text segments in certain PDF files
Post by Jake C
We use an OCR product to generate a PDF from a TIF with the original
image
plus hidden text, so that you can search/select the text, but only see
the
originally scanned image. We then use Adobe FlashPaper 2 to turn it
into
Post by Jake C
Post by Jake C
a
SWF that can be imbedded in a web page. However, the hidden text is
being
Post by Jake C
Post by Jake C
stripped out of the final SWF, so that it is no longer searchable.
Adobe
Post by Jake C
Post by Jake C
considers this a "limitation" (we consider it a "bug"). Most other OCR
software has the same problem as the platform we chose, but there is
one
Post by Jake C
Post by Jake C
that seems to convert to SWF just fine. In an attempt to find out what
the
difference was between the two files, I tried to use the Tree Viewer
from
Post by Jake C
Post by Jake C
iText to examine the contents of the files. However, when I select the
Content node of the one that gets the text stripped out, I don't see
anything. If I use the API to try to extract the Stream directly, I get
a
Post by Jake C
Post by Jake C
NullPointerException.
So I guess I really have two questions.
1) Is there something wrong with how the PDF is constructed that we
cannot
examine the text content with iText, or is there a bug in iText?
2) Is there a way we can manipulate the PDF from the OCR software we
chose
to make it structurally look like the one that actually keeps the text
when
converted to SWF?
I'm attaching a copy of the two files (0112_094_no_text_select.pdf from
our
selected OCR product, which we cannot view the text content, and
0112_094_text_select.pdf from the other product, which we CAN view the
text
content, and actually keeps the text in the SWF) in a zip file.
OK, it seems I can't attach a file, or the message gets refused. I've
uploaded it to
http://www.sharebigfile.com/file/116699/0112-094-zip.html
Post by Jake C
Post by Jake C
_________________________________________________________________
i'm making a difference. Make every IM count for the cause of your
choice.
Join Now.
http://clk.atdmt.com/MSN/go/msnnkwme0080000001msn/direct/01/?href=http://im.live.com/messenger/im/home/?source=hmtagline
--------------------------------------------------------------------------------
Post by Jake C
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--------------------------------------------------------------------------------
Post by Jake C
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
<< 0112_094_no_text_select_mod.pdf >>
Jake C
2007-03-20 23:53:30 UTC
Permalink
As to that last question, I wasn't asking a FlashPaper question. I wanted to
modify one PDF structurally to look like another PDF. However, since iText
isn't capable of reading all text blocks, I guess the question is moot.
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Date: Tue, 20 Mar 2007 22:58:19 -0000
----- Original Message -----
Sent: Tuesday, March 20, 2007 7:41 PM
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Post by Jake C
What did you do to the original document to create your modified
version?
Removed the invisible text rendering.
Post by Jake C
Why can't the TreeViewPDF tool view the Content of either my original
version or your modified version? Is it possible to make the structure
of
It probably has limitations. You should look at
http://www.windjack.com/products/pdfcanopener.html.
Post by Jake C
one that doesn't convert to FlashPaper to look like the one that DOES
convert to FlashPaper using iText?
That's something that can't be done without the Flash environment and it
goes somewhat above the scope of this mailing list.
Paulo
Post by Jake C
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Date: Tue, 20 Mar 2007 18:50:46 -0000
The text pasted from the PDF to the clipboard is correct. It will
probably
Post by Jake C
require more investigation but not related to iText.
Paulo
----- Original Message -----
Sent: Tuesday, March 20, 2007 6:33 PM
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Post by Jake C
No, there is actually text there now, but not a single one is
alphanumeric.
¿œœ·Œ»²¬ó·²·¬·¿¬·²¹
«¬·Ž·¬§
ª»²Œ±®ó­«°°Ž·»Œ ¬ž»®³¿Žóž§Œ®¿«Ž·œ­ œ¿Žœ«Ž¿ó
׬
·²
°Ž¿²¬
±³·¬¬»Œô
Œ»ª»Ž±°»Œ
Œ·®»œ¬Ž§
°®»œ»Œ·²¹
«­«¿ŽŽ§
٤
ª»®§
°Ž¿²¬ Œ»­·¹²
¿œœ·Œ»²¬ó·²·¬·¿¬·²¹
»²¹·²»»®·²¹
·­
ª¿Ž«¿ŸŽ»
®·­µó¿­­»­­³»²¬ °®±œ»­­ô
¿
©±«ŽŒ
¿ º±®³¿ŽŽ§ Œ±œ«³»²¬»Œ
ß²¿Ž§­·­
Ú«²œ¬·±² Ûª»²¬
°Ž¿²¬
³·¬·¹¿¬·²¹
·²·¬·¿¬·²¹
¬®¿²­Ž¿¬»Œ
°»®º±®³·²¹
°®±ª·Œ»­
°®»°¿®·²¹
³±®»
Œ»¬¿·Ž»Œ
Ú«²œ¬·±²
¿ Œ·­¬·²œ¬Ž§
°Ž¿²¬
»ª»²¬
°®±ª·Œ»­ ¿
Ÿ¿­»Ž·²»
°»®³·¬­ ¿
¿°°®±¿œž
Ÿ»¬©»»²
³·¬·¹¿¬·²¹
°Ž¿²¬
¿
»ª»²¬«¿ŽŽ§ Œ»œ±³°±­»Œ
±®
«²¿ª¿·Ž¿Ÿ·Ž·¬§
¯«¿²¬·¬¿¬·ª»Ž§ ³»¿­«®»Œò
œ±²­¬®«œ¬·²¹
°®»ª»²¬
œ±²­»¯«»²œ»­ô
®»Ž¿¬·±²­ž·°­
Ÿ»¬©»»²
·²ª»²¬±®§
³¿·²ó
¬¿·²»Œô
¿œœ±³°Ž·­ž»Œò
·²
»Ž·³·²¿¬·²¹
ž»¿¬ó®»³±ª¿Ž
­«œœ»­­º«ŽŽ§ ³¿·²¬¿·²»Œò
¿
¿
ÔÑÝßò
œ±²­·Œ»®»Œ
ïò
îò ݱ²¬¿·²³»²¬ ±ª»®°®»­­«®»
ŸŽ±©Œ±©² Ÿ§
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain
PDF
Post by Jake C
Post by Jake C
files
Date: Tue, 20 Mar 2007 18:02:04 -0000
See if it works now.
Paulo
Sent: Tuesday, March 20, 2007 4:44 PM
Subject: [iText-questions] Can't view text segments in certain PDF
files
Post by Jake C
We use an OCR product to generate a PDF from a TIF with the original
image
plus hidden text, so that you can search/select the text, but only
see
Post by Jake C
Post by Jake C
Post by Jake C
the
originally scanned image. We then use Adobe FlashPaper 2 to turn it
into
Post by Jake C
Post by Jake C
a
SWF that can be imbedded in a web page. However, the hidden text is
being
Post by Jake C
Post by Jake C
stripped out of the final SWF, so that it is no longer searchable.
Adobe
Post by Jake C
Post by Jake C
considers this a "limitation" (we consider it a "bug"). Most other
OCR
Post by Jake C
Post by Jake C
Post by Jake C
software has the same problem as the platform we chose, but there is
one
Post by Jake C
Post by Jake C
that seems to convert to SWF just fine. In an attempt to find out
what
Post by Jake C
Post by Jake C
Post by Jake C
the
difference was between the two files, I tried to use the Tree Viewer
from
Post by Jake C
Post by Jake C
iText to examine the contents of the files. However, when I select
the
Post by Jake C
Post by Jake C
Post by Jake C
Content node of the one that gets the text stripped out, I don't see
anything. If I use the API to try to extract the Stream directly, I
get
a
Post by Jake C
Post by Jake C
NullPointerException.
So I guess I really have two questions.
1) Is there something wrong with how the PDF is constructed that we
cannot
examine the text content with iText, or is there a bug in iText?
2) Is there a way we can manipulate the PDF from the OCR software we
chose
to make it structurally look like the one that actually keeps the
text
Post by Jake C
Post by Jake C
Post by Jake C
when
converted to SWF?
I'm attaching a copy of the two files (0112_094_no_text_select.pdf
from
our
selected OCR product, which we cannot view the text content, and
0112_094_text_select.pdf from the other product, which we CAN view
the
Post by Jake C
Post by Jake C
Post by Jake C
text
content, and actually keeps the text in the SWF) in a zip file.
OK, it seems I can't attach a file, or the message gets refused.
I've
Post by Jake C
Post by Jake C
Post by Jake C
uploaded it to
http://www.sharebigfile.com/file/116699/0112-094-zip.html
Post by Jake C
Post by Jake C
_________________________________________________________________
i'm making a difference. Make every IM count for the cause of your
choice.
Join Now.
http://clk.atdmt.com/MSN/go/msnnkwme0080000001msn/direct/01/?href=http://im.live.com/messenger/im/home/?source=hmtagline
--------------------------------------------------------------------------------
Post by Jake C
-------------------------------------------------------------------------
Post by Jake C
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to
share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--------------------------------------------------------------------------------
Post by Jake C
Post by Jake C
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
<< 0112_094_no_text_select_mod.pdf >>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
_________________________________________________________________
Mortgage refinance is hot 1) Rates near 30-yr lows 2) Good credit get
intro-rate 4.625%*
https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h2a5f&s=4056&p=5117&disc=y&vers=743
Paulo Soares
2007-03-21 00:20:58 UTC
Permalink
You are asking a FlashPaper question. The working PDF structure that
FlashPaper requires must be found and then a tool would be used to
create/recreate that structure. FlashPaper is the limitation here. This
isn't about iText reading text blocks, which it may or may not read, you
have a long way before getting there.

Paulo

----- Original Message -----
From: "Jake C" <***@hotmail.com>
To: <itext-***@lists.sourceforge.net>
Sent: Tuesday, March 20, 2007 11:53 PM
Subject: Re: [iText-questions] Can't view text segments in certain PDF files
Post by Jake C
As to that last question, I wasn't asking a FlashPaper question. I wanted to
modify one PDF structurally to look like another PDF. However, since iText
isn't capable of reading all text blocks, I guess the question is moot.
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Date: Tue, 20 Mar 2007 22:58:19 -0000
----- Original Message -----
Sent: Tuesday, March 20, 2007 7:41 PM
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Post by Jake C
What did you do to the original document to create your modified
version?
Removed the invisible text rendering.
Post by Jake C
Why can't the TreeViewPDF tool view the Content of either my original
version or your modified version? Is it possible to make the structure
of
It probably has limitations. You should look at
http://www.windjack.com/products/pdfcanopener.html.
Post by Jake C
one that doesn't convert to FlashPaper to look like the one that DOES
convert to FlashPaper using iText?
That's something that can't be done without the Flash environment and it
goes somewhat above the scope of this mailing list.
Paulo
Post by Jake C
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Date: Tue, 20 Mar 2007 18:50:46 -0000
The text pasted from the PDF to the clipboard is correct. It will
probably
Post by Jake C
require more investigation but not related to iText.
Paulo
----- Original Message -----
Sent: Tuesday, March 20, 2007 6:33 PM
Subject: Re: [iText-questions] Can't view text segments in certain PDF
files
Post by Jake C
No, there is actually text there now, but not a single one is
alphanumeric.
¿½½·¼»²¬ó·²·¬·¿¬·²¹
«¬·´·¬§
ª»²¼±®ó­«°°´·»¼ ¬¸»®³¿´ó¸§¼®¿«´·½­ ½¿´½«´¿ó
׬
·²
°´¿²¬
±³·¬¬»¼ô
¼»ª»´±°»¼
¼·®»½¬´§
°®»½»¼·²¹
«­«¿´´§
¾§
ª»®§
°´¿²¬ ¼»­·¹²
¿½½·¼»²¬ó·²·¬·¿¬·²¹
»²¹·²»»®·²¹
·­
ª¿´«¿¾´»
®·­µó¿­­»­­³»²¬ °®±½»­­ô
¿
©±«´¼
¿ º±®³¿´´§ ¼±½«³»²¬»¼
ß²¿´§­·­
Ú«²½¬·±² Ûª»²¬
°´¿²¬
³·¬·¹¿¬·²¹
·²·¬·¿¬·²¹
¬®¿²­´¿¬»¼
°»®º±®³·²¹
°®±ª·¼»­
°®»°¿®·²¹
³±®»
¼»¬¿·´»¼
Ú«²½¬·±²
¿ ¼·­¬·²½¬´§
°´¿²¬
»ª»²¬
°®±ª·¼»­ ¿
¾¿­»´·²»
°»®³·¬­ ¿
¿°°®±¿½¸
¾»¬©»»²
³·¬·¹¿¬·²¹
°´¿²¬
¿
»ª»²¬«¿´´§ ¼»½±³°±­»¼
±®
«²¿ª¿·´¿¾·´·¬§
¯«¿²¬·¬¿¬·ª»´§ ³»¿­«®»¼ò
½±²­¬®«½¬·²¹
°®»ª»²¬
½±²­»¯«»²½»­ô
®»´¿¬·±²­¸·°­
¾»¬©»»²
·²ª»²¬±®§
³¿·²ó
¬¿·²»¼ô
¿½½±³°´·­¸»¼ò
·²
»´·³·²¿¬·²¹
¸»¿¬ó®»³±ª¿´
­«½½»­­º«´´§ ³¿·²¬¿·²»¼ò
¿
¿
ÔÑÝßò
½±²­·¼»®»¼
ïò
îò ݱ²¬¿·²³»²¬ ±ª»®°®»­­«®»
¾´±©¼±©² ¾§
Reply-To: Post all your questions about iText here
To: "Post all your questions about iText here"
Subject: Re: [iText-questions] Can't view text segments in certain
PDF
Post by Jake C
Post by Jake C
files
Date: Tue, 20 Mar 2007 18:02:04 -0000
See if it works now.
Paulo
----- Original Message ----- From: "Jake C"
Sent: Tuesday, March 20, 2007 4:44 PM
Subject: [iText-questions] Can't view text segments in certain PDF
files
Post by Jake C
We use an OCR product to generate a PDF from a TIF with the original
image
plus hidden text, so that you can search/select the text, but only
see
Post by Jake C
Post by Jake C
Post by Jake C
the
originally scanned image. We then use Adobe FlashPaper 2 to turn it
into
Post by Jake C
Post by Jake C
a
SWF that can be imbedded in a web page. However, the hidden text is
being
Post by Jake C
Post by Jake C
stripped out of the final SWF, so that it is no longer searchable.
Adobe
Post by Jake C
Post by Jake C
considers this a "limitation" (we consider it a "bug"). Most other
OCR
Post by Jake C
Post by Jake C
Post by Jake C
software has the same problem as the platform we chose, but there is
one
Post by Jake C
Post by Jake C
that seems to convert to SWF just fine. In an attempt to find out
what
Post by Jake C
Post by Jake C
Post by Jake C
the
difference was between the two files, I tried to use the Tree Viewer
from
Post by Jake C
Post by Jake C
iText to examine the contents of the files. However, when I select
the
Post by Jake C
Post by Jake C
Post by Jake C
Content node of the one that gets the text stripped out, I don't see
anything. If I use the API to try to extract the Stream directly, I
get
a
Post by Jake C
Post by Jake C
NullPointerException.
So I guess I really have two questions.
1) Is there something wrong with how the PDF is constructed that we
cannot
examine the text content with iText, or is there a bug in iText?
2) Is there a way we can manipulate the PDF from the OCR software we
chose
to make it structurally look like the one that actually keeps the
text
Post by Jake C
Post by Jake C
Post by Jake C
when
converted to SWF?
I'm attaching a copy of the two files (0112_094_no_text_select.pdf
from
our
selected OCR product, which we cannot view the text content, and
0112_094_text_select.pdf from the other product, which we CAN view
the
Post by Jake C
Post by Jake C
Post by Jake C
text
content, and actually keeps the text in the SWF) in a zip file.
OK, it seems I can't attach a file, or the message gets refused.
I've
Post by Jake C
Post by Jake C
Post by Jake C
uploaded it to
http://www.sharebigfile.com/file/116699/0112-094-zip.html
Post by Jake C
Post by Jake C
_________________________________________________________________
i'm making a difference. Make every IM count for the cause of your
choice.
Join Now.
http://clk.atdmt.com/MSN/go/msnnkwme0080000001msn/direct/01/?href=http://im.live.com/messenger/im/home/?source=hmtagline
--------------------------------------------------------------------------------
Post by Jake C
-------------------------------------------------------------------------
Post by Jake C
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--------------------------------------------------------------------------------
Post by Jake C
Post by Jake C
_______________________________________________
iText-questions mailing list
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
<< 0112_094_no_text_select_mod.pdf >>
Continue reading on narkive:
Loading...