Discussion:
[iText-questions] iText help resources?
Steve Garcia
2016-01-14 05:54:24 UTC
Permalink
Hi,

Am trying to pull table data out of PDF files that contain non tabular text as well as the tables. I've successfully parsed the non tabled text using PdfTextExtractor.GetTextFromPage(), but the resulting text stream is empty at each table location.

I'm sure there's a way to do what I need to do, but I can't find documentation for itext. Suggestions for learning my way out of this delimma?

I've attached a sample PDF. The tables are in the latter part of the file.

Thanks,
Steve
mkl
2016-01-14 13:38:29 UTC
Permalink
Steve,
Post by Steve Garcia
Am trying to pull table data out of PDF files that contain non tabular
text as well as the tables. I've successfully parsed the non tabled text
using PdfTextExtractor.GetTextFromPage(), but the resulting text stream is
empty at each table location.
The text in the tables cannot be extracted without OCR.

The text in the tables is drawn using type 3 fonts with an ad-hoc encoding,
i.e. the first glyph drawn on the page is encoded as 0, the second
(differing) glyph as 1, ...

E.g. on page 11 the first text drawn is "B6 Summary (Official Form 6 -
Summary) (12/14)" and is encoded as 00, 01, 02, 03, 04, 05, 05, 06, 07, 08,
02, 09, 0A, 0B, 0B, 0C, 0D, 0C, 06, 0E, 02, 0F, ...

Furthermore the font has not mapping to Unicode.

Thus, automated text extraction without some kind of OCR is impossible.

Regards, Michael



--
View this message in context: http://itext.2136553.n4.nabble.com/iText-help-resources-tp4660980p4660981.html
Sent from the iText mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
iText-questions mailing list
iText-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Loading...