Hi
Thanks a lot for you time and suggestion.
I have tried extracting the coordinates along with the text. The columns in the PDF table are right aligned (rightly guessed by Micheal) as the data is numeric. So, instead of (X,Y) I used (X+width, Y) to deal with the right align issue. However, the (X+width, Y) is not identical for all values in a column. Following is an example of my table in PDF.
columnA columnB
123.45 8901.9
9.12 72.35
To get the coordinates, I have created a class "TestStrategy" inherited from base class "LocationTextExtractionStrategy". Following is the override function in TestStrategy, from which I get the coordinates.
public override void RenderText(TextRenderInfo renderInfo)
{
LineSegment segment = renderInfo.GetBaseline();
//get coordiantes
RectangleJ rect = segment.GetBoundingRectange());
}
Now, rect object will have X,Y and Width (height is always 0). I have also tried segment.GetStartPoint() and segment.GetEndPoint(), even these values are not identical for all the values in a column.
Please let me now if this is not the correct way to get the coordinates or I am missing something. A sample or pseudo code will be of great help.
Thanks,
Bhanu Kota.
Thread as follows.
Hi,
I am using iTextSharp to parse a PDF document and extract the content as text. The PDF document I am parsing contains data in tabular format.
I am using PdfTextExtractor.GetTextFromPage() to extract text from a PDF page containing tablular data. The extracted text is having line seperator "\n". So I split by "\n" and considering every line as a row. In every line I split by whitespace and consider each value as a cell value. This has been working great for me.
Now, the issue is, if any cell in the tablular data (in PDF page) is empty, there is no way to identify from the extracted text for which column of tabular data in PDF, the data is missing.
Please let me know if there is any solution for this issue. Also please let me know whether it is possible to convert a PDF to Excel using ITextSharp.
Thanks in advance.
Thanks,
Bhanu Kota.
anand035 wrote
> What I would do is while extracting the text, i would also extract their x
> and y coordinates using myTextrenderer class which implements
> RenderListner interface of iText. Then group the text with same Y
> coordinate to collect data in same row or group the text with same X
> cordinate to collect data by column.
>
> In this approach , you are not spliting anything with "\n", instead you
> make collection of them using their coordinates so empty string would just
> appear as "" string in your collection
I, too, would propose to implement your own RenderListener and group the
incoming data by their coordinates.
For a generic solution, though, you cannot count on identical x or y
coordinates. If a column is right-aligned or centered, the x coordinates of
the entries of the column most likely wont be identical. And if the entries
of a row are vertically centered or bottom-aligned, their y coordinates can
differ.
A solution for generic PDFs, therefore, requires that you either know the
coordinate ranges of columns and rows beforehand and in your listener group
text bits by cell according to those known ranges, or that you do some
intense analysis of the table layout in the PDF, e.g. look for x and y
coordinate ranges along which there is no character data or look for lines
separating the columns.
BTW, there are some PDFs making your task even more difficult. If you have
something like this
A1a2 B3b4
C5c6 D7d8
the text chunks in the page content can be "A1", "a2 B3", "b4", and "C5c6
D7d8" and the proper alignment in the displayed PDF is accomplished by a
word spacing operator (which changes the width of white space characters).
Yes, PDFs built like this do exist in the wild and have even been discussed
in this mailing list here.
Therefore, to analyze them correctly, you sometimes even have to split the
text chunks your render listener receives.
Regards, Michael
--
View this message in context: http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tp4657013p4657019.html
Sent from the iText - General mailing list archive at Nabble.com.
------------------------------------------------------------------------------
Keep yourself connected to Go Parallel:
VERIFY Test and improve your parallel project with help from experts
and peers. http://goparallel.sourceforge.net
_______________________________________________
iText-questions mailing list
iText-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php