[iText-questions] parse tabular data in PDF using iTextSharp

Discussion:

bhanu

2012-11-28 10:57:33 UTC

Hi,
I am using iTextSharp to parse a PDF document and extract the content as text. The PDF document I am parsing contains data in tabular format.
 I am using PdfTextExtractor.GetTextFromPage() to extract text from a PDF page containing tablular data. The extracted text is having line seperator "\n". So I split by "\n" and considering every line as a row. In every line I split by whitespace and consider each value as a cell value. This has been working great for me.
Now, the issue is, if any cell in the tablular data (in PDF page) is empty, there is no way to identify from the extracted text for which column of tabular data in PDF, the data is missing.
Please let me know if there is any solution for this issue. Also please let me know whether it is possible to convert a PDF to Excel using ITextSharp.
Thanks in advance.
Thanks,

Bhanu Kota.

anand035

2012-11-28 22:51:03 UTC

Permalink

What I would do is while exteacting the text, i would also extract their x
and y coordinates using myTextrenderer class. Then group the text with same
Y coordinate to collect data in same row or group the text with same X
cordinate to collect data by column.

In this approach , you are not spliting anything with "\n", instead you make
collection of the using their coordinates so empty string would just appear
as "" string in your collection

--
View this message in context: http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tp4657013p4657014.html
Sent from the iText - General mailing list archive at Nabble.com.

mkl

2012-11-29 11:30:42 UTC

Permalink

bhanu, anand035,

What I would do is while extracting the text, i would also extract their x
and y coordinates using myTextrenderer class which implements
RenderListner interface of iText. Then group the text with same Y
coordinate to collect data in same row or group the text with same X
cordinate to collect data by column.
In this approach , you are not spliting anything with "\n", instead you
make collection of them using their coordinates so empty string would just
appear as "" string in your collection

I, too, would propose to implement your own RenderListener and group the
incoming data by their coordinates.

For a generic solution, though, you cannot count on identical x or y
coordinates. If a column is right-aligned or centered, the x coordinates of
the entries of the column most likely wont be identical. And if the entries
of a row are vertically centered or bottom-aligned, their y coordinates can
differ.

A solution for generic PDFs, therefore, requires that you either know the
coordinate ranges of columns and rows beforehand and in your listener group
text bits by cell according to those known ranges, or that you do some
intense analysis of the table layout in the PDF, e.g. look for x and y
coordinate ranges along which there is no character data or look for lines
separating the columns.

BTW, there are some PDFs making your task even more difficult. If you have
something like this

A1a2 B3b4
C5c6 D7d8

the text chunks in the page content can be "A1", "a2 B3", "b4", and "C5c6
D7d8" and the proper alignment in the displayed PDF is accomplished by a
word spacing operator (which changes the width of white space characters).
Yes, PDFs built like this do exist in the wild and have even been discussed
in this mailing list here.

Therefore, to analyze them correctly, you sometimes even have to split the
text chunks your render listener receives.

Regards, Michael

--
View this message in context: http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tp4657013p4657019.html
Sent from the iText - General mailing list archive at Nabble.com.

bhanukota

2012-11-30 14:46:05 UTC

Permalink

Hi Anand & Micheal,

Thanks a lot for you time and suggestion.

I have tried extracting the coordinates along with the text. The columns in
the PDF table are right aligned (rightly guessed by Micheal) as the data is
numeric. So, instead of (X,Y) I used (X+width, Y) to deal with the right
align issue. However, the (X+width, Y) is not identical for all values in a
column. Following is an example of my table in PDF.
columnA columnB
123.45 8901.9
9.12 72.35

To get the coordinates, I have created a class "TestStrategy" inherited from
base class "LocationTextExtractionStrategy". Following is the override
function in TestStrategy, from which I get the coordinates.
public override void RenderText(TextRenderInfo renderInfo)
{
LineSegment segment = renderInfo.GetBaseline();
//get coordiantes
RectangleJ rect = segment.GetBoundingRectange());
}

Now, rect object will have X,Y and Width (height is always 0). I have also
tried segment.GetStartPoint() and segment.GetEndPoint(), even these values
are not identical for all the values in a column.

Please let me now if this is not the correct way to get the coordinates or I
am missing something. A sample or pseudo code will be of great help.

Thanks,
Bhanu Kota.

--
View this message in context: http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tp4657013p4657024.html
Sent from the iText - General mailing list archive at Nabble.com.

bhanukota

2012-11-30 14:32:13 UTC

Permalink

Hi Anand & Micheal,

Thanks a lot for you time and suggestion.

I have tried extracting the coordinates along with the text. The columns in
the PDF table are right aligned (rightly guessed by Micheal) as the data is
numeric. So, instead of (X,Y) I used (X+width, Y) to deal with the right
align issue. However, the (X+width, Y) is not identical for all values in a
column. Following is an example of my table in PDF.
columnA columnB
123.45 8901.9
9.12 72.35

To get the coordinates, I have created a class "TestStrategy" inherited from
base class "LocationTextExtractionStrategy". Following is the override
function in TestStrategy, from which I get the coordinates.
public override void RenderText(TextRenderInfo renderInfo)
{
LineSegment segment = renderInfo.GetBaseline();
//get coordiantes
RectangleJ rect = segment.GetBoundingRectange());
}

Now, rect object will have X,Y and Width (height is always 0). I have also
tried segment.GetStartPoint() and segment.GetEndPoint(), even these values
are not identical for all the values in a column.

Please let me now if this is not the correct way to get the coordinates or I
am missing something. A sample or pseudo code will be of great help.

Thanks,
Bhanu Kota.

--
View this message in context: http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tp4657013p4657023.html
Sent from the iText - General mailing list archive at Nabble.com.

anand035

2012-11-30 16:12:52 UTC

Permalink

Michael would be able to give more elegant approach.
Though one solution of your problem I see is this
U dont need rectangle.

What u do is get start and end values of x cordinates of "columnA" and
"columnB" so forth
Like this

Float x1 = renderinfo.getbaseline.getstartpoint.get(0);
Float x2 = renderinfo.getbaseline.getendpoint.get(0);

If u want u get y1,y2 using index 1 in get method

Then instead of doing identical match, go for range match to get all values
of columnA
I.e. anytext, whoose x1,x2 falls with in the range of (x1,x2) of "columnA"
is considered value of columnA

--
View this message in context: http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tp4657013p4657026.html
Sent from the iText - General mailing list archive at Nabble.com.

mkl

2012-11-30 16:59:19 UTC

Permalink

anand035, Bhanu

Post by anand035
Michael would be able to give more elegant approach.

The approach I mentioned in my earlier message may be interesting for a
generic solution.

If there is a reference row, though, i.e. a row with a wide enough entry in
every row, using it is very appropriate. I would propose a small change,

Post by anand035
I.e. anytext, whoose x1,x2 falls with in the range of (x1,x2) of "columnA"
is considered value of columnA

As the column headers probably are shorter than individual values, I would
propose checking instead whether the interval x1, x2 of the text intersects
the interval x1', x2' of "columnA" but not the interval x1", x2" of any
other column header. If some text intersects the intervals of multiple
column headers, some special treatment is required.

Regards, Michael

--
View this message in context: http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tp4657013p4657028.html
Sent from the iText - General mailing list archive at Nabble.com.

bhanukota

2012-12-04 12:34:34 UTC

Permalink

Hi,

Thanks a lot the interval/range approach worked great.

anand035 wrote
I.e. anytext, whoose x1,x2 falls with in the range of (x1,x2) of "columnA"
is considered value of columnA.

Now, I have one more question. Sorry for posting so many.
How to read the Local/Culture information from a PDF file.
For example - In es-ES (Spanish) comma is used as a decimal seperator i.e.
14.25 is written as 14,25. As we read data as string form PDF, if I convert
14,25 to double using en-US(English) locale, it is 1425.00 which is wrong.
So, I would like to get the culture/language information form PDF itself.

Please let me know if there is a way to get this culture/locale/language
information.

Thanks in Advance.

Thanks,
Bhanu Kota.

--
View this message in context: http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tp4657013p4657053.html
Sent from the iText - General mailing list archive at Nabble.com.

mkl

2012-12-04 13:32:08 UTC

Permalink

Bhanu Kota,

Post by bhanukota
How to read the Local/Culture information from a PDF file.

There is no (at least no mandatory) attribute for the Locale or Culture of a
PDF document.

Regards, Michael

--
View this message in context: http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tp4657013p4657058.html
Sent from the iText - General mailing list archive at Nabble.com.

bhanu

2012-11-30 14:45:19 UTC

Permalink

Hi

Thanks a lot for you time and suggestion. 

 I have tried extracting the coordinates along with the text. The columns in the PDF table are right aligned (rightly guessed by Micheal) as the data is numeric. So, instead of (X,Y) I used (X+width, Y) to deal with the right align issue. However, the (X+width, Y) is not identical for all values in a column. Following is an example of my table in PDF. 
 columnA     columnB 
   123.45       8901.9 
      9.12         72.35 

To get the coordinates, I have created a class "TestStrategy" inherited from base class "LocationTextExtractionStrategy". Following is the override function in TestStrategy, from which I get the coordinates. 
        public override void RenderText(TextRenderInfo renderInfo) 
        { 
            LineSegment segment = renderInfo.GetBaseline(); 
            //get coordiantes 
            RectangleJ rect = segment.GetBoundingRectange()); 
        } 

Now, rect object will have X,Y and Width (height is always 0). I have also tried segment.GetStartPoint() and segment.GetEndPoint(), even these values are not identical for all the values in a column. 

Please let me now if this is not the correct way to get the coordinates or I am missing something. A sample  or pseudo code will be of great help. 

Thanks, 
Bhanu Kota. 

Thread as follows.

Hi,
I am using iTextSharp to parse a PDF document and extract the content as text. The PDF document I am parsing contains data in tabular format.
 I am using PdfTextExtractor.GetTextFromPage() to extract text from a PDF page containing tablular data. The extracted text is having line seperator "\n". So I split by "\n" and considering every line as a row. In every line I split by whitespace and consider each value as a cell value. This has been working great for me.
Now, the issue is, if any cell in the tablular data (in PDF page) is empty, there is no way to identify from the extracted text for which column of tabular data in PDF, the data is missing.
Please let me know if there is any solution for this issue. Also please let me know whether it is possible to convert a PDF to Excel using ITextSharp.
Thanks in advance.
Thanks,
Bhanu Kota.

anand035 wrote

> What I would do is while extracting the text, i would also extract their x

> and y coordinates using myTextrenderer class which implements

> RenderListner interface of iText. Then group the text with same Y

> coordinate to collect data in same row or group the text with same X

> cordinate to collect data by column.

>

> In this approach , you are not spliting anything with "\n", instead you

> make collection of them using their coordinates so empty string would just

> appear as "" string in your collection

I, too, would propose to implement your own RenderListener and group the

incoming data by their coordinates.

For a generic solution, though, you cannot count on identical x or y

coordinates. If a column is right-aligned or centered, the x coordinates of

the entries of the column most likely wont be identical. And if the entries

of a row are vertically centered or bottom-aligned, their y coordinates can

differ.

A solution for generic PDFs, therefore, requires that you either know the

coordinate ranges of columns and rows beforehand and in your listener group

text bits by cell according to those known ranges, or that you do some

intense analysis of the table layout in the PDF, e.g. look for x and y

coordinate ranges along which there is no character data or look for lines

separating the columns.

BTW, there are some PDFs making your task even more difficult. If you have

something like this

A1a2   B3b4

C5c6   D7d8

the text chunks in the page content can be "A1", "a2 B3", "b4", and "C5c6

D7d8" and the proper alignment in the displayed PDF is accomplished by a

word spacing operator (which changes the width of white space characters).

Yes, PDFs built like this do exist in the wild and have even been discussed

in this mailing list here.

Therefore, to analyze them correctly, you sometimes even have to split the

text chunks your render listener receives.

Regards,   Michael

--

View this message in context: http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tp4657013p4657019.html

Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------

Keep yourself connected to Go Parallel:

VERIFY Test and improve your parallel project with help from experts

and peers. http://goparallel.sourceforge.net

_______________________________________________

iText-questions mailing list

iText-***@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.

Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/

Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php