Read pdf file via VBA

kwvh
3StarLounger
Posts: 308
Joined: 24 Feb 2010, 13:41

Read pdf file via VBA

Post by kwvh »

Does anyone have any experience reading pdf's via VBA? Is it possible? I am getting batches of pdf files, all in the same layout that contain three pieces of information that I need to gather. Opening them individually is time consuming so I am looking for an alternative.

Thanks in advance for your assistance.

Ken
Last edited by HansV on 27 Mar 2011, 12:53, edited 1 time in total.
Reason: to correct typo in subject

User avatar
HansV
Administrator
Posts: 78474
Joined: 16 Jan 2010, 00:14
Status: Microsoft MVP
Location: Wageningen, The Netherlands

Re: Read pdf file via VBA

Post by HansV »

If you have Adobe Reader, you may be able to use the Adobe Acrobat n.0 Type Library, but I don't know whether you can find text in a PDF file - among other things it depends on the PDF file: for example, a scanned document is basically an image that can't be searched.
Unfortunately, the documentation from Adobe is rather esoteric. And since I don't have Adobe Reader myself, I can't create or test code.
Best wishes,
Hans

William
StarLounger
Posts: 79
Joined: 08 Feb 2010, 21:48
Location: Wellington, New Zealand

Re: Read pdf file via VBA

Post by William »

With Word VBA it is possible to open PDF files using "Documents.Open ... Format:=wdOpenFormatText" - which is probably the same as opening them manually using the "Recover Text from Any File" option. The problem is that this doesn't give you much useful text - usually just some metadata, and not much else.

I have used this process to open multiple PDF files and extract link - hyperlink and email - details, but the success of this has been dependent on the application used to create the PDF files (my current creator, Acrobat 9 Professional, doesn't create files that "expose" this information, whereas this information is available in files created using my previous creator - Acrobat 7 Standard).

User avatar
jscher2000
2StarLounger
Posts: 148
Joined: 26 Dec 2010, 18:17

Re: Read pdf file via VBA

Post by jscher2000 »

William wrote:...the success of this has been dependent on the application used to create the PDF files (my current creator, Acrobat 9 Professional, doesn't create files that "expose" this information, whereas this information is available in files created using my previous creator - Acrobat 7 Standard).
I wonder whether this might be related to adding hidden metadata for full text indexing? See Adobe Acrobat 9 Standard * Create and manage an index in a PDF for how to add it (maybe it was added by default in Acrobat 7??).

User avatar
Guessed
2StarLounger
Posts: 102
Joined: 04 Feb 2010, 22:44
Location: Melbourne Australia

Re: Read pdf file via VBA

Post by Guessed »

Years ago I manipulated PDF files using VBA and it worked reasonably well.

I can't remember the sources I used for my fiddling but these links will give you some useful areas to start looking at...
http://www.adobe.com/content/dam/Adobe/ ... Script.pdf
http://diaryproducts.net/for/programmer ... javascript
http://www.adobe.com/devnet/acrobat/overview.html#IAC
http://www.planetpdf.com/developer/arti ... t&gid=6624
Andrew Lockton
Melbourne Australia

User avatar
macropod
4StarLounger
Posts: 508
Joined: 17 Dec 2010, 03:14

Re: Read pdf file via VBA

Post by macropod »

Here's some code I've used:

Code: Select all

Public Function ReadAcrobatDocument(strFileName As String) As String
'Note: A Reference to the Adobe Library must be set in Tools|References!
Dim AcroApp As CAcroApp, AcroAVDoc As CAcroAVDoc, AcroPDDoc As CAcroPDDoc
Dim AcroHiliteList As CAcroHiliteList, AcroTextSelect As CAcroPDTextSelect
Dim PageNumber, PageContent, Content, i, j
Set AcroApp = CreateObject("AcroExch.App")
Set AcroAVDoc = CreateObject("AcroExch.AVDoc")
If AcroAVDoc.Open(strFileName, vbNull) <> True Then Exit Function
' The following While-Wend loop shouldn't be necessary but timing issues may occur.
While AcroAVDoc Is Nothing
  Set AcroAVDoc = AcroApp.GetActiveDoc
Wend
Set AcroPDDoc = AcroAVDoc.GetPDDoc
For i = 0 To AcroPDDoc.GetNumPages - 1
  Set PageNumber = AcroPDDoc.AcquirePage(i)
  Set PageContent = CreateObject("AcroExch.HiliteList")
  If PageContent.Add(0, 9000) <> True Then Exit Function
  Set AcroTextSelect = PageNumber.CreatePageHilite(PageContent)
  ' The next line is needed to avoid errors with protected PDFs that can't be read
  On Error Resume Next
  For j = 0 To AcroTextSelect.GetNumText - 1
    Content = Content & AcroTextSelect.GetText(j)
  Next j
Next i
ReadAcrobatDocument = Content
AcroAVDoc.Close True
AcroApp.Exit
Set AcroAVDoc = Nothing: Set AcroApp = Nothing
End Function
You can then call the function with code like:

Code: Select all

Sub Demo()
Dim strPDF As String, strTmp As String, i As Integer
' The next ten lines and the last line in this sub can help if
' you get "ActiveX component can't create object" errors even
' though a Reference to Acrobat is set in Tools|References.
Dim bTask As Boolean
  bTask = True
If Tasks.Exists(Name:="Adobe Acrobat Professional") = False Then
  bTask = False
  Dim AdobePath As String, WshShell As Object
  Set WshShell = CreateObject("Wscript.shell")
  AdobePath = WshShell.RegRead("HKEY_CLASSES_ROOT\acrobat\shell\open\command\")
  AdobePath = Trim(Left(AdobePath, InStr(AdobePath, "/") - 1))
  Shell AdobePath, vbHide
End If
'Replace FilePath & Filename with the correct FilePath & Filename for the pdf file to be read.
strPDF = ReadAcrobatDocument("FilePath & Filename")
ActiveDocument.Range.InsertAfter strPDF
If bTask = False Then Tasks.Item("Adobe Acrobat Professional").Close
End Sub
Note: This code is perhaps a little more complicated than you'll find elsewhere because I'm using Acrobat Pro 8 on Windows 7, where it isn't fully supported (Acrobat Pro 9 is the first version fully supported on Windows 7).
Paul Edstein
[Fmr MS MVP - Word]

kwvh
3StarLounger
Posts: 308
Joined: 24 Feb 2010, 13:41

Re: Read pdf file via VBA

Post by kwvh »

Hans, William, jscher, Guessed and Paul,

THANKS! Still struggling with trying to use the various approaches to reading specific lines within the pdf file. The pdf files may be more than one page, but everything I need is in the top 10 lines or so on the first page.

The information I need will always be prefaced with the same string per field needed. For example:
Line 3 "Student #: " would precede the information I need which is the student's number which will always be 8 characters
Line 6 "Home Room #: " would precede the room number which will always be 9 characters
Line 12 "Date of Enrollment:" would always precede the date enrolled which will always be 8 characters.

So I must find a mechanism to search the pdf for the these labels and then capture the following XX characters. Is that possible?

Thanks in advance for your ideas.

User avatar
macropod
4StarLounger
Posts: 508
Joined: 17 Dec 2010, 03:14

Re: Read pdf file via VBA

Post by macropod »

Hi Ken,

If your PDFs always have the same format, what you should be able to do is to read in the data, then discard however many characters precede the start of what you're interested in, along with however many characters follow the maximum length of what you're interested in, then parse what's left for the data you're interested in. For example, you might use the line:
strPDF = Mid(strPDF, 500, 250)
to disregard anything before the 500th character in the file and anything after the 750th character. That leaves just 250 characters to parse. Some trial an error will be required for the Mid variable, since the # characters in the output won't necessarily correspond with what you can see in the PDF.

If you have problems doing this, post a sample PDF and we'll see what we can do.
Paul Edstein
[Fmr MS MVP - Word]