Payton Flint's Tech Blog
Menu
  • Home
  • Blog
  • Categories
  • Resources
  • About
  • Contact
Menu

PowerShell – Extract Text From .PDF Files

Posted on October 1, 2023October 1, 2023 by paytonflint

A unique challenge presented itself the other day- to extract text from .PDF files. I found the iText-based PSwritePDF module in the PowerShell Gallery that offers this capability; however, I found it to be an incomplete solution due to the fact that many .PDF files (particularly those created with an automated tool, like GhostScript or similar) contain old PostScript Type3-encoded fonts. My understanding is that these fonts may each have its own “decoder ring” of sorts. I put a script together to attempt to decode the extracted bits using all of the common text encoding types, and- no dice.

The solution is to use Optical Character Recognition, or OCR, which will use shape of the text characters to recognize and convert the text. Tesseract-OCR is an excellent tool for this task. However, it does not accept .PDF files, only image file formats. This presents an additional step of converting the .PDF to an image file, which can be accomplished with GhostScript. And, I did not find a way to return the text, but only to output a .TXT file. Both of these tools provide a command-line interface, so I’ve scripted out these steps, retrieved the text, and automated the deletion of the ephemeral image and text files.

I ran into dependency loop issues when attempting to install these tools with NuGet. And, despite reaching out to the owners and offering assistance with remediation of these issues, I received no response. It is worth mentioning that I do not own these tools, and they have their own licensing. You must mind their respective licenses, and this script is merely a means to use these tools together to complete a workflow. I scripted the install from their respective sources and included logic to check for their presence. I don’t find it to be a perfect solution, but it does accomplish the task at hand.

In short, this script can extract and return text from a given set of .PDF files using OCR. Paired with RegEx, this could be a very useful tool for batch processing files.

Here is the GitHub link: https://github.com/p8nflnt/SysAdmin-Toolbox/blob/main/Get-PdfOcrText.ps1

And here is the script:

<#
.SYNOPSIS
    Capture text from all single-page .PDF documents in target directory via OCR
    Uses open-source Ghostscript (https://www.ghostscript.com/index.html)
    Uses open-source Tesseract-OCR (https://github.com/tesseract-ocr/tesseract)

.NOTES
    Name: Get-PdfOcrText
    Author: Payton Flint
    Version: 1.0
    DateCreated: 2023-Sep

.LINK
    https://paytonflint.com/powershell-extract-text-from-pdf-files/
    https://github.com/p8nflnt/SysAdmin-Toolbox/blob/main/Get-PdfOcrText.ps1
#>

Function Get-PdfOcrText {
    param (
        $pdfFileStore
    )

    # identify location of script
    #$PSScriptRoot = Split-Path ($MyInvocation.MyCommand.Path) -Parent

    # Update-EnvironmentVariables Function
    Function Update-EnvironmentVariables {
        # Clear nullified environment variables
        $machineValues = [Environment]::GetEnvironmentVariables('Machine')
        $userValues    = [Environment]::GetEnvironmentVariables('User')
        $processValues = [Environment]::GetEnvironmentVariables('Process')
        # Identify the entire list of environment variable names first
        $envVarNames = ($machineValues.Keys + $userValues.Keys + 'PSModulePath') | Sort-Object | Select-Object -Unique
        # Lastly remove the environment variables that no longer exist
        ForEach ($envVarName in $processValues.Keys | Where-Object {$envVarNames -like $null}) {
        Remove-Item -LiteralPath "env:${envVarName}" -Force
        }
        # Update variables
        foreach($level in "Machine","User","Process") {
        [Environment]::GetEnvironmentVariables($level)
        }
    } # End of Update-EnvironmentVariables Function

    # specify path to tesseract.exe (TesseractOCR)
    $tesseractPath = "$env:SystemDrive\Program Files\Tesseract-OCR\tesseract.exe"

    # install TesseractOCR if it is not present
    # https://github.com/UB-Mannheim/tesseract/wiki
    If (!(Test-Path -Path $tesseractPath)){
        Write-Host -ForegroundColor Red "`'tesseract.exe`' not present @ $tesseractPath."
        Write-host -ForegroundColor Red "Please install Tesseract or specify new location."

        # specify Tesseract source URL
        $tesseractURL = "https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-5.3.1.20230401.exe"

        # derive file name from url
        $tesseractFileName = $tesseractURL -split '/'
        $tesseractFileName = $tesseractFileName[-1]

        # build destination path
        $tesseractInstallPath = Join-Path $PSScriptRoot $tesseractFileName

        # download .exe from url to script root
        Invoke-WebRequest -Uri $tesseractURL -OutFile $tesseractInstallPath

        # run .exe installer
        Start-Process -FilePath $tesseractInstallPath -Wait

        # remove file when done
        Remove-Item -Path "$tesseractInstallPath"
    }

    # specify path to gswin64c.exe (GhostScript)
    $ghostScriptPath = 'C:\Program Files\gs\gs10.02.0\bin\gswin64c.exe'

    # install GhostScript if it is not present
    # https://ghostscript.readthedocs.io/en/latest/?utm_content=cta-header-link&utm_medium=website&utm_source=ghostscript
    If (!(Test-Path -Path $ghostScriptPath)) {
        $ghostScriptUrl = 'https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs10020/gs10020w64.exe'

        # derive file name from url
        $ghostScriptFileName = $ghostScriptUrl -split '/'
        $ghostScriptFileName = $ghostScriptFileName[-1]

        # build destination path
        $ghostScriptInstallPath = Join-Path $PSScriptRoot $ghostScriptFileName

        # download .exe from url to script root
        Invoke-WebRequest -Uri $ghostScriptUrl -OutFile $ghostScriptInstallPath

        # run .exe installer
        Start-Process -FilePath $ghostScriptInstallPath -Wait

        # Update Environment Variables - gswin64 path is written to $env:PATH on install
        Update-EnvironmentVariables

        # remove file when done
        Remove-Item -Path "$ghostScriptInstallPath"
    }

    # specify ghostscript input file extension
    $gsInFileExt = '.pdf'

    # add wildcard to file extension to get all files of that type
    $gsInFileExt = '*' + $gsInFileExt

    # get all .pdf files from input file store
    $gsInFiles = $pdfFileStore | Get-ChildItem -Filter $gsInFileExt

    # specify ghostscript output file extension
    $gsOutFileExt = '.png'

    # specify tesseract output file extension
    $tessOutFileExt = '.txt'

    # specify language abbreviation for TesseractOCR 
    # refer to Tesseract documentation
    $tessOCRLang = 'eng' # eng = english

    # set working directory to pdf source path
    Set-Location $pdfFileStore

    # initialize PDF OCR text array
    $pdfOcrText = @()

    $gsInFiles | ForEach-Object {
    
        # specify input .pdf file name for GhostScript
        $gsInFile =  $_.Name

        # specify output .png file name for GhostScript
        $gsOutFile = $_.BaseName + $gsOutFileExt
    
        # convert .pdf input file to temp .png output file via GhostScript
        Start-Process -FilePath 'gswin64c.exe' -ArgumentList "-sDEVICE=pngalpha -o $gsOutFile -r144 $gsInFile " -Wait

        # specify temp .png input file name for TesseractOCR
        $tessInFile =  Join-Path $pdfFileStore $gsOutFile

        # specify temp .txt output file basename for TesseractOCR
        $tessOutFile = Join-Path $pdfFileStore $_.BaseName

        # convert temp .png input file to temp .txt output file via TesseractOCR
        Start-Process -FilePath $tesseractPath -ArgumentList "-l $tessOCRLang `"$tessInFile`" `"$tessOutFile`" txt" -Wait
    
        # specify temp .txt output file full name w/ extension
        $tessOutFile = $tessOutFile + $tessOutFileExt
    
        # remove temp .png file
        Remove-Item -Path $tessInFile -Force
    
        # get content from temp .txt file
        $pdfOcrText += Get-Content -Path $tessOutFile
    
        # remove temp .txt file
        Remove-Item -Path $tessOutFile -Force
    }
    $pdfOcrText
} # End Function Get-PdfOcrText

$pdfFileStore = "<DIRECTORY PATH>"

$text = Get-PdfOcrText $pdfFileStore

$text

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

About The Author

Author's Portrait

In my journey as a technologist and 11 years of experience as an IT professional, I have found my niche as Director of Infrastructure Services; developing my skillsets in management, scripting, cloud infrastructure, identity management, and networking.

I have experience as a Systems Administrator and Engineer for large enterprises including the DoD, government agencies, and a nuclear-generation site.

I've been blessed to collaborate with engineers at esteemed Fortune 50 corporations, and one of Africa's largest, to ensure successful implementation of my work.

GitHub Button

Credentials

M365 Endpoint Administrator Associate
M365 Fundamentals
Microsoft AZ-900
CompTIA CSIS
CompTIA CIOS
CompTIA Security+
CompTIA Network+
CompTIA A+
  • April 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
© 2022 Payton Flint | The views and opinions expressed on this website belong solely to the author/owner and do not represent the perspectives of any individuals, institutions, or organizations, whether affiliated personally or professionally, unless explicitly stated otherwise. The content and products on this website are provided as-is with no warranties or guaranties, are for informational/demonstrative purposes only, do not constitute professional advice, and are not to be used maliciously. The author/owner is not responsible for any consequences arising from actions taken based on information provided on this website, nor from the use/misuse of products from this site. All trademarks are the property of their respective owners.