Unable to collect all the lines under transactions from a pdf file

Question

Unable to collect all the lines under transactions from a pdf file

35 Views Asked by robots.txt At 21 February 2024 at 09:42

I'm trying to extract all the lines under the transactions table from this pdf file. The script that I've created can scrape the first line under the first and last headers. How can I collect all the lines from that page?

import os
import io
import re
import requests
import pdfplumber

pdf_url = 'https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2016/20005444.pdf'

response = requests.get(pdf_url)

with io.BytesIO(response.content) as f:
    with pdfplumber.open(f) as pdf:
        text_content = ""
        for page in pdf.pages:
            text_content += page.extract_text()

pattern = r'(?:iD owner asset transaction Date notification amount cap\.\s*type Date gains >\s*\$200\?\s*|iD owner asset transaction Date notification(?: amount)?\s*type Date\s*)\s*([^\n]+)'
matches = re.findall(pattern, text_content, re.IGNORECASE | re.DOTALL)
for match in matches:
    print(match.strip())

Current output:

JT Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000
FIlINg STATuS: New
u.S. global Jets ETF (JETS) P 07/1/2016 07/1/2016 $1,001 - $15,000

For your reference, this is the type of line I'm interested in:

Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000

Original Q&A

There are 1 best solutions below

**Andrej Kesely** · Accepted Answer · 2024-02-21T10:05:14.703000

Perhaps you can use simpler strategy - find all lines with $:

import pdfplumber
import requests

pdf_url = "https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2016/20005444.pdf"

response = requests.get(pdf_url)

with io.BytesIO(response.content) as f:
    with pdfplumber.open(f) as pdf:
        out = []
        for page in pdf.pages:
            for line in page.extract_text().splitlines():
                if "$" in line:
                    out.append(line.removeprefix("JT "))

print(out)

Prints:

[
    "Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000",
    "Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "barrick gold Corporation (AbX) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Eldorado gold Corporation Ordinary S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "First Trust ISE-Revere Natural gas S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "goldcorp Inc. (gg) S 06/29/2016 06/30/2016 $15,001 - $50,000",
    "goldcorp Inc. (gg) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Kinross gold Corporation (KgC) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Newmont Mining Corporation (NEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Newmont Mining Corporation (NEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "North American Palladium, ltd. (PAl) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Pan American Silver Corp. (PAAS) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Pilot gold, Inc Ordinary Shares (PlgTF) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Pinetree Capital ltd Ordinary Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Rare Element Resources ltd. Ordinary S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Silver Wheaton Corp Common Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Silver Wheaton Corp Common Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "SPdR S&P International dividend ETF P 07/1/2016 07/1/2016 $1,001 - $15,000",
    "u.S. global Jets ETF (JETS) P 07/1/2016 07/1/2016 $1,001 - $15,000",
    "Yamana gold Inc. Ordinary Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
]

Unable to collect all the lines under transactions from a pdf file

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in WEB-SCRAPING

Related Questions in PYTHON-REQUESTS

Related Questions in PDFPLUMBER

Trending Questions

Popular # Hahtags

Popular Questions