I'm trying to extract all the lines under the transactions table from this pdf file. The script that I've created can scrape the first line under the first and last headers. How can I collect all the lines from that page?
import os
import io
import re
import requests
import pdfplumber
pdf_url = 'https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2016/20005444.pdf'
response = requests.get(pdf_url)
with io.BytesIO(response.content) as f:
with pdfplumber.open(f) as pdf:
text_content = ""
for page in pdf.pages:
text_content += page.extract_text()
pattern = r'(?:iD owner asset transaction Date notification amount cap\.\s*type Date gains >\s*\$200\?\s*|iD owner asset transaction Date notification(?: amount)?\s*type Date\s*)\s*([^\n]+)'
matches = re.findall(pattern, text_content, re.IGNORECASE | re.DOTALL)
for match in matches:
print(match.strip())
Current output:
JT Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000
FIlINg STATuS: New
u.S. global Jets ETF (JETS) P 07/1/2016 07/1/2016 $1,001 - $15,000
For your reference, this is the type of line I'm interested in:
Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000
Perhaps you can use simpler strategy - find all lines with
$:Prints: