Step-by-Step Guide to Building a PDF Analyzer Application

Developing a PDF Analyzer in Python

Explore how building a Python PDF Analyzer application can significantly assist with your Python assignment. This comprehensive guide equips you with the skills to create a PDF Analyzer in Python, enhancing document management and metadata extraction capabilities. Simplify your workload and strengthen your programming expertise with this guide. Whether you're a student, researcher, or professional, having the ability to efficiently handle PDF documents can be a game-changer. With this knowledge, you can tailor your PDF management solutions to meet your specific needs and work with data more effectively, saving valuable time and streamlining your workflow. Elevate your programming skills by learning to create a powerful tool that simplifies the handling of PDFs, making it a valuable addition to your programming toolbox.

PDF Analyzer Application

This code defines a Python class called `PDFAnalyzer` that uses the `tkinter` library to create a graphical user interface (GUI) application for analyzing PDF files. The application allows users to select a folder containing PDF files, extract metadata from those files, and search for PDFs based on metadata criteria. The main components and their functionality are divided into several blocks as follows:

Block 1: Importing Libraries


```python
from tkinter import *
from tkinter import filedialog as fd, ttk
import pikepdf
import csv
import re
import os
from keywords import extract_keywords
```

In this block, the necessary libraries are imported. `tkinter` is used for creating the GUI, `pikepdf` for working with PDF files, `csv` for handling CSV files, `re` for regular expressions, and `os` for file system operations. Additionally, a custom function `extract_keywords` is imported from a module named `keywords`.

Block 2: Class Definition


```python
class PDFAnalyzer:
def __init__(self) -> None:
# Constructor method
self.analyzed_pdf_files = [] # List to store analyzed PDFs
self.search_results = [] # List to store search results
def analyze_pdf(self, pdf_path):
# Method to analyze a specific PDF file
# Extract metadata and add it to analyzed_pdf_files
pass
def search_pdf(self, keyword):
# Method to search for PDFs based on a keyword
# Populate search_results with matching PDFs
pass
def generate_report(self, output_path):
# Method to generate a summary report of analyzed PDFs
pass
# Create an instance of the PDFAnalyzer class
pdf_analyzer = PDFAnalyzer()
```

This block defines the `PDFAnalyzer` class. The class contains various methods and properties for building the PDF analysis application.

Block 3: Initializing the GUI


```python
from tkinter import *
# Create the root window
self.window = Tk()
# Set window title
self.window.title('PDF Analyzer')
# Set window size
self.window.geometry("700x600")
# Set window background color
self.window.config(background="white")
```

This part initializes the main window for the GUI application using `tkinter`. It sets the title, size, and background color for the window.

Block 4: Creating User Interface Elements


```python
# Create user interface elements (buttons, labels, input fields)
self.btn_select_folder = Button(self.window, text='Select Folder', width=25, command=self.select_dir)
self.label_search_folder = Label(self.window, text="Click the button to browse the Folder containing PDF files")
self.label_analyze_progress = Label(self.window, text="Status")
self.label_search_keyword = Label(self.window, text="Input your keyword...")
self.input_search_keyword = ttk.Entry()
self.btn_search_keyword = Button(self.window, text="Search", width=25, command=self.search)
```

This section creates various user interface elements, including buttons, labels, and input fields, and sets their properties. These elements are used to interact with the application.

Block 5: Creating PDF File List Table


```python
# Creating a table - pdf_file_list
self.columns = ("No", "url")
self.tree_pdf_files = ttk.Treeview(columns=self.columns, show="headings")
self.tree_pdf_files.grid(column=0, columnspan=3, row=0)
self.tree_pdf_files.heading("No", text="No")
self.tree_pdf_files.column("#1", width=60)
self.tree_pdf_files.heading("url", text="url")
self.tree_pdf_files.column("#2", width=500)
self.scrollbar = ttk.Scrollbar(
self.window, orient=VERTICAL, command=self.tree_pdf_files
)
self.tree_pdf_files.configure(yscrollcommand=self.scrollbar.set)
self.scrollbar.grid(column=3, row=0, rowspan=1, sticky=NS)
```

This section sets up a table (using `ttk.Treeview`) to display a list of PDF files in the selected folder. It configures the table's columns, headings, and scrollbar.

Block 6: Creating Search List Table


```python
# Creating a table - search_list
self.tree_search_list = ttk.Treeview(columns=self.columns, show="headings")
self.tree_search_list.grid(column=0, columnspan=3, row=4)
self.tree_search_list.heading("No", text="No")
self.tree_search_list.column("#1", width=60)
self.tree_search_list.heading("url", text="url")
self.tree_search_list.column("#2", width=500)
self.scrollbar = ttk.Scrollbar(
self.window, orient=VERTICAL, command=self.tree_search_list
)
self.tree_search_list.configure(yscrollcommand=self.scrollbar.set)
self.scrollbar.grid(column=3, row=4, rowspan=1, sticky=NS)
```

This part is similar to Block 5 but configures a separate table for displaying search results.

Block 7: Placing GUI Elements


```python
# Placing user interface elements in the window
self.label_search_folder.grid(column=0, row=1)
self.btn_select_folder.grid(column=2, row=1)
self.label_analyze_progress.grid(column=0, row=2)
self.label_search_keyword.grid(column=0, row=3)
self.input_search_keyword.grid(column=1, row=3, sticky=NSEW, padx=10)
self.btn_search_keyword.grid(column=2, row=3)
```

This block positions the previously created UI elements in the window, specifying their layout within the application's interface.

Block 8: Window Event Handling


```python
# Window event handling
def on_closing():
self.window.destroy()
self.window.protocol("WM_DELETE_WINDOW", on_closing)
self.window.mainloop()
```

This block defines an event handler for the window's close button, allowing the application to gracefully exit when the user closes the window.

Block 9: Selecting a Folder


```python
def select_dir(self):
try:
folder_path = fd.askdirectory(initialdir="./", title="Select a directory")
if folder_path != '':
self.label_search_folder.config(text=folder_path)
self.label_analyze_progress.config(text="Analyzing files...")
self.generate_csv(folder_path)
else:
pass
except Exception as e:
raise e
```

This method is called when the "Select Folder" button is pressed. It opens a file dialog for the user to choose a folder containing PDF files and then triggers the PDF analysis process.

Block 10: Generating CSV Metadata


```python
def generate_csv(self, path):
pdf_file_list = []
fieldnames = ['Name', 'Title', 'Author', 'CreationDate', 'Keywords', 'Short summary']
meta_info = {}
index = 0
# Clear pdf_file_tree view
for i in self.tree_pdf_files.get_children():
self.tree_pdf_files.delete(i)
# Get lists of names for all PDF files in the folder
for dirpath, dirnames, filenames in os.walk(path):
for filename in filenames:
if filename.endswith('.pdf'):
pdf_file_list.append(os.path.join(dirpath, filename))
index = index + 1
self.tree_pdf_files.insert("", END, values=(index, os.path.join(dirpath, filename)))
self.label_analyze_progress.config(text='Loading files finished.')
# Save metadata to a CSV file
with open('metadata.csv', 'w', encoding='UTF8', newline='') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for pdf_file in pdf_file_list:
pdf = pikepdf.Pdf.open(pdf_file) # Open the PDF file
docinfo = pdf.docinfo # Get info from the PDF file
meta_info.clear()
meta_info = {
'Name': pdf_file,
'Title': '',
'Author': '',
'CreationDate': '',
'Keywords': '',
'Short summary': ''
}
for key, value in docinfo.items(): # Make metadata from info
key_data = key[1:]
if key_data in fieldnames:
if value != '':
meta_info[key_data] = value
keywords = extract_keywords(pdf_file) # Get keywords from the PDF file
keywords_str_version = ""
for keyword in keywords:
keywords_str_version += f'{keyword}. '
meta_info['Keywords'] = keywords_str_version
writer.writerow(meta_info) # Write data to the CSV file
self.label_analyze_progress.config(text="Analyzing finished.")
```

This method extracts metadata from PDF files within the selected folder, including information such as the file name, title, author, creation date, keywords, and a short summary. It stores this metadata in a CSV file named "metadata.csv."

Block 11: Searching for PDFs


```python
def search(self):
search_text = self.input_search_keyword.get() # Get search text
index = 0
pattern = f".*{search_text}.*".lower()
for i in self.tree_search_list.get_children():
self.tree_search_list.delete(i)
with open('metadata.csv', 'r', encoding='UTF8') as f: # Open the CSV file containing metadata
csv_reader = csv.reader(f)
for line_no, line in enumerate(csv_reader, 1):
if (
re.findall(pattern, line[0].lower())
or re.findall(pattern, line[1].lower())
or re.findall(pattern, line[2].lower())
or re.findall(pattern, line[4].lower())
): # Find text in metadata (title, author, keywords, etc.)
index = index + 1
self.tree_search_list.insert("", END, values=(index, line[0]))
```

This method is called when the "Search" button is pressed. It searches for PDF files in the metadata CSV file that match the user-provided search criteria (keywords). Matching files are displayed in the search results table.

Block 12: Application Entry Point


```python
pdfAnalyzer = PDFAnalyzer()
```

Finally, an instance of the `PDFAnalyzer` class is created, which initiates the GUI application when the script is executed.

Conclusion

Simplify the management of your PDF documents with the PDF Analyzer. This tool is designed to enhance your document organization and retrieval, making it a valuable asset for students, researchers, professionals, and anyone dealing with PDF files. The PDF Analyzer is not just a time-saver; it's a productivity booster, ensuring you can spend less time on administrative tasks and more time on what truly matters – your work and research. Say goodbye to the frustration of disorganized PDFs and welcome a new era of streamlined document management. Experience the difference today with the PDF Analyzer – your gateway to efficient and stress-free PDF handling.

Creating a PDF Analyzer Application in Python