Read Metadata from PDF using Java

Metadata within PDF files comprise critical details about the document, including the title, author, creation and modification dates, keywords, and other pertinent data. Extracting this metadata can offer significant advantages for a range of applications, from document management systems to data analysis and automation tasks. This article will explore the process of how to read metadata from PDF using Java. Here’s a step-by-step breakdown of the procedure, accompanied by an example code to illustrate how to read metadata of PDF using Java.

Steps to Read Metadata from PDF using Java

  1. Set up your IDE to utilize GroupDocs.Metadata for Java to extract metadata from PDF files
  2. Instantiate a Metadata object using the PDF file path as an argument for its constructor
  3. Set rules to check the collected metadata information
  4. Provide a condition for employing the Metadata.findProperties method
  5. Iterate through each property individually

Extracting metadata from PDF files using Java equips developers with valuable information regarding document properties like title, authorship, creation and modification dates, and keywords. This data plays a critical role in document management systems, data analysis, and automated workflows. You can follow the provided instructions on Windows, macOS, or Linux, as long as Java is installed. No additional software installations are required to extract metadata of PDF in Java. After configuring the recommended library and adjusting file paths as needed, integrating the following code into your projects should be straightforward without any complications or difficulties.

Code to Read Metadata from PDF using Java

import com.groupdocs.metadata.Metadata;
import com.groupdocs.metadata.core.FileFormat;
import com.groupdocs.metadata.core.IReadOnlyList;
import com.groupdocs.metadata.core.MetadataProperty;
import com.groupdocs.metadata.core.MetadataPropertyType;
import com.groupdocs.metadata.licensing.License;
import com.groupdocs.metadata.search.FallsIntoCategorySpecification;
import com.groupdocs.metadata.search.OfTypeSpecification;
import com.groupdocs.metadata.search.Specification;
import com.groupdocs.metadata.tagging.Tags;
import java.util.Calendar;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ReadMetadataFromPDFUsingJava {
public static void main(String[] args) {
// Set License to avoid the limitations of Metadata library
License license = new License();
license.setLicense("GroupDocs.Metadata.lic");
Metadata metadata = new Metadata("input.pdf");
if (metadata.getFileFormat() != FileFormat.Unknown && !metadata.getDocumentInfo().isEncrypted()) {
System.out.println();
// Fetch all metadata properties that fall into a particular category
IReadOnlyList<MetadataProperty> properties = metadata.findProperties(new FallsIntoCategorySpecification(Tags.getContent()));
System.out.println("The metadata properties describing some characteristics of the file content: title, keywords, language, etc.");
for (MetadataProperty property : properties) {
System.out.println(String.format("Property name: %s, Property value: %s", property.getName(), property.getValue()));
}
// Fetch all properties having a specific type and value
int year = Calendar.getInstance().get(Calendar.YEAR);
properties = metadata.findProperties(new OfTypeSpecification(MetadataPropertyType.DateTime).and(new ReadMetadataFromPDFUsingJava().new YearMatchSpecification(year)));
System.out.println("All datetime properties with the year value equal to the current year");
for (MetadataProperty property : properties) {
System.out.println(String.format("Property name: %s, Property value: %s", property.getName(), property.getValue()));
}
// Fetch all properties whose names match the specified regex
Pattern pattern = Pattern.compile("^author|company|(.+date.*)$", Pattern.CASE_INSENSITIVE);
properties = metadata.findProperties(new ReadMetadataFromPDFUsingJava().new RegexSpecification(pattern));
System.out.println(String.format("All properties whose names match the following regex: %s", pattern.pattern()));
for (MetadataProperty property : properties) {
System.out.println(String.format("Property name: %s, Property value: %s", property.getName(), property.getValue()));
}
}
}
// Define your own specifications to filter metadata properties
public class YearMatchSpecification extends Specification {
public YearMatchSpecification(int year) {
setValue(year);
}
public final int getValue() {
return auto_Value;
}
private void setValue(int value) {
auto_Value = value;
}
private int auto_Value;
public boolean isSatisfiedBy(MetadataProperty candidate) {
Date date = candidate.getValue().toClass(Date.class);
if (date != null) {
Calendar calendar = Calendar.getInstance();
calendar.setTime(date);
return getValue() == calendar.get(Calendar.YEAR);
}
return false;
}
}
public class RegexSpecification extends Specification {
private Pattern pattern;
public RegexSpecification(Pattern pattern) {
this.pattern = pattern;
}
@Override
public boolean isSatisfiedBy(MetadataProperty metadataProperty) {
Matcher matcher = pattern.matcher(metadataProperty.getName());
return matcher.find();
}
}
}

In summary, this article has offered a detailed guide on how to get metadata of PDF in Java. With the Metadata library, developers can effectively retrieve crucial information like document titles, author details, creation and modification dates, and keywords from PDF documents. Mastering metadata extraction techniques in Java enables developers to create robust applications for document management, data analysis, and automation. We encourage you to experiment with various PDF files and explore additional metadata properties to enhance the capabilities of metadata extraction in Java applications further.

In a prior conversation, we presented a detailed tutorial on extracting metadata from PPTX files using Java. For a deeper comprehension of this subject, we suggest consulting our comprehensive guide on how to read metadata from PPTX using Java.

 English