Extract Text from MHTML using C#

MHTML (MIME HTML) files, a web archive format, are used to save the entire content of a webpage including text, images, and links into a single file. Extracting text from MHTML files is crucial when dealing with web content for data analysis, document processing, or automated reporting. In this article, we will explore how to extract text from MHTML using C#, providing developers with an efficient way to retrieve relevant information from these files for various applications. Using the right tool and technique, text extraction from MHTML in C# can be a straightforward process. For this process, ensure you have the latest .NET Framework, an IDE like Visual Studio, and the Parser library.

Steps to Extract Text from MHTML using C#

Set up your development environment by adding the GroupDocs.Parser for .NET library, allowing you to easily extract text from MHTML files
Initialize a Parser object by passing the path to your MHTML file into its constructor
Use the Parser.GetText method to retrieve a TextReader object, which will allow access to the text content
Call the TextReader.ReadToEnd method to extract the full text from the MHTML file

After setting up your environment, MHTML text extraction in C# is straightforward process. Start by creating a Parser instance with the path to your MHTML file. Use the GetText method to obtain a TextReader object, which lets you access the file’s text. Finally, call ReadToEnd on the TextReader to extract all the text at once. This method is ideal for analyzing extensive web content or automating the conversion of web archives. Once you’ve set up the file paths, integrating the below code example into your projects will be easy.

Code to Extract Text from MHTML using C#

	using GroupDocs.Parser;
	using GroupDocs.Parser.Options;
	using System;
	using System.IO;

	namespace ExtractTextfromMHTMLusingCSharp
	{
	internal class Program
	{
	static void Main(string[] args)
	{
	// Set License to avoid the limitations of Parser library
	License lic = new License();
	lic.SetLicense(@"GroupDocs.Parser.lic");

	// Instantiate the Parser class
	using (Parser parser = new Parser("input.mhtml"))
	{
	// Retrieve formatted text into the reader
	using (TextReader reader = parser.GetFormattedText(
	new FormattedTextOptions(FormattedTextMode.Html)))
	{
	// Output the formatted text from the document
	// If formatted text extraction is not supported,
	// the reader will be null
	Console.WriteLine(reader == null ?
	"Formatted text extraction isn't supported"
	: reader.ReadToEnd());
	Console.ReadLine();
	}
	}
	}
	}
	}

view raw Extract Text from MHTML using C#.cs hosted with ❤ by GitHub

You can successfully perform C# read text from MHTML operations on Windows, macOS, and Linux. This can be done without any additional software beyond what is included with .NET. The text extraction process is a valuable technique for developers working with web content or building document automation tools. Whether you are dealing with large-scale data scraping, content analysis, or archiving, having the ability to extract text from MHTML files programmatically will streamline your workflow and enhance the capabilities of your applications.

Earlier, we shared a comprehensive guide on how to extract text from TXT files using C#. For a deeper understanding, please check out our full tutorial on how to extract text from TXT using C#.

GroupDocs Knowledge Base

Find Answers by API

Extract Text from MHTML using C#

Steps to Extract Text from MHTML using C#

Code to Extract Text from MHTML using C#