神奇的Python腳本pdf轉word、doc轉docx、word轉html各種格式都有

迎點擊右上角關注小編，除了分享技術文章之外還有很多福利，私信學習資料可以領取包括不限于Python實戰演練、PDF電子文檔、面試集錦、學習資料等。

前言

對于PDF轉換成word文檔，我想很多人都了解過，那就是需要付費，而且很貴，但是如果你會Python，只要你會Python這么問題都不再是問題。

pdf文件轉換為word文件

Word文件轉換為pdf文件

doc轉docx

docx轉html

.什么是tika?

Tika是一個內容分析工具，自帶全面的parser工具類，能解析基本所有常見格式的文件，得到文件的metadata，content等內容，返回格式化信息。總的來說可以作為一個通用的解析工具。特別對于搜索引擎的數據抓去和處理步驟有重要意義。Tika是Apache的Lucene項目下面的子項目，在lucene的應用中可以使用tika獲取大批量文檔中的內容來建立索引，非常方便，也很容易使用。Apache Tika toolkit可以自動檢測各種文檔(如word,ppt,xml,csv,ppt等)的類型并抽取文檔的元數據和文本內容。Tika集成了現有的文檔解析庫，并提供統一的接口，使針對不同類型的文檔進行解析變得更簡單。Tika針對搜索引擎索引、內容分析、轉化等非常有用。

Tika架構

應用程序員可以很容易地在他們的應用程序集成Tika。Tika提供了一個命令行界面和圖形用戶界面，使它比較人性化。在本章中，我們將討論構成Tika架構的四個重要模塊。下圖顯示了Tika的四個模塊的體系結構：

語言檢測機制。
MIME檢測機制。
Parser接口。
Tika Facade 類.

語言檢測機制

每當一個文本文件被傳遞到Tika，它將檢測在其中的語言。它接受沒有語言的注釋文件和通過檢測該語言添加在該文件的元數據信息。支持語言識別，Tika 有一類叫做語言標識符在包org.apache.tika.language及語言識別資料庫里面包含了語言檢測從給定文本的算法。Tika 內部使用N-gram算法語言檢測。

MIME檢測機制

Tika可以根據MIME標準檢測文檔類型。Tika默認MIME類型檢測是使用org.apache.tika.mime.mimeTypes。它使用org.apache.tika.detect.Detector 接口大部分內容類型檢測。內部Tika使用多種技術，如文件匹配替換，內容類型提示，魔術字節，字符編碼，以及其他一些技術。

解析器接口

org.apache.tika.parser 解析器接口是Tika解析文檔的主要接口。該接口從提取文檔中的文本和元數據，并總結了其對外部用戶愿意寫解析器插件。采用不同的具體解析器類，具體為各個文檔類型，Tika 支持大量的文件格式。這些格式的具體類不同的文件格式提供支持，無論是通過直接實現邏輯分析器或使用外部解析器庫。

Tika Facade 類

使用的Tika facade類是從Java調用Tika的最簡單和直接的方式，而且也沿用了外觀的設計模式。可以在 Tika API的org.apache.tika包Tika 找到外觀facade類。通過實現基本用例，Tika作為facade的代理。它抽象了的Tika庫的底層復雜性，例如MIME檢測機制，解析器接口和語言檢測機制，并提供給用戶一個簡單的接口來使用。

2.代碼工程

實驗目標

實現word文檔轉html

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>springboot-demo</artifactId>
        <groupId>com.et</groupId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>

    <artifactId>tika</artifactId>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>
    <dependencies>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-autoconfigure</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.17</version>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
        </dependency>

    </dependencies>
</project>

controller

package com.et.tika.controller;

import com.et.tika.convertor.WordToHtmlConverter;
import com.et.tika.dto.ConvertedDocumentDTO;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;

import java.util.HashMap;
import java.util.Map;

@RestController
@Slf4j
public class HelloWorldController {
    @RequestMapping("/hello")
    public Map<String, Object> showHelloWorld(){
        Map<String, Object> map = new HashMap<>();
        map.put("msg", "HelloWorld");
        return map;
    }
    @Autowired
    WordToHtmlConverter converter;



    /**
     * Transforms the Word document into HTML document and returns the transformed document.
     *
     * @return  The content of the uploaded document as HTML.
     */
    @RequestMapping(value = "/api/word-to-html", method = RequestMethod.POST)
    public ConvertedDocumentDTO convertWordDocumentIntoHtmlDocument(@RequestParam(value = "file", required = true) MultipartFile wordDocument) {
        log.info("Converting word document into HTML document");

        ConvertedDocumentDTO htmlDocument = converter.convertWordDocumentIntoHtml(wordDocument);

        log.info("Converted word document into HTML document.");
        log.trace("The created HTML markup looks as follows: {}", htmlDocument);

        return htmlDocument;
    }
}

WordToHtmlConverter

package com.et.tika.convertor;


import com.et.tika.dto.ConvertedDocumentDTO;
import com.et.tika.exception.DocumentConversionException;
import lombok.extern.slf4j.Slf4j;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Component;
import org.springframework.web.multipart.MultipartFile;
import org.xml.sax.SAXException;

import javax.xml.transform.OutputKeys;
import javax.xml.transform.TransformerException;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.stream.StreamResult;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;

/**
 *
 */
@Component
@Slf4j
public class WordToHtmlConverter {


    /**
     * Converts a .docx document into HTML markup. This code
     * is based on <a href="http://stackoverflow.com/a/9053258/313554">this StackOverflow</a> answer.
     *
     * @param wordDocument  The converted .docx document.
     * @return
     */
    public ConvertedDocumentDTO convertWordDocumentIntoHtml(MultipartFile wordDocument) {
        log.info("Converting word document: {} into HTML", wordDocument.getOriginalFilename());
        try {
            InputStream input = wordDocument.getInputStream();
            Parser parser = new OOXMLParser();

            StringWriter sw = new StringWriter();
            SAXTransformerFactory factory = (SAXTransformerFactory)
                    SAXTransformerFactory.newInstance();
            TransformerHandler handler = factory.newTransformerHandler();
            handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "utf-8");
            handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
            handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
            handler.setResult(new StreamResult(sw));

            Metadata metadata = new Metadata();
            metadata.add(Metadata.CONTENT_TYPE, "text/html;charset=utf-8");
            parser.parse(input, handler, metadata, new ParseContext());
            return new ConvertedDocumentDTO(wordDocument.getOriginalFilename(), sw.toString());
        }
        catch (IOException | SAXException | TransformerException | TikaException ex) {
            log.error("Conversion failed because an exception was thrown", ex);
            throw new DocumentConversionException(ex.getMessage(), ex);
        }
    }
}

dto

package com.et.tika.dto;

import org.apache.commons.lang.builder.ToStringBuilder;

/**
 *
 */
public  class ConvertedDocumentDTO {

    private final String contentAsHtml;
    private final String filename;

    public ConvertedDocumentDTO(String filename, String contentAsHtml) {
        this.contentAsHtml = contentAsHtml;
        this.filename = filename;
    }

    public String getContentAsHtml() {
        return contentAsHtml;
    }

    public String getFilename() {
        return filename;
    }

    @Override
    public String toString() {
        return new ToStringBuilder(this)
                .append("filename", this.filename)
                .append("contentAsHtml", this.contentAsHtml)
                .toString();
    }
}

自定義異常

package com.et.tika.exception;

/**
 *
 */
public final class DocumentConversionException extends RuntimeException {

    public DocumentConversionException(String message, Exception ex) {
        super(message, ex);
    }
}

以上只是一些關鍵代碼，所有代碼請參見下面代碼倉庫

代碼倉庫

https://github.com/Harries/springboot-demo

3.測試

啟動Spring Boot應用

測試word轉html

4.引用

https://tika.apache.org/
http://www.liuhaihua.cn/archives/710679.html

、前言

實現文檔在線預覽的方式除了上篇文章文檔在線預覽新版（一）通過將文件轉成圖片實現在線預覽功能說的將文檔轉成圖片的實現方式外，還有轉成pdf，前端通過pdf.js、pdfobject.js等插件來實現在線預覽，以及本文將要說到的將文檔轉成html的方式來實現在線預覽。

以下代碼分別提供基于aspose、pdfbox、spire來實現來實現txt、word、pdf、ppt、word等文件轉圖片的需求。

1、aspose

Aspose 是一家致力于.Net ,Java,SharePoint,JasperReports和SSRS組件的提供商，數十個國家的數千機構都有用過aspose組件，創建、編輯、轉換或渲染 Office、OpenOffice、PDF、圖像、ZIP、CAD、XPS、EPS、PSD 和更多文件格式。注意aspose是商用組件，未經授權導出文件里面都是是水印（尊重版權，遠離破解版）。

需要在項目的pom文件里添加如下依賴

        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-words</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-pdf</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-cells</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-slides</artifactId>
            <version>23.1</version>
        </dependency>

2 、poi + pdfbox

因為aspose和spire雖然好用，但是都是是商用組件，所以這里也提供使用開源庫操作的方式的方式。

POI是Apache軟件基金會用Java編寫的免費開源的跨平臺的 Java API，Apache POI提供API給Java程序對Microsoft Office格式檔案讀和寫的功能。

Apache PDFBox是一個開源Java庫，支持PDF文檔的開發和轉換。使用此庫，您可以開發用于創建，轉換和操作PDF文檔的Java程序。

需要在項目的pom文件里添加如下依賴

		<dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.4</version>
        </dependency>
		<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-excelant</artifactId>
            <version>5.2.0</version>
        </dependency>

3 spire

spire一款專業的Office編程組件，涵蓋了對Word、Excel、PPT、PDF等文件的讀寫、編輯、查看功能。spire提供免費版本，但是存在只能導出前3頁以及只能導出前500行的限制，只要達到其一就會觸發限制。需要超出前3頁以及只能導出前500行的限制的這需要購買付費版（尊重版權，遠離破解版）。這里使用免費版進行演示。

spire在添加pom之前還得先添加maven倉庫來源

		<repository>
            <id>com.e-iceblue</id>
            <name>e-iceblue</name>
            <url>https://repo.e-iceblue.cn/repository/maven-public/</url>
        </repository>

接著在項目的pom文件里添加如下依賴

免費版：

		<dependency>
            <groupId>e-iceblue</groupId>
            <artifactId>spire.office.free</artifactId>
            <version>5.3.1</version>
        </dependency>

付費版版：

		<dependency>
            <groupId>e-iceblue</groupId>
            <artifactId>spire.office</artifactId>
            <version>5.3.1</version>
        </dependency>

二、將文件轉換成html字符串

1、將word文件轉成html字符串

1.1 使用aspose

public static String wordToHtmlStr(String wordPath) {
        try {
            Document doc = new Document(wordPath); // Address是將要被轉化的word文檔
            String htmlStr = doc.toString();
            return htmlStr;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }

驗證結果：

1.2 使用poi

public String wordToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
        String htmlStr = null;
        String ext = wordPath.substring(wordPath.lastIndexOf("."));
        if (ext.equals(".docx")) {
            htmlStr = word2007ToHtmlStr(wordPath);
        } else if (ext.equals(".doc")){
            htmlStr = word2003ToHtmlStr(wordPath);
        } else {
            throw new RuntimeException("文件格式不正確");
        }
        return htmlStr;
    }

    public String word2007ToHtmlStr(String wordPath) throws IOException {
        // 使用內存輸出流
        try(ByteArrayOutputStream out = new ByteArrayOutputStream()){
            word2007ToHtmlOutputStream(wordPath, out);
            return out.toString();
        }
    }

    private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
        ZipSecureFile.setMinInflateRatio(-1.0d);
        InputStream in = Files.newInputStream(Paths.get(wordPath));
        XWPFDocument document = new XWPFDocument(in);
        XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
        // 使用內存輸出流
        XHTMLConverter.getInstance().convert(document, out, options);
    }


    private String word2003ToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
        org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
        // Transform document to string
        StringWriter writer = new StringWriter();
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer transformer = tf.newTransformer();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
        transformer.setOutputProperty(OutputKeys.METHOD, "html");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        transformer.transform(new DOMSource(htmlDocument), new StreamResult(writer));
        return writer.toString();
    }

private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
        InputStream input = Files.newInputStream(Paths.get(wordPath));
        HWPFDocument wordDocument = new HWPFDocument(input);
        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());
        wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
            System.out.println(pictureType);
            if (PictureType.UNKNOWN.equals(pictureType)) {
                return null;
            }
            BufferedImage bufferedImage = ImgUtil.toImage(content);
            String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
            //  帶圖片的word，則將圖片轉為base64編碼，保存在一個頁面中
            StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
            return sb.toString();
        });
        // 解析word文檔
        wordToHtmlConverter.processDocument(wordDocument);
        return wordToHtmlConverter.getDocument();
    }

1.3 使用spire

 public String wordToHtmlStr(String wordPath) throws IOException {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            Document document = new Document();
            document.loadFromFile(wordPath);
            document.saveToFile(outputStream, FileFormat.Html);
            return outputStream.toString();
        }
    }

2、將pdf文件轉成html字符串

2.1 使用aspose

public static String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new StringWriter();
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
        return writer.toString();
    }

驗證結果：

2.2 使用 poi + pbfbox

public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new StringWriter();
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
        return writer.toString();
    }

2.3 使用spire

public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            PdfDocument pdf = new PdfDocument();
            pdf.loadFromFile(pdfPath);
            return outputStream.toString();
        }
    }

3、將excel文件轉成html字符串

3.1 使用aspose

public static String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        Workbook workbook = new XSSFWorkbook(fileInputStream);
        DataFormatter dataFormatter = new DataFormatter();
        FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
        Sheet sheet = workbook.getSheetAt(0);
        StringBuilder htmlStringBuilder = new StringBuilder();
        htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
        htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
        htmlStringBuilder.append("</head><body><table>");
        for (Row row : sheet) {
            htmlStringBuilder.append("<tr>");
            for (Cell cell : row) {
                CellType cellType = cell.getCellType();
                if (cellType == CellType.FORMULA) {
                    formulaEvaluator.evaluateFormulaCell(cell);
                    cellType = cell.getCachedFormulaResultType();
                }
                String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
            }
            htmlStringBuilder.append("</tr>");
        }
        htmlStringBuilder.append("</table></body></html>");
        return htmlStringBuilder.toString();
    }

返回的html字符串：

<html><head><title>Excel to HTML using Java and POI library</title><style>table, th, td { border: 1px solid black; }</style></head><body><table><tr><td>序號</td><td>姓名</td><td>性別</td><td>聯系方式</td><td>地址</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr></table></body></html>

3.2 使用poi + pdfbox

public String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
            DataFormatter dataFormatter = new DataFormatter();
            FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
            org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
            StringBuilder htmlStringBuilder = new StringBuilder();
            htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
            htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
            htmlStringBuilder.append("</head><body><table>");
            for (Row row : sheet) {
                htmlStringBuilder.append("<tr>");
                for (Cell cell : row) {
                    CellType cellType = cell.getCellType();
                    if (cellType == CellType.FORMULA) {
                        formulaEvaluator.evaluateFormulaCell(cell);
                        cellType = cell.getCachedFormulaResultType();
                    }
                    String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                    htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
                }
                htmlStringBuilder.append("</tr>");
            }
            htmlStringBuilder.append("</table></body></html>");
            return htmlStringBuilder.toString();
        }
    }

3.3 使用spire

public String excelToHtmlStr(String excelPath) throws Exception {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            Workbook workbook = new Workbook();
            workbook.loadFromFile(excelPath);
            workbook.saveToStream(outputStream, com.spire.xls.FileFormat.HTML);
            return outputStream.toString();
        }
    }

三、將文件轉換成html，并生成html文件

有時我們是需要的不僅僅返回html字符串，而是需要生成一個html文件這時應該怎么做呢？一個改動量小的做法就是使用org.apache.commons.io包下的FileUtils工具類寫入目標地址：

FileUtils類將html字符串生成html文件示例：

首先需要引入pom：

		<dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.8.0</version>
        </dependency>

相關代碼：

String htmlStr = FileConvertUtil.pdfToHtmlStr("D:\\書籍\\電子書\\小說\\歷史小說\\最后的可汗.doc");
FileUtils.write(new File("D:\\test\\doc.html"), htmlStr, "utf-8");

除此之外，還可以對上面的代碼進行一些調整，已實現生成html文件，代碼調整如下：

1、將word文件轉換成html文件

word原文件效果：

1.1 使用aspose

public static void wordToHtml(String wordPath, String htmlPath) {
        try {
            File sourceFile = new File(wordPath);
            String path = htmlPath + File.separator + sourceFile.getName().substring(0, sourceFile.getName().lastIndexOf(".")) + ".html";
            File file = new File(path); // 新建一個空白pdf文檔
            FileOutputStream os = new FileOutputStream(file);
            Document doc = new Document(wordPath); // Address是將要被轉化的word文檔
            HtmlSaveOptions options = new HtmlSaveOptions();
            options.setExportImagesAsBase64(true);
            options.setExportRelativeFontSize(true);
            doc.save(os, options);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

轉換成html的效果：

1.2 使用poi + pdfbox

public void wordToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
        String ext = wordPath.substring(wordPath.lastIndexOf("."));
        if (ext.equals(".docx")) {
            word2007ToHtml(wordPath, htmlPath);
        } else if (ext.equals(".doc")){
            word2003ToHtml(wordPath, htmlPath);
        } else {
            throw new RuntimeException("文件格式不正確");
        }
    }

    public void word2007ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        //try(OutputStream out = Files.newOutputStream(Paths.get(path))){
        try(FileOutputStream out = new FileOutputStream(htmlPath)){
            word2007ToHtmlOutputStream(wordPath, out);
        }
    }

    private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
        ZipSecureFile.setMinInflateRatio(-1.0d);
        InputStream in = Files.newInputStream(Paths.get(wordPath));
        XWPFDocument document = new XWPFDocument(in);
        XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
        // 使用內存輸出流
        XHTMLConverter.getInstance().convert(document, out, options);
    }

    public void word2003ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
        // 生成html文件地址

        try(OutputStream outStream = Files.newOutputStream(Paths.get(htmlPath))){
            DOMSource domSource = new DOMSource(htmlDocument);
            StreamResult streamResult = new StreamResult(outStream);
            TransformerFactory factory = TransformerFactory.newInstance();
            Transformer serializer = factory.newTransformer();
            serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
            serializer.setOutputProperty(OutputKeys.INDENT, "yes");
            serializer.setOutputProperty(OutputKeys.METHOD, "html");
            serializer.transform(domSource, streamResult);
        }
    }

    private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
        InputStream input = Files.newInputStream(Paths.get(wordPath));
        HWPFDocument wordDocument = new HWPFDocument(input);
        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());
        wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
            System.out.println(pictureType);
            if (PictureType.UNKNOWN.equals(pictureType)) {
                return null;
            }
            BufferedImage bufferedImage = ImgUtil.toImage(content);
            String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
            //  帶圖片的word，則將圖片轉為base64編碼，保存在一個頁面中
            StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
            return sb.toString();
        });
        // 解析word文檔
        wordToHtmlConverter.processDocument(wordDocument);
        return wordToHtmlConverter.getDocument();
    }

轉換成html的效果：

1.3 使用spire

public void wordToHtml(String wordPath, String htmlPath) {
        htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
        Document document = new Document();
        document.loadFromFile(wordPath);
        document.saveToFile(htmlPath, FileFormat.Html);
    }

轉換成html的效果：

因為使用的是免費版，存在頁數和字數限制，需要完整功能的的可以選擇付費版本。PS：這回76頁的文檔居然轉成功了前50頁。

2、將pdf文件轉換成html文件

圖片版pdf原文件效果：

文字版pdf原文件效果：

2.1 使用aspose

public static void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        File file = new File(pdfPath);
        String path = htmlPath + File.separator + file.getName().substring(0, file.getName().lastIndexOf(".")) + ".html";
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new PrintWriter(path, "UTF-8");
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
    }

圖片版PDF文件驗證結果：

文字版PDF文件驗證結果：

2.2 使用poi + pdfbox

public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        String path = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new PrintWriter(path, "UTF-8");
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
    }

圖片版PDF文件驗證結果：

文字版PDF原文件效果：

2.3 使用spire

public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        htmlPath = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
        PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile(pdfPath);
        pdf.saveToFile(htmlPath, com.spire.pdf.FileFormat.HTML);
    }

圖片版PDF文件驗證結果：
因為使用的是免費版，所以只有前三頁是正常的。。。有超過三頁需求的可以選擇付費版本。

文字版PDF原文件效果：

報錯了無法轉換。。。

java.lang.NullPointerException
	at com.spire.pdf.PdfPageWidget.spr┢?(Unknown Source)
	at com.spire.pdf.PdfPageWidget.getSize(Unknown Source)
	at com.spire.pdf.PdfPageBase.spr???—(Unknown Source)
	at com.spire.pdf.PdfPageBase.getActualSize(Unknown Source)
	at com.spire.pdf.PdfPageBase.getSection(Unknown Source)
	at com.spire.pdf.general.PdfDestination.spr︻┎?—(Unknown Source)
	at com.spire.pdf.general.PdfDestination.spr┻┑?—(Unknown Source)
	at com.spire.pdf.general.PdfDestination.getElement(Unknown Source)
	at com.spire.pdf.primitives.PdfDictionary.setProperty(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmark.setDestination(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmarkWidget.spr┭┘?—(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmarkWidget.getDestination(Unknown Source)
	at com.spire.pdf.PdfDocumentBase.spr??(Unknown Source)
	at com.spire.pdf.widget.PdfPageCollection.spr┦?(Unknown Source)
	at com.spire.pdf.widget.PdfPageCollection.removeAt(Unknown Source)
	at com.spire.pdf.PdfDocumentBase.spr┞?(Unknown Source)
	at com.spire.pdf.PdfDocument.loadFromFile(Unknown Source)

3、將excel文件轉換成html文件

excel原文件效果：

3.1 使用aspose

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        Workbook workbook = new Workbook(excelPath);
        com.aspose.cells.HtmlSaveOptions options = new com.aspose.cells.HtmlSaveOptions();
        workbook.save(htmlPath, options);
    }

轉換成html的效果：

3.2 使用poi

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        String path = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        try(FileOutputStream fileOutputStream = new FileOutputStream(path)){
            String htmlStr = excelToHtmlStr(excelPath);
            byte[] bytes = htmlStr.getBytes();
            fileOutputStream.write(bytes);
        }
    }


    public String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
            DataFormatter dataFormatter = new DataFormatter();
            FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
            org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
            StringBuilder htmlStringBuilder = new StringBuilder();
            htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
            htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
            htmlStringBuilder.append("</head><body><table>");
            for (Row row : sheet) {
                htmlStringBuilder.append("<tr>");
                for (Cell cell : row) {
                    CellType cellType = cell.getCellType();
                    if (cellType == CellType.FORMULA) {
                        formulaEvaluator.evaluateFormulaCell(cell);
                        cellType = cell.getCachedFormulaResultType();
                    }
                    String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                    htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
                }
                htmlStringBuilder.append("</tr>");
            }
            htmlStringBuilder.append("</table></body></html>");
            return htmlStringBuilder.toString();
        }
    }

轉換成html的效果：

3.3 使用spire

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        Workbook workbook = new Workbook();
        workbook.loadFromFile(excelPath);
        workbook.saveToFile(htmlPath, com.spire.xls.FileFormat.HTML);
    }

轉換成html的效果：

四、總結

從上述的效果展示我們可以發現其實轉成html效果不是太理想，很多細節樣式沒有還原，這其實是因為這類轉換往往都是追求目標是通過使用文檔中的語義信息并忽略其他細節來生成簡單干凈的 HTML，所以在轉換過程中復雜樣式被忽略，比如居中、首行縮進、字體，文本大小，顏色。舉個例子在轉換是會將應用標題 1 樣式的任何段落轉換為 h1 元素，而不是嘗試完全復制標題的樣式。所以轉成html的顯示效果往往和原文檔不太一樣。這意味著對于較復雜的文檔而言，這種轉換不太可能是完美的。但如果都是只使用簡單樣式文檔或者對文檔樣式不太關心的這種方式也不妨一試。

PS：如果想要展示效果好的話，其實可以將上篇文章《文檔在線預覽（一）通過將txt、word、pdf轉成圖片實現在線預覽功能》說的內容和本文結合起來使用，即將文檔里的內容都生成成圖片（很可能是多張圖片），然后將生成的圖片全都放到一個html頁面里，用html+css來保持樣式并實現多張圖片展示，再將html返回。開源組件kkfilevie就是用的就是這種做法。

kkfileview展示效果如下：

下圖是kkfileview返回的html代碼，從html代碼我們可以看到kkfileview其實是將文件（txt文件除外）每頁的內容都轉成了圖片，然后將這些圖片都嵌入到一個html里，再返回給用戶一個html頁面。

在線咨詢

上一篇：PHP的大致學習思路
下一篇：網頁加載進度條實現方案

您的項目需求

*請認真填寫需求信息，我們會在24小時內與您取得聯系。

整合營銷服務商

神奇的Python腳本pdf轉word、doc轉docx、word轉html各種格式都有

前言

.什么是tika?

Tika架構

語言檢測機制

MIME檢測機制

解析器接口

Tika Facade 類

2.代碼工程

實驗目標

pom.xml

controller

WordToHtmlConverter

dto

自定義異常

代碼倉庫

3.測試

測試word轉html

4.引用

、前言

1、aspose

2 、poi + pdfbox

3 spire

二、將文件轉換成html字符串

1、將word文件轉成html字符串

1.1 使用aspose

1.2 使用poi

1.3 使用spire

2、將pdf文件轉成html字符串

2.1 使用aspose

2.2 使用 poi + pbfbox

2.3 使用spire

3、將excel文件轉成html字符串

3.1 使用aspose

3.2 使用poi + pdfbox

3.3 使用spire

三、將文件轉換成html，并生成html文件

FileUtils類將html字符串生成html文件示例：

1、將word文件轉換成html文件

1.1 使用aspose

1.2 使用poi + pdfbox

1.3 使用spire

2、將pdf文件轉換成html文件

2.1 使用aspose

2.2 使用poi + pdfbox

2.3 使用spire

3、將excel文件轉換成html文件

3.1 使用aspose

3.2 使用poi

3.3 使用spire

四、總結

您的項目需求