Warning: error_log(/data/www/wwwroot/hmttv.cn/caches/error_log.php): failed to open stream: Permission denied in /data/www/wwwroot/hmttv.cn/phpcms/libs/functions/global.func.php on line 537 Warning: error_log(/data/www/wwwroot/hmttv.cn/caches/error_log.php): failed to open stream: Permission denied in /data/www/wwwroot/hmttv.cn/phpcms/libs/functions/global.func.php on line 537
啥要做這個軟件
軟件獲取方式:私信“word”即可獲取
近有一個業務是前端要上傳word格式的文稿,然后用戶上傳完之后,可以用瀏覽器直接查看該文稿,并且可以在富文本框直接引用該文稿,所以上傳word文稿之后,后端保存到db的必須是html格式才行,所以涉及到word格式轉html格式。
通過調查,這個word和html的處理,有兩種方案,方案1是前端做這個轉換。方案2是把word文檔上傳給后臺,后臺轉換好之后再返回給前端。至于方案1,看到大家的反饋都說很多問題,所以就沒采用前端轉的方案,最終決定是后端轉化為html格式并返回給前段預覽,待客戶預覽的時候,確認格式沒問題之后,再把html保存到后臺(因為word涉及到的格式太多,比如圖片,visio圖,表格,圖片等等之類的復雜元素,轉html的時候,可能會很多格式問題,所以要有個預覽的過程)。
對于word中普通的文字,問題倒不大,主要是文本之外的元素的處理,比如圖片,視頻,表格等。針對我本次的文章,只處理了圖片,處理的方式是:后臺從word中找出圖片(當然引入的jar包已經帶了獲取word中圖片的功能),上傳到服務器,拿到絕對路徑之后,放入到html里面,這樣,返回給前端的html內容,就可以直接預覽了。
maven引入相關依賴包如下:
<poi-scratchpad.version>3.14</poi-scratchpad.version>
<poi-ooxml.version>3.14</poi-ooxml.version>
<xdocreport.version>1.0.6</xdocreport.version>
<poi-ooxml-schemas.version>3.14</poi-ooxml-schemas.version>
<ooxml-schemas.version>1.3</ooxml-schemas.version>
<jsoup.version>1.11.3</jsoup.version>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>${poi-scratchpad.version}</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>${poi-ooxml.version}</version>
</dependency>
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>xdocreport</artifactId>
<version>${xdocreport.version}</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml-schemas</artifactId>
<version>${poi-ooxml-schemas.version}</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>ooxml-schemas</artifactId>
<version>${ooxml-schemas.version}</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>${jsoup.version}</version>
</dependency>
word轉html,對于word2003和word2007轉換方式不一樣,因為word2003和word2007的格式不一樣,工具類如下:
使用方法如下:
public String uploadSourceNews(MultipartFile file) {
String fileName = file.getOriginalFilename();
String suffixName = fileName.substring(fileName.lastIndexOf("."));
if (!".doc".equals(suffixName) && !".docx".equals(suffixName)) {
throw new UploadFileFormatException();
}
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyyMM");
String dateDir = formatter.format(LocalDate.now());
String directory = imageDir + "/" + dateDir + "/";
String content = null;
try {
InputStream inputStream = file.getInputStream();
if ("doc".equals(suffixName)) {
content = wordToHtmlUtil.Word2003ToHtml(inputStream, imageBucket, directory, Constants.HTTPS_PREFIX + imageVisitHost);
} else {
content = wordToHtmlUtil.Word2007ToHtml(inputStream, imageBucket, directory, Constants.HTTPS_PREFIX + imageVisitHost);
}
} catch (Exception ex) {
logger.error("word to html exception, detail:", ex);
return null;
}
return content;
}
關于doc和docx的一些存儲格式介紹:
docx 是微軟開發的基于 xml 的文字處理文件。docx 文件與 doc 文件不同, 因為 docx 文件將數據存儲在單獨的壓縮文件和文件夾中。早期版本的 microsoft office (早于 office 2007) 不支持 docx 文件, 因為 docx 是基于 xml 的, 早期版本將 doc 文件另存為單個二進制文件。
DOCX is an XML based word processing file developed by Microsoft. DOCX files are different than DOC files as DOCX files store data in separate compressed files and folders. Earlier versions of Microsoft Office (earlier than Office 2007) do not support DOCX files because DOCX is XML based where the earlier versions save DOC file as a single binary file.
可能你會問了,明明是docx結尾的文檔,怎么成了xml格式了?
很簡單:你隨便選擇一個docx文件,右鍵使用壓縮工具打開,就能得到一個這樣的目錄結構:
所以你以為docx是一個完整的文檔,其實它只是一個壓縮文件。
參考:
https://www.cnblogs.com/ct-csu/p/8178932.html
實現文檔在線預覽的方式除了上篇文章 文檔在線預覽新版(一)通過將文件轉成圖片實現在線預覽功能說的將文檔轉成圖片的實現方式外,還有轉成pdf,前端通過pdf.js、pdfobject.js等插件來實現在線預覽,以及本文將要說到的將文檔轉成html的方式來實現在線預覽。
以下代碼分別提供基于aspose、pdfbox、spire來實現來實現txt、word、pdf、ppt、word等文件轉圖片的需求。
Aspose 是一家致力于.Net ,Java,SharePoint,JasperReports和SSRS組件的提供商,數十個國家的數千機構都有用過aspose組件,創建、編輯、轉換或渲染 Office、OpenOffice、PDF、圖像、ZIP、CAD、XPS、EPS、PSD 和更多文件格式。注意aspose是商用組件,未經授權導出文件里面都是是水印(尊重版權,遠離破解版)。
需要在項目的pom文件里添加如下依賴
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-words</artifactId>
<version>23.1</version>
</dependency>
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-pdf</artifactId>
<version>23.1</version>
</dependency>
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-cells</artifactId>
<version>23.1</version>
</dependency>
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-slides</artifactId>
<version>23.1</version>
</dependency>
因為aspose和spire雖然好用,但是都是是商用組件,所以這里也提供使用開源庫操作的方式的方式。
POI是Apache軟件基金會用Java編寫的免費開源的跨平臺的 Java API,Apache POI提供API給Java程序對Microsoft Office格式檔案讀和寫的功能。
Apache PDFBox是一個開源Java庫,支持PDF文檔的開發和轉換。 使用此庫,您可以開發用于創建,轉換和操作PDF文檔的Java程序。
需要在項目的pom文件里添加如下依賴
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.4</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>5.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>5.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-excelant</artifactId>
<version>5.2.0</version>
</dependency>
spire一款專業的Office編程組件,涵蓋了對Word、Excel、PPT、PDF等文件的讀寫、編輯、查看功能。spire提供免費版本,但是存在只能導出前3頁以及只能導出前500行的限制,只要達到其一就會觸發限制。需要超出前3頁以及只能導出前500行的限制的這需要購買付費版(尊重版權,遠離破解版)。這里使用免費版進行演示。
spire在添加pom之前還得先添加maven倉庫來源
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.cn/repository/maven-public/</url>
</repository>
接著在項目的pom文件里添加如下依賴
免費版:
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.office.free</artifactId>
<version>5.3.1</version>
</dependency>
付費版版:
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.office</artifactId>
<version>5.3.1</version>
</dependency>
public static String wordToHtmlStr(String wordPath) {
try {
Document doc = new Document(wordPath); // Address是將要被轉化的word文檔
String htmlStr = doc.toString();
return htmlStr;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
驗證結果:
public String wordToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
String htmlStr = null;
String ext = wordPath.substring(wordPath.lastIndexOf("."));
if (ext.equals(".docx")) {
htmlStr = word2007ToHtmlStr(wordPath);
} else if (ext.equals(".doc")){
htmlStr = word2003ToHtmlStr(wordPath);
} else {
throw new RuntimeException("文件格式不正確");
}
return htmlStr;
}
public String word2007ToHtmlStr(String wordPath) throws IOException {
// 使用內存輸出流
try(ByteArrayOutputStream out = new ByteArrayOutputStream()){
word2007ToHtmlOutputStream(wordPath, out);
return out.toString();
}
}
private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
ZipSecureFile.setMinInflateRatio(-1.0d);
InputStream in = Files.newInputStream(Paths.get(wordPath));
XWPFDocument document = new XWPFDocument(in);
XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
// 使用內存輸出流
XHTMLConverter.getInstance().convert(document, out, options);
}
private String word2003ToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
// Transform document to string
StringWriter writer = new StringWriter();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new DOMSource(htmlDocument), new StreamResult(writer));
return writer.toString();
}
private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
InputStream input = Files.newInputStream(Paths.get(wordPath));
HWPFDocument wordDocument = new HWPFDocument(input);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
System.out.println(pictureType);
if (PictureType.UNKNOWN.equals(pictureType)) {
return null;
}
BufferedImage bufferedImage = ImgUtil.toImage(content);
String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
// 帶圖片的word,則將圖片轉為base64編碼,保存在一個頁面中
StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
return sb.toString();
});
// 解析word文檔
wordToHtmlConverter.processDocument(wordDocument);
return wordToHtmlConverter.getDocument();
}
public String wordToHtmlStr(String wordPath) throws IOException {
try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
Document document = new Document();
document.loadFromFile(wordPath);
document.saveToFile(outputStream, FileFormat.Html);
return outputStream.toString();
}
}
public static String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new StringWriter();
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
return writer.toString();
}
驗證結果:
public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new StringWriter();
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
return writer.toString();
}
public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
PdfDocument pdf = new PdfDocument();
pdf.loadFromFile(pdfPath);
return outputStream.toString();
}
}
public static String excelToHtmlStr(String excelPath) throws Exception {
FileInputStream fileInputStream = new FileInputStream(excelPath);
Workbook workbook = new XSSFWorkbook(fileInputStream);
DataFormatter dataFormatter = new DataFormatter();
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
Sheet sheet = workbook.getSheetAt(0);
StringBuilder htmlStringBuilder = new StringBuilder();
htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
htmlStringBuilder.append("</head><body><table>");
for (Row row : sheet) {
htmlStringBuilder.append("<tr>");
for (Cell cell : row) {
CellType cellType = cell.getCellType();
if (cellType == CellType.FORMULA) {
formulaEvaluator.evaluateFormulaCell(cell);
cellType = cell.getCachedFormulaResultType();
}
String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
}
htmlStringBuilder.append("</tr>");
}
htmlStringBuilder.append("</table></body></html>");
return htmlStringBuilder.toString();
}
返回的html字符串:
<html><head><title>Excel to HTML using Java and POI library</title><style>table, th, td { border: 1px solid black; }</style></head><body><table><tr><td>序號</td><td>姓名</td><td>性別</td><td>聯系方式</td><td>地址</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區xx路xx弄xx號</td></tr></table></body></html>
public String excelToHtmlStr(String excelPath) throws Exception {
FileInputStream fileInputStream = new FileInputStream(excelPath);
try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
DataFormatter dataFormatter = new DataFormatter();
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
StringBuilder htmlStringBuilder = new StringBuilder();
htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
htmlStringBuilder.append("</head><body><table>");
for (Row row : sheet) {
htmlStringBuilder.append("<tr>");
for (Cell cell : row) {
CellType cellType = cell.getCellType();
if (cellType == CellType.FORMULA) {
formulaEvaluator.evaluateFormulaCell(cell);
cellType = cell.getCachedFormulaResultType();
}
String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
}
htmlStringBuilder.append("</tr>");
}
htmlStringBuilder.append("</table></body></html>");
return htmlStringBuilder.toString();
}
}
public String excelToHtmlStr(String excelPath) throws Exception {
try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
Workbook workbook = new Workbook();
workbook.loadFromFile(excelPath);
workbook.saveToStream(outputStream, com.spire.xls.FileFormat.HTML);
return outputStream.toString();
}
}
有時我們是需要的不僅僅返回html字符串,而是需要生成一個html文件這時應該怎么做呢?一個改動量小的做法就是使用org.apache.commons.io包下的FileUtils工具類寫入目標地址:
首先需要引入pom:
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.8.0</version>
</dependency>
相關代碼:
String htmlStr = FileConvertUtil.pdfToHtmlStr("D:\\書籍\\電子書\\小說\\歷史小說\\最后的可汗.doc");
FileUtils.write(new File("D:\\test\\doc.html"), htmlStr, "utf-8");
除此之外,還可以對上面的代碼進行一些調整,已實現生成html文件,代碼調整如下:
word原文件效果:
public static void wordToHtml(String wordPath, String htmlPath) {
try {
File sourceFile = new File(wordPath);
String path = htmlPath + File.separator + sourceFile.getName().substring(0, sourceFile.getName().lastIndexOf(".")) + ".html";
File file = new File(path); // 新建一個空白pdf文檔
FileOutputStream os = new FileOutputStream(file);
Document doc = new Document(wordPath); // Address是將要被轉化的word文檔
HtmlSaveOptions options = new HtmlSaveOptions();
options.setExportImagesAsBase64(true);
options.setExportRelativeFontSize(true);
doc.save(os, options);
} catch (Exception e) {
e.printStackTrace();
}
}
轉換成html的效果:
public void wordToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
String ext = wordPath.substring(wordPath.lastIndexOf("."));
if (ext.equals(".docx")) {
word2007ToHtml(wordPath, htmlPath);
} else if (ext.equals(".doc")){
word2003ToHtml(wordPath, htmlPath);
} else {
throw new RuntimeException("文件格式不正確");
}
}
public void word2007ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
//try(OutputStream out = Files.newOutputStream(Paths.get(path))){
try(FileOutputStream out = new FileOutputStream(htmlPath)){
word2007ToHtmlOutputStream(wordPath, out);
}
}
private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
ZipSecureFile.setMinInflateRatio(-1.0d);
InputStream in = Files.newInputStream(Paths.get(wordPath));
XWPFDocument document = new XWPFDocument(in);
XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
// 使用內存輸出流
XHTMLConverter.getInstance().convert(document, out, options);
}
public void word2003ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
// 生成html文件地址
try(OutputStream outStream = Files.newOutputStream(Paths.get(htmlPath))){
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(outStream);
TransformerFactory factory = TransformerFactory.newInstance();
Transformer serializer = factory.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
}
}
private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
InputStream input = Files.newInputStream(Paths.get(wordPath));
HWPFDocument wordDocument = new HWPFDocument(input);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
System.out.println(pictureType);
if (PictureType.UNKNOWN.equals(pictureType)) {
return null;
}
BufferedImage bufferedImage = ImgUtil.toImage(content);
String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
// 帶圖片的word,則將圖片轉為base64編碼,保存在一個頁面中
StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
return sb.toString();
});
// 解析word文檔
wordToHtmlConverter.processDocument(wordDocument);
return wordToHtmlConverter.getDocument();
}
轉換成html的效果:
public void wordToHtml(String wordPath, String htmlPath) {
htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
Document document = new Document();
document.loadFromFile(wordPath);
document.saveToFile(htmlPath, FileFormat.Html);
}
轉換成html的效果:
因為使用的是免費版,存在頁數和字數限制,需要完整功能的的可以選擇付費版本。PS:這回76頁的文檔居然轉成功了前50頁。
圖片版pdf原文件效果:
文字版pdf原文件效果:
public static void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
File file = new File(pdfPath);
String path = htmlPath + File.separator + file.getName().substring(0, file.getName().lastIndexOf(".")) + ".html";
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new PrintWriter(path, "UTF-8");
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
}
圖片版PDF文件驗證結果:
文字版PDF文件驗證結果:
public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
String path = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new PrintWriter(path, "UTF-8");
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
}
圖片版PDF文件驗證結果:
文字版PDF原文件效果:
public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
htmlPath = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
PdfDocument pdf = new PdfDocument();
pdf.loadFromFile(pdfPath);
pdf.saveToFile(htmlPath, com.spire.pdf.FileFormat.HTML);
}
圖片版PDF文件驗證結果:
因為使用的是免費版,所以只有前三頁是正常的。。。有超過三頁需求的可以選擇付費版本。
文字版PDF原文件效果:
報錯了無法轉換。。。
java.lang.NullPointerException
at com.spire.pdf.PdfPageWidget.spr┢?(Unknown Source)
at com.spire.pdf.PdfPageWidget.getSize(Unknown Source)
at com.spire.pdf.PdfPageBase.spr???—(Unknown Source)
at com.spire.pdf.PdfPageBase.getActualSize(Unknown Source)
at com.spire.pdf.PdfPageBase.getSection(Unknown Source)
at com.spire.pdf.general.PdfDestination.spr︻┎?—(Unknown Source)
at com.spire.pdf.general.PdfDestination.spr┻┑?—(Unknown Source)
at com.spire.pdf.general.PdfDestination.getElement(Unknown Source)
at com.spire.pdf.primitives.PdfDictionary.setProperty(Unknown Source)
at com.spire.pdf.bookmarks.PdfBookmark.setDestination(Unknown Source)
at com.spire.pdf.bookmarks.PdfBookmarkWidget.spr┭┘?—(Unknown Source)
at com.spire.pdf.bookmarks.PdfBookmarkWidget.getDestination(Unknown Source)
at com.spire.pdf.PdfDocumentBase.spr??(Unknown Source)
at com.spire.pdf.widget.PdfPageCollection.spr┦?(Unknown Source)
at com.spire.pdf.widget.PdfPageCollection.removeAt(Unknown Source)
at com.spire.pdf.PdfDocumentBase.spr┞?(Unknown Source)
at com.spire.pdf.PdfDocument.loadFromFile(Unknown Source)
excel原文件效果:
public void excelToHtml(String excelPath, String htmlPath) throws Exception {
htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
Workbook workbook = new Workbook(excelPath);
com.aspose.cells.HtmlSaveOptions options = new com.aspose.cells.HtmlSaveOptions();
workbook.save(htmlPath, options);
}
轉換成html的效果:
public void excelToHtml(String excelPath, String htmlPath) throws Exception {
String path = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
try(FileOutputStream fileOutputStream = new FileOutputStream(path)){
String htmlStr = excelToHtmlStr(excelPath);
byte[] bytes = htmlStr.getBytes();
fileOutputStream.write(bytes);
}
}
public String excelToHtmlStr(String excelPath) throws Exception {
FileInputStream fileInputStream = new FileInputStream(excelPath);
try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
DataFormatter dataFormatter = new DataFormatter();
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
StringBuilder htmlStringBuilder = new StringBuilder();
htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
htmlStringBuilder.append("</head><body><table>");
for (Row row : sheet) {
htmlStringBuilder.append("<tr>");
for (Cell cell : row) {
CellType cellType = cell.getCellType();
if (cellType == CellType.FORMULA) {
formulaEvaluator.evaluateFormulaCell(cell);
cellType = cell.getCachedFormulaResultType();
}
String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
}
htmlStringBuilder.append("</tr>");
}
htmlStringBuilder.append("</table></body></html>");
return htmlStringBuilder.toString();
}
}
轉換成html的效果:
public void excelToHtml(String excelPath, String htmlPath) throws Exception {
htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
Workbook workbook = new Workbook();
workbook.loadFromFile(excelPath);
workbook.saveToFile(htmlPath, com.spire.xls.FileFormat.HTML);
}
轉換成html的效果:
從上述的效果展示我們可以發現其實轉成html效果不是太理想,很多細節樣式沒有還原,這其實是因為這類轉換往往都是追求目標是通過使用文檔中的語義信息并忽略其他細節來生成簡單干凈的 HTML,所以在轉換過程中復雜樣式被忽略,比如居中、首行縮進、字體,文本大小,顏色。舉個例子在轉換是 會將應用標題 1 樣式的任何段落轉換為 h1 元素,而不是嘗試完全復制標題的樣式。所以轉成html的顯示效果往往和原文檔不太一樣。這意味著對于較復雜的文檔而言,這種轉換不太可能是完美的。但如果都是只使用簡單樣式文檔或者對文檔樣式不太關心的這種方式也不妨一試。
PS:如果想要展示效果好的話,其實可以將上篇文章《文檔在線預覽(一)通過將txt、word、pdf轉成圖片實現在線預覽功能》說的內容和本文結合起來使用,即將文檔里的內容都生成成圖片(很可能是多張圖片),然后將生成的圖片全都放到一個html頁面里 ,用html+css來保持樣式并實現多張圖片展示,再將html返回。開源組件kkfilevie就是用的就是這種做法。
kkfileview展示效果如下:
下圖是kkfileview返回的html代碼,從html代碼我們可以看到kkfileview其實是將文件(txt文件除外)每頁的內容都轉成了圖片,然后將這些圖片都嵌入到一個html里,再返回給用戶一個html頁面。
*請認真填寫需求信息,我們會在24小時內與您取得聯系。