啥要做這個(gè)軟件
軟件獲取方式:私信“word”即可獲取
近有一個(gè)業(yè)務(wù)是前端要上傳word格式的文稿,然后用戶上傳完之后,可以用瀏覽器直接查看該文稿,并且可以在富文本框直接引用該文稿,所以上傳word文稿之后,后端保存到db的必須是html格式才行,所以涉及到word格式轉(zhuǎn)html格式。
通過(guò)調(diào)查,這個(gè)word和html的處理,有兩種方案,方案1是前端做這個(gè)轉(zhuǎn)換。方案2是把word文檔上傳給后臺(tái),后臺(tái)轉(zhuǎn)換好之后再返回給前端。至于方案1,看到大家的反饋都說(shuō)很多問(wèn)題,所以就沒(méi)采用前端轉(zhuǎn)的方案,最終決定是后端轉(zhuǎn)化為html格式并返回給前段預(yù)覽,待客戶預(yù)覽的時(shí)候,確認(rèn)格式?jīng)]問(wèn)題之后,再把html保存到后臺(tái)(因?yàn)閣ord涉及到的格式太多,比如圖片,visio圖,表格,圖片等等之類的復(fù)雜元素,轉(zhuǎn)html的時(shí)候,可能會(huì)很多格式問(wèn)題,所以要有個(gè)預(yù)覽的過(guò)程)。
對(duì)于word中普通的文字,問(wèn)題倒不大,主要是文本之外的元素的處理,比如圖片,視頻,表格等。針對(duì)我本次的文章,只處理了圖片,處理的方式是:后臺(tái)從word中找出圖片(當(dāng)然引入的jar包已經(jīng)帶了獲取word中圖片的功能),上傳到服務(wù)器,拿到絕對(duì)路徑之后,放入到html里面,這樣,返回給前端的html內(nèi)容,就可以直接預(yù)覽了。
maven引入相關(guān)依賴包如下:
<poi-scratchpad.version>3.14</poi-scratchpad.version>
<poi-ooxml.version>3.14</poi-ooxml.version>
<xdocreport.version>1.0.6</xdocreport.version>
<poi-ooxml-schemas.version>3.14</poi-ooxml-schemas.version>
<ooxml-schemas.version>1.3</ooxml-schemas.version>
<jsoup.version>1.11.3</jsoup.version>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>${poi-scratchpad.version}</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>${poi-ooxml.version}</version>
</dependency>
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>xdocreport</artifactId>
<version>${xdocreport.version}</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml-schemas</artifactId>
<version>${poi-ooxml-schemas.version}</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>ooxml-schemas</artifactId>
<version>${ooxml-schemas.version}</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>${jsoup.version}</version>
</dependency>
word轉(zhuǎn)html,對(duì)于word2003和word2007轉(zhuǎn)換方式不一樣,因?yàn)閣ord2003和word2007的格式不一樣,工具類如下:
使用方法如下:
public String uploadSourceNews(MultipartFile file) {
String fileName = file.getOriginalFilename();
String suffixName = fileName.substring(fileName.lastIndexOf("."));
if (!".doc".equals(suffixName) && !".docx".equals(suffixName)) {
throw new UploadFileFormatException();
}
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyyMM");
String dateDir = formatter.format(LocalDate.now());
String directory = imageDir + "/" + dateDir + "/";
String content = null;
try {
InputStream inputStream = file.getInputStream();
if ("doc".equals(suffixName)) {
content = wordToHtmlUtil.Word2003ToHtml(inputStream, imageBucket, directory, Constants.HTTPS_PREFIX + imageVisitHost);
} else {
content = wordToHtmlUtil.Word2007ToHtml(inputStream, imageBucket, directory, Constants.HTTPS_PREFIX + imageVisitHost);
}
} catch (Exception ex) {
logger.error("word to html exception, detail:", ex);
return null;
}
return content;
}
關(guān)于doc和docx的一些存儲(chǔ)格式介紹:
docx 是微軟開發(fā)的基于 xml 的文字處理文件。docx 文件與 doc 文件不同, 因?yàn)?docx 文件將數(shù)據(jù)存儲(chǔ)在單獨(dú)的壓縮文件和文件夾中。早期版本的 microsoft office (早于 office 2007) 不支持 docx 文件, 因?yàn)?docx 是基于 xml 的, 早期版本將 doc 文件另存為單個(gè)二進(jìn)制文件。
DOCX is an XML based word processing file developed by Microsoft. DOCX files are different than DOC files as DOCX files store data in separate compressed files and folders. Earlier versions of Microsoft Office (earlier than Office 2007) do not support DOCX files because DOCX is XML based where the earlier versions save DOC file as a single binary file.
可能你會(huì)問(wèn)了,明明是docx結(jié)尾的文檔,怎么成了xml格式了?
很簡(jiǎn)單:你隨便選擇一個(gè)docx文件,右鍵使用壓縮工具打開,就能得到一個(gè)這樣的目錄結(jié)構(gòu):
所以你以為docx是一個(gè)完整的文檔,其實(shí)它只是一個(gè)壓縮文件。
參考:
https://www.cnblogs.com/ct-csu/p/8178932.html
實(shí)現(xiàn)文檔在線預(yù)覽的方式除了上篇文章 文檔在線預(yù)覽新版(一)通過(guò)將文件轉(zhuǎn)成圖片實(shí)現(xiàn)在線預(yù)覽功能說(shuō)的將文檔轉(zhuǎn)成圖片的實(shí)現(xiàn)方式外,還有轉(zhuǎn)成pdf,前端通過(guò)pdf.js、pdfobject.js等插件來(lái)實(shí)現(xiàn)在線預(yù)覽,以及本文將要說(shuō)到的將文檔轉(zhuǎn)成html的方式來(lái)實(shí)現(xiàn)在線預(yù)覽。
以下代碼分別提供基于aspose、pdfbox、spire來(lái)實(shí)現(xiàn)來(lái)實(shí)現(xiàn)txt、word、pdf、ppt、word等文件轉(zhuǎn)圖片的需求。
Aspose 是一家致力于.Net ,Java,SharePoint,JasperReports和SSRS組件的提供商,數(shù)十個(gè)國(guó)家的數(shù)千機(jī)構(gòu)都有用過(guò)aspose組件,創(chuàng)建、編輯、轉(zhuǎn)換或渲染 Office、OpenOffice、PDF、圖像、ZIP、CAD、XPS、EPS、PSD 和更多文件格式。注意aspose是商用組件,未經(jīng)授權(quán)導(dǎo)出文件里面都是是水印(尊重版權(quán),遠(yuǎn)離破解版)。
需要在項(xiàng)目的pom文件里添加如下依賴
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-words</artifactId>
<version>23.1</version>
</dependency>
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-pdf</artifactId>
<version>23.1</version>
</dependency>
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-cells</artifactId>
<version>23.1</version>
</dependency>
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-slides</artifactId>
<version>23.1</version>
</dependency>
因?yàn)閍spose和spire雖然好用,但是都是是商用組件,所以這里也提供使用開源庫(kù)操作的方式的方式。
POI是Apache軟件基金會(huì)用Java編寫的免費(fèi)開源的跨平臺(tái)的 Java API,Apache POI提供API給Java程序?qū)icrosoft Office格式檔案讀和寫的功能。
Apache PDFBox是一個(gè)開源Java庫(kù),支持PDF文檔的開發(fā)和轉(zhuǎn)換。 使用此庫(kù),您可以開發(fā)用于創(chuàng)建,轉(zhuǎn)換和操作PDF文檔的Java程序。
需要在項(xiàng)目的pom文件里添加如下依賴
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.4</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>5.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>5.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-excelant</artifactId>
<version>5.2.0</version>
</dependency>
spire一款專業(yè)的Office編程組件,涵蓋了對(duì)Word、Excel、PPT、PDF等文件的讀寫、編輯、查看功能。spire提供免費(fèi)版本,但是存在只能導(dǎo)出前3頁(yè)以及只能導(dǎo)出前500行的限制,只要達(dá)到其一就會(huì)觸發(fā)限制。需要超出前3頁(yè)以及只能導(dǎo)出前500行的限制的這需要購(gòu)買付費(fèi)版(尊重版權(quán),遠(yuǎn)離破解版)。這里使用免費(fèi)版進(jìn)行演示。
spire在添加pom之前還得先添加maven倉(cāng)庫(kù)來(lái)源
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.cn/repository/maven-public/</url>
</repository>
接著在項(xiàng)目的pom文件里添加如下依賴
免費(fèi)版:
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.office.free</artifactId>
<version>5.3.1</version>
</dependency>
付費(fèi)版版:
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.office</artifactId>
<version>5.3.1</version>
</dependency>
public static String wordToHtmlStr(String wordPath) {
try {
Document doc = new Document(wordPath); // Address是將要被轉(zhuǎn)化的word文檔
String htmlStr = doc.toString();
return htmlStr;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
驗(yàn)證結(jié)果:
public String wordToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
String htmlStr = null;
String ext = wordPath.substring(wordPath.lastIndexOf("."));
if (ext.equals(".docx")) {
htmlStr = word2007ToHtmlStr(wordPath);
} else if (ext.equals(".doc")){
htmlStr = word2003ToHtmlStr(wordPath);
} else {
throw new RuntimeException("文件格式不正確");
}
return htmlStr;
}
public String word2007ToHtmlStr(String wordPath) throws IOException {
// 使用內(nèi)存輸出流
try(ByteArrayOutputStream out = new ByteArrayOutputStream()){
word2007ToHtmlOutputStream(wordPath, out);
return out.toString();
}
}
private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
ZipSecureFile.setMinInflateRatio(-1.0d);
InputStream in = Files.newInputStream(Paths.get(wordPath));
XWPFDocument document = new XWPFDocument(in);
XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
// 使用內(nèi)存輸出流
XHTMLConverter.getInstance().convert(document, out, options);
}
private String word2003ToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
// Transform document to string
StringWriter writer = new StringWriter();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new DOMSource(htmlDocument), new StreamResult(writer));
return writer.toString();
}
private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
InputStream input = Files.newInputStream(Paths.get(wordPath));
HWPFDocument wordDocument = new HWPFDocument(input);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
System.out.println(pictureType);
if (PictureType.UNKNOWN.equals(pictureType)) {
return null;
}
BufferedImage bufferedImage = ImgUtil.toImage(content);
String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
// 帶圖片的word,則將圖片轉(zhuǎn)為base64編碼,保存在一個(gè)頁(yè)面中
StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
return sb.toString();
});
// 解析word文檔
wordToHtmlConverter.processDocument(wordDocument);
return wordToHtmlConverter.getDocument();
}
public String wordToHtmlStr(String wordPath) throws IOException {
try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
Document document = new Document();
document.loadFromFile(wordPath);
document.saveToFile(outputStream, FileFormat.Html);
return outputStream.toString();
}
}
public static String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new StringWriter();
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
return writer.toString();
}
驗(yàn)證結(jié)果:
public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new StringWriter();
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
return writer.toString();
}
public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
PdfDocument pdf = new PdfDocument();
pdf.loadFromFile(pdfPath);
return outputStream.toString();
}
}
public static String excelToHtmlStr(String excelPath) throws Exception {
FileInputStream fileInputStream = new FileInputStream(excelPath);
Workbook workbook = new XSSFWorkbook(fileInputStream);
DataFormatter dataFormatter = new DataFormatter();
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
Sheet sheet = workbook.getSheetAt(0);
StringBuilder htmlStringBuilder = new StringBuilder();
htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
htmlStringBuilder.append("</head><body><table>");
for (Row row : sheet) {
htmlStringBuilder.append("<tr>");
for (Cell cell : row) {
CellType cellType = cell.getCellType();
if (cellType == CellType.FORMULA) {
formulaEvaluator.evaluateFormulaCell(cell);
cellType = cell.getCachedFormulaResultType();
}
String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
}
htmlStringBuilder.append("</tr>");
}
htmlStringBuilder.append("</table></body></html>");
return htmlStringBuilder.toString();
}
返回的html字符串:
<html><head><title>Excel to HTML using Java and POI library</title><style>table, th, td { border: 1px solid black; }</style></head><body><table><tr><td>序號(hào)</td><td>姓名</td><td>性別</td><td>聯(lián)系方式</td><td>地址</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號(hào)</td></tr></table></body></html>
public String excelToHtmlStr(String excelPath) throws Exception {
FileInputStream fileInputStream = new FileInputStream(excelPath);
try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
DataFormatter dataFormatter = new DataFormatter();
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
StringBuilder htmlStringBuilder = new StringBuilder();
htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
htmlStringBuilder.append("</head><body><table>");
for (Row row : sheet) {
htmlStringBuilder.append("<tr>");
for (Cell cell : row) {
CellType cellType = cell.getCellType();
if (cellType == CellType.FORMULA) {
formulaEvaluator.evaluateFormulaCell(cell);
cellType = cell.getCachedFormulaResultType();
}
String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
}
htmlStringBuilder.append("</tr>");
}
htmlStringBuilder.append("</table></body></html>");
return htmlStringBuilder.toString();
}
}
public String excelToHtmlStr(String excelPath) throws Exception {
try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
Workbook workbook = new Workbook();
workbook.loadFromFile(excelPath);
workbook.saveToStream(outputStream, com.spire.xls.FileFormat.HTML);
return outputStream.toString();
}
}
有時(shí)我們是需要的不僅僅返回html字符串,而是需要生成一個(gè)html文件這時(shí)應(yīng)該怎么做呢?一個(gè)改動(dòng)量小的做法就是使用org.apache.commons.io包下的FileUtils工具類寫入目標(biāo)地址:
首先需要引入pom:
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.8.0</version>
</dependency>
相關(guān)代碼:
String htmlStr = FileConvertUtil.pdfToHtmlStr("D:\\書籍\\電子書\\小說(shuō)\\歷史小說(shuō)\\最后的可汗.doc");
FileUtils.write(new File("D:\\test\\doc.html"), htmlStr, "utf-8");
除此之外,還可以對(duì)上面的代碼進(jìn)行一些調(diào)整,已實(shí)現(xiàn)生成html文件,代碼調(diào)整如下:
word原文件效果:
public static void wordToHtml(String wordPath, String htmlPath) {
try {
File sourceFile = new File(wordPath);
String path = htmlPath + File.separator + sourceFile.getName().substring(0, sourceFile.getName().lastIndexOf(".")) + ".html";
File file = new File(path); // 新建一個(gè)空白pdf文檔
FileOutputStream os = new FileOutputStream(file);
Document doc = new Document(wordPath); // Address是將要被轉(zhuǎn)化的word文檔
HtmlSaveOptions options = new HtmlSaveOptions();
options.setExportImagesAsBase64(true);
options.setExportRelativeFontSize(true);
doc.save(os, options);
} catch (Exception e) {
e.printStackTrace();
}
}
轉(zhuǎn)換成html的效果:
public void wordToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
String ext = wordPath.substring(wordPath.lastIndexOf("."));
if (ext.equals(".docx")) {
word2007ToHtml(wordPath, htmlPath);
} else if (ext.equals(".doc")){
word2003ToHtml(wordPath, htmlPath);
} else {
throw new RuntimeException("文件格式不正確");
}
}
public void word2007ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
//try(OutputStream out = Files.newOutputStream(Paths.get(path))){
try(FileOutputStream out = new FileOutputStream(htmlPath)){
word2007ToHtmlOutputStream(wordPath, out);
}
}
private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
ZipSecureFile.setMinInflateRatio(-1.0d);
InputStream in = Files.newInputStream(Paths.get(wordPath));
XWPFDocument document = new XWPFDocument(in);
XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
// 使用內(nèi)存輸出流
XHTMLConverter.getInstance().convert(document, out, options);
}
public void word2003ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
// 生成html文件地址
try(OutputStream outStream = Files.newOutputStream(Paths.get(htmlPath))){
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(outStream);
TransformerFactory factory = TransformerFactory.newInstance();
Transformer serializer = factory.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
}
}
private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
InputStream input = Files.newInputStream(Paths.get(wordPath));
HWPFDocument wordDocument = new HWPFDocument(input);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
System.out.println(pictureType);
if (PictureType.UNKNOWN.equals(pictureType)) {
return null;
}
BufferedImage bufferedImage = ImgUtil.toImage(content);
String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
// 帶圖片的word,則將圖片轉(zhuǎn)為base64編碼,保存在一個(gè)頁(yè)面中
StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
return sb.toString();
});
// 解析word文檔
wordToHtmlConverter.processDocument(wordDocument);
return wordToHtmlConverter.getDocument();
}
轉(zhuǎn)換成html的效果:
public void wordToHtml(String wordPath, String htmlPath) {
htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
Document document = new Document();
document.loadFromFile(wordPath);
document.saveToFile(htmlPath, FileFormat.Html);
}
轉(zhuǎn)換成html的效果:
因?yàn)槭褂玫氖敲赓M(fèi)版,存在頁(yè)數(shù)和字?jǐn)?shù)限制,需要完整功能的的可以選擇付費(fèi)版本。PS:這回76頁(yè)的文檔居然轉(zhuǎn)成功了前50頁(yè)。
圖片版pdf原文件效果:
文字版pdf原文件效果:
public static void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
File file = new File(pdfPath);
String path = htmlPath + File.separator + file.getName().substring(0, file.getName().lastIndexOf(".")) + ".html";
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new PrintWriter(path, "UTF-8");
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
}
圖片版PDF文件驗(yàn)證結(jié)果:
文字版PDF文件驗(yàn)證結(jié)果:
public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
String path = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
PDDocument document = PDDocument.load(new File(pdfPath));
Writer writer = new PrintWriter(path, "UTF-8");
new PDFDomTree().writeText(document, writer);
writer.close();
document.close();
}
圖片版PDF文件驗(yàn)證結(jié)果:
文字版PDF原文件效果:
public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
htmlPath = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
PdfDocument pdf = new PdfDocument();
pdf.loadFromFile(pdfPath);
pdf.saveToFile(htmlPath, com.spire.pdf.FileFormat.HTML);
}
圖片版PDF文件驗(yàn)證結(jié)果:
因?yàn)槭褂玫氖敲赓M(fèi)版,所以只有前三頁(yè)是正常的。。。有超過(guò)三頁(yè)需求的可以選擇付費(fèi)版本。
文字版PDF原文件效果:
報(bào)錯(cuò)了無(wú)法轉(zhuǎn)換。。。
java.lang.NullPointerException
at com.spire.pdf.PdfPageWidget.spr┢?(Unknown Source)
at com.spire.pdf.PdfPageWidget.getSize(Unknown Source)
at com.spire.pdf.PdfPageBase.spr???—(Unknown Source)
at com.spire.pdf.PdfPageBase.getActualSize(Unknown Source)
at com.spire.pdf.PdfPageBase.getSection(Unknown Source)
at com.spire.pdf.general.PdfDestination.spr︻┎?—(Unknown Source)
at com.spire.pdf.general.PdfDestination.spr┻┑?—(Unknown Source)
at com.spire.pdf.general.PdfDestination.getElement(Unknown Source)
at com.spire.pdf.primitives.PdfDictionary.setProperty(Unknown Source)
at com.spire.pdf.bookmarks.PdfBookmark.setDestination(Unknown Source)
at com.spire.pdf.bookmarks.PdfBookmarkWidget.spr┭┘?—(Unknown Source)
at com.spire.pdf.bookmarks.PdfBookmarkWidget.getDestination(Unknown Source)
at com.spire.pdf.PdfDocumentBase.spr??(Unknown Source)
at com.spire.pdf.widget.PdfPageCollection.spr┦?(Unknown Source)
at com.spire.pdf.widget.PdfPageCollection.removeAt(Unknown Source)
at com.spire.pdf.PdfDocumentBase.spr┞?(Unknown Source)
at com.spire.pdf.PdfDocument.loadFromFile(Unknown Source)
excel原文件效果:
public void excelToHtml(String excelPath, String htmlPath) throws Exception {
htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
Workbook workbook = new Workbook(excelPath);
com.aspose.cells.HtmlSaveOptions options = new com.aspose.cells.HtmlSaveOptions();
workbook.save(htmlPath, options);
}
轉(zhuǎn)換成html的效果:
public void excelToHtml(String excelPath, String htmlPath) throws Exception {
String path = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
try(FileOutputStream fileOutputStream = new FileOutputStream(path)){
String htmlStr = excelToHtmlStr(excelPath);
byte[] bytes = htmlStr.getBytes();
fileOutputStream.write(bytes);
}
}
public String excelToHtmlStr(String excelPath) throws Exception {
FileInputStream fileInputStream = new FileInputStream(excelPath);
try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
DataFormatter dataFormatter = new DataFormatter();
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
StringBuilder htmlStringBuilder = new StringBuilder();
htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
htmlStringBuilder.append("</head><body><table>");
for (Row row : sheet) {
htmlStringBuilder.append("<tr>");
for (Cell cell : row) {
CellType cellType = cell.getCellType();
if (cellType == CellType.FORMULA) {
formulaEvaluator.evaluateFormulaCell(cell);
cellType = cell.getCachedFormulaResultType();
}
String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
}
htmlStringBuilder.append("</tr>");
}
htmlStringBuilder.append("</table></body></html>");
return htmlStringBuilder.toString();
}
}
轉(zhuǎn)換成html的效果:
public void excelToHtml(String excelPath, String htmlPath) throws Exception {
htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
Workbook workbook = new Workbook();
workbook.loadFromFile(excelPath);
workbook.saveToFile(htmlPath, com.spire.xls.FileFormat.HTML);
}
轉(zhuǎn)換成html的效果:
從上述的效果展示我們可以發(fā)現(xiàn)其實(shí)轉(zhuǎn)成html效果不是太理想,很多細(xì)節(jié)樣式?jīng)]有還原,這其實(shí)是因?yàn)檫@類轉(zhuǎn)換往往都是追求目標(biāo)是通過(guò)使用文檔中的語(yǔ)義信息并忽略其他細(xì)節(jié)來(lái)生成簡(jiǎn)單干凈的 HTML,所以在轉(zhuǎn)換過(guò)程中復(fù)雜樣式被忽略,比如居中、首行縮進(jìn)、字體,文本大小,顏色。舉個(gè)例子在轉(zhuǎn)換是 會(huì)將應(yīng)用標(biāo)題 1 樣式的任何段落轉(zhuǎn)換為 h1 元素,而不是嘗試完全復(fù)制標(biāo)題的樣式。所以轉(zhuǎn)成html的顯示效果往往和原文檔不太一樣。這意味著對(duì)于較復(fù)雜的文檔而言,這種轉(zhuǎn)換不太可能是完美的。但如果都是只使用簡(jiǎn)單樣式文檔或者對(duì)文檔樣式不太關(guān)心的這種方式也不妨一試。
PS:如果想要展示效果好的話,其實(shí)可以將上篇文章《文檔在線預(yù)覽(一)通過(guò)將txt、word、pdf轉(zhuǎn)成圖片實(shí)現(xiàn)在線預(yù)覽功能》說(shuō)的內(nèi)容和本文結(jié)合起來(lái)使用,即將文檔里的內(nèi)容都生成成圖片(很可能是多張圖片),然后將生成的圖片全都放到一個(gè)html頁(yè)面里 ,用html+css來(lái)保持樣式并實(shí)現(xiàn)多張圖片展示,再將html返回。開源組件kkfilevie就是用的就是這種做法。
kkfileview展示效果如下:
下圖是kkfileview返回的html代碼,從html代碼我們可以看到kkfileview其實(shí)是將文件(txt文件除外)每頁(yè)的內(nèi)容都轉(zhuǎn)成了圖片,然后將這些圖片都嵌入到一個(gè)html里,再返回給用戶一個(gè)html頁(yè)面。
*請(qǐng)認(rèn)真填寫需求信息,我們會(huì)在24小時(shí)內(nèi)與您取得聯(lián)系。