问题描述
使用java如何读取doc文件,保证不会乱码
解决方案
如果不需要把图片读取出来,可以用下面的方法 public static void testWord1(){ try { //word 2003: 图片不会被读取 InputStream is = new FileInputStream(new File("c:\a.doc")); WordExtractor ex = new WordExtractor(is); String text2003 = ex.getText().trim(); System.out.println(text2003); //word 2007 图片不会被读取, 表格中的数据会被放在字符串的最后 // OPCPackage opcPackage = POIXMLDocument.openPackage("c:\a.doc"); // POIXMLTextExtractor extractor = new XWPFWordExtractor(opcPackage); // String text2007 = extractor.getText(); //System.out.println(text2007); } catch (Exception e) { e.printStackTrace(); } 如果是word2003用前半部分如果是2007用后半部分
解决方案二:
没什么用,表格和图片都读不了。连最基本的格式都读不出来。
解决方案三:
这种问题,明显是查API就能解决的事!
解决方案四:
使用poi:package org.apache.poi.hwpf;19 20 import org.apache.poi.hwpf.model.FileInformationBlock;21 import org.apache.poi.poifs.filesystem.DocumentEntry;22 import org.apache.poi.poifs.filesystem.POIFSFileSystem;23 import org.apache.poi.POIDataSamples;24 25 26 public final class HWPFDocFixture27 {28 public static final String DEFAULT_TEST_FILE = "test.doc";29 30 public byte[] _tableStream;31 public byte[] _mainStream;32 public FileInformationBlock _fib;33 private String _testFile;34 35 public HWPFDocFixture(Object obj, String testFile)36 {37 _testFile = testFile;38 }39 40 public void setUp()41 {42 try43 {44 POIFSFileSystem filesystem = new POIFSFileSystem(45 POIDataSamples.getDocumentInstance().openResourceAsStream(_testFile));46 47 DocumentEntry documentProps =48 (DocumentEntry) filesystem.getRoot().getEntry("WordDocument");49 _mainStream = new byte[documentProps.getSize()];50 filesystem.createDocumentInputStream("WordDocument").read(_mainStream);51 52 // use the fib to determine the name of the table stream.53 _fib = new FileInformationBlock(_mainStream);54 55 String name = "0Table";56 if (_fib.getFibBase().isFWhichTblStm())57 {58 name = "1Table";59 }60 61 // read in the table stream.62 DocumentEntry tableProps =63 (DocumentEntry) filesystem.getRoot().getEntry(name);64 _tableStream = new byte[tableProps.getSize()];65 filesystem.createDocumentInputStream(name).read(_tableStream);66 67 _fib.fillVariableFields(_mainStream, _tableStream);68 }69 catch (Throwable t)70 {71 t.printStackTrace();72 }73 }74 75 public void tearDown()76 {77 }78 79 }
解决方案五:
http://download.csdn.net/detail/hcs371239924/3761147
解决方案六:
如果只有文字,没有图片、表格等可以用下面的方法先下载jacobhttp://sourceforge.net/project/showfiles.php?group_id=109543&package_id=118368需要将acob-1.15-M4-x86.dll放在system32和jdk的bin下先将word文档转成txt,然后从txt中读取import com.jacob.activeX.ActiveXComponent;import com.jacob.com.Dispatch;import com.jacob.com.Variant;public class WordReader1 { public static void extractDoc(String inputFIle, String outputFile) { boolean flag = false; // 打开Word 应用程序 ActiveXComponent app = new ActiveXComponent("Word.Application"); try { // 设置word 不可见 app.setProperty("Visible", new Variant(false)); // 打开word 文件 Dispatch doc1 = app.getProperty("Documents").toDispatch(); Dispatch doc2 = Dispatch.invoke(doc1,"Open",Dispatch.Method,new Object[] { inputFIle, new Variant(false), new Variant(true) }, new int[1]).toDispatch(); // 作为txt 格式保存到临时文件 Dispatch.invoke(doc2, "SaveAs", Dispatch.Method, new Object[] {outputFile, new Variant(7) }, new int[1]); // 关闭wordVariant f = new Variant(false); Dispatch.call(doc2, "Close", f); flag = true; } catch (Exception e) { e.printStackTrace(); } finally { app.invoke("Quit", new Variant[] {}); } if (flag == true) { System.out.println("Transformed Successfully"); } else { System.out.println("Transform Failed"); } } public static void main(String[] args) { WordReader1.extractDoc("c:/a.doc", "c:/a.txt");}}
解决方案七:
POI 设置编码