? I would like to ask if someone know the way of repairing a partly overwritten word document.The head also has been overwritten,and I only saw confused characters when viewing it.
Any suggestions?
Do you know what version of word the document was generated from? Ie, is it in old binary format, or newer zip/xml format (Office 2007/2010)?
Joakim
For the .DOC (not .DOCX) it may be OK just to search for and extract the text data.
HOWEVER, you should be aware that when such documents are edited, the original text remains, and the modifications are stored elsewhere in the document. Only when a full save is done is everything put into the single text block.
If it is a docx fragment, then it may also be possible to reconstruct parts of the document. The individual xml parts of the docx are basically attached after each other in the zip. They are all by default compressed by deflate method, and because of that it may be possible to decompress those parts by raw method, or attempt repair if the compressed parts are corrupt.
(I am currently working on a tiny app to extract such parts and attempt decompression if found..)
Joakim
mscotgroveYes I did have a look at the corrupt 2003 word document and the plain text could be seen.But when it comes to word 2007,it would not show plain text if the head were overwritten.
joakimsCan I ask what methods do you utilize to extract the individual xml files out and decompress them to get the data.
I do this manually. To recover 1 xml part, all you need is the local file header inside the zip with the compressed data attached imediately after. The signature is 0x504b0304. You do not strictly need any information from the "zip central directory file header" or the "zip end of central directory record". If parts of the local file header are also damaged, you can make a good guess where the actual compressed data is located as the internal file name is always right before the compressed data (unless the "Extra field" is in use). The end of the compressed data is then always followed by 1 of the 4 known zip structure signatures. The internal file name is always visible in clear text and easy to spot. The first is usually [Content_Types].xml. When you have isolated 1 such chunk, you can;
1. Recreate a dummy zip structure around it and unpack with standard tools.
2. Attempt decompression by inflate (it was compressed by deflate) using raw method (advanced feature of the zlib library with little documentation). This way you can feed it with raw compressed data, ie without any headers and footers around (knowledge of name, size, checksum etc).
But if compressed data id corrupt, you may need to recreate a dummy zip structure around it and let some zip repair software work on it.
Since this is tedious, I have been working on an application to automate this process. That is feed it with a part of binary data (slack space, free space, etc) and let it rip out any zip fragments (originally meant as specialized for Office 2007/2010) if found.
Btw, what tools are you guys using to do this?
Joakim
How do you know that the document has only been partly overwritten?
One advantage of .docx files are that they are compressed, and often the files are very short, maybe less than 32K for a typical document of just a few pages with no graphics. This means that any overwritting can delete the complete file very easily.
A .docx file (as stated above) is just a ZIP file with about 10 sections. Can you see the headers for these sections in the zip file? In my experience, it is not very common for just the header to be overwritten.
To help determine what is in the file, rename a good .docx to .zip and view with WinZip (or similar). Also look at it with a hex viewer and you will each section, with a text header. The final part of the file has directory of all the sections. Learn about Zip files and you will be long way down any recovery path.
Here are some places to start
PK Zip (base format for Office 2007/2010 files)
http//
http//
OLE2 (base format fro Office 2003 and earlier files)
http//
http//
Hi joakims
According to your reply if this kind of problem occurs,two things matter
1 How to locate the locations where the actual contents of the word2007 reside.
2 How to isolate this chunk out and restore it into actual characters
I then do a test I create a word 2007 and put an English article into it and save it.I then open it with Winhex and I know where the article resides-document.xml.So I try to copy the chunk from right after the"word/document.xml" to right the beginning of the next xml which in my test is "word/theme/theme1.xml" Then I get the binary version of the article.But I failed to restore it into the article.
I know where the contents begin because I view it in Encase-View File Structure,and when I click the document.xml it shows the right English contents.