Notifications

Clear all

Repairing word document

Page 1 / 2 Next

General (Technical, Procedural, Software, Hardware etc.)

Last Post by joakims 14 years ago

12 Posts

4 Users

0 Reactions

1,210 Views

RSS

Georgefan

(@georgefan)

Eminent Member

Joined: 14 years ago

Posts: 27

Topic starter 15/06/2011 1:26 pm

? I would like to ask if someone know the way of repairing a partly overwritten word document.The head also has been overwritten,and I only saw confused characters when viewing it.
Any suggestions?

Quote

joakims

(@joakims)

Estimable Member

Joined: 15 years ago

Posts: 224

15/06/2011 1:35 pm

Do you know what version of word the document was generated from? Ie, is it in old binary format, or newer zip/xml format (Office 2007/2010)?

Joakim

ReplyQuote

mscotgrove

(@mscotgrove)

Prominent Member

Joined: 17 years ago

Posts: 940

15/06/2011 6:48 pm

For the .DOC (not .DOCX) it may be OK just to search for and extract the text data.

HOWEVER, you should be aware that when such documents are edited, the original text remains, and the modifications are stored elsewhere in the document. Only when a full save is done is everything put into the single text block.

ReplyQuote

joakims

(@joakims)

Estimable Member

Joined: 15 years ago

Posts: 224

15/06/2011 9:12 pm

If it is a docx fragment, then it may also be possible to reconstruct parts of the document. The individual xml parts of the docx are basically attached after each other in the zip. They are all by default compressed by deflate method, and because of that it may be possible to decompress those parts by raw method, or attempt repair if the compressed parts are corrupt.

(I am currently working on a tiny app to extract such parts and attempt decompression if found..)

Joakim

ReplyQuote

Georgefan

(@georgefan)

Eminent Member

Joined: 14 years ago

Posts: 27

Topic starter 16/06/2011 12:44 pm

mscotgroveYes I did have a look at the corrupt 2003 word document and the plain text could be seen.But when it comes to word 2007,it would not show plain text if the head were overwritten.
joakimsCan I ask what methods do you utilize to extract the individual xml files out and decompress them to get the data.

ReplyQuote

joakims

(@joakims)

Estimable Member

Joined: 15 years ago

Posts: 224

16/06/2011 2:59 pm

I do this manually. To recover 1 xml part, all you need is the local file header inside the zip with the compressed data attached imediately after. The signature is 0x504b0304. You do not strictly need any information from the "zip central directory file header" or the "zip end of central directory record". If parts of the local file header are also damaged, you can make a good guess where the actual compressed data is located as the internal file name is always right before the compressed data (unless the "Extra field" is in use). The end of the compressed data is then always followed by 1 of the 4 known zip structure signatures. The internal file name is always visible in clear text and easy to spot. The first is usually [Content_Types].xml. When you have isolated 1 such chunk, you can;

1. Recreate a dummy zip structure around it and unpack with standard tools.

2. Attempt decompression by inflate (it was compressed by deflate) using raw method (advanced feature of the zlib library with little documentation). This way you can feed it with raw compressed data, ie without any headers and footers around (knowledge of name, size, checksum etc).

But if compressed data id corrupt, you may need to recreate a dummy zip structure around it and let some zip repair software work on it.

Since this is tedious, I have been working on an application to automate this process. That is feed it with a part of binary data (slack space, free space, etc) and let it rip out any zip fragments (originally meant as specialized for Office 2007/2010) if found.

Btw, what tools are you guys using to do this?

Joakim

ReplyQuote

mscotgrove

(@mscotgrove)

Prominent Member

Joined: 17 years ago

Posts: 940

16/06/2011 6:44 pm

How do you know that the document has only been partly overwritten?

One advantage of .docx files are that they are compressed, and often the files are very short, maybe less than 32K for a typical document of just a few pages with no graphics. This means that any overwritting can delete the complete file very easily.

A .docx file (as stated above) is just a ZIP file with about 10 sections. Can you see the headers for these sections in the zip file? In my experience, it is not very common for just the header to be overwritten.

To help determine what is in the file, rename a good .docx to .zip and view with WinZip (or similar). Also look at it with a hex viewer and you will each section, with a text header. The final part of the file has directory of all the sections. Learn about Zip files and you will be long way down any recovery path.

ReplyQuote

ForensicRob

(@forensicrob)

Eminent Member

Joined: 20 years ago

Posts: 26

16/06/2011 7:29 pm

Here are some places to start

PK Zip (base format for Office 2007/2010 files)
http//en.wikipedia.org/wiki/ZIP_(file_format)
http//www.pkware.com/support/zip-app-note

OLE2 (base format fro Office 2003 and earlier files)
http//www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
http//en.wikipedia.org/wiki/Object_Linking_and_Embedding

ReplyQuote

Georgefan

(@georgefan)

Eminent Member

Joined: 14 years ago

Posts: 27

Topic starter 18/06/2011 8:36 am

Hi joakims
According to your reply if this kind of problem occurs,two things matter
1 How to locate the locations where the actual contents of the word2007 reside.
2 How to isolate this chunk out and restore it into actual characters

I then do a test I create a word 2007 and put an English article into it and save it.I then open it with Winhex and I know where the article resides-document.xml.So I try to copy the chunk from right after the"word/document.xml" to right the beginning of the next xml which in my test is "word/theme/theme1.xml" Then I get the binary version of the article.But I failed to restore it into the article.

ReplyQuote

Georgefan

(@georgefan)

Eminent Member

Joined: 14 years ago

Posts: 27

Topic starter 18/06/2011 8:38 am

I know where the contents begin because I view it in Encase-View File Structure,and when I click the document.xml it shows the right English contents.

ReplyQuote

Page 1 / 2 Next

8 Forums
15.7 K Topics
92.3 K Posts
193 Online
41.1 K Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed