mft2csv - NTFS systemfile extracter and $MFT decoder
That 'small discovery' was what my earlier post about Fixup was all about. I did say that it would bite you at some stage if you didn't do it. It is not unusual for mft records to extend past the 0x1fe-0x1ff point, but I don't think I have seen any where the bytes at 0x3fe-0x3ff make any difference.
Having now discovered for yourself that you really do need to do the fixup before processing the mft record, you will undoubtably find that it will also resolve the problem with filenames that have bad unicode characters!
Those bytes at 0x1fe-0x1ff can have a dramatic effect on things like long filenames, times and dataruns when they are not corrected.
I'm sorry to keep harping on about this, but it really is important. I thought from my first post about Fixup that you had understood, but re-reading your last post makes it clear that you didn't. So let's be unequivocal.
When you read an mft record from disk, you do not get the real record. What you get is a modified version. It has been modified so that it has an inbuilt integrity check. You need to undo those modifications before you process the record. If you don't undo the modifications, you are not processing the real record. The undo process is called Fixup. It consists of putting two pairs of bytes back into their proper position. The two pairs of bytes are in the update sequence array and their proper positions are the last two bytes in each sector of the mft record.
You have already seen examples of what happens when you don't do Fixup. It causes dataruns to go haywire, filenames to have bad unicode, resident data to contain bad values, etc, etc, etc.
He he, I now feel like a moron.. Anyways, have moved over to reassembling runs with compression (including decompressing it), and it's going really good. But this time I will keep my mouth shut a little longer. )
A question concerning ntfs compression.
I have resolved runs, meaning I can extract the raw compressed data of the file. I have also identified the winapi rtldecompressbuffer being able to handle at least parts of the compressed data. But I'm facing the issue that only the first 8192 bytes are decompressed, so I'm wondering if
1. There is a better api or method to use?
2. More modification to the reassembled compressed parts are needed?
Anybody familiar with this?
Edit Turns out decompression is working just fine if applied to each individual run, so reassembly must be done on decompressed chunks.
I'm not familiar with the api so cannot help there.
With the 8192 bytes though, was the cluster size 512 bytes? If it was, then it probably is ok.
The compression works on a block-by-block basis. A compression block is 16 clusters. The compression flag is always 4, and 2 to the power of 4 is 16. The fact that the flag is set doesn't actually mean that the file is compressed. It only means that the file is scheduled to be compressed.
If the file is resident, it is never compressed. A non-resident file is first split into 16-cluster compression blocks. Each block is then examined and if compression saves at least one cluster, then it is compressed. Otherwise it is not compressed. This means that the compressed file can have blocks which are compressed and some which are not compressed. The first two bytes in the compressed block tell you which is which.
To determine whether a file is compressed, you need to split the data run into 16-cluster subsets. A compressed subset would be of the form (x data clusters) plus (y sparse clusters) where x+y=16. Note that after decompression, the x data clusters will expand to 16 clusters so the sparse clusters are not actually used.
Hope this helps.
Learning from past mistakes, I will not be too bombastik in my statements. Anyways, I just uploaded a new version of extracter with preliminary support for compressed files. It has been tested OK on a few files using the same api as mentioned before.
Thanks for pointing out that the flag only indicates that a file, at least, is to be compressed, and not already compressed with the magic 16 clusters as a big clue. I think the existing solution is very close to that, but probably will fail for files with an uncompressed run and the compressed flag set simultanously.
1 question though
When you say the first two bytes in the compressed block tells you which is which, what is the actual meaning/interpretation of these 2 bytes. I noticed some patters but did not identify the difference. If I understand you right it is therefore 2 ways of identifying whether a block is compressed/uncompressed;
a. Looking at the runs and their clusters.
b. Looking at the first 2 bytes of a given data block (non-sparsed run).
I'm stretching my memory here.
I don't think there are two ways to determine compression or not. You only look at the first two bytes when the compression flag is set and when the data run shows 16 cluster sets containing sparse data, ie you need both. Otherwise the first two bytes are just data.
When the data run indicates compression though, the first two bytes, little endian format of course, are treated as 4 bits plus 12 bits. If the highest bit is set, then we actually have compression. My experience, mainly with XP, is that the 16 bit word is either &HBxxx or &H3FFF for a cluster size of 4096. The &HB indicates compression, the &H3 indicates no compression.
The remaining 12 bits indicate the length of the compressed sub-block minus 3 bytes. The length includes the first two bytes. So &H3FFF gives an uncompressed block of &H1002 bytes. That is, two more than the uncompressed length. Also for a full-size compressed block, the (xxx +3) bytes will expand to 16 clusters.
One other comment in regard to the &HB (ie 1011 in bits) is that no-one seems to know what the lower two set bits mean. It seems they are always set.
I knew I should not have relied on my memory to talk about compression. I didn't quite get it right.
The 16 clusters are called a 'compression unit' and the compression is actually done on a 'compression block' which is always 4096 bytes (or less if end of file).
So for a small disk with cluster size 512 bytes, a compression unit is only two blocks. A large disk with cluster size 4096 bytes has 16 blocks.
The 16 bit header is always &HBxxx or &H3FFF, irrespective of cluster size.
Hope this clarifies everything.
Thank you once more for your explanantions! They are much appreciated.
I just made some updates
Added fixups. However hardcoded to handle record size of 1024 bytes.
Do you know the formula for getting record size?
NTFS Systemfile extracter v1.7
I am a bit stuck with the extraction of compressed data. I believe I have understood most of it now, but am facing weird issues with the extracted data. For instance is data correctly extracted up until a certain run, but after that the correct data is in fact extracted but appear corrupted in that 1 byte of random is added to the extracted data at arbitrary locations several places. I simply do not understand what is going on, and don't have much time to investigate this issue further. Therefore I post the most current version in case someone is curious and wants to look at it. The relevant code for extracting the compressed data is at around line 1060 and 1100. Furthermore, I acknowledge that the implementation of runs could have been done differently and possibly better, by doing the arrays smarter.
I've never seen any size other than 1024 bytes. Maybe some earlier versions of NTFS, prior to XP, were different.
The formula uses the int32 value( ie signed) at offset &H40 in the boot record. This is the number of clusters/MFT record. A negative number indicates that the size is less than a cluster. The formula is 2^(-1*int32). The normal value is &HF6 or -10, so 2^(-1*-10)=1024.
Having said that though some lateral thinking makes it a bit more obvious. It's probably a safe assumption that for any drive or image, the MFT record size is fixed. Also we know that Fixup is done on a sector by sector basis. If you look at the MFT record for the $Mft, the size of the Fixup array (aka Update Sequence array) is at &H6. This is always one more than the number of sectors to be fixed up! So for the usual value of 3, the record must be 2 sectors long ie 1024 bytes. You can bring this into perspective by looking at an INDX record. It's Fixup array size is usually 9, so it's size is 8 sectors ie 4096 bytes.
The latter approach is good particularly when the boot record is missing or damaged.
By accident the wrong version of mft2csv was uploaded last time, with an option to detect record slack. That was never really implemented because I believed it was too timeconsuming to process with lots of false positives.. But while at it, would it make sense to implement with an option to detect such?
The way I see it, each and every byte between the attribute end marker (0xFFFFFFFF) and offset 0x3FE of record, must be compared against 00. And if implemented, would it make sense to dump slack data into a subfolder using a naming convention of [IndexNumber]_[FileName].bin or something similar?
Added a console application named MFTRCRD that dumps the same information as mft2csv, but to the console. Much faster when just looking at stuff for 1 particular file at the time, like when you're testing and experimenting.
I'm looking into your fixup explanation. Your example is for 1024 byte MFT records and 512 bytes per block. Is it fair to assume that the amount of fixup words (in your example 3) will always be 'amount of blocks in an MFT record' + 1 ?
PS. We're discussing NTFS in thread
Feel free to chime in.
Oops, I was staring at page one and had no idea 2 more pages followed the dicussion -)
Quickly scanned the content and I now see mention of fixup for INDX blocks as well ?
Do I have to correct these blocks as well before processing them ?