Notifications

Clear all

JPEG carving/identifying/recovering

Page 1 / 4 Next

General (Technical, Procedural, Software, Hardware etc.)

Last Post by impulse 12 years ago

39 Posts

5 Users

0 Reactions

12.8 K Views

RSS

joachimm

(@joachimm)

Estimable Member

Joined: 18 years ago

Posts: 181

Topic starter 27/09/2014 12:56 pm [#12209]

For context this post originates from this thread
http//www.forensicfocus.com/Forums/viewtopic/t=12127/

Actually, It works with identifying JPGE file when Block begins
- 0xff, 0xd8, 0xff, 0xe0
- 0xff, 0xd8, 0xff, 0xo1
- 0r 0xff, 0xd8, 0xff, 0xfe

This is incorrect. Please check the JPEG format specification the JPEG should start with 0xff, 0xd8 (according to its spec) the bytes that follow are common but other values are possible.

Quote

jaclaz

(@jaclaz)

Illustrious Member

Joined: 19 years ago

Posts: 5133

27/09/2014 4:41 pm

This is incorrect. Please check the JPEG format specification the JPEG should start with 0xff, 0xd8 (according to its spec) the bytes that follow are common but other values are possible.

Just for the record that info is stated in the "introductory" page of photorec
http//www.cgsecurity.org/wiki/PhotoRec#How_PhotoRec_works
and on the "developers" page
http//www.cgsecurity.org/wiki/Developers
and has been posted verbatim (but with a typo) by EvaMendis.

The pattern used in Photorec is definitely that one
http//git.cgsecurity.org/cgit/testdisk/tree/src/file_jpg.c
most probably it derives by "observation of wild files in nature" ?

CNWrecovery
http//www.cnwrecovery.com/manual/DataCarving.html
seemingly uses the same approach (but limitet to FFD8FFE0 and FFD8FFE1)

Possibly to avoid false positives?

The generic pattern like FFD8 might provide too many results
https://www.ocf.berkeley.edu/~fricke/projects/jpegrescue/

Trid's XML definition
http//mark0.net/soft-tridscan-e.html
use instead FFD8FF (which possibly it is a "good compromise") ?

jaclaz

ReplyQuote

joachimm

(@joachimm)

Estimable Member

Joined: 18 years ago

Posts: 181

Topic starter 27/09/2014 6:00 pm

Just for the record that info is stated in the "introductory" page of photorec

Taken out of context the documentation might give you the idea that you are correct but if you read on

If PhotoRec has already started to recover a file, it stops its recovery, checks the consistency of the file when possible and starts to save the new file (which it determined from the signature it found).

Also if you look at the source code you see that photorec does much more to determine if it's dealing with a JPEG than the documentation indicates.

The generic pattern like FFD8 might provide too many results

Yes if the byte pattern is your only criteria the signal/noise rate is high.
But photorec also uses block alignment and format validation, which makes it produce higher quality results. Alas this technique is not suitable for every file system.

use instead FFD8FF (which possibly it is a "good compromise")

This is indeed the longest unique byte signature of the start of a JPEG that conforms to the specification.
This does not mean you cannot use a longer signature if you know what you're looking for. For context it is not uncommon to see JPEG files that start wih 0xff, 0xd8, 0xff, 0xe[2-9]

ReplyQuote

jaclaz

(@jaclaz)

Illustrious Member

Joined: 19 years ago

Posts: 5133

27/09/2014 8:21 pm

Also if you look at the source code you see that photorec does much more to determine if it's dealing with a JPEG than the documentation indicates.

Well, I was trying to be more accurate than the previous poster, and unless there are further "overrides" in other parts of the source code, this still seems to me pretty much accurate

The pattern used in Photorec is definitely that one
http//git.cgsecurity.org/cgit/testdisk/tree/src/file_jpg.c

It seems to me like the patterns used are

static const unsigned char jpg_header_app0[4]= { 0xff,0xd8,0xff,0xe0};
static const unsigned char jpg_header_app1[4]= { 0xff,0xd8,0xff,0xe1};
static const unsigned char jpg_header_app12[4]= { 0xff,0xd8,0xff,0xec};
static const unsigned char jpg_header_com[4]= { 0xff,0xd8,0xff,0xfe};

The rest are (seemingly ? ) "further checks", ONCE the file header has been recognized as per above.

To make sure I ran photorec on a FAT12 floppy image to which I had written (and deleted) a .jpg image, several times hexediting each time the fourth byte.
Photorec found it when the fourth byte was E0, E1 EC and FE, BUT it failed to recover with fourth byte E2, E3 and E9. (did not test other values)

And while I don't doubt in the least that the "proper" way is the three bytes FFD8FF ) (as TriD BTW uses), I was merely stating the fact that Testdisk does check for 4 bytes and that the fourth byte must be any of E0, E1, EC or FE in order for the file to be recognized and recovered, which is consistent with the provided quotes.

As such the documentation (in or out of context) seems like reflecting accurately what the tool actually does (which does not mean that the approach used is the "right" one, I was ONLY reporting what patterns were used in a few tools).

You should contact Cristophe Grenier about the "missing" patterns or about the approach photorec actually uses being incorrect.

jaclaz

ReplyQuote

joachimm

(@joachimm)

Estimable Member

Joined: 18 years ago

Posts: 181

Topic starter 27/09/2014 11:10 pm

Photorec found it when the fourth byte was E0, E1 EC and FE, BUT it failed to recover with fourth byte E2, E3 and E9. (did not test other values)

Thx for testing. This is a very good objective approach to validate tooling and how it is working 😉

No idea why the author strayed from the spec here, looked up my notes on the matter of allowed first sections after the start of image (ff d8) (signatures are represented as binary string expressions)

application segment "\xff[\xe3-\xef]"

Table segments
"\xff\xc4" # Define Huffmann table (DHT)
"\xff\xcc" # Arithmetic coding condition table (DAC)
"\xff\xdb" # Define quantization table (DQT)

Reserved segments
"\xff\xc8" # Start of Frame (JPG) (Reserved for JPEG extensions)
"\xff[\xf0-\xfd]" # Reserved for JPEG extensions
"\xff\xfe" # Comment (COM)
"\xff[\x02-\xbf]" # Reserved

And while I don't doubt in the least that the "proper" way is the three bytes FFD8FF ) (as TriD BTW uses), I was merely stating the fact that Testdisk does check for 4 bytes and that the fourth byte must be any of E0, E1, EC or FE in order for the file to be recognized and recovered, which is consistent with the provided quotes.

I assume you mean photorec here instead of testdisk. As indicated there is more to it.
To repeat the signature must be block aligned as well and will do format validation which is important by fragmentation e.g. by the file system itself. This has implications on when to use the tool or when not. So photorec might not find carve-able files if the situation is not favorable.

In the revit proof of concept carver the sequence "ff d8" was sufficient since file format validation is done, this should suffice for photorec as well. No idea why it was implemented in this manner in photorec.

As such the documentation (in or out of context) seems like reflecting accurately what the tool actually does (which does not mean that the approach used is the "right" one, I was ONLY reporting what patterns were used in a few tools).

No worries, the remark regarding the documentation is largely to point out the missing important line that follows the highlighted section.

You should contact Cristophe Grenier about the "missing" patterns or about the approach photorec actually uses being incorrect.

I can drop him a mail ask to improve the JPEG format support and to follow the spec.

I think this is a nice illustration of assumptions about tools 😉 To be verbose I'm NOT of opinion the technique photorec uses is incorrect (in the sense of the word). IMO the cons described are a side effect of the technique used. I agree that the JPEG format support can be improved.

There are pros and cons to techniques tools use and you (in general) as the user will need to know favorable and unfavorable circumstances. Carving and recovery are particular tricky matters because sometimes recall matters, sometimes precision.

The remark regarding the incorrectness regarding the EvaMendis post is in various aspects

The photorec wiki gives it as an example

For example, PhotoRec identifies a JPEG file when a block begins with

the EvaMendis post

Actually, It works with identifying JPGE file when Block begins

The typos aside; I hope you can see the semantic difference. I'm missing the reasoning here when "for example" became "actually". And as you and I have pointed out there is significant more to it then the poster indicates. Which IMO a nice example that it is important to look under the hood and do cross checking 😉

ReplyQuote

jaclaz

(@jaclaz)

Illustrious Member

Joined: 19 years ago

Posts: 5133

28/09/2014 1:19 am

I assume you mean photorec here instead of testdisk.

Yep, my bad oops , I meant Photorec and not Testdisk, of course ) .

My personal opinion is - as said - that the right way to check is for the three bytes FFD8FF (and being more "flexible" about the fourth byte), as, considering also the added mechanisms of check that Photorec has (as you pointed out), i.e. block alignment and I may add "footer" check it should be enough to avoid the largest part of "false positives".

We have to put however into account how different tools may have (even if through the same "function") a different use.

Photorec is essentially a Photo Recovery tool and not properly a "forensic" (or however "pure") carver, so it makes a lot of sense that it has "beginning of block check" (which independently from the three or four bytes header patterns will exclude a number of "embedded into other files images", including most "preview images" or "thumbnails" inserted in the EXIF data ).

TriD, being a "file identifier" has the "advantage" that it needs not such a check (since what you feed it with is an actual file and not a "random address on a disk image") and more than that it's output is "probable" file type.

For the record the "file" *nix utility has seemingly the much more "generic" two bytes pattern recognition of FFD8
http//darwinsys.com/file/
https://github.com/file/file/blob/master/magic/Magdir/jpeg

About the semantics, to be picky, as I am wink , the "proper" description should probably have been something *like*

As an example, for JPEG images, Photorec first checks if the four bytes at the beginning of a block is any among FFD8FFE0, FFD8FFE1. FFD8FFEC or FFD8FFEF, and IF any of these conditions is met, it tentatively identifies the block as the beginning of a JPEG image and then makes a number of further checks to make sure that the block belongs to a valid JPEG image, the size of the image, etc. in order to actually recover the file.

And we have to note how here
http//www.cgsecurity.org/wiki/PhotoRec#How_PhotoRec_works
the text is

For example, PhotoRec identifies a JPEG file when a block begins with

While here
http//www.cgsecurity.org/wiki/Developers
it is

If the file format specifications aren't available, compare several samples to identify constant fields. In example, PhotoRec identifies a JPEG file when a block begins with

possibly the distinction/misunderstanding is between "identifying" as in "tentatively identify" and "identify" as "identify and recover without further checks".

But yes, we are both on the same side when it comes to "assumptions" and how frequent they are ) .

jaclaz

ReplyQuote

jaclaz

(@jaclaz)

Illustrious Member

Joined: 19 years ago

Posts: 5133

29/09/2014 9:47 pm

Probably it's just me, it is well possible that I am particularly unlucky, but it is strange how every single time I touch a can, it pops open 😯 and a zillion worms get on the loose out of it ( .

While still pertaining to the carving approach, this is slightly bent towards data recovery, but still IMNHO intriguing.

Test.

Taken a small JPEG ("normal" JFIF one with header FFD8FFE0) named Base_hexE0.jpg I made 256 copies of it, named from modded_0x00.jpg to modded_0xFF.jpg hexediting on each the 4th byte to the corresponding value in name.

Then I ran on the whole set of 256 images from modded_0x00.jpg to modded_0xFF.jpg the jpegsnoop
http//www.impulseadventure.com/photo/jpeg-snoop.html
in batch mode.

A large number of these modded images were considered "non-valid" JPEG's by the tool that stopped scanning just as soon as, passed the SOI "FFD8", it found an invalid set of 3rd and 4th byte.

A few images crashed the tool.

Results
Values that produced a "valid" log (i.e. that continued the parsing after the header)
01, C4, C8, CC, DE, E0-FE

Values that produced an "invalid" log (i.e. that stopped the parsing after the header)
00, 02-BF, D0-DD, DF, FF

Values that crashed jpegsnoop
C0-C3, C5-C7, C9-CB, CD-CF

ALL the images that crashed jpegsnoop are (I woudl say obviously) NOT viewable.

Now the "interesting part".

Among the "valid" values, were "normally" seen in an Explorer window in "preview mode" AND could be double clicked and displayed correctly with Microsoft Photo Editor (on XP SP2) ONLY
01, CC, E0-EF, FE
Whilst these were NOT viewable ?
C4, C8, DE, F0-FD
(but jpegsnoop did display them fine)

BUT among the "invalid" values (that jpegsnoop could NOT display), these were viewable in Explorer/Photo Editor as above 😯
00,D0-D7,DC, FF

I made a small batch to replicate.
You need in the same directory you put/run it a "base image" (I suggest a small one) called Base_hexE0.jpg and HExAlter
http//kuwanger.net/misc/hexalter.shtml

If you invoke the batch with the /ALL parameter it will create the 256 jpeg's in the same directory, whilst if you invoke it without parameters it will make the images already divided into three subdirectories Valid, Invalid and Crash.

jaclaz

@ECHO OFF SETLOCAL ENABLEEXTENSIONS SET Base=Base_hexE0.jpg


IF NOT %1.==/ALL. GOTO makesets
FOR /L %%? IN (0,1,255) DO (

call changeHex %%?

)

GOTO EOF
makesets
Valid

SET TargetDir=Valid

IF EXIST .\%TargetDir% RD /S /Q.\%TargetDir%

MD .\%TargetDir%

FOR %%? in (

   01

            C4          C8          CC

                                          DE

E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF

F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE

) DO CALL make_jpgs %%?
Invalid

SET TargetDir=Invalid

IF EXIST .\%TargetDir% RD /S /Q .\%TargetDir%

MD .\%TargetDir%

FOR %%? in (

00    02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F

20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F

30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F

40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F

50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F

60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F

70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F

80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F

90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F

A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF

B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 DA DB DC DD    DF
                                             FF

) DO CALL make_jpgs %%?
Crash

SET TargetDir=Crash

IF EXIST .\%TargetDir% RD /S /Q .\%TargetDir%

MD .\%TargetDir%

FOR %%? in (

C0 C1 C2 C3    C5 C6 C7    C9 CA CB    CD CE CF

) DO CALL make_jpgs %%?
GOTO EOF
make_jpgs

ECHO %1

copy %base% .\%TargetDir%\modded0x%1.jpg>nul

hexalter .\%TargetDir%\modded0x%1.jpg 3=0x%1

GOTO EOF

changehex CMD /C EXIT /B %1 SET "Line=%=ExitCode%" SET "Line_hex=0x%Line~-2%" ECHO copy %base% modded%Line_hex%.jpg copy %base% modded%Line_hex%.jpg>nul hexalter modded%Line_hex%.jpg 3=%Line_hex% GOTO EOF

ReplyQuote

mscotgrove

(@mscotgrove)

Prominent Member

Joined: 18 years ago

Posts: 940

29/09/2014 10:43 pm

I accept that the spec can include more than 0xff 0xe0, and 0xff 0xe1. However, in the real world I have never seen any other values. On what devices do more values exist?

ReplyQuote

jaclaz

(@jaclaz)

Illustrious Member

Joined: 19 years ago

Posts: 5133

29/09/2014 11:34 pm

I accept that the spec can include more than 0xff 0xe0, and 0xff 0xe1. However, in the real world I have never seen any other values. On what devices do more values exist?

As I see it the issue (or non-issue, as it is more a personal opinion than anything else) is that even though "in nature" (i.e. every single software ever made only creates them ? ) the only existing values are E0, E1, plus maybe EC and FE (i.e. the ones photorec uses), there are possible (if you want "artificial") values besides those that can be parsed "correctly" (or at least "displayed") by either the "normal" Microsoft tools and/or by a "dedicated" but "common enough" tool.

This could imply that some carving/parsing tools may miss them altogether (when parsing/carving - say - unallocated).

I would presume that jpegsnoop has a more "correct" or "strict" parsing mechanism, and added "self-healing" capabilities, being a tool aimed at the recovery of corrupted jpg's, so I do understand how it may be able to display the images with C4, C8, DE, F0-FD which MS standard programs do not display.

I find more curious that the MS standard tools allowed to view the ones with values 00, D0-D7, DC, FF which are "invalid" according to jpegsnoop.

With reference to the snippet joachimm posted
C4 belongs to "Table segments"
C8 and F0-FD belong to "Reserved segments"

All the rest, though "outside" of the specifications mentioned by joachimm do display in the one or the other.

As I see it such "malformed" images may well be "overlooked" by recovery/carving tools, while being when it comes to "practical" effects (i.e. viewing them) "good enough" JPEG's.

jaclaz

ReplyQuote

joachimm

(@joachimm)

Estimable Member

Joined: 18 years ago

Posts: 181

Topic starter 29/09/2014 11:38 pm

I accept that the spec can include more than 0xff 0xe0, and 0xff 0xe1. However, in the real world I have never seen any other values. On what devices do more values exist?

This is a typical blackswan statement; I've never seen it therefore it does not exists. And if you're not looking for them you'll not find them either.

I've seen various different values for the first segment, e.g. those created by Adobe applications. Again if you know that you are looking for a JPEG that comes of a camera or application using certain signatures is perfectly fine. As long this is a conscious decision and not an assumption.

ReplyQuote