This is a typical blackswan statement; I've never seen it therefore it does not exists. And if you're not looking for them you'll not find them either.
This statement is true if just applied to data carving. However, it is not true if doing signature testing on good/deleted files on a working disk.
If I do a (non carving) file recovery and a jpg/jpeg file does not have a valid signature I check. It is in this mode that I have not been aware of other values - though it is not a perfect process, and I may have missed some.
Another area to be aware of is if say jpgs have been renamed as .dat files in an attempt to hide them. If they were non 0xe0/0xe1 files, the hiding might work as signature recognition would not detect them as jpgs.
This statement is true if just applied to data carving. However, it is not true if doing signature testing on good/deleted files on a working disk.
It applies to your use case as well. How do you know your sample set is representative?
If you have a disk with all JPEG from the same applications it will likely yield the same results.
As long if all swans are white, it will re-enforce your thinking model.
I've done a lot of file format analysis, take it from me, if there can be an edge case sooner or later there will be a sample that confirms it.
jaclaz, interesting finds. I'll take the time to read it a bit more carefully. FYI I've pinged C. Grenier on the matter, let's see if he agrees.
I wrote a JPEG raw recovery program a long, long time ago (15 years ago likely) and if I remember well to filter out most of the corrupted images I kept reading chunks after chunk until I found unexpected data at the end of the chunk, in that case marking the JPEG as corrupted (so no stream type checking whatsoever).
If the two bytes after the last chunk were FF F9 I considered the file as good and saved it.
I used only the first two bytes to recognize the header and all seemed to work alright (I still had to do additional cleanup on the recovered files with some command line tools though).
Another area to be aware of is if say jpgs have been renamed as .dat files in an attempt to hide them. If they were non 0xe0/0xe1 files, the hiding might work as signature recognition would not detect them as jpgs.
Allow me to disagree. 😯
If you pass a ".dat" file through (say) the mentioned Trid, it will give you a "correct" identification as ".jpg" with a 75% confidence, actually higher than the "base" E0 file
TrID/32 - File Identifier v2.10 - (C) 2003-11 By M.Pontello
Definitions found 5387
Analyzing...
File .\DATs\Base_hexE0.dat
50.0% (.JPG) JFIF JPEG Bitmap (4003/3)
File .\DATs\modded0xDE.dat
75.0% (.JPG) JPEG Bitmap (3000/1)
File .\DATs\modded0xFF.dat
75.0% (.JPG) JPEG Bitmap (3000/1)
Then you try viewing it with a "common" viewer (still to remain within the tested tool MS Photo Editor) BUT if it is one of those values that the tool does not display (let's say DE) what will you do?
Test it with jpegsnoop anyway?
Or viceversa, you test it with a "dedicated" program (again for the sake of the example jpegsnoop) BUT the file has value FF?
JPEGsnoop 1.7.2 by Calvin Hass
http//
-------------------------------------
Filename [D\partitionview\vidma04\Photorec_test\256_jpegs\DATs\Base_hexE0.dat]
Filesize [36316] Bytes
Start Offset 0x00000000
Marker SOI (xFFD8)
OFFSET 0x00000000
Marker APP0 (xFFE0)
OFFSET 0x00000002
Length = 16
Identifier = [JFIF]
version = [1.1]
density = 72 x 72 DPI (dots per inch)
thumbnail = 0 x 0
Marker DQT (xFFDB)
Define a Quantization Table.
OFFSET 0x00000014
Table length = 67
....
JPEGsnoop 1.7.2 by Calvin Hass
http//
-------------------------------------
Filename [D\partitionview\vidma04\Photorec_test\256_jpegs\DATs\modded0xDE.dat]
Filesize [36316] Bytes
Start Offset 0x00000000
Marker SOI (xFFD8)
OFFSET 0x00000000
OFFSET 0x00000002
Header length = 16
Skipping unsupported marker
Marker DQT (xFFDB)
Define a Quantization Table.
OFFSET 0x00000014
Table length = 67
....
JPEGsnoop 1.7.2 by Calvin Hass
http//
-------------------------------------
Filename [D\partitionview\vidma04\Photorec_test\256_jpegs\DATs\modded0xFF.dat]
Filesize [36316] Bytes
Start Offset 0x00000000
Marker SOI (xFFD8)
OFFSET 0x00000000
Skipped 1 marker pad bytes
OFFSET 0x00000003
WARNING Unknown marker [0xFF00], stopping decode
Use [Img Search Fwd/Rev] to locate other valid embedded JPEGs
@joachimm
Good, I made a post on jpegsnoop's page to make Calvin Hass also aware of the matter.
jaclaz
I think initially one has to work on what is most likely. If 99.5% of jpegs are standard*, then they will be picked up with any extension.
If the files are the 0.5% non standard, then there should be a warning because files called jpg, do not open, validate or match a signature.
At this point, I would dive into a hex editor and see what is what.
With pressure on speed of forensic examinations is 99.5% acceptable, or do we need 99.999%? Hence my question, how common are non e0/e1 files?
A disk under investigation is very likely to have many files of the same type. Thus if non e0/e1 files are present, they should be spotted
*percentage figure is my guess - am I a long way off?
I think initially one has to work on what is most likely. If 99.5% of jpegs are standard*, then they will be picked up with any extension.
If the files are the 0.5% non standard, then there should be a warning because files called jpg, do not open, validate or match a signature.
What is standard? Most occurring in your sample set? Who defines the sample set? Do you include all the cameras in your sample set? What if a very exotic camera is relevant?
With pressure on speed of forensic examinations is 99.5% acceptable, or do we need 99.999%? Hence my question, how common are non e0/e1 files?
Common or uncommon is relative. And what relevance does it have? If your tool catches the edge cases as well that's a nice addition isn't it?
The relevance to your case will depend. In most cases where the graphical content of jpeg is relevant, maybe even a 50% success rate is sufficient. In case you're looking for that one special crafted JPEG file used to compromise your image web service very likely that you want 99.999%.
But if you don't know what you're looking and don't know how relevant it is, how can you justify not looking for it?
(no need to answer this is a philosophical question 😉
With pressure on speed of forensic examinations is 99.5% acceptable, or do we need 99.999%? Hence my question, how common are non e0/e1 files?
I understand your point of view ) , but I see it from a different standpoint.
Is it "better" to cover 99.51% than 99.50% of possibilities?
If it is, how much does it cost (in terms of *whatever*, be it time, money, little furry creatures images used for tests) this increase?
If nothing or next to nothing, why not doing it?
Also, is it the ONLY field where correctly identifying a JPEG and displaying it is digital forensics and specifically digital forensics on criminal cases matters?
Wouldn't data recovery, malware analysis, and more or less a larger set of other computer related fields take advantage of knowing about this peculiarities?
More generally, I guess we succeeded in at least moving the item from the group of the unknown unknowns 😯 to that of the known unknowns ) wink
http//
jaclaz
I thought I'd chime in with a few thoughts based on my experience developing JPEGsnoop and recovering damaged photos.
Attempting to identify "JPEG" images from a file header signature will run into a number of issues as has already been raised in this thread.
It's important to note that there are really three general standards at play with "JPEG" images
- JPEG (ITU-T.81)
- JFIF
- EXIF
Of these, the JPEG format is most permissive, being the superset. JFIF and EXIF leverage "JPEG" but reduce the flexibility to a subset (eg. to address interoperability). Please note that the following is just a quick list of some considerations with respect to file carving – I am certainly no expert and therefore it is by no means exhaustive. Plus it is quite likely that I've made an error in overlooking a detail )
Some of the issues we could run into
- 1) It is possible to encode "viewable" JPEG images meeting the spec of JPEG and file header requirements of JFIF but not EXIF
- 2) It is possible to encode "viewable" JPEG images meeting the spec of JPEG and file header requirements of EXIF but not JFIF
- 3) It is possible to encode "viewable" JPEG images meeting the spec of JPEG without meeting either JFIF or EXIF requirements.
- 4) It is possible to encode "viewable" JPEG images not meeting the spec of JPEG and similarly not meeting either JFIF or EXIF requirements.
By the "spec" of JPEG, I mean the use of valid marker encapsulation, valid marker values and a sequencing that meets the "Flow of Compressed Data syntax" and "Flow of Marker segments".
When decoding a file that deviates from the "spec" (#4), it is up to the implementation to decide how resilient or sensitive it may be. Some invalid markers could be ignored, provided that the necessary tables (ie. other markers) are present before we get to the scan data. That is why JPEGsnoop and Windows may give different results.
By my read of the specifications, we have a few different "valid" file headers for #1, #2 & #3
- 1) JFIF 0xFFD8, 0xFFE0
- 2) EXIF 0xFFD8, 0xFFE1
- 3) JPEG 0xFFD8, {0xFFDB, 0xFFC4, 0xFFCC, 0xFFDD, 0xFFFE, 0xFFE0..0xFFEF, 0xFFC0..0xFFC7, 0xFFC9..0xFFCF, 0xFFDE, 0xFFD9}
Note that the file header list for #3 above may be overly permissive (I'd need to double-check that there are not further restrictions documented elsewhere).
So, if we are data carving to identify all possible images that fall under #1,#2,#3 then we could theoretically include all the values above. Although these may all be valid markers per the flow, some of them are probably unrealistic/incorrect to have at the start of the file.
Given that non-image files could also have random data that aliases with the above marker values (eg. if the parsing occurred mid-file because of file fragmentation prior to recovery), we should also consider cross-checking other encapsulation/format/sequencing checks in the file after validating the file header. I believe this was also mentioned earlier in the thread. In other words, one could have a permissive header check, but then apply further tests for validity to weed out the false-positives. I could followup in another post with more details/ideas.
Conversely, if you want a data carving utility to detect images that have been encoded improperly or corrupted (ie. #4), one can't rely on the subset of valid markers alone. Instead, one would need to look for further indicators or heuristics. For example, one could look for the scan segment (0xFFDA) and then look for stuff bytes (0xFF00), or one of many other methods.
If its helpful, I could try performing a search across my database of images from the web (100k+) and see what file headers were actually used.
Calvin
Calvin, I would love to know if my wild guess of 99.5% of images are 0xe0, 0xe1 is actually anywhere near correct. Your suggested scan might help here.
One type of file I skip when carving is jpeg headers within an AVI file, these have the string "AVI" starting in the 6th byte (after a e0 or e1). This was probably determined by false positives rather than detailed analysis of the spec!
I thought I'd chime in with a few thoughts based on my experience developing JPEGsnoop and recovering damaged photos.
Which I am pretty sure will be useful/interesting, happy you joined, Calvin. )
By my read of the specifications, we have a few different "valid" file headers for #1, #2 & #3
- 1) JFIF 0xFFD8, 0xFFE0
- 2) EXIF 0xFFD8, 0xFFE1
- 3) JPEG 0xFFD8, {0xFFDB, 0xFFC4, 0xFFCC, 0xFFDD, 0xFFFE, 0xFFE0..0xFFEF, 0xFFC0..0xFFC7, 0xFFC9..0xFFCF, 0xFFDE, 0xFFD9}
Very good and definitely an non-trifling widening form the 2 values ones E0/E1 and from the 4 values one E0/E1/EC/FE.
(I believe there are some repeated values in #3)
Your list of "valid according to specifications" is similar to the one joachimm posted
application segment "\xff[\xe3-\xef]"
Table segments
"\xff\xc4" # Define Huffmann table (DHT)
"\xff\xcc" # Arithmetic coding condition table (DAC)
"\xff\xdb" # Define quantization table (DQT)Reserved segments
"\xff\xc8" # Start of Frame (JPG) (Reserved for JPEG extensions)
"\xff[\xf0-\xfd]" # Reserved for JPEG extensions
"\xff\xfe" # Comment (COM)
"\xff[\x02-\xbf]" # Reserved
(though not exactly the same one).
And still, the test showed how jpegsnoop will "display" also (though "skipping" the unknown marker)
Values that produced a "valid" log (i.e. that continued the parsing after the header)
01, C4, C8, CC, DE, E0-FE
And will "choke" on
Values that crashed jpegsnoop
C0-C3, C5-C7, C9-CB, CD-CF
And
Values that produced an "invalid" log (i.e. that stopped the parsing after the header)
00, 02-BF, D0-DD, DF, FF
Now, I do understand how the "crash" and some of the "invalid" values are due to the "further" checks that jpegsnoop (which is not obviously a carver, nor a filetype identifier) performs, but it would be nice if the program would not crash and would have (say) an option (or whatever) to force the skipping ? .
As well I would like if there could be an option (or something) to have the possibility of "correcting" the values that are now "invalid" but that can be displayed by "common" viewers
BUT among the "invalid" values (that jpegsnoop could NOT display), these were viewable in Explorer/Photo Editor as above
00,D0-D7,DC, FF
One of the probabilities that mscotgrove would not take into consideration representing, say, 0.00000000000001% of cases but let's say that someone has an image that displays correctly in (still say) Photo Editor and that has the fourth byte E0.
After some time - for whatever reasons - *something* changes the fourth byte to 00.
The user continues having the image display fine in Photo Editor.
Then *something else* changes another byte, making the image not dislayed correctly anymore in Photo Editor.
Right now if the user gets jpegsnoop to attempt recovering the image, he will never be able to go past the initial fourth byte, which is not the actual "issue".
Someone will notice how three people (joachimm, Calvin and myself) that posted about these values managed to find three different notations for the same data, so that it is very difficult to compare the values. 😯
I propose the use of a "more graphical" notation, such as the "table format" I used in my half-@§§ed batches.
What Joachimm seemingly posted (please double check) C4 C8 CC
DB
E0 E1 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE
What Calvin/Impulse seemingly posted (please double check)
C0 C1 C2 C3 C4 C5 C6 C7 C9 CA CB CC CD CE CF
D9 DB DD DE
E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
FE
jaclaz