Publication of Hach...
 
Notifications
Clear all

Publication of Hachoir project version 1.0

1 Posts
1 Users
0 Reactions
504 Views
(@haypo)
New Member
Joined: 18 years ago
Posts: 1
Topic starter  

Hachoir is a framework for binary file manipulation file format recognition, metadata extraction, search files in any binary stream (forensics), view file content with human representation, etc. It's composed of many component

Programs
· hachoir-metadata fault tolerant metadata extraction;
· hachoir-subfile search subfiles in a disk image or any other binary stream;
· hachoir-urwid, hachoir-wx, hachoir-gtk, hachoir-gtk user interface to view file content (curses, wxPython, pygtk, web+ajax);

Modules
· hachoir-core library to split binary data into a field tree;
· hachoir-parser collection of 70 file format parsers;
· hachoir-regex regular expression optimization/manipulation and pattern matching (used by hachoir-subfile).

· Hachoir project website
· List of supported file formats (jpeg, ttf, exe, rar, ogg, ntfs, ole2, torrent, …)
· Examples of metadata extraction
· hachoir-wx screenshots

Hachoir works any operating system and only depends on Python (2.4+). Packages are available for Debian, Mandriva, Gentoo, Arch and FreeBSD.

hachoir-core goal is to ease binary parser writing. It takes care of endian problem, has bit resolution (for addresses and sizes), and only use Unicode charset for text. It gives a nice API to the programmer (see parsers source code) each field is an object. A parser is lazy its value, display string, description, etc. is computed on demand (when the program ask it). So it's possible to parse very complex structures and huge files (60 GB or more is not a problem).

hachoir-core and hachoir-metadata are "fault tolerant" on parser/extractor error or file error (truncated or damaged file), the program doesn't stop but continue to next valid state. It allows to extract informations on very damaged files.

hachoir-metadata create a dictionary with typed values track number is an integer, creation date is datetime.datetime object, etc. and all text are stored as Unicode string. The API allows easy reuse of extracted data.

Source code has good code coverage with automatic tests (lot of testcases). Fuzzing is sometimes used to find more bugs.

Some experimental programs exist like hachoir-strip program to remove personal information (author name, timestamp, copyright, etc.) from a picture, movie, sound, archive, etc. Another example swf_extract.py allows to extract pictures and sounds from a SWF (Flash) document.

Victor Stinner aka haypo

PS I tried to post this text as a news but it was detected as spam!? Error message SpamGuard has blocked this email from being sent


   
Quote
Share: