slithering along a file with python

The ‘file’ command is a nice tool. It has a database of filetypes and “magic” numbers which correspond to offsets and values within a file and are used to hazard a guess as to what type of file it is. On my system, the /usr/share/file/magic database has 13474 lines in it. Quite a bit of knowledge about filetypes at your fingertips!
To use it simply:
$ file <targetfile>
Example:
$ file /pictures/nice.jpg
/pictures/nice.jpg: JPEG image data, JFIF standard 1.02
or
$ file ./unknown
./unknown: VMS Alpha executable
What happens when dealing with “unknown” file types that may not be accurately described by the “file” command’s knowledge of filetypes? Or, what happens when a file contains many other files within it that we can easily get to? We can attempt to peer inside an unknown container file and find what types of other files it is made of… by sliding along the file and comparing every offset to the magic database.
Luckily, there is a python binding the “magic” database.
# apt-get install python-magic
And a handy example is included in /usr/share/doc/python-magic/examples/example.py.
Excellent. This is just what we need. Our algorithm is simple. Loop over each offset in the file and see what python-magic thinks it is. Interesting offsets can then be identified and extracted for further analysis.
Here’s a quick one-off python script to do just that:
-------------------------- BEGIN magicslide.py # !/usr/bin/env python
"""
%s <filename> <filename> will be checked at each offset to see what the magic offset database from the "file" command's database thinks it is.
Entries that return 'data' will be filtered because they are boring.
"""
import magic
import os
import sys
def usage():
sys.stdout.write( __doc__ % os.path.basename(sys.argv[0]))
sys.exit(0)
def analyze(ms,buffer):
return ms.buffer(buffer)
def output(offset,s):
sys.stdout.write("%08x:%s\n" % (offset,s) )
try:
filename = sys.argv[1]
except:
usage()
try:
f = open(filename)
except:
sys.stderr.write("could not open %s\n" % filename)
sys.exit(1)
filedata = f.read()
totallen = len(filedata)
buffsize = 4096 # a nice big chunk of file
# load the magic db
ms = magic.open(magic.MAGIC_NONE)
ms.load()
for offset in range(0,totallen):
end_offset = min(offset+buffsize+1,totallen)
kind = analyze ( ms, filedata[offset:end_offset] )
if kind != 'data':
output( offset, kind ) --------------------------------------- END magicslide.py
Sample output looks like:
0001047c:Hitachi SH big-endian COFF executable, not stripped
00010493:PCX ver. 2.5 image data
000104a8:MIPSEB MIPS-III ECOFF executable not stripped - version 255.26
000104b2:\012- 8086 relocatable (Microsoft)
000104b8:PCX ver. 2.5 image data
000104bd:MPEG ADTS, layer I, v1, 32 kBits, 32 kHz, Monaural
000104c1:MPEG ADTS, layer I, v1, 448 kBits, 32 kHz, Stereo
000104c8:DBase 3 data file
000104cc:LANalyzer capture file
000104e0:PCX ver. 2.5 image data
000104e8:shell archive or script for antique kernel text
000104ef:PCX ver. 2.5 image data
000104f6:MPEG-4 LOAS
00010508:AmigaOS bitmap font
0001050c:PCX ver. 2.5 image data
00010514:shell archive or script for antique kernel text
0001051c:MIPSEB MIPS-III ECOFF executable not stripped - version 0.10
00010522:MPEG-4 LOAS
00010530:Hitachi SH big-endian COFF executable, stripped
00010538:DBase 3 data file
0001053c:PCX ver. 2.5 image data
00010544:shell archive or script for antique kernel text
00010549:MPEG ADTS, layer I, v1, 32 kBits, 32 kHz, Stereo
00010560:DBase 3 data file
Well, it’s still pretty messy and the data may be wrong, but it’s more than we had to go on before for our analysis of this unknown file type. There are obvious false positives here, but things like images such as JPGs, PNGs, etc. can probably be readily identified in the file of interest.
# aa
1 comment Digg this1 Comment so far
Leave a reply
Added to the arsenal. Thanks!