Wednesday, September 14, 2011

Decompressing Archive Files

I've been decompressing archive files recently.  It's a lot of fun; it makes me feel like a detective.  I started after I read this blog post (http://briancarper.net/blog/475/) about extracting resources from a Final Fantasy PSP game.  I had wanted to do this sort of thing when I was in high school, but I didn't even know where to start.  I wondered how I'd fare now after years working as a programmer.

I followed that blog's links for a while and came across this useful guide (http://www.xentax.com/downloads/multiex/Definitive_Guide_To_Exploring_File_Formats_MW_2004.pdf) and this great forum (http://forum.xentax.com/).  After reading the guide and following the Quake example, I felt I was ready to try an archive on my own.  It really wasn't very hard, just time consuming.  I found it was generally true for all the different archives I ended up working on.

Decompressing an archive is reverse engineering how the data is compressed, so most of the challenge is in detecting patterns and guessing at data structure... which is pretty fun.  Things go better with good tools and more experience, but there's always a lot of trial and error.  Success can rely heavily on luck.  Sometimes the archives can just be too confusing or sometimes the encryption is too hard to reverse.  I haven't come across too many "impossible" examples, but I haven't done too many archives either.

I decided to document (with some side comments in orange/brown) how I solved my first solo archive to hopefully help people who are just starting out themselves.  This isn't the best or most efficient way of doing it (in fact, it may not even be completely correct), but it should be a good starting point to understand how it can be done.  I did this on a Windows XP machine over a (long) evening.

I chose the game Half Minute Hero, which is a sprite based PSP game.  It's a great great game, so I suggest you buy a copy.  You can get it on XBox Live too.  It's one of my favorite games, although that's probably due to my increasing ADD tendencies.  It's a mashup of a puzzle game and JRPG, as well as a sort of a deconstruction and parody of NES/SNES JRPGs.


My goal was to rip the sprites from the PSP version of Half Minute Hero.  This is how I did it.


I started off by obtaining an image of the game.  (If you can't rip one yourself, you can probably easily find one online.  If you get a CSO, find a tool to decompress it to an ISO.  You can mount the ISO just as you would any image: You can burn it to a CD or use Virtual Drive Software, such as Daemon Tools Lite.)

I mounted the ISO using Daemon Tools and looked through the contents via the Windows file explorer.  I found Half Minute Hero contains all its High Res Images and Music in obvious files (Music is in the AT3 format, which can be converted to MP3 using this tool), but the sprite resources weren't out in the open.  I used a hex editor to look for image data in the files with extensions I didn't recognize.  I quickly found filenames with BMP extensions in the RESARC.JRZ file.

(The archive file that contains the sprite resources is RESARC.JRZ.  You can download it here if you'd like to play along: RESARC.JRZ.  You can't do much else with it besides extract files.)

Next, I opened up RESARC.JRZ in Hex Workshop and looked around for clues that would help me figure out how to decompress the data.  (When I work on an archive, the first thing I do is run a google search on the first few bytes, which often identifies the file format.  Sometimes the file format has already been solved and I can just download a script to decompress it.  If I can't identify the file format or find an existing decompressor, the next thing I do is look at the bytes that could be ASCII characters.  If there are bytes that are obviously filenames, I play around with the bytes preceding and following the filename bytes, looking for bytes that might be file offsets.)  After looking/playing around for a while, I determined the parts of the file structure I cared about.

This is what I found:  The RESARC.JRZ archive is split up into a list of all the files followed by the compressed file data itself.  The file list starts at offset 0x006D.  Each file location has a 28 byte header, followed by a variable length ascii file name, followed by 00.  The header structure seemed to be:

     2 Bytes unknown
     2 Bytes for the number of bytes in the file name
     4 bytes unknown
     4 bytes unknown
     4 bytes for the file size
     4 bytes for the extracted file size
     4 bytes unknown
     4 bytes for the file offset

As a sample, here are the last two listed files in the file list section:

     - PARTY01.BMP (Starting at offset 0xE42F) -
     00 00
     0C 00 (Number of Bytes for the Filename)
     00 00 00 00
     00 00 00 00
     92 0A 00 00 (File Size A)
     36 04 01 00 (Extracted File Size A)
     00 00 00 00
     A9 B1 3B 00 (File Offset A)
     50 41 52 54 59 30 31 2E 42 4D 50 00 (File Name + 00)

     - PARTY01_HIT.BMP (Starting at offset 0xE455) -
     00 00
     10 00 (Number of Bytes for the Filename)
     00 00 00 00
     00 00 00 00
     DF 08 00 00 (File Size B)
     36 04 01 00 (Extracted File Size B)
     00 00 00 00
     3B BC 3B 00 (File Offset B)
     50 41 52 54 59 30 31 5F 48 49 54 2E 42 4D 50 00 (File Name + 00)

Doing the math, 0x3BB1A9 (File Offset A) + 0x0A92 (File Size A) = 0x3BBC3B (File Offset B), so my analysis seemed like it could be correct.

(I should make this note here.  You'll often come across little endian (http://en.wikipedia.org/wiki/Endianness) addressing, which usually means the least significant byte in a multi-byte value is "first", stored at the lowest address.  This is the case for this archive.  For example, the hex value of the bytes for File Offset A is [A9 B1 3B 00], but the value of File Offset A, a long, is actually 0x003BB1A9.  It's pretty obvious that this is the case when you look at addresses 0xA9B13B00 (which is out of range for this file) vs. 0x003BB1A9.)

Now that I knew how to find the location of the file data, the next step was to recreate the files.  To do that, I tried to correctly decompress one of the listed BMP files.

I copied the contents of byte location 0x3BB1A9 through 0x3BBC3B to a new file named PARTY01.BMP.  It didn't open as a BMP.  I noticed the data started with [08 00 00 00] instead of the expected [42 4D] (all BMP files should start with those two bytes, which is ASCII for "BM"), so first I simply tried adding a BMP header.  The file still didn't open.  Because it didn't work, it either meant I created the file incorrectly, added the header incorrectly, or the file was compressed/encrypted.  I decided to assume the file was compressed and work from there.

I noticed all the BMP files I'd tried to extract so far started with [08 00 00 00].  Reading more bytes in each file, I noticed all the files contained the hex bytes [78 DA] at location 0xC.  After some googling, I found out that those bytes usually indicated the start of a zlib compressed stream.

After some more googling I found aluigi's offzip tool, which unpacks the zip (zlib/gzip/deflate) data contained in any type of file.  I tested it out on PARTY01.BMP.  The offzip program requires an offset.  The [78 DA] bytes that identified the zipped file started at 0xC, so I used that.  I executed the command [offzip "input/PARTY01.BMP" "output/PARTY01.BMP" 0xC] and checked the resulting file.  I successfully opened it up in MS Paint as a BMP, but it had a bunch of garbage at the top.  Still, it opened as a recognizable BMP and I figured this was good enough; maybe that was just how the game chose to represent BMPs.


At this point I wrote a script to decompress all the archive files contained by RESARC.JRZ.  I used aluigi's QuickBMS.  The syntax was simple enough.  'get DUMMY short' pulls the next two bytes into a variable named DUMMY, whereas 'get DUMMY long' pulls the next four bytes into a variable named DUMMY.  The BMS script is run via a program called quickbms.  When executed, it asks for the archive, the BMS script, and the output directory.  This is the BMS script:
  1. GoTo 0x6D  # Jump to file locations section  
  2.   
  3. for i = 0 < 15000  
  4.  get DUMMY short  
  5.  get FILENAMELENGTH short  
  6.  get DUMMY long  
  7.  get DUMMY long  
  8.  get FILESIZE long  
  9.  get DUMMY long  
  10.  get DUMMY long  
  11.  get OFFSET long  
  12.  getdstring NAME FILENAMELENGTH  
  13.  log NAME OFFSET FILESIZE   # This creates the file in the directory you specified.  
  14. next i  
The script worked and I had all the compressed files.  Then I wrote a batch script to call offzip on all the compressed files:
  1. cd "C:\Extract\HMH"  
  2. for %%i in (input/*.pck) do offzip "input/%%i" "output/%%i" 0xC  
After running that, I found that many compressed BMP files failed to decompress into viewable images.  So I pored over the files in Hex Workshop again.  This time I noticed not all the files began with [08 00 00 00], that [78 DA] wasn't always at offset 0xC, and that [78 DA] sometimes appeared several times throughout the file.  I also looked more closely at the offzip options, noticing its options (-a) for multiple sets of zip data.  It clicked for me that there were multiple sets of zip data in the compressed files.  (It wasn't immediately obvious because I hadn't dealt with much file compression before.)

So, as a test, I executed the command [offzip -a "input/3_300LOGO.BMP" "output/3_300LOGO.BMP" 0].  It generated multiple .dat files.  I merged all the .dat files the command produced (by simpley copying and pasting all the files into one file) and, sure enough, I had a valid BMP file.

So I rewrote my batch file.  (Instead of using a batch script, I could have continued using BMS, but I still wasn't that familiar with it.)  For each compressed file, I would run the 'offzip -a' command to generate .dat files.  Then I would concatenate all the .dat files into one file.  Then I would delete all the .dat files:

After running that, I finally realized that there was a problem with decompressing the files because I noticed some BMP files didn't have the garbled tops.  Since I now knew that there were multiple sets of zip data in the files and, due to how the BMP format worked, a garbled top of a BMP probably meant there was missing information at the end of a file, it was obvious that I was somehow skipping the last archive.

Poring over the offzip options again, I eventually decided to try to use the -m command, which specifies the size of the zip block.  Using -m 16 found one more zip data set than -m 32.  I updated my script and ran it again.  There were a lot more valid BMP files, but some files, such as CARAVAN.BMP, were still not right.  After some more manual trial and error testing, I found that I needed -m 8.

I created my final batch script:
  1. cd "C:\Extract\HMH"  
  2. for %%F in (input/*.*) do call :offzip %%F  
  3.   
  4. goto :eof  
  5.   
  6. :offzip  
  7.   
  8. set FILEVAR=%1  
  9.   
  10. offzip -a -m 8 "input/%FILEVAR%" "output" 0  
  11. cd output  
  12. echo 2>%FILEVAR%  
  13. for %%F in (*.dat) do copy /b %FILEVAR%+%%F %FILEVAR%  
  14. del *.dat  
  15. cd ..  
  16.   
  17. :eof  
(By the way, if you're using Windows 7, you need to run this script with Administrator privileges.  If not, the copy command will fail and the output won't be written.  For more information, read this blog post.)

And everything came out right.  Exciting!  I browsed through some of the other data, reading the strings and such.  Everything looked as expected now.  So the steps I took were:

1. Get the Half Minute Hero ISO, mount it, and find the RESARC.JRZ archive.
2. Run the BMS script on the RESARC.JRZ archive to get all the zlib compressed files.
3. Run the Batch Script on the directory all the compressed files are stored in.

Once again, you can download all the files you need to do this from this link.

(And once again, disclaimer, I don't really know what I'm doing.  I'm just trying what makes sense and it's seemed to work so far.)

After getting this far, there's lots more that can be done than getting a bunch of images.  You can hunt for data, like game text scripts. Game mods can be made by editing the decompressed files and then repackaging the archive.  Converters can be created to change program specific audio and visual files to more standard formats.  Those things are outside of the scope this blog entry will cover though, and really outside of what I've personally tried.  But if you've made it this far, you're probably ready to do stuff with the data you've extracted.  If so, Good Luck!

(As a side note, the Japanese version of Half Minute Hero 2 has been released, so if you want a simple archive to decompress, you can find the ISO and try it to get the images.  The file structure is really similar to Half Minute Hero, so it's a good archive to start off with.)

1 comment:

Unknown said...
This comment has been removed by the author.

  © Blogger template 'Isolation' by Ourblogtemplates.com 2008

Back to TOP