The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

Extracting strings from an image file

You can use perldoc to read this document: perldoc README.

Last update: 2006-04-05

Server/Client parts

The server divides the image file into parts, using a maximum size for each part. The parts do overlap, e.g. the first part is the first megabyte plus some bytes, the second part is the second megabyte plus some bytes etc.

The overlap should be equal to the maximum length of passwords extracted.

The client downloads the proper part of the image and gives it's name to the worker via a charset description file, which is also generated by the server.

The server needs to define a EXTRACT type of charset. It should carry settings for what to extract (the type of characters) and in which way (skipping, terminating, see below).

The charset should not contain the minimum and maximum length, nor the actual file from where to extract. These settings will be specified on a job-by-job basis, since they differ typically for each job.

The job's start and end setting will be the min and max length, and the image file name must be specified additionally.

Worker and Workerframe

The worker reads the image file and extracts passwords out of it, trying then each of the extracted passwords in turn.

The exact implementation is transparent, e.g. the usage of dofunction() stays the same - for each extracted string dofunction() is called and checks it. Where the string comes from is of no concern to the checking code. This keeps the generation and check separated.

Password extraction

The extraction runs in bursts, e.g. first all strings of length N are extracted, then tried, and only then strings of length N+1 will be extracted.

For each of the length from min to max (where min is the minimum length requested, and max the maximum length request from us to extract), the same schedule will be used, first extracting all strings, then checking them.

This has many advantages for the code, and simplifies it a lot. Details on why follow further down.

Extracting at each position

For each byte-position in the image file, we extract the string at the current position of length N. Some problems arise, like what to do if the string at the current position is longer (it usually will be), or shorter (like at the end of the current image part), or contains invalid/unwanted characters.

Extracting the same length

The first observation is that when extracting string of length N, we never need to extract strings that are shorter or longer than N.

The longer ones will be done later, so we need not to be concerned about them now.

The shorter passwords fall into two categories:

Smaller than min

Passwords smaller than the min length need not to be extracted at all

Smaller than N

Passwords that are smaller than N but longer or the same than min where already tried when we extracted N-1, N-2 etc passwords, so we do not need to extract them again.

This means at the stage N we will only extract passwords of exactly length N. Thus we do know that all strings extracted have the same length and this simplifies the code greatly.

Skipping or Termination characters

The question remains what to do with character that cannot be part of an extracted string. We have two options here:

terminator

The character in question terminates the current extracted string. If the string is shorter then N, we simple drop it. As an optimization, we can also skip ahead to the position right after the terminator, since any string between the current position and the terminator would be even shorter than the current one, and thus also be shorter than N.

skip

We simple skip the character in question and append the next valid character we find.

Both options have advantages. The terminator cuts down unwanted strings, while the skip also catches strings that are in (simple) Unicode, or broken down on line breaks, or other boundaries.

Also, the characters considered invalid/valid are changeable, so that we can extract strings of certain qualities. We use a SIMPLE charset to describe the list of characters that are valid. Anything else is then dropped or used as a terminator.

Extracting at odd/even positions

There are some simple Unicode strings that follow the patter of:

        0x00 0x41 0x00 0x42 0x00 0x43   .A.B.C.D etc

While these will be extracted when using the extract_skip_invalids option is set to 1 (see skip above), this will also generate a lot of garbage strings. To extract only these specific strings, but not every other random collection of characters from the image, we introduce another option called extract_even_odd. Setting it to 1 causes the extraction code to to the following three steps:

extract at every position
extract at every even position
extract at every odd position

This means the maximum number of strings extracted will be doubled, however, in practice not all strings are valid (meaning they are too short or doubles of already extracted strings) and thus only some more strings will be extracted.

While this can be combined with extract_skip_invalids, it is not recommended to do so to cut down the number of unwanted strings.

Weeding out doubles

It does not make sense to check the same passwords twice, so we must weed out doubles. A first, simple check can be made if the currently extracted password is the same than the one before. These can be skipped immediately.

This improves performance if there are long stretches of the same pattern (if the pattern length is dividable by N) or of the same character, since these will be dropped immediately.

After extracting all possible strings, minus some of the doubles, we will still have some doubles left. We weed these out effectively by sorting all strings, and then skipping over the double ones.

The doubles need not be removed from the list, since they occur right after each other (after the sorting), so before calling dofunction, we could just check that we call it with a new, different password.

Not all doubles will be skipped, however. The reason is that we can fit only so much strings into the memory and sort them (which is necessary to skip doubles). If our memory is exhausted, we clear the list and start over with an empty list and append newly extracted strings to it.

And in the next list we might find some strings that were already checked in the list before.

However, the occasional double checking is still much faster if we attempted to create one big, huge sorted list of all strings ever seen - especially if the file size is very big. The reason for this is because keeping an entire huge list of all unique strings ever seen is impossible to solve without unlimited memory and/or swapping them to disk. This is were previous and already existing solutions fail, so we only keep an finite amount of strings and accept that doubles occur.

Mathematically speaking, building a list of strings twice and checking them takes twice as long, but merging these two lists and sorting them takes more than twice as long due to the properties of the sort algorithm, which is typically O(N * log N).

This also effects string extraction while saving them to a file, see the topic "Dictionary Extraction" for more information.

To weed out doubles effectively, we gather as much strings as we can fit into memory, e.g. more than a limit of X.

Then we sort them, and remove the doubles.

If we have still more than X (now unique) strings, we will put them through dofunction() and empty the list.

If we had less than X (now unique strings), we extract some more until we have again more than X strings, and then sort them, weed out the doubles and then check again against our limit of X.

For performance reasons we note that after the first sort we have a list of strings that is already sorted (Y strings), and another list of unsorted strings appended (N strings), so that N + Y X>.

Instead of sorting the entire list, and then weeding out the doubles, we just need to sort the last N strings, and then merge the two sorted lists together to form one big sorted list of strings, from which we can then weed out the doubles.

Caching

The weeding out of doubles will get rid if double passwords in the current part of the image. It will not, however, catch passwords that occur in each part of the image, since these parts are likely given to different clients.

It is not clear whether it is worthwhile to cache some information, since there will be only a few parts of the entire image distributed to one client. To catch all doubles we would need to share a cache across all clients. It is also unclear which strings to cache and which not, and it is certainly not possible to simple cache all of them due to memory and processing time restraints.

For simplicities reasons caching will not be done for now.

Implementation specific details

The ssPWD struct

The code that extracts the passwords, and sorts them and then calls dofunction for each of the non-double ones patches the proper information into the ssPWD struct before calling dofunction.

Fortunately, the length does not change (it is always N), only the password and the overflow need to be changed. For simplicity reasons, overflow can be set to N; indicating that the entire password changed. While this is not quite true due to walking a sorted list of passwords, it is just a minor optimization for dofunction, which is not strictly necessary, and not used that often.

From the dofunction's point of view, the extraction is invisible and transparent.

Memory allocation

We know the length of each string at each stage (N), and we know how many different strings we can have at most: namely as many as there are positions (byte) in the image file. Thus we can allocate a block of fixed size for all possible extracted passwords. While this wastes a bit of memory, it simplifies the code a lot, and also saves a lot of time of reallocating the memory block. But see below in Notes.

In practice we allocate two blocks of memory:

read-buffer string list

This list stores the passwords extracted from the read buffer.

string list

This list stores all strings extracted so far. The strings from the current read buffer are appended to this list, which is sorted after it reaches it's limit. After sorting, we weed out doubles and check the fill-rate of this list against our limit. If this list is (nearly) full, we check each string via dofunction() (see also Notes on mutating case and forward/reverse), otherwise we collect more strings.

Notes

padding

For a great (?) speedup the extracted strings should probably padded to a multiple of 4 bytes, to avoid problems with more modern CPUs. This would waste a bit of memory, but not so much if the strings are longer (than 12 characters).

index-sorting

Instead of shuffling the extracted strings around, we could use a 32-bit index for each one, and upon sorting just shuffle these indices around. It is not clear whether this is actually faster, although it should be.

memory waste/footprint

The worst memory usage occurs for the max stage, it is not clear whether it makes sense to reduce memory footprint for stages min .. max-1 because at max we will waste/need the memory anyway. Maybe allocating the memory for the max stage saves us memory re-allocations for each stage. This would, however, not speed up the code greatly, since a couple of malloc()'s (from min to max) are no concern.

Sorting

Sorting is done by a slightly-optimized mergesort, which was improved by having a special case for last - first == 1. We also use hand-rolled copy and compare routines, which are faster than memcpy/memcmp for small ranges.

Speed

On a PIII 500 Mhz the empty check makes about 135000 pwds/s, about 30% more than previously. This depends greatly on the number of strings extracted and the amount of doubles in them. The sorting could still be a bit faster as to not waste more time to extract/sort than to check, especially if the check is (very) fast. For very slow checks, the sorting time does not matter.

Mutations/Rules

The strings are extracted and tried as-they-are.

In addition to this, the strings can be tried in different stages (forward, reversed) and each of these stages can be mutated (lower case, UPPER case, MiXeD case etc) just like with dictionaries.

tests

Some initial tests and benchmarks with artificially created images of various sizes up to one megabyte have been tried. The formal tests with known results all pass, but we need more with more special extraction sets.

Dictionary Extraction

The workerframe contains a sample worker called extract. With this worker it is possible and very easy to extract a dictionary out from a file, image or device.

You just need to write a config file like this one:

        # Config file for extract
        charset_id=300
        # or use image_file=/dev/sda1
        image_file=FILE.IMG
        start=6
        end=20
        target=30
        extractset_id=15
        extract_skip_invalids=0
        extract_even_odd=1
        extract_to=extracted.txt
        extract_as_hex=0
        # disable any case mutation
        dictionary_mutations=0
        dictionary_mutations=0
        # only forward
        dictionary_stages=0

image_file is the file to extract strings from. extract_to the file were to write them to. If extract_as_hex is set to 0, plain ASCII will be written, otherwise the passwords will be in hex. target is irrelevant, so just set it to 30. The extractset_id is the ID of the charset specifying what characters are valid, set 15 means anything except 0x00 and 0x0a.

It does not make much sense to extract strings shorter than 6 characters.

The charsets are defined in the file charsets.def, found in the worker dir.

When setting the option extract_skip_invalids to 1, a lot more strings will be extracted, but these are not necessarily useful. A much more useful option is extract_even_odd, and it is recommended to set it to 1. This will extract strings from every other byte at even and odd positions.

After compiling extract.c, put extract into a directory together with the text from above as extract.cfg, as well as the charsets.def file. Then call:

        ./extract extract.cfg

After it finishes, you will have a list of all extracted strings, partially sorted and most of the doubles removed. Not all doubles are removed, because it would take too much time and memory (or disk space) to create a list with only unique strings. It might be faster just to use the produced list than to really weed out all doubles.

If you must remove all doubles at whatever the cost, you can use something like this:

        sort -u extracted.txt >sorted.txt

Enjoy!

Extracting with Case Mutations

If you set the two config entries as follows:

        dictionary_mutations=2047
        dictionary_stages=3

each extracted string will, in addition to as-it-is, be tried in two different stages, and each of these in ten different case spellings. This means that the amount of tried strings will go from 1 to 21, roughly twenty times as many.

Please see the doc about dictionary mutations for the exact list of mutations.

PLAN

The current working plan is as follows:

Workerframe

Index-sorting could be implemented and we need to check whether weeding out doubles inline is faster or not. Also, merge_sort with weeding out doubles as it goes along could bring speed. Or not, so check it.

In addition, ultra-slow check algorithms could benefit from a larger list, because the potential more doubles could be identified and removed prior to feeding the strings to the checklist.

Server

The server needs to adjust the speed of a client if it is too slow, but not if it is too fast, because the client might not have had enough passwords to extract from the image file. Or maybe let the client re-transmit how many passwords it extracted and checked (and check that this are not more than can be contained in the image part).

The image needs to be split up into parts having a max size (1 Megabyte?).

Wrapup

Check everything working together. Celebrate. Party. Rest.

AUTHOR

(c) Bundesamt fuer Sicherheit in der Informationstechnik 1998-2007

The DiCoP Workerframe falls under the BSD license.

See the file LICENSE or http://www.bsi.de/ for more information.

Contact

Address: BSI Referat 123 Godesberger Allee 185-189 Bonn 53175 Germany email: dicop @ bsi.bund.de (for public key see dicop.asc or web site)