ProDOS Attribute Preservation |
| |
This document describes how NuLib2 preserves file types and identifies resource forks and disk images when such things aren't handled by the filesystem.
The overriding goal is to provide a way to preserve filetypes and auxtypes when extracting files to "typeless" filesystems like those supported by UNIX or Windows. A secondary goal is to make the preservation attractive. As it turns out, these goals tend to conflict.
First, a simple example of a ProDOS text file named "fubar". Here's a trivial way of preserving the file type when extracting the file from an archive:
Archive : FUBAR TXT $0000 Extract to : FUBAR.TXTWhen adding files to the archive, we'd just do the opposite:
Original : FUBAR.TXT Rearchive to : FUBAR TXT $0000This works out pretty well under Windows, since "fubar.txt" is recognized with the correct file type. (It might get confused by the carriage returns, but that's a different problem.) If we happened to find a file called "fubar.txt" that didn't come from an archive, we still do the right thing, and store it as a file with type "TXT". All well and good.
Now suppose we have an auxtype that we don't want to lose. We have to make things a little more ugly.
Archive : FUBAR TXT $0100 Extract to : FUBAR.TXT#0100This isn't going to open with a double-click under Win95, but at least we're not losing the type.
Now imagine we have something that doesn't use a standard type, like:
Archive : FUBAR LBR $8002 Extract to : FUBAR.SHK Rearchive to : FUBAR LBR $8002We happen to know that $E0 (LBR) with auxtype of $8002 is a ShrinkIt archive. So, when we extract it, instead of making it FUBAR.LBR#8002, we change it to FUBAR.SHK. When we archive such a file, we apply the same process in reverse. We don't *have* to do this, but it certainly makes the results more attractive, and would allow a Windows-based ShrinkIt application to identify the file.
Now things start to get a little ugly. Suppose, like most ShrinkIt archives, it already ends with ".SHK"? Now we have:
Archive : FUBAR.SHK LBR $8002 Extract to : FUBAR.SHK.SHK Rearchive to : FUBAR.SHK LBR $8002This is annoying, but it won't stop anything from working (unless the file extension is too long!). The alternative would be to realize that there's already a ".SHK" extension on the file, and not add another one, but then when we went to rearchive it we'd end up with something different:
Archive : FUBAR.SHK LBR $8002 Extract to : FUBAR.SHK Rearchive to : FUBAR LBR $8002We've lost the file extension. For a ShrinkIt archive this wouldn't be so bad, but for a library or executable launched with a hardcoded path ("foo.s16") it could be fatal.
In some cases we just want to be "nice" and put file types on things
that weren't extracted from a ShrinkIt archive. For example, suppose
we're archiving a bunch of source code ("foo.c" and "foo.h"). We can
give them specific file types, e.g. the APW "SRC" type $b0/$000a. We
can't convert back from those types though, since *.c and *.h are
both $b0/$000a. With .txt files we could strip off ".txt" and give them
a unique type, but with source files we have to leave ".c" and ".h" on
them.
The situation gets more confusing when we re-extract the files from the new archive. If their types are NON/$0000, then they will get extracted as "foo.c" and "foo.h". If we were nice and gave them file types, then when we extracted them from the new archive they'd come out with preserved file types, named "foo.c.SRC#000a" and "foo.h.SRC#000a". We may actually make things more ugly by trying to be nice!
There are also cases where we may want to be "mean" and lose information, such as when extracting a BIN file called "foo.gif" or "foo.jpg". In most cases, these are GIF or JPEG images that should not have type information appended. Storing the file as "foo.gif.BIN" is counterproductive if we want to use the file, but it's the right thing to do if we want to re-archive the files in the same way that we extracted them.
One other bit of difficulty arises if the archiver application gets
updated. Maybe a file type was misnamed, so what used to be type "AST"
becomes "AJT". Now, when we try to add "FUBAR.AST#0100", we don't recognize
the file type. To avoid problems recognizing file types written by older
versions of NuLib2, we always want to use the numeric file type values. However,
this prevents us from ever being able to double-click on an extracted file in
Windows, unless we set up mappings for the numeric types (e.g. associate
"$04" with the same thing ".TXT" uses).
Bill North gave me some interesting ideas about how to preserve the file type and still keep extension-oriented operating systems like Windows happy. The format proposed below is based largely on his ideas.
There are three levels of file type preservation:
Extracting "fubar", type=TXT, auxtype=$0000 none: fubar basic: fubar#040000 extended: fubar#040000.txt Extracting "fubar.txt", type=TXT, auxtype=$0000 none: fubar.txt basic: fubar.txt#040000 extended: fubar.txt#040000.txt Extracting "fubar.doc", type=TXT, auxtype=$0000 none: fubar.doc basic: fubar.doc#040000 extended: fubar.doc#040000.txt Extracting "fubar.doc", type=BIN, auxtype=$0000 none: fubar.doc basic: fubar.doc#060000 extended: fubar.doc#060000.doc Extracting "fubar", type=S16, auxtype=$0100 none: fubar basic: fubar#b30100 extended: fubar#b30100.s16 Extracting "fubar.gif", type=BIN, auxtype=$2000 none: fubar.gif basic: fubar.gif#062000 extended: fubar.gif#062000.gif Extracting "fubar.c", type=SRC, auxtype=$000a none: fubar.c basic: fubar.c#b0000a extended: fubar.c#b0000a.c Extracting "fubar", type=LBR, auxtype=$8002 none: fubar basic: fubar#e08002 extended: fubar#e08002.lbr Extracting "fubar.shk", type=LBR, auxtype=$8002 none: fubar.shk basic: fubar.shk#e08002 extended: fubar.shk#e08002.shk
Adding file "fubar" none: fubar/NON/$0000 basic: fubar/NON/$0000 extended: (same as basic) Adding file "fubar.txt" none: fubar.txt/NON/$0000 basic: fubar.txt/NON/$0000 extended: fubar.txt/TXT/$0000 Adding file "fubar#B30100" none: fubar#B30100/NON/$0000 basic: fubar/S16/$0100 extended: (same as basic) Adding file "fubar.c" none: fubar.c/NON/$0000 basic: fubar.c/NON/$0000 extended: fubar.c/SRC/$000a Adding file "fubar.gif" none: fubar.gif/NON/$0000 basic: fubar.gif/NON/$0000 extended: fubar.gif/PNT/$8006 Adding file "fubar.gif#060000.txt" none: fubar.gif#060000/NON/$0000 basic: fubar.gif/BIN/$0000 extended: (same as basic) Adding file "fubar.shk#045678.s16-wahoo" none: fubar.shk/TXT/$5678 basic: fubar.shk/TXT/$5678 extended: (same as basic)
Files extracted in either "basic" or "extended" mode can be re-added in "basic" mode. Files extracted in "none" mode shouldn't be re-added if you care about file types. Files that didn't originate from a NuFX archive, such as text files or source code on disk, can be added in "extended" mode if you'd like to have NuLib2 guess at their file types.
Because GS/OS supports the HFS filesystem, we may have items in an archive that have full Macintosh HFS types rather than ProDOS types. If the file type is larger than 0xff, or the auxtype is larger than 0xffff, then the type will be a 16-digit hex value (#1234567812345678) instead of the usual 6-digit value. This may strain the limits on some filesystems, so preserving the types of Mac files may not be practical everywhere.
Filesystems don't generally allow every possible byte value to be included in a filename. The typical UNIX filesystem is very forgiving, but it won't allow '/' or '\0'. Win32 won't accept \/:*?"<>| . If we are to preserve the filenames as well as the filetypes, we have to provide a way to include special characters. ProDOS only uses A-Z, 1-9, and '.', so preserving special characters may not be possible.
Some filesystems, such as MS-DOS and ISO-9660 (level 1), restrict the filename format as well as the character set, e.g. names limited to "8.3" form. It's not generally possible to preserve complex names on such systems, so we don't even try. Hybrid CD-ROMs can be created with Joliet, Rock Ridge, and HFS filenames, so the appropriate target system can see the correct name. (Of course, stuff written to a CD-ROM should be inside an SHK archive anyway, not expanded into separate files.)
In the "none" preservation mode, filenames will be converted into something acceptable for the target filesystem. No effort will be made to create something that can be converted back. When files are added in the "none" mode, no conversion will take place.
In "basic" and "extended" modes, characters invalid on the current filesystem will be written as "%xx", where "xx" is the two-digit hex value for the character. If the '%' character appears in a filename, it will be stored as "%%". The "%00" sequence, added in some unusual circumstances, should be removed entirely rather than converted to '\0'.
Character preservation shouldn't often be necessary, unless the files were archived from an HFS or UNIX volume, and the archive creator used characters like "/" or "*". Win32, HFS, and UNIX can all handle the short names and restricted set of characters that ProDOS filesystems support.
Another situation where filenames can be twisted is when they are too
long to fit on a filesystem. The character escaping and addition of type
information can make a filename much longer than it was originally, so
a name that was kinda long before will be really long when it's extracted.
In the "none" mode, filenames will be truncated silently. In the "basic" and "extended" modes, an error will be returned, and you will be given the opportunity to skip or rename the file.
Another problem area has to do with the path separators. Consider a file
named "foo/bar" in a folder called "subdir" on an HFS
volume. It would be archived as "subdir:foo/bar". When
extracted to a UNIX volume, you would get a file called "foo%2fbar" in
"subdir". When added back to an archive, however, if '/' is used
as the path separator, you would get "subdir/foo/bar", which is not
what was intended. Similar examples can be created for other pathname
separators.
In general, restoring a filename to its original status requires encoding not only the special characters but also the path separators. Ideally the gunk added to the filename would include some indication, either an enumerated value or a two-digit hex ASCII value. In practice, ':' is illegal on all Apple II filesytems (except DOS 3.3) as well as Win32, so using it as the default path separator should work well. Only files created on a UNIX system will have problems, and these can be screened (replacing ':' with, say, 'X').
Since NuLib2 isn't intended to be a general-purpose file archiver, there's not much need to support all possible UNIX filenames. There's little advantage to adding an additional character to every filename for this rare case.
A forked file "FINDER.SYS16" with filetype S16/$0100 would be extracted into "FINDER.SYS16#b30100" and "FINDER.SYS16#b30100r". The "r" is added in both "extended" and "basic" modes, but as with everything else is unused in "none" mode. This used to result in "file already exists, overwrite?" messages when the resource fork was extracted, because both the data and resource forks will be written to "FINDER.SYS16". The current version of NuLib2 appends the rather obvious "_rsrc_" to resource forks in "none" mode.
The earlier discussion on file type preservation has meaning for disk
archive preservation as well. In general, people don't combine file and
disk archives, or have more than one disk image in an archive, but there's
nothing in the NuFX format that prevents it. It is useful to transparently
handle disk images as well.
The trouble is with identifying disk image files as such. Formats with unique extensions, such as 2IMG (.2MG) are fairly safe, but a raw disk image entitled "system.raw" could be confused with other forms of data. This can make it tricky to do the right thing.
The presence of an explicit "this file is a disk" option, which treats all files as disk images no matter what they're called, guarantees that we can always do *something* useful with a disk image file. Even when this option isn't being used, we can identify .2MG files by the extension and (to be rigorous) the file contents. Extracting and re-adding a .2MG file multiple times shouldn't result in any degradation, unless we try to convert the sector interleave from DOS to ProDOS, but even that is a reversible transformation.
The explicit flag for a disk image works similarly to the flag for a resource fork. After the type info, which for a disk is always $00 with the number of blocks in the auxtype, we add 'i'. A 5.25" disk image stored as "SYSTEM" would be extracted in "none" mode as "SYSTEM", and in "basic" or "extended" mode as "SYSTEM#000118i".
No flag is added for a data fork. If a flag were added, it probably wouldn't be 'd', since that could be confused with "disk" and also happens to be a valid hexadecimal digit.
Comments are another special case. Preserving archive comments requires extracting them into separate files. NuLib2 doesn't currently do this, but if it were to do so the file would look like "SYSTEM#0000c8n", where 0x00c8 is the pre-allocated size for the comment thread. I'm using 'n' as the comment designator (for "note") because 'c' is a valid hexadecimal digit.
This document is Copyright © 2000-2003 by Andy McFadden. All Rights Reserved.
The latest version can be found on the NuLib web site at http://www.nulib.com/.