NuLib2's ProDOS Attribute Preservation

ProDOS Attribute Preservation
[ Home ] [ Up ] [ NuFX Addendum ] [ ProDOS Attribute Preservation ]

NuLib2's ProDOS Attribute Preservation - By Andy McFadden - Last revised 2003/02/08

This document describes how NuLib2 preserves file types and identifies resource forks and disk images when such things aren't handled by the filesystem.

File Type Preservation

The overriding goal is to provide a way to preserve filetypes and auxtypes when extracting files to "typeless" filesystems like those supported by UNIX or Windows. A secondary goal is to make the preservation attractive. As it turns out, these goals tend to conflict.

First, a simple example of a ProDOS text file named "fubar". Here's a trivial way of preserving the file type when extracting the file from an archive:

Archive      :  FUBAR           TXT $0000
Extract to   :  FUBAR.TXT

When adding files to the archive, we'd just do the opposite:

Original     :  FUBAR.TXT
Rearchive to :  FUBAR           TXT $0000

This works out pretty well under Windows, since "fubar.txt" is recognized with the correct file type. (It might get confused by the carriage returns, but that's a different problem.) If we happened to find a file called "fubar.txt" that didn't come from an archive, we still do the right thing, and store it as a file with type "TXT". All well and good.

Now suppose we have an auxtype that we don't want to lose. We have to make things a little more ugly.

Archive      :  FUBAR           TXT $0100
Extract to   :  FUBAR.TXT#0100

This isn't going to open with a double-click under Win95, but at least we're not losing the type.

Now imagine we have something that doesn't use a standard type, like:

Archive      :  FUBAR           LBR $8002
Extract to   :  FUBAR.SHK
Rearchive to :  FUBAR           LBR $8002

We happen to know that $E0 (LBR) with auxtype of $8002 is a ShrinkIt archive. So, when we extract it, instead of making it FUBAR.LBR#8002, we change it to FUBAR.SHK. When we archive such a file, we apply the same process in reverse. We don't *have* to do this, but it certainly makes the results more attractive, and would allow a Windows-based ShrinkIt application to identify the file.

Now things start to get a little ugly. Suppose, like most ShrinkIt archives, it already ends with ".SHK"? Now we have:

Archive      :  FUBAR.SHK       LBR $8002
Extract to   :  FUBAR.SHK.SHK
Rearchive to :  FUBAR.SHK       LBR $8002

This is annoying, but it won't stop anything from working (unless the file extension is too long!). The alternative would be to realize that there's already a ".SHK" extension on the file, and not add another one, but then when we went to rearchive it we'd end up with something different:

Archive      :  FUBAR.SHK       LBR $8002
Extract to   :  FUBAR.SHK
Rearchive to :  FUBAR           LBR $8002

We've lost the file extension. For a ShrinkIt archive this wouldn't be so bad, but for a library or executable launched with a hardcoded path ("foo.s16") it could be fatal.

In some cases we just want to be "nice" and put file types on things that weren't extracted from a ShrinkIt archive. For example, suppose we're archiving a bunch of source code ("foo.c" and "foo.h"). We can give them specific file types, e.g. the APW "SRC" type $b0/$000a. We can't convert back from those types though, since *.c and *.h are both $b0/$000a. With .txt files we could strip off ".txt" and give them a unique type, but with source files we have to leave ".c" and ".h" on them.

The situation gets more confusing when we re-extract the files from the new archive. If their types are NON/$0000, then they will get extracted as "foo.c" and "foo.h". If we were nice and gave them file types, then when we extracted them from the new archive they'd come out with preserved file types, named "foo.c.SRC#000a" and "foo.h.SRC#000a". We may actually make things more ugly by trying to be nice!

There are also cases where we may want to be "mean" and lose information, such as when extracting a BIN file called "foo.gif" or "foo.jpg". In most cases, these are GIF or JPEG images that should not have type information appended. Storing the file as "foo.gif.BIN" is counterproductive if we want to use the file, but it's the right thing to do if we want to re-archive the files in the same way that we extracted them.

One other bit of difficulty arises if the archiver application gets updated. Maybe a file type was misnamed, so what used to be type "AST" becomes "AJT". Now, when we try to add "FUBAR.AST#0100", we don't recognize the file type. To avoid problems recognizing file types written by older versions of NuLib2, we always want to use the numeric file type values. However, this prevents us from ever being able to double-click on an extracted file in Windows, unless we set up mappings for the numeric types (e.g. associate "$04" with the same thing ".TXT" uses).

Bill North gave me some interesting ideas about how to preserve the file type and still keep extension-oriented operating systems like Windows happy. The format proposed below is based largely on his ideas.

There are three levels of file type preservation:

None (equivalent to the original NuLib):: When extracting, no file type information is stored in the name extension.; When adding, file type information in the extension is ignored (in fact, it's regarded as part of the filename).

Basic (preserves reliably):: When extracting, all files have their type and auxtype appended at the end of the filename, in hexadecimal. "fubar.txt" becomes "fubar.txt#040000". Resource forks and disk images are annotated with single-letter codes.; When adding in "basic" mode, all files are checked for file type information, and (if found) everything after the last '#' is removed. If a full type isn't found ("foo.c"), the file is added as NON/$0000. Care is taken to treat files like "blah#123" and "foo#040000xyz" as typeless, so we don't get confused by files that legitimately have a '#' in the filename.

Extended (preserves reliably, works better with Windows): This works like "basic", but a redundant file extension is added to the filename. "fubar.txt" becomes "fubar.txt#040000.txt". Special care is taken to preserve existing extensions, so "foo.c" would become "foo.c#b0000a.c", not "foo.c#b0000a.src". If no extension is present on the original, and no ProDOS three-letter extension is known (e.g. $f7), then no redundant extension is added. Type TXT is special-cased, so text files are always ".TXT".; Adding of preserved files works like "basic" mode, where everything after the last '#' is removed. The redundant file extension is simply ignored. If a file was not preserved, but it has a file extension, an attempt is made to determine the file type based solely on the extension (e.g. "fubar.jpeg" gets stored as BIN rather than NON).

Examples

Extracting "fubar", type=TXT, auxtype=$0000
  none:     fubar
  basic:    fubar#040000
  extended: fubar#040000.txt

Extracting "fubar.txt", type=TXT, auxtype=$0000
  none:     fubar.txt
  basic:    fubar.txt#040000
  extended: fubar.txt#040000.txt
  
Extracting "fubar.doc", type=TXT, auxtype=$0000
  none:     fubar.doc
  basic:    fubar.doc#040000
  extended: fubar.doc#040000.txt

Extracting "fubar.doc", type=BIN, auxtype=$0000
  none:     fubar.doc
  basic:    fubar.doc#060000
  extended: fubar.doc#060000.doc

Extracting "fubar", type=S16, auxtype=$0100
  none:     fubar
  basic:    fubar#b30100
  extended: fubar#b30100.s16

Extracting "fubar.gif", type=BIN, auxtype=$2000
  none:     fubar.gif
  basic:    fubar.gif#062000
  extended: fubar.gif#062000.gif

Extracting "fubar.c", type=SRC, auxtype=$000a
  none:     fubar.c
  basic:    fubar.c#b0000a
  extended: fubar.c#b0000a.c

Extracting "fubar", type=LBR, auxtype=$8002
  none:     fubar
  basic:    fubar#e08002
  extended: fubar#e08002.lbr

Extracting "fubar.shk", type=LBR, auxtype=$8002
  none:     fubar.shk
  basic:    fubar.shk#e08002
  extended: fubar.shk#e08002.shk

Adding file "fubar"
  none:     fubar/NON/$0000
  basic:    fubar/NON/$0000
  extended: (same as basic)

Adding file "fubar.txt"
  none:     fubar.txt/NON/$0000
  basic:    fubar.txt/NON/$0000
  extended: fubar.txt/TXT/$0000

Adding file "fubar#B30100"
  none:     fubar#B30100/NON/$0000
  basic:    fubar/S16/$0100
  extended: (same as basic)

Adding file "fubar.c"
  none:     fubar.c/NON/$0000
  basic:    fubar.c/NON/$0000
  extended: fubar.c/SRC/$000a

Adding file "fubar.gif"
  none:     fubar.gif/NON/$0000
  basic:    fubar.gif/NON/$0000
  extended: fubar.gif/PNT/$8006

Adding file "fubar.gif#060000.txt"
  none:     fubar.gif#060000/NON/$0000
  basic:    fubar.gif/BIN/$0000
  extended: (same as basic)

Adding file "fubar.shk#045678.s16-wahoo"
  none:     fubar.shk/TXT/$5678
  basic:    fubar.shk/TXT/$5678
  extended: (same as basic)

Files extracted in either "basic" or "extended" mode can be re-added in "basic" mode. Files extracted in "none" mode shouldn't be re-added if you care about file types. Files that didn't originate from a NuFX archive, such as text files or source code on disk, can be added in "extended" mode if you'd like to have NuLib2 guess at their file types.

Because GS/OS supports the HFS filesystem, we may have items in an archive that have full Macintosh HFS types rather than ProDOS types. If the file type is larger than 0xff, or the auxtype is larger than 0xffff, then the type will be a 16-digit hex value (#1234567812345678) instead of the usual 6-digit value. This may strain the limits on some filesystems, so preserving the types of Mac files may not be practical everywhere.

Special Characters and Long Names

Filesystems don't generally allow every possible byte value to be included in a filename. The typical UNIX filesystem is very forgiving, but it won't allow '/' or '\0'. Win32 won't accept \/:*?"<>| . If we are to preserve the filenames as well as the filetypes, we have to provide a way to include special characters. ProDOS only uses A-Z, 1-9, and '.', so preserving special characters may not be possible.

Some filesystems, such as MS-DOS and ISO-9660 (level 1), restrict the filename format as well as the character set, e.g. names limited to "8.3" form. It's not generally possible to preserve complex names on such systems, so we don't even try. Hybrid CD-ROMs can be created with Joliet, Rock Ridge, and HFS filenames, so the appropriate target system can see the correct name. (Of course, stuff written to a CD-ROM should be inside an SHK archive anyway, not expanded into separate files.)

In the "none" preservation mode, filenames will be converted into something acceptable for the target filesystem. No effort will be made to create something that can be converted back. When files are added in the "none" mode, no conversion will take place.

In "basic" and "extended" modes, characters invalid on the current filesystem will be written as "%xx", where "xx" is the two-digit hex value for the character. If the '%' character appears in a filename, it will be stored as "%%". The "%00" sequence, added in some unusual circumstances, should be removed entirely rather than converted to '\0'.

Character preservation shouldn't often be necessary, unless the files were archived from an HFS or UNIX volume, and the archive creator used characters like "/" or "*". Win32, HFS, and UNIX can all handle the short names and restricted set of characters that ProDOS filesystems support.

Another situation where filenames can be twisted is when they are too long to fit on a filesystem. The character escaping and addition of type information can make a filename much longer than it was originally, so a name that was kinda long before will be really long when it's extracted.

In the "none" mode, filenames will be truncated silently. In the "basic" and "extended" modes, an error will be returned, and you will be given the opportunity to skip or rename the file.

Another problem area has to do with the path separators. Consider a file named "foo/bar" in a folder called "subdir" on an HFS volume. It would be archived as "subdir:foo/bar". When extracted to a UNIX volume, you would get a file called "foo%2fbar" in "subdir". When added back to an archive, however, if '/' is used as the path separator, you would get "subdir/foo/bar", which is not what was intended. Similar examples can be created for other pathname separators.

In general, restoring a filename to its original status requires encoding not only the special characters but also the path separators. Ideally the gunk added to the filename would include some indication, either an enumerated value or a two-digit hex ASCII value. In practice, ':' is illegal on all Apple II filesytems (except DOS 3.3) as well as Win32, so using it as the default path separator should work well. Only files created on a UNIX system will have problems, and these can be screened (replacing ':' with, say, 'X').

Since NuLib2 isn't intended to be a general-purpose file archiver, there's not much need to support all possible UNIX filenames. There's little advantage to adding an additional character to every filename for this rare case.

Resource Forks, Disk Images, and Comments

A forked file "FINDER.SYS16" with filetype S16/$0100 would be extracted into "FINDER.SYS16#b30100" and "FINDER.SYS16#b30100r". The "r" is added in both "extended" and "basic" modes, but as with everything else is unused in "none" mode. This used to result in "file already exists, overwrite?" messages when the resource fork was extracted, because both the data and resource forks will be written to "FINDER.SYS16". The current version of NuLib2 appends the rather obvious "_rsrc_" to resource forks in "none" mode.

The earlier discussion on file type preservation has meaning for disk archive preservation as well. In general, people don't combine file and disk archives, or have more than one disk image in an archive, but there's nothing in the NuFX format that prevents it. It is useful to transparently handle disk images as well.

The trouble is with identifying disk image files as such. Formats with unique extensions, such as 2IMG (.2MG) are fairly safe, but a raw disk image entitled "system.raw" could be confused with other forms of data. This can make it tricky to do the right thing.

The presence of an explicit "this file is a disk" option, which treats all files as disk images no matter what they're called, guarantees that we can always do *something* useful with a disk image file. Even when this option isn't being used, we can identify .2MG files by the extension and (to be rigorous) the file contents. Extracting and re-adding a .2MG file multiple times shouldn't result in any degradation, unless we try to convert the sector interleave from DOS to ProDOS, but even that is a reversible transformation.

The explicit flag for a disk image works similarly to the flag for a resource fork. After the type info, which for a disk is always $00 with the number of blocks in the auxtype, we add 'i'. A 5.25" disk image stored as "SYSTEM" would be extracted in "none" mode as "SYSTEM", and in "basic" or "extended" mode as "SYSTEM#000118i".

No flag is added for a data fork. If a flag were added, it probably wouldn't be 'd', since that could be confused with "disk" and also happens to be a valid hexadecimal digit.

Comments are another special case. Preserving archive comments requires extracting them into separate files. NuLib2 doesn't currently do this, but if it were to do so the file would look like "SYSTEM#0000c8n", where 0x00c8 is the pre-allocated size for the comment thread. I'm using 'n' as the comment designator (for "note") because 'c' is a valid hexadecimal digit.

The latest version can be found on the NuLib web site at http://www.nulib.com/.