Concatenate HTML Files

See the download page to obtain this program

Description

This script combines a number of HTML files into one. The beginning of the first file (up to and including <body ...>) is used for all the files since only their bodies are concatenated. An optional divider followed by the label of a file is used between files.

Note the following limitations. Some of these are fixable, but the author has not worked on the code for a long time.

Run the script from the highest level directory in which HTML files are to be concatenated. If a parent directory of this is used, the cross-references may be wrong.
The code has been developed and tested in Unix-like environments (various flavours of Unix and CygWin on Windows). Use on MS Windows may cause problems as follows. Drive prefixes should be avoided for files as they will be embedded in anchors and so will not work correctly. Using backslashes in file names will cause problems as forward slashes are used in the generated references to files in child directories.
The code relies on the calling shell to expand wildcard filenames like '*.html'. This is automatic in a Unix shell, but does not happen at a DOS prompt. For the latter it is therefore necessary to list files explicitly.
The original files must conform to HTML conventions. If necessary use htmlfix first to correct major problems.
<body ...> and </body> must be on a line of their own. Any other information on these lines will be lost.
In anchors, href="..." and name="..." must be not be split across a line.
Any material after "</body>" (such as HTML comments) will be lost.
The script might get confused by a symbolic directory index link or references to files in remote directories (though it does its best).
If the concatenated HTML file is moved, remember to move any other local files (e.g. images) to the same relative location (e.g. the same directory).
For use with a frame-based collection of files, exclude the frameset definition file from the list of inputs and probably start with a contents file.

Options

The command line options are:

-d: print divider between concatenated files
-h: print usage as help
-o file: name output file (this will be ignored if present in the input list, e.g. due to giving *.html)
-s: sort input files into case-insensitive alphabetical order (putting the index file first if necessary, and removing the file it points to from the inputs if it is a symbolic link)

Usage

Run on one or more HTML files. Warning messages are sent to standard error. Examples of usage are:

htmlcat -o some.html def.html res.html: concatenate def.html and res.html to some.html
htmlcat -d -o all.html *.html: concatenate all HTML files to all.html with dividers between them
htmlcat -o -s out.html *.html: sort then concatenate all HTML files to out.html
htmlcat *.html > /tmp/all.html: concatenate all HTML files to standard output (here /tmp/all.html); for this method, do not create a concatenated file in the same directory or the script will run indefinitely on its own output!

The only things likely to need changed for installation are the directory index filename and the nature of a file divider (see customise subroutine in the code). Change the first line of the script according to where Perl is located. Although tested with Perl5, the script may work with only minor changes for Perl4.

Licence

htmlcat is free software, distributed under the GNU Public License Version 2. You may re-distribute this software provided you preserve this README file. The contents of this package may be used freely for non-commercial purposes provided this README file and copyright notices are retained. Copyright remains with the author. No warranties are given as to the accuracy or suitability of this package.

History

First public version Ken Turner, 21st November 1998

Up one level to Web Utilities

Last Update: 13th May 2010
URL: https://www.cs.stir.ac.uk/~kjt/software/web/htmlcat.html