Manual Reference Pages - extract (1)

NAME

extract - extract character ranges or tokens from text files.

Synopsis
Description
Options
Examples
See Also
License
Copyright
Acknowledgements
Authors

SYNOPSIS

extract [ -h -? -help --help --? ]
extract [options...] <inputfile >outputfile

DESCRIPTION

extract reads a text file from stdin and extracts a range of rows and columns (character positions) and sends them to stdout. Alternatively, it can process tokens instead of character columns. Alternatively, it can remove the selected range instead of emitting it.
extract is much simpler to use than awk or perl and is sufficient for most column/row extraction tasks.
extract may be obtained from:
ftp://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/
Use of extract is subject to the License terms.

OPTIONS

-all
Emit unprocessed the text rows outside of the range specified with -sr,-er,-nr. (Default is not to emit these rows.)

-bs
Add backslashes (unix escape characters) before any character other than alphabet, numeric, underscore, period, or slash. Note that this only applies within a field, so that, for instance, if the program is running in token mode a token range [1,3] would apply the backslashes between characters within each token but not between tokens. To work around that limitation use [dv\\:1,3]. (Default is not to add backslashes.)

-cols format
Specify in great detail the format of the output line. Using other command line options one column is singled out and those options are applied to it (subject to the logical changes indicated by -rm or -ins). When -cols is used the other command line options specify the default values for all column fields and multiple column fields (indicated by [] brackets within format) may be specified. Between column fields static strings may be introduced. These static strings may contain any symbol, escaped characters (\char), and/or may use [[ and ]] to represent [ and ] (which would otherwise be intrepreted as the limits of a column field. Within a column field a colon (:) separated set of options are allowed. Characters Within a column field [ and ] are not allowed but all other characters are and escapes may be used to include colons. Arbitrary combinations of static strings and column fields may be employed, freely mixing token and character mode columns, and emitting columns in any order, including emitting a single column multiple times. Typically format must be quoted or escaped on the command line so that the shell does not mangle it before passing it into the program. The options for a column field are:

+ set_as = match command line specifications
p default = match program defaults (overrides -pd,-lj,-uc,etc.)
- disable = disable options
If employed as a single character it applies to all settings and must be the first option within a column field. As a suffix these may be applied singly to each of the -cols options.

mt/mc/m-/mp/m+ token mode/character mode/disable/default/set_as. Also sets the delimit state in some instances to match the command line, but this may be overridden again by a subsequent :d*: clause in the same column field. (overrides -mt/-mc)

jl/jr/jc/j-/jp/j+ justify left/right/center/disable/default/set_as (overrides -j*)

cu/cl/c-/cp/c+ case upper/lower/disable/default/set_as (overrides -c*)

bs/c-/cp/c+ backslashes apply(as needed)/disable/default/set_as (overrides -c*)
dt/dvN/d-/dp/d+ emit delimit from token/with char N/disable/default/set_as. Restriction: the delimit character N must be escaped if it is a colon or a backslash, ie \: and \\. (overrides -d*)
pd###/pd-/pdp/pd+ pad with ### spaces/disable/default/set_as (overrides -pd or -fw)

fw###/fw-/fwp/fwd field width to ### spaces/disable/default/set_as (overrides -pd or -fw)

rsSTR/rs-/rsp/rsd replacement string is STR/disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rs)

[c] [s,e] [s,] [,e] range values for single column, column range (start,end), open ended(start ,or range), and tail (offset, count) ranges. The single range for each column field is employed instead of that specified by -sc. The range values must be the final option in a column field. Both the s and e values may be positive or negative. If positive, they are column/token positions measured from the front of the line. If negative, they are column/token positions measured from the end of the line. Mixing modes like [10,-10] is possible but can generate fatal errors if the line is too short or has too few tokens to satisfy the range.

-cu -cl
In selected characters/tokens change case to upper or lower. (Default is to leave case unmodified.)

-dbg
Emit state and parsing information as each input line is processed. (Default is to not emit this information.)

-dl delimiter_string
Change the delimiters used to define tokens. Typically delimiter_string must be quoted or escaped on the command line so that the shell does not interpret it. Use \t for tab, \\ for \, and \19 for the character with value 19. (Default string is space,colon,tab.)

-dt
When tokens are emitted followed by delimiters use as that delimiter that which defined the end of the current token. (Default). See also -d- and -dv.

-dq -dqs
While parsing tokens ignore delimiters within double quotes. -dq returns the token with the surrounding double quotes, -dqs returns the token without the quotes. (Default is to recognize delimiters no matter where they occur.)

-dv delimit_character
When tokens are emitted followed by delimiters use -dv delimit_character. (Default is -dt).

-d-
Do not emit a delimiter following a token. This is most often used in combination with the -s, -pd/fw, and -j* switches. (Default is -dt, see also -dv).

-ec end_column
The last character column to select. (Defaults to -1, the last column.)

-er end_row
The last text row to process. (Defaults to the final row in the file.)

-fw number_of_characters
Specifies in number_of_characters the field width. The input field is either padded or truncated as required. When fields are processed they are padded, then justified, then the character cases adjusted. See also -pd. (Default is 0 - no change to field sizes.)

-h -help --help -? --??
Print the help message. (Default - do not print help message.)

-i
Emit version, copyright, license and contact information.( Default - do not emit information.)

-in input_file
Read input from the specified file. (Default is to read from stdin.)

-is
in situ modify the indicated character or token range and emit them and the unmodified surrounding region. This option may not be used with -rm or -cols. (Default is to emit only the selected character/token range.)

-jl -jc -jr
Justify field left, center, or right. (Default is to not change justification.)

-mc
Process lines as character columns. See also -mt. (Default.)

-mt
Process lines as tokens. In this mode -sc,-ec, and -nc values refer to token numbers.(Default is character columns = -mc )

If a single token is emitted then no delimiters is emitted with it. However, two or more tokens are emitted as:
token1 delim1 token2 delim2 token3 etc. tokenN
where delim1 is the first delimiter following token1. When -s is also used delim1 will be the only delimiter after token1 but if -s is not specified there may be other delimiters after delim1 and these will not be emitted. The last token emitted is not followed by a delimiter.

-nc number_of_columns
Number of columns to select. Do not specify both -nc and -ec.

-nr number_of_rows
Number of text rows to process starting from sr. Do not specify both -nr and -er.

-out output_file
Write output to the specified file. (Default is to write to stdout.)

-pd number_of_characters
Specifies the number_of_characters (spaces) to be added to the right side of the field. When fields are processed they are padded, then justified, then the character cases adjusted. See also -fw. (Default is 0 - no padding.)

-rm
Remove the selected character columns/tokens instead of emitting them. This option may not be used with -is or -cols. (Default is to emit only the selected character/token range.)

-rs replacement_string
replacement_string substitutes for empty fields. Typically employed to insert NA or 0 in a tab delimited file which left unspecified values as empty fields. (Default leave empty fields empty.)

-s
Emit a token for each delimiter encountered. When -s is specified tokens may consist of empty strings. This mode is for use with delimited data as from a spreadsheet. (Default is to emit one token for each run of delimiters.)

-sc start_column
The first character column to select. Columns are numbered from 1. Negative values are allowed and represent columns measured from the end of the line, where -1 is the last column. (Default start_column=1.)

-sr start_row
The first text row (line of text) to process. Rows are numbered from 1. (Default start_row=1.)

-wl widest_line
Widest input line in characters. (Default widest_line=16000.)

-xc maXimum_Columns
Maximum number of column fields ([] in -cols) and/or tokens that may be referenced. (Default maXimum_columns=8192.)

EXAMPLES

% extract
% extract -h
List the the command line options.
% extract -sc 50 <infile.txt >outfile.txt
Extract characters 50 to end of row for every line in infile.txt and write them to outfile.txt.
% extract -sr 4 -sc 5 -ec 10 <infile.txt >outfile.txt
Extract characters 5-10 from rows 4 to end of infile.txt and write them to outfile.txt.
% extract -sc 5 -nc 10 <infile.txt >outfile.txt
Extract characters 5-14 from all rows in infile.txt and write them to outfile.txt.
% extract -sc 2 -ec 3 -mt -dl ':,;' <infile.txt >outfile.txt
Extract the 2nd and 3rd tokens delimited by one or more :,; characters from each row in infile.txt and write them to outfile.txt.
% extract -sr 4 -er 40 -sc 2 -ec 3 -mt -dl ':,;' -s -all -rm <infile.txt >outfile.txt
Process infile.txt as follows:
1. Emit verbatim rows 1 through 3.
2. For rows 4 though 40 emit the 1st, and 4th through Nth tokens delimited by a single :,; character.
3. Emit verbatim rows 41 to the final row in the file.
% cd / ; du -k | extract -cols '[jr:fw14:1] [2]' -mt
Lists the size of all directories on a Unix system with the size field right formatted so that the columns all line up.
% ls -al | extract -cols '[ch:1,32][fw14:jr:5] [6] [fw2:7] [jr:fw5:8] [9]' -mt -dl ' '
Straighten the columns in a directory listing on a Unix system with large files.
% extract -cols 'foo[cu:lj:fw20:3,5]blah[-:ch:10,30]er[1]' -mt -fw30 <infile.txt
Process each line of infile.txt as follows:
1. Emit "foo".
2. Emit tokens 3,4, and 5 upper cased in a 20 character field, left justified.
3. Emit "blah".
4. Emit characters 10 through 30.
5. Emit "er".
6. Emit column 1 in a field of width 30.

LICENSE

You may run this program on any platform. You may redistribute the source code of this program subject to the condition that you do not first modify it in any way. You may distribute binary versions of this program so long as they were compiled from unmodified source code. There is no charge for using this software. You may not charge others for the use of this software.

COPYRIGHT

Copyright (C) 2002 David Mathog and Caltech.

ACKNOWLEDGEMENTS

This program was inspired by Pat Rankin's EXTRACT utility for VMS.

AUTHORS

David Mathog, Biology Division, Caltech <mathog@caltech.edu>

extract (1)

21 Feb 2002

Manual Reference Pages - extract (1)

NAME

CONTENTS

SYNOPSIS

DESCRIPTION

OPTIONS

EXAMPLES

SEE ALSO

LICENSE

COPYRIGHT

ACKNOWLEDGEMENTS

AUTHORS

-all
	Emit unprocessed the text rows outside of the range specified with -sr,-er,-nr. (Default is not to emit these rows.)
-bs
	Add backslashes (unix escape characters) before any character other than alphabet, numeric, underscore, period, or slash. Note that this only applies within a field, so that, for instance, if the program is running in token mode a token range [1,3] would apply the backslashes between characters within each token but not between tokens. To work around that limitation use [dv\\:1,3]. (Default is not to add backslashes.)
-cols format
	Specify in great detail the format of the output line. Using other command line options one column is singled out and those options are applied to it (subject to the logical changes indicated by -rm or -ins). When -cols is used the other command line options specify the default values for all column fields and multiple column fields (indicated by [] brackets within *format) may be specified. Between column fields static strings may be introduced. These static strings may contain any symbol, escaped characters (\char), and/or may use [[ and ]] to represent [ and ] (which would otherwise be intrepreted as the limits of a column field. Within a column field a colon (:) separated set of options are allowed. Characters Within a column field [ and ] are not allowed but all other characters are and escapes may be used to include colons. Arbitrary combinations of static strings and column fields may be employed, freely mixing token and character mode columns, and emitting columns in any order, including emitting a single column multiple times. Typically format* must be quoted or escaped on the command line so that the shell does not mangle it before passing it into the program. The options for a column field are: + set_as = match command line specifications p default = match program defaults (overrides -pd,-lj,-uc,etc.) - disable = disable options If employed as a single character it applies to all settings and must be the first option within a column field. As a suffix these may be applied singly to each of the -cols options. mt/mc/m-/mp/m+ token mode/character mode/disable/default/set_as. Also sets the delimit state in some instances to match the command line, but this may be overridden again by a subsequent :d: clause in the same column field. (overrides -mt/-mc) jl/jr/jc/j-/jp/j+* justify left/right/center/disable/default/set_as (overrides -j) cu/cl/c-/cp/c+* case upper/lower/disable/default/set_as (overrides -c) bs/c-/cp/c+* backslashes apply(as needed)/disable/default/set_as (overrides -c) dt/dvN/d-/dp/d+* emit delimit from token/with char N/disable/default/set_as. Restriction: the delimit character N must be escaped if it is a colon or a backslash, ie \: and \\. (overrides -d) pd###/pd-/pdp/pd+* pad with ### spaces/disable/default/set_as (overrides -pd or -fw) fw###/fw-/fwp/fwd field width to ### spaces/disable/default/set_as (overrides -pd or -fw) rsSTR/rs-/rsp/rsd replacement string is STR/disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rs) [c] [s,e] [s,] [,e] range values for single column, column range (start,end), open ended(start ,or range), and tail (offset, count) ranges. The single range for each column field is employed instead of that specified by -sc. The range values must be the final option in a column field. Both the s and e values may be positive or negative. If positive, they are column/token positions measured from the front of the line. If negative, they are column/token positions measured from the end of the line. Mixing modes like [10,-10] is possible but can generate fatal errors if the line is too short or has too few tokens to satisfy the range.
-cu -cl
	In selected characters/tokens change case to upper or lower. (Default is to leave case unmodified.)
-dbg
	Emit state and parsing information as each input line is processed. (Default is to not emit this information.)
-dl delimiter_string
	Change the delimiters used to define tokens. Typically *delimiter_string* must be quoted or escaped on the command line so that the shell does not interpret it. Use \t for tab, \\ for \, and \19 for the character with value 19. (Default string is space,colon,tab.)
-dt
	When tokens are emitted followed by delimiters use as that delimiter that which defined the end of the current token. (Default). See also -d- and -dv.
-dq -dqs
	While parsing tokens ignore delimiters within double quotes. -dq returns the token with the surrounding double quotes, -dqs returns the token without the quotes. (Default is to recognize delimiters no matter where they occur.)
-dv delimit_character
	When tokens are emitted followed by delimiters use -dv delimit_character. (Default is -dt).
-d-
	Do not emit a delimiter following a token. This is most often used in combination with the -s, -pd/fw, and -j* switches. (Default is -dt, see also -dv).
-ec end_column
	The last character column to select. (Defaults to -1, the last column.)
-er end_row
	The last text row to process. (Defaults to the final row in the file.)
-fw number_of_characters
	Specifies in *number_of_characters* the field width. The input field is either padded or truncated as required. When fields are processed they are padded, then justified, then the character cases adjusted. See also *-pd*. (Default is 0 - no change to field sizes.)
-h -help --help -? --??
	Print the help message. (Default - do not print help message.)
-i
	Emit version, copyright, license and contact information.( Default - do not emit information.)
-in input_file
	Read input from the specified file. (Default is to read from stdin.)
-is
	*in situ* modify the indicated character or token range and emit them and the unmodified surrounding region. This option may not be used with -rm or -cols. (Default is to emit only the selected character/token range.)
-jl -jc -jr
	Justify field left, center, or right. (Default is to not change justification.)
-mc
	Process lines as character columns. See also -mt. (Default.)
-mt
	Process lines as tokens. In this mode -sc,-ec, and -nc values refer to token numbers.(Default is character columns = -mc ) If a single token is emitted then no delimiters is emitted with it. However, two or more tokens are emitted as: token1 delim1 token2 delim2 token3 etc. tokenN where delim1 is the first delimiter following token1. When -s is also used delim1 will be the only delimiter after token1 but if -s is not specified there may be other delimiters after delim1 and these will not be emitted. The last token emitted is not followed by a delimiter.
-nc number_of_columns
	Number of columns to select. Do not specify both -nc and -ec.
-nr number_of_rows
	Number of text rows to process starting from sr. Do not specify both -nr and -er.
-out output_file
	Write output to the specified file. (Default is to write to stdout.)
-pd number_of_characters
	Specifies the *number_of_characters* (spaces) to be added to the right side of the field. When fields are processed they are padded, then justified, then the character cases adjusted. See also *-fw*. (Default is 0 - no padding.)
-rm
	Remove the selected character columns/tokens instead of emitting them. This option may not be used with -is or -cols. (Default is to emit only the selected character/token range.)
-rs replacement_string
	*replacement_string* substitutes for empty fields. Typically employed to insert NA or 0 in a tab delimited file which left unspecified values as empty fields. (Default leave empty fields empty.)
-s
	Emit a token for each delimiter encountered. When -s is specified tokens may consist of empty strings. This mode is for use with delimited data as from a spreadsheet. (Default is to emit one token for each run of delimiters.)
-sc start_column
	The first character column to select. Columns are numbered from 1. Negative values are allowed and represent columns measured from the end of the line, where -1 is the last column. (Default *start_column*=1.)
-sr start_row
	The first text row (line of text) to process. Rows are numbered from 1. (Default *start_row*=1.)
-wl widest_line
	Widest input line in characters. (Default *widest_line*=16000.)
-xc maXimum_Columns
	Maximum number of column fields ([] in -cols) and/or tokens that may be referenced. (Default *maXimum_columns*=8192.)