tdb - A Text Database Utility Version 0.1 tdb - A Text Database Utility 1. Overview tdb is a report writing utility base on reading text "database" files. It is a non relational system. tdb requires a metadata description of the layout of the text files. This file is called the FIELDs file; it serves as the "data dictionary". A program file is also required, depending upon which of tdb's two modes (select or report) is used. Note: square brackets `[]' are used to indicate something is optional, and ellipses (...) are used to indicate that the item(s) preceding the ellipsis may be (optionally) repeated multiple times. 2. FIELD Files A FIELD file serves the most basic purpose of the data dictionary in an RDBMS - that is, it describes the data. There are three FIELD types: int, float, and string. FIELDs are declared on one line using the type and the name, separated by "white space" (one or more space and/or tab characters). FIELD files may contain comments (anything after a '#' character). Here is an example of a FIELD file with four FIELDs: string name # student name int department # department number int course # course number float score # score for one exam, out of 100 # +-+ yet more inane commentary +-+ 3. Data Files and Data Directories 3.1. Data Files Data files consist of text records. Each record consist of FIELDs of text separated by "white space" (by default), or a delimiter character specified on the command line (-d ''). String FIELDs may contain delimiter character(s) by surrounding them with the quote character (default double quote): "Tux T. Penguin" A different quote character may be specified on the command line, e.g., "-q '|'" 3.1.1. White Space Delimited Data Files This is the default delimiting style. There are two valid formats for data records: all the FIELDs are on one line, or FIELDs split across two lines. The first FIELD of a data record must start at the beginning of a line. Lines starting with "white space" are a continuation of a previous line. As an example of the simplest case, here is valid a data record for the preceding FIELD file: 1991-2015 -1- James S. Crook tdb - A Text Database Utility Version 0.1 "Tux T. Penguin" 42 101 88.5 Data records beginning with "white space" are considered to be continuations of the most recent previous record that started at the beginning of the line. For example, all of the following sample data files represent exactly the same data. Example 1: "Tux T. Penguin" 42 101 88.5 "Tux T. Penguin" 42 104 92.4 "Tux T. Penguin" 24 52 64.2 Example 2: "Tux T. Penguin" 42 101 88.5 42 104 92.4 24 52 64.2 Example 3: "Tux T. Penguin" 42 101 88.5 104 92.4 "Tux T. Penguin" 24 52 64.2 "White space" delimited data files may also contain comments. All text after a comment character (`#') is ignored. The comment character may be included as data by surrounding it with quote characters (by default, double quote: "#"). 3.1.2. Command Line Specified Single Character Delimited Data Files The FIELDs are delimited by a single character, specified on the command line. For example, a CSV (Comma Separated Value) file could be used by invoking '-d ,' on the command line. Comments are not allowed in single character delimited data files. Quoting text is the same for both formats of data file. 3.2. Data Directories Whenever a tdb command line data argument is a directory, tdb traverses that directory tree, opening all files in turn. Each file is processed in exactly the same manner as if it were specified individually (see above). This permits large numbers of data files to be stored in a hierarchical manner, so there are not too many files in any one directory. 4. Modes of Operation tdb operates in two modes, select mode and report mode (described below). Each mode requires a program file, but they are different. However, both select mode and report mode read the same data file format. The syntax of tdb programs is similar to "C" in some respects. 4.1. Select Mode Select mode is used to select (choose) or reject (exclude) individual data records. Select mode programs consist of one or more select or reject EXECUTABLE 1991-2015 -2- James S. Crook tdb - A Text Database Utility Version 0.1 STATEMENTs. Each of the select program statements is processed in order, and the first one to cause a rejection (or, equivalently, a non selection) causes that record to be rejected. If none of the statements causes such a rejection, the record is selected. If a record is selected, it is written out (output) exactly as it was read (including being split into two lines, where appropriate). 4.1.1. Select Mode EXECUTABLE STATEMENTs The syntax of the two select mode statements are: select statement: select(EXPR); reject statement: reject(EXPR); A sample select mode program: select(department == 42); reject(course > 101); 4.2. Report Mode Report mode is used to calculate and generate (hopefully) useful output. Report mode programs consist of exactly the following format: 1. zero or more join statements 2. exactly one aggregate statement 3. zero or more function definitions 4. one or more EXECUTABLE STATEMENTs 4.2.1. Join Statements tdb report programs may optionally have join statement(s) as the first statement(s) in the program. join statement: join(type NEW_FIELD: FIELD|STRING [,FIELD|STRING...]); The join statement creates a new FIELD (NEW_FIELD, above) by concatenating one or more existing FIELDs and/or STRINGs. Joined FIELDs behave exactly as FIELDs defined in the FIELDs file, except they may not be used as FIELD arguments in other join statements. 4.2.2. Aggregate Statements After the optional join statement(s), every tdb report program must contain exactly one aggregate statement. tdb "aggregates" (sorts and collects into buckets) the data only for the FIELD(s) that are listed in the aggregate statement. Aggregating on FIELD(s) that are not referenced in the report program does not produce different output, but it wastes computing resources (processing cycles and memory) - resulting in longer execution times. aggregate statement: aggregate(FIELD [,FIELD...]); Altering the order of the FIELDs in the aggregate statement may also affect program efficiency. Use the -v option in report mode to view tdb internal 1991-2015 -3- James S. Crook tdb - A Text Database Utility Version 0.1 statistics. 4.2.3. Function Definitions function definition: function function_name([parameter,...]) { [EXECUTABLE STATEMENTs] } Note: The return statement (see below) must only be used within function definitions when they need to "return" (exit the function) and optionally supply a value (an expression) to the caller. 4.2.4. Report Mode EXECUTABLE STATEMENTs compound statement: { EXECUTABLE STATEMENT [EXECUTABLE STATEMENT(s)] } foreach statement: foreach (REF_FIELD: [FIELD...]) EXECUTABLE STATEMENT The foreach statement iterates through each valid value for REF_FIELD which is represented in the (aggregated) bucket(s) in the optional FIELD(s). If there are no optional FIELDs, foreach loops once for each REF_FIELD value. Unless the sort command (see below) has been used, the values are in ascending sorted order of the REF_FIELD values. See also: first and last statements. while statements: while (EXPR) EXECUTABLE STATEMENT do-while statement: do EXECUTABLE STATEMENT while (EXPR); if statement: if (EXPR) EXECUTABLE STATEMENT if-else statement: if (EXPR) EXECUTABLE STATEMENT else EXECUTABLE STATEMENT sort statement: sort(REF_FIELD: EXPR [,reverse]); first statement: first(REF_FIELD: INTEGER_EXPR); last statement: last(REF_FIELD: INTEGER_EXPR); The sort statement (re)arranges the processing order the REF_FIELD value(s) for any subsequent foreach statement(s) for REF_FIELD. The EXPR argument of the 1991-2015 -4- James S. Crook tdb - A Text Database Utility Version 0.1 sort statement can (and usually does) reference REF_FIELD - even though it is not within a foreach(REF_FIELD:...) statement. The first and last statements cause subsequent foreach statement(s), for the specified FIELD (REF_FIELD), to only process the first (or last) N values. printf statement: printf(STRING_EXPR [,EXPR...]); sprintf statement: sprintf(VARIABLE, STRING_EXPR [,EXPR...]); assign statement: VARIABLE ASSIGNMENT_OPERATOR EXPR; system statement: system(STRING_EXPR); return statement: return [EXPR]; (See Function Definitions) expression statement: EXPR; trap statement: trap(INTEGER_EXPR, function_name([EXPR,...])); The trap statement is used to invoke (call) a function whenever a the linenumber variable (see below) hits a particular value. This is typically used for page headers and footers, but can be used for any line number in the output. 5. FIELDs FIELDs can be any valid identifier except a reserved word. 6. INTEGER CONSTANTs Integer constants are comprised of one or more digits (0-9). 7. FLOATING POINT CONSTANTs Floating point constants are best illustrated by example: 23.45 -.1 -0.1 3. 3.0000 .7e5 0.7e5 .7E5 0.7E5 .7e+5 0.7e+5 .7E+5 0.7E+5 .33e-5 0.33e-5 .33E-5 0.33E-5 -98.76e5 -98.76E5 -98.76e+5 -98.76E+5 8. STRING CONSTANTs String constants are any characters enclosed by double quotes (`"'). There are several characters that can be included in a string constant by preceding them with a backslash character (`\'). These are: \n new line 1991-2015 -5- James S. Crook tdb - A Text Database Utility Version 0.1 \t tab \r return \f form feed \b backspace \" double quote \\ backslash The following are example of string constants: "Hello there" (without new line at the end) "Hello there\n" (with a new line at the end) "\tMy name is Fifo" (with a tab at the beginning) 9. Operators 9.1. ARITHMETIC_OPERATORs + addition - subtraction * multiplication / division % modulus ^ exponentiation 9.2. RELATIONAL_OPERATORs < less than > greater than <= less than or equal to >= greater than or equal to == equal to != not equal to 9.3. LOGICAL_OPERATORs && logical and || logical or ! logical negation 9.4. Operator Precedence This list of operator precedences is organized as follows: operators on the same line have equal precedence. Operators at the top of the list have the highest precedence. + - ! (unary) (TYPE) (type casting) * / % ^ + - (binary) < <= > >= == != && || 1991-2015 -6- James S. Crook tdb - A Text Database Utility Version 0.1 10. ASSIGNMENT_OPERATORs = assignment += add assignment -= subtract assignment *= multiplication assignment /= divide assignment %= mod assignment ^= exponentiation assignment For example: a += b means: a = (a) + (b) 11. VARIABLEs IDENTIFIER (scalar variable) IDENTIFIER [EXPR [,EXPR...]] (array variable) Array indexes may be integers, strings or floats. (However, using floating point numbers for indexes is not recommended; it should be avoided if possible for obvious reasons!) There is no limit to the dimension of array variables - that is - one can have arrayvar[10,integervar,42,"January",...66,stringvar,19]. 12. EXPRs INTEGER CONSTANT FLOATING POINT CONSTANT STRING CONSTANT VARIABLE FIELD (EXPR) EXPR RELATIONAL_OPERATOR EXPR EXPR LOGICAL_OPERATOR EXPR EXPR ARITHMETIC_OPERATOR EXPR tdb intrinsic FUNCTION function_name([EXPR,...]) 13. tdb Intrinsic Functions 13.1. Select and Report Mode Intrinsic Functions Some of the tdb intrinsic functions can be used in both select mode and report mode. These functions are listed below. tdb Intrinsic Functions Expr type returned _______________________________________ __________________ sin(FLOAT_EXPR) FLOAT cos(FLOAT_EXPR) FLOAT tan(FLOAT_EXPR) FLOAT asin(FLOAT_EXPR) FLOAT acos(FLOAT_EXPR) FLOAT atan(FLOAT_EXPR) FLOAT log(FLOAT_EXPR) FLOAT log10(FLOAT_EXPR) FLOAT exp(FLOAT_EXPR) FLOAT 1991-2015 -7- James S. Crook tdb - A Text Database Utility Version 0.1 atoi(STRING_EXPR) INTEGER atof(STRING_EXPR) FLOAT patmatch(STRING_EXPR, STRING_EXPR) INTEGER strlen(STRING_EXPR) INTEGER indexstr(STRING_EXPR, STRING_EXPR) INTEGER substr(STRING_EXPR, INTEGER_EXPR, INTEGER_EXPR) STRING lookup(STRING_EXPR, STRING_EXPR, [INTEGER_EXPR]) STRING reportdate() INTEGER reporttime() INTEGER formatdate(STRING_EXPR, INTEGER_EXPR) STRING formattime(STRING_EXPR, INTEGER_EXPR) STRING 13.1.1. The String Intrinsic Functions The patmatch function searches its second argument using the pattern specified by its first. regcomp and regexec are used to do this. (See their manual page for details.) It returns a 1 if there is a match, otherwise 0. The strlen function returns the length (in bytes) of its string argument. The indexstr function searches its second argument for the string specified by its first argument. This is not a pattern match; it's a simple string find. If the string is found, indexstr returns the character position of the first occurrence. If the string is not found, it returns 0. The substr function returns the part of its first string argument specified by its second and third index integer arguments, the indexes. These index arguments must be between 1 and strlen(string), inclusive, and the second argument must be less than or equal to the third. The lookup function is used to reference string patterns in a text file. The first argument is the string to search for - the pattern. The second argument is the name of a file to search. This function searches through the file for a line beginning with the pattern. If no such line exists, lookup returns the pattern. If it does find one, lookup returns the N'th argument on that line, where N is the optional integer third argument - the index. Using an index of 1 returns the pattern - a waste of system resources. If no index (third) argument is used, it is assumed to be 2. 13.1.2. The Date and Time Intrinsic Functions The reportdate function returns the current date, as read from the system. The value is an INTEGER, the format is yyyymmdd. The reporttime returns the number of seconds that have elapsed since midnight, as read from the system. This is an INTEGER value. The formatdate and formattime functions use the following format codes. (Note, these are from strftime.) %d : two digit day of the month (dd) %m : two digit month of the year (mm) %y : two digit year (yy) %Y : for digit year (yyyy) %H : two digit hour (HH) %M : two digit minute (MM) 1991-2015 -8- James S. Crook tdb - A Text Database Utility Version 0.1 %S : two digit second (SS) %a : three character weekday abbreviation (e.g., "Mon") %A : full weekday name (e.g., "Monday") %b, %h: three character month abbreviation (e.g., "Jan") %B : full month name (e.g., "January") %j : the day number of the year (e.g., 001-366) %U : the week number of year (e.g., 01-52) Sun 1st day of week %W : the week number of year (e.g., 01-52) Mon 1st day of week %% : the literal '%' character The formatdate function is used to transform date values (like those returned by reportdate) into another text date string. It is called with two arguments, first the format string, and second, the (INTEGER) date to be formatted. Two sample format strings are: "%d/%m/%y" for dd/mm/yy, or "%m-%d" for mm-dd. The formattime function is used to transform time values (like those returned by reporttime) into another text time string. For example, the format string "%H:%M:%S" produces HH:MM:SS. 13.1.3. The Need Intrinsic Function The need function is used when it is desirable to prevent a group of output lines from being split across a page break. It takes the required number of lines as its argument. 13.2. Report Mode Intrinsic Functions Some of the tdb intrinsic functions make no sense in select mode. The following functions may only be used in report mode. Report Mode Intrinsic Function Expr type returned _______________________________________ __________________ count([FIELD,...]) INTEGER number(REF_FIELD: [FIELD,...]) INTEGER sum(REF_FIELD: [FIELD,...]) REF_FIELD's type sumsquared(REF_FIELD: [FIELD,...]) REF_FIELD's type need(INTEGER_EXPR) INTEGER Note: only the FIELD(s) specified in the aggregate statement can be referenced in a tdb report program. Much of the explanation below is centered on data records and data "buckets". For the sake of simplicity, assume each physical data input line corresponds to one (logical) record - that is, a row. That is, the fact that two data input lines can map to one (logical) record is not important here. Each time an input record is read, tdb selects the FIELD(s) (columns) of interest (the ones in the aggregate statement, in report mode) and places them into appropriate data "buckets". A bucket is simply a place where information about FIELD values is stored. For example, a FIELD named fruit might have values of "apple", "banana", "cherry", and "quince". If the aggregate statement contained this field, then a bucket would exist for each of these four values. 1991-2015 -9- James S. Crook tdb - A Text Database Utility Version 0.1 The examples below refer to the following FIELD and data files: FIELD file data file ------------ -------------- string fruit apple 3.00 float price banana 1.00 banana 3.00 cherry 3.00 cherry 4.00 cherry 5.00 quince 10.00 quince 10.00 13.2.1. The Count Intrinsic Function The count function returns the number of record(s) that fell into the appropriate bucket. This function requires zero or more FIELD argument(s), and must be called from within a foreach loop for every one of the FIELD argument(s). For example, if it is called with three arguments, there must be three nested foreach loops with the same FIELDs. The call `count()' (with no arguments) returns the total number of records. Since it has no arguments, it may be called from anywhere in a program; that is, it need not be within any foreach loop. For the example data above, count(fruit) would return the following values for apple: 1, banana: 2, cherry: 3, and quince: 2. Also, count() is 8. 13.2.2. The Number Intrinsic Function The number function calculates the number of bucket(s) of a particular type for the reference FIELD (REF_FIELD). Both the REF_FIELD and the `:' are required. Like count, number must be called from within a foreach loop for all of the optional FIELDs. REF_FIELD must not also appear in the optional FIELD list. For the above data, number(fruit:) is 4, and number(price:) is 5. 13.2.3. The Sum Intrinsic Function The sum function calculates the (sum) total of all the individual values for the reference FIELD (REF_FIELD) for the specified bucket. REF_FIELD must be a numeric type (INTEGER or FLOAT). Both the REF_FIELD and the `:' are required. This function must also be called from within a foreach loop for all of the optional FIELDs. REF_FIELD must not also appear in the optional FIELD list. For the example data above, sum(price:) would return 39.0. 13.2.4. The Sumsquared Intrinsic Function The sumsquared function calculates the (sum) total of the squares of all of the individual numeric data values for the reference FIELD (REF_FIELD) for the specified bucket. REF_FIELD must be a numeric type (INTEGER or FLOAT). Both the REF_FIELD and the `:' are required. This function must also be called from within a foreach loop for all of the optional FIELDs. 1991-2015 -10- James S. Crook tdb - A Text Database Utility Version 0.1 The primary purpose of this function is to facilitate statistical analysis, such as standard deviation. For the example data above, sumsquared(price:) would return 269.0. 14. tdb Reserved Variables tdb maintains three INTEGER variables automatically in report mode. These are listed below (followed by their default initial values): - pagelength (66) - pagenumber (1) - linenumber (0) These variables are updated whenever tdb produces a line of output. It should never be necessary to assign linenumber. Note, the "line number" is relative to the top of the current page, and the numbers range from 0 to pagelength-1 inclusive. If it is desired for the first page of the report to start from some number other than 1, pagenumber can be set accordingly. Similarly, if the length of the page is not 66 lines, pagelength can be set to some other legal value. Do not assign any of these variables to non integer or non positive values, otherwise, undesirable results will very likely occur. 15. Include Files It is possible to include one file in another in the same manner as with the C Preprocessor (CPP) `#include' mechanism. To do this, one uses the following construction: #include "filename" Note that the `#' must be in the first column, and there must be no space between the `#' and `include'. 16. Warnings 16.1. The Maximum Number of FIELDs in an Aggregate Statement For now, the maximum number of FIELDs that can be used in any one aggregate statement is the number of bits in the C language type long - 32 or 64 in most cases. Note that this restriction does not apply to the number of FIELDs in the "database", nor does it apply to the number of FIELDs that can be referenced in select mode. 16.2. Trap Statements and Function Arguments If the trap statement is used within a user defined function, do NOT use any of that of that function's parameters in either of trap's arguments (either directly or indirectly). This is due to the fact that the trap function can be called at any time, and the same arguments that were on the stack when the trap was 'set' are unlikely to still be there when the trap function is invoked. Some examples may make this clear: 1991-2015 -11- James S. Crook tdb - A Text Database Utility Version 0.1 function myfunc (integerparam, stringparam) { trap(0, pageheader("hi")); /* OK, no use of parameters */ globalvar = integerparam + 2; trap(globalvar, pageheader("hi")); /* OK, no use of parameters */ trap(integerparam, pageheader("hi")); /* ERROR, direct use */ trap(0, pageheader(stringparam)); /* ERROR, direct use */ /* ERROR, indirect use */ trap(myotherfunc(10*integerparam/2), pageheader("hi")); } 17. Copyright and License Copyright (C) 1991-2015 James S. Crook Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front- Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". 1991-2015 -12- James S. Crook