Some Recent Customer Solutions

A Typical Customer Brief

Develop a small, elegant, extremely robust, inexpensive, easily supportable and fully automated solution. Avoid the initial purchase price, annual software maintenance fee, unexpected/unwanted vendor changes and retain total control of the software.

Also, when external problems occur, the solution must "self-right" after short/medium term interruptions and provide automated notifications of long term interruptions.

Constraints:

Yes, it is possible!

Some Recent Customer Solutions

  1. A suite of shell scripts to backup a SAP ERP system located in the primary DC with no local backup facility to a remote DR DC - over a modest network link. This is fully automated and sends notification emails in the event of a problem. (Linux/MaxDB).

    For this customer, CPU cycles are cheap and network bandwidth is expensive, so trade more of the former for less of the latter. Other objectives are to get the backups off site ASAP and to cope with network interruptions by transferring many smaller files rather than one huge file.

    • splitcompress: Asynchronously splits and compresses "large" (~130GiB, so fairly small by SAP standards) full and incremental SAP ERP DB backup files into compressed chunks - N chunks at a time - using as many available processors as desired. Checksums are calculated at every stage.
    • offsite_dirs: Copies the compressed files and checksums generated when they are complete.
    • offsite_files: Copies all the other files that are required to restore a SAP ERP system (including DB transaction logs) off site shortly after they are complete.
    • verifycksum: Verifies all the files have arrived at the DR site intact. All the compressed chunks are uncompressed and concatenated, and that result is checksummed and compared with the original checksum from the primary site.
  2. EDI bureau interface: A shell script that polls the EDI bureau at specified times to up/download business documents. Due to the bureau's systems being quite slow, the script forks N "threads" (sub-shells) that simultaneously up/download - thus realizing the required throughput. Rogue suppliers occasionally generating up to hundreds of duplicate orders was a major problem for the business, so the script also detects and quarantines duplicate orders so they never reach the ERP system.
  3. Third party logistics (3PL) interface: A shell script to up/download business documents of various types (e.g., iDoc, XML and CSV). Depending on the document type, various actions are required, such as: line termination change, creation of a file containing a list of files (for ERP to read and process), trigger a SAP event, send emails, etc.
  4. A suite of shell scripts to keep Oracle DR DBs (for SAP ERP systems) in warm standby: As above, one of the design constraints was that the network link is fairly modest, so a very high level of compression was required. Another benefit is weeks or even months of compressed Oracle offline redo logs can be stored off site at the DR/test DC. This is very handy for recovering test and sandbox systems. (Linux/Oracle)

    Oracle's Data Guard was not used because: (1) DG was too new when we started, (2) we have control! - e.g., choice of compression utility, the priority to run it, etc., (3) DG only replicates Oracle, there are other files that must be replicated too, (4) No Oracle changes are required on either the primary or standby DB and (5) these scripts are extremely robust and reliable.

    • Oracle DB log shipping: This runs on the production (primary) servers. Shortly after every offline redo log file is created, a checksum is calculated, the log file is compressed, and then both the compressed file and the checksum are sent off site (as described above). Unix nice and lzip are used to control resource consumption on the production (primary) systems. (Originally, gzip was used, but when the network link became saturated, it was trivial to replace it with lzip - which squishes much better!) This script is also used as part of the backup strategy for development systems.

      Development systems' DB redo logs are sent off site in exactly the same way as the production logs, so development system backups need only be run nightly or even weekly - with no extra risk of data loss.

    • Oracle DB log recover: This runs on the DR (off-site) servers. As soon as a redo log file has reached a (business-specified) age, it is uncompressed, the checksum is compared to ensure the file has not been damaged, and then it's applied to the standby DB.
    • Synchronization of other files: This is just a list of target directories to rsync from the production (primary) servers to the DR (off-site) servers. (All the real work is done by rsync.)
Notes:
  1. PKI is used for authentication and encryption.
  2. Passphrases for automated secure communication use ssh-agent.
  3. Most of the shell scripts are initiated by cron or SAP batch jobs.
  4. Extensive use of utilities such as: rsync, lzip, xz, awk, sed, cksum utilities (cksum, sha256sum, etc.) and findfiles (developed by YOSJ Staff), of course!
  5. "Obvious" solutions such as SAN replication were eliminated due to the additional initial and ongoing costs - i.e., the SAN software and the requirement of a much higher bandwidth network link between the DCs.
  6. CapEx can inexplicably be an order of magnitude more difficult to obtain than OpEx - even if it's the better option!