Jul 10

Muscular sed

I offered examples that demonstrated how sed works, but very few of these examples actually did anything particularly useful. In this final sed article, it’s time to change that pattern and put sed to good use. I’ll show you several excellent examples that not only demonstrate the power of sed, but also do some really neat (and handy) things. For example, in the second half of the article, I’ll show you how I designed a sed script that converts a .QIF file from Intuit’s Quicken financial program into a nicely formatted text file. Before doing that, we’ll take a look at some less complicated yet useful sed scripts.


Text translation

Our first practical script converts UNIX-style text to DOS/Windows format. As you probably know, DOS/Windows-based text files have a CR (carriage return) and LF (line feed) at the end of each line, while UNIX text has only a line feed. There may be times when you need to move some UNIX text to a Windows system, and this script will perform the necessary format conversion for you.

$ sed -e 's/$/\r/' myunix.txt > mydos.txt

In this script, the ‘$’ regular expression will match the end of the line, and the ‘\r’ tells sed to insert a carriage return right before it. Insert a carriage return before a line feed, and presto, a CR/LF ends each line. Please note that the ‘\r’ will be replaced with a CR only when using GNU sed 3.02.80 or later.

I can’t tell you how many times I’ve downloaded some example script or C code, only to find that it’s in DOS/Windows format. While many programs don’t mind DOS/Windows format CR/LF text files, several programs definitely do — the most notable being bash, which chokes as soon as it encounters a carriage return. The following sed invocation will convert DOS/Windows format text to trusty UNIX format:

$ sed -e 's/.$//' mydos.txt > myunix.txt

The way this script works is simple: our substitution regular expression matches the last character on the line, which happens to be a carriage return. We replace it with nothing, causing it to be deleted from the output entirely. If you use this script and notice that the last character of every line of the output has been deleted, you’ve specified a text file that’s already in UNIX format. No need for that!


Reversing lines

Here’s another handy little script. This one will reverse lines in a file, similar to the "tac" command that’s included with most Linux distributions. The name "tac" may be a bit misleading, because "tac" doesn’t reverse the position of characters on the line (left and right), but rather the position of lines in the file (up and down). Tacing the following file:

foo

bar

oni

….produces the following output:

oni

bar

foo

We can do the same thing with the following sed script:

$ sed -e '1!G;h;$!d' forward.txt > backward.txt

You’ll find this sed script useful if you’re logged in to a FreeBSD system, which doesn’t happen to have a "tac" command. While handy, it’s also a good idea to know why this script does what it does. Let’s dissect it.


Reversal explained

First, this script contains three separate sed commands, separated by semicolons: ‘1!G’, ‘h’ and ‘$!d’. Now, it’s time to get an good understanding of the addresses used for the first and third commands. If the first command were ‘1G’, the ‘G’ command would be applied only to the first line. However, there is an additional ‘!’ character — this ‘!’ character negates the address, meaning that the ‘G’ command will apply to all but the first line. For the ‘$!d’ command, we have a similar situation. If the command were ‘$d’, it would apply the ‘d’ command to only the last line in the file (the ‘$’ address is a simple way of specifying the last line). However, with the ‘!’, ‘$!d’ will apply the ‘d’ command to all but the last line. Now, all we need to to is understand what the commands themselves do.

When we execute our line reversal script on the text file above, the first command that gets executed is ‘h’. This command tells sed to copy the contents of the pattern space (the buffer that holds the current line being worked on) to the hold space (a temporary buffer). Then, the ‘d’ command is executed, which deletes "foo" from the pattern space, so it doesn’t get printed after all the commands are executed for this line.

Now, line two. After "bar" is read into the pattern space, the ‘G’ command is executed, which appends the contents of the hold space ("foo\n") to the pattern space ("bar\n"), resulting in "bar\n\foo\n" in our pattern space. The ‘h’ command puts this back in the hold space for safekeeping, and ‘d’ deletes the line from the pattern space so that it isn’t printed.

For the last "oni" line, the same steps are repeated, except that the contents of the pattern space aren’t deleted (due to the ‘$!’ before the ‘d’), and the contents of the pattern space (three lines) are printed to stdout.

Now, it’s time to do some powerful data conversion with sed.


sed QIF magic

For the last few weeks, I’ve been thinking about purchasing a copy of Quicken to balance my bank accounts. Quicken is a very nice financial program, and would certainly perform the job with flying colors. But, after thinking about it, I decided that I could easily write some software that would balance my checkbook. After all, I reasoned, I’m a software developer!

I developed a nice little checkbook balancing program (using awk) that calculates by balance by parsing a text file containing all my transactions. After a bit of tweaking, I improved it so that I could keep track of different credit and debit categories, just like Quicken can. But, there was one more feature I wanted to add. I recently switched my accounts to a bank that has an online Web account interface. One day, I noticed that my bank’s Web site allowed me to to download my account information in Quicken’s .QIF format. In very little time, I decided that it would be really neat if I could convert this information into text format.


A tale of two formats

Before we look at the QIF format, here’s what my checkbook.txt format looks like:

28 Aug 2000     food    -       -       Y     Supermarket             30.94

25 Aug 2000     watr    -       103     Y     Check 103               52.86

In my file, all fields are separated by one or more tabs, with one transaction per line. After the date, the next field lists the type of expense (or "-" if this is an income item). The third field lists the type of income (or "-" if this is an expense item). Then, there’s a check number field (again, "-" if empty), a transaction cleared field ("Y" or "N"), a comment and a dollar amount. Now, we’re ready to take a look at the QIF format. When I viewed my downloaded QIF file in a text viewer, this is what I saw:

!Type:Bank

D08/28/2000

T-8.15

N

PCHECKCARD SUPERMARKET

^

D08/28/2000

T-8.25

N

PCHECKCARD PUNJAB RESTAURANT

^

D08/28/2000

T-17.17

N

PCHECKCARD SUPERMARKET

After scanning the file, wasn’t very hard to figure out the format — ignoring the first line, the format is as follows:

D<date>

T<transaction amount>

N<check number>

P<description>

^

 (this is the field separator)

Starting the process

When you’re tackling a significant sed project like this, don’t get discouraged — sed allows you to gradually massage the data into its final form. As you progress, you can continue to refine your sed script until your output appears exactly as intended. You don’t need to get it exactly right on the first try.

To start off, I created a file called "qiftrans.sed", and started massaging the data:

1d

/^^/d

s/[[:cntrl:]]//g

The first ‘1d’ command deletes the first line, and the second command removes those pesky ‘^’ characters from the output. The last line removes any control characters that may exist in the file. Since I’m dealing with a foreign file format, I want to eliminate the risk of encountering any control characters along the way. So far, so good. Now, it’s time to add some processing punch to this basic script:

1d

/^^/d

s/[[:cntrl:]]//g

/^D/ {

	s/^D\(.*\)/\1\tOUTY\tINNY\t/

        s/^01/Jan/

        s/^02/Feb/

        s/^03/Mar/

        s/^04/Apr/

        s/^05/May/

        s/^06/Jun/

        s/^07/Jul/

        s/^08/Aug/

        s/^09/Sep/

        s/^10/Oct/

        s/^11/Nov/

        s/^12/Dec/

        s:^\(.*\)/\(.*\)/\(.*\):\2 \1 \3: 

}

First, I add a ‘/^D/’ address so that sed will only begin processing when it encounters the first character of the QIF date field, ‘D’. All of the commands in the curly braces will execute in order as soon as sed reads such a line into its pattern space.

The first line in the curly braces will transform a line that looks like:

D08/28/2000

into one that looks like thist:

08/28/2000	OUTY	INNY

Of course, this format isn’t perfect right now, but that’s OK. We’ll gradually refine the contents of the pattern space as we go. The next 12 lines have the net effect of transforming the date to a three-letter format, with the last line removing the three slashes from the date. We end up with this line:

Aug 28 2000	OUTY	INNY

The OUTY and INNY fields are serving as placeholders and will get replaced later. I can’t specify them just yet, because if the dollar amount is negative, I’ll want to set OUTY and INNY to "misc" and "-", but if the dollar amount is positive, I’ll want to change them to "-" and "inco" respectively. Since the dollar amount hasn’t been read yet, I need to use placeholders for the time being.


Refinement

Now, it’s time for some further refinement:

1d 

/^^/d

s/[[:cntrl:]]//g 

/^D/ { 

        s/^D\(.*\)/\1\tOUTY\tINNY\t/ 

        s/^01/Jan/ 

        s/^02/Feb/ 

        s/^03/Mar/ 

        s/^04/Apr/ 

        s/^05/May/ 

        s/^06/Jun/ 

        s/^07/Jul/ 

        s/^08/Aug/ 

        s/^09/Sep/ 

        s/^10/Oct/ 

        s/^11/Nov/ 

        s/^12/Dec/ 

        s:^\(.*\)/\(.*\)/\(.*\):\2 \1 \3: 

        N 

        N 

        N 

        s/\nT\(.*\)\nN\(.*\)\nP\(.*\)/NUM\2NUM\t\tY\t\t\3\tAMT\1AMT/ 

        s/NUMNUM/-/ 

        s/NUM\([0-9]*\)NUM/\1/ 

        s/\([0-9]\),/\1/ 

}

The next seven lines are a bit complicated, so we’ll cover them in detail. First, we have three ‘N’ commands in a row. The ‘N’ command tells sed to read in the next line in the input and append it to our current pattern space. The three ‘N’ commands cause the next three lines to be appended to our current pattern space buffer, and now our line looks like this:

28 Aug 2000	OUTY	INNY	\nT-8.15\nN\nPCHECKCARD SUPERMARKET

Sed’s pattern space got ugly — we need to remove the extra newlines and perform some additional formatting. To do this, we’ll use the substitution command. The pattern we want to match is:

'\nT.*\nN.*\nP.*'

This will match a newline, followed by a ‘T’, followed by zero or more characters, followed by a newline, followed by an ‘N’, followed by any number of characters and a newline, followed by a ‘P’, followed by any number of characters. Phew! This regexp will match the entire contents of the three lines we just appended to the pattern space. But we want to reformat this region, not replace it entirely. The dollar amount, check number (if any) and description need to reappear in our replacement string. To do this, we surround those "interesting parts" with backslashed parentheses, so that we can refer to them in our replacement string (using ‘\1′, ‘\2\, and ‘\3′ to tell sed where to insert them). Here is the final command:

s/\nT\(.*\)\nN\(.*\)\nP\(.*\)/NUM\2NUM\t\tY\t\t\3\tAMT\1AMT/ 

This command transforms our line into:


28 Aug 2000  OUTY  INNY  NUMNUM    Y	   CHECKCARD SUPERMARKET	 AMT-8.15AMT

While this line is getting better, there are a few things that at first glance appear a bit…er…interesting. The first is that silly "NUMNUM" string — what purpose does that serve? You’ll find out as you inspect the next two lines of the sed script, which will replace "NUMNUM" with a "-", while "NUM"<number>"NUM" will be replaced with <number>. As you can see, surrounding the check number with a silly tag allows us to conveniently insert a "-" if the field is empty.


Finishing touches

The last line removes a comma following a number. This converts dollar amounts like "3,231.00" to "3231.00", which is the format I use. Now, it’s time to take a look at the final, production script:

1d

/^^/d

s/[[:cntrl:]]//g

/^D/ {

	s/^D\(.*\)/\1\tOUTY\tINNY\t/

	s/^01/Jan/

	s/^02/Feb/

	s/^03/Mar/

	s/^04/Apr/

	s/^05/May/

	s/^06/Jun/

	s/^07/Jul/

	s/^08/Aug/

	s/^09/Sep/

	s/^10/Oct/

	s/^11/Nov/

	s/^12/Dec/

	s:^\(.*\)/\(.*\)/\(.*\):\2 \1 \3:

	N

	N

	N

	s/\nT\(.*\)\nN\(.*\)\nP\(.*\)/NUM\2NUM\t\tY\t\t\3\tAMT\1AMT/

	s/NUMNUM/-/

	s/NUM\([0-9]*\)NUM/\1/

	s/\([0-9]\),/\1/

	/AMT-[0-9]*.[0-9]*AMT/b fixnegs

	s/AMT\(.*\)AMT/\1/

	s/OUTY/-/

	s/INNY/inco/

	b done

:fixnegs

	s/AMT-\(.*\)AMT/\1/

	s/OUTY/misc/

	s/INNY/-/

:done

}

The additional eleven lines use substitution and some branching functionality to perfect the output. We’ll want to take a look at this line first:

        /AMT-[0-9]*.[0-9]*AMT/b fixnegs 

This line contains a branch command, which is of the format "/regexp/b label". If the pattern space matches the regexp, sed will branch to the fixnegs label. You should be able to easily spot this label, which appears as ":fixnegs" in the code. If the regexp doesn’t match, processing continues as normal with the next command.

Now that you understand the workings of the command itself, let’s take a look at the branches. If you look at the branch regular expression, you’ll see that it will match the string ‘AMT’, followed by a ‘-’, followed by any number of digits, a ‘.’, any number of digits and ‘AMT’. As I’m sure you’ve figured out, this regexp deals specifically with a negative dollar amount. Earlier, we surrounded our dollar amount with ‘AMT’ strings so we could easily find it later. Because the regexp only matches dollar amounts that begin with a ‘-’, our branch will only happen if we happen to be dealing with a debit. If we are dealing with a debit, OUTY should be set to ‘misc’, INNY should be set to ‘-’, and the negative sign in front of the debit amount should be removed. If you follow the code, you’ll see that this is exactly what happens. If the branch isn’t executed, OUTY gets replaced with ‘-’, and INNY gets replaced with ‘inco’. We’re finished! Our output line is now perfect:

28 Aug 2000	misc	-	-       Y     CHECKCARD SUPERMARKET  -8.15

Don’t get confuSed

As you can see, converting data using sed isn’t all that hard, as long as you approach the problem incrementally. Don’t try to do everything with a single sed command, or all at once. Instead, gradually work your way toward the goal, and continue to enhance your sed script until your output looks just the way you want it to. Sed packs a lot of punch, and I hope that you’ve become very familiar with its inner workings and that you’ll continue to grow in your sed mastery!

Tagged with:
Jul 09

Sed is a very useful (but often forgotten) UNIX stream editor. It’s ideal for batch-editing files or for creating shell scripts to modify existing files in powerful ways. This article builds on my previous article introducing sed.

Substitution!

Let’s look at one of sed’s most useful commands, the substitution command. Using it, we can replace a particular string or matched regular expression with another string. Here’s an example of the most basic use of this command:

$ sed -e 's/foo/bar/' myfile.txt

The above command will output the contents of myfile.txt to stdout, with the first occurrence of ‘foo’ (if any) on each line replaced with the string ‘bar’. Please note that I said first occurrence on each line, though this is normally not what you want. Normally, when I do a string replacement, I want to perform it globally. That is, I want to replace all occurrences on every line, as follows:

$ sed -e 's/foo/bar/g' myfile.txt

The additional ‘g’ option after the last slash tells sed to perform a global replace.

Here are a few other things you should know about the ’s///’ substitution command. First, it is a command, and a command only; there are no addresses specified in any of the above examples. This means that the ’s///’ command can also be used with addresses to control what lines it will be applied to, as follows:

$ sed -e '1,10s/enchantment/entrapment/g' myfile2.txt

The above example will cause all occurrences of the phrase ‘enchantment’ to be replaced with the phrase ‘entrapment’, but only on lines one through ten, inclusive.

$ sed -e '/^$/,/^END/s/hills/mountains/g' myfile3.txt

This example will swap ‘hills’ for ‘mountains’, but only on blocks of text beginning with a blank line, and ending with a line beginning with the three characters ‘END’, inclusive.

Another nice thing about the ’s///’ command is that we have a lot of options when it comes to those ‘/’ separators. If we’re performing string substitution and the regular expression or replacement string has a lot of slashes in it, we can change the separator by specifying a different character after the ’s’. For example, this will replace all occurrences of /usr/local with /usr:

$ sed -e 's:/usr/local:/usr:g' mylist.txt

In this example, we’re using the colon as a separator. If you ever need to specify the separator character in the regular expression, put a backslash before it.


Regexp snafus

Up until now, we’ve only performed simple string substitution. While this is handy, we can also match a regular expression. For example, the following sed command will match a phrase beginning with ‘<’ and ending with ‘>’, and containing any number of characters inbetween. This phrase will be deleted (replaced with an empty string):

$ sed -e 's/<.*>//g' myfile.html

This is a good first attempt at a sed script that will remove HTML tags from a file, but it won’t work well, due to a regular expression quirk. The reason? When sed tries to match the regular expression on a line, it finds the longest match on the line. This wasn’t an issue in my previous sed article, because we were using the ‘d’ and ‘p’ commands, which would delete or print the entire line anyway. But when we use the ’s///’ command, it definitely makes a big difference, because the entire portion that the regular expression matches will be replaced with the target string, or in this case, deleted. This means that the above example will turn the following line:

<b>This</b> is what <b>I</b> meant.

into this:

meant.

rather than this, which is what we wanted to do:

This is what I meant.

Fortunately, there is an easy way to fix this. Instead of typing in a regular expression that says "a ‘<’ character followed by any number of characters, and ending with a ‘>’ character", we just need to type in a regexp that says "a ‘<’ character followed by any number of non-’>’ characters, and ending with a ‘>’ character". This will have the effect of matching the shortest possible match, rather than the longest possible one. The new command looks like this:

$ sed -e 's/<[^>]*>//g' myfile.html

In the above example, the ‘[^>]‘ specifies a "non-’>’" character, and the ‘*’ after it completes this expression to mean "zero or more non-’>’ characters". Test this command on a few sample html files, pipe them to more, and review their results.


More character matching

The ‘[ ]‘ regular expression syntax has some more additional options. To specify a range of characters, you can use a ‘-’ as long as it isn’t in the first or last position, as follows:

'[a-x]*'

This will match zero or more characters, as long as all of them are ‘a’,'b’,'c’…’v',’w',’x’. In addition, the ‘[:space:]‘ character class is available for matching whitespace. Here’s a fairly complete list of available character classes:

Character class Description
[:alnum:] Alphanumeric [a-z A-Z 0-9]
[:alpha:] Alphabetic [a-z A-Z]
[:blank:] Spaces or tabs
[:cntrl:] Any control characters
[:digit:] Numeric digits [0-9]
[:graph:] Any visible characters (no whitespace)
[:lower:] Lower-case [a-z]
[:print:] Non-control characters
[:punct:] Punctuation characters
[:space:] Whitespace
[:upper:] Upper-case [A-Z]
[:xdigit:] hex digits [0-9 a-f A-F]

It’s advantageous to use character classes whenever possible, because they adapt better to nonEnglish speaking locales (including accented characters when necessary, etc.


Advanced substitution stuff

We’ve looked at how to perform simple and even reasonably complex straight substitutions, but sed can do even more. We can actually refer to either parts of or the entire matched regular expression, and use these parts to construct the replacement string. As an example, let’s say you were replying to a message. The following example would prefix each line with the phrase "ralph said: ":

$ sed -e 's/.*/ralph said: &/' origmsg.txt

The output will look like this:

ralph said: Hiya Jim,

ralph said:

ralph said: I sure like this sed stuff!

ralph said:

In this example, we use the ‘&’ character in the replacement string, which tells sed to insert the entire matched regular expression. So, whatever was matched by ‘.*’ (the largest group of zero or more characters on the line, or the entire line) can be inserted anywhere in the replacement string, even multiple times. This is great, but sed is even more powerful.


Those wonderful backslashed parentheses

Even better than ‘&’, the ’s///’ command allows us to define regions in our regular expression, and we can refer to these specific regions in our replacement string. As an example, let’s say we have a file that contains the following text:

foo bar oni

eeny meeny miny

larry curly moe

jimmy the weasel

Now, let’s say we wanted to write a sed script that would replace "eeny meeny miny" with "Victor eeny-meeny Von miny", etc. To do this, first we would write a regular expression that would match the three strings, separated by spaces:

'.* .* .*'

There. Now, we will define regions by inserting backslashed parentheses around each region of interest:

'\(.*\) \(.*\) \(.*\)'

This regular expression will work the same as our first one, except that it will define three logical regions that we can refer to in our replacement string. Here’s the final script:

$ sed -e 's/\(.*\) \(.*\) \(.*\)/Victor \1-\2 Von \3/' myfile.txt

As you can see, we refer to each parentheses-delimited region by typing ‘\x’, where x is the number of the region, starting at one. Output is as follows:

Victor foo-bar Von oni

Victor eeny-meeny Von miny

Victor larry-curly Von moe

Victor jimmy-the Von weasel

As you become more familiar with sed, you will be able to perform fairly powerful text processing with a minimum of effort. You may want to think about how you’d have approached this problem using your favorite scripting language — could you have easily fit the solution in one line?


Mixing things up

As we begin creating more complex sed scripts, we need the ability to enter more than one command. There are several ways to do this. First, we can use semicolons between the commands. For example, this series of commands uses the ‘=’ command, which tells sed to print the line number, as well as the ‘p’ command, which explicitly tells sed to print the line (since we’re in ‘-n’ mode):

$ sed -n -e '=;p' myfile.txt

Whenever two or more commands are specified, each command is applied (in order) to every line in the file. In the above example, first the ‘=’ command is applied to line 1, and then the ‘p’ command is applied. Then, sed proceeds to line 2, and repeats the process. While the semicolon is handy, there are instances where it won’t work. Another alternative is to use two -e options to specify two separate commands:

$ sed -n -e '=' -e 'p' myfile.txt

However, when we get to the more complex append and insert commands, even multiple ‘-e’ options won’t help us. For complex multiline scripts, the best way is to put your commands in a separate file. Then, reference this script file with the -f options:

$ sed -n -f mycommands.sed myfile.txt

This method, although arguably less convenient, will always work.


Multiple commands for one address

Sometimes, you may want to specify multiple commands that will apply to a single address. This comes in especially handy when you are performing lots of ’s///’ to transform words or syntax in the source file. To perform multiple commands per address, enter your sed commands in a file, and use the ‘{ }’ characters to group commands, as follows:

1,20{

	s/[Ll]inux/GNU\/Linux/g

	s/samba/Samba/g

	s/posix/POSIX/g

}

The above example will apply three substitution commands to lines 1 through 20, inclusive. You can also use regular expression addresses, or a combination of the two:

1,/^END/{

        s/[Ll]inux/GNU\/Linux/g 

        s/samba/Samba/g 

        s/posix/POSIX/g 

	p

}

This example will apply all the commands between ‘{ }’ to the lines starting at 1 and up to a line beginning with the letters "END", or the end of file if "END" is not found in the source file.


Append, insert, and change line

Now that we’re writing sed scripts in separate files, we can take advantage of the append, insert, and change line commands. These commands will insert a line after the current line, insert a line before the current line, or replace the current line in the pattern space. They can also be used to insert multiple lines into the output. The insert line command is used as follows:

i\

This line will be inserted before each line

If you don’t specify an address for this command, it will be applied to each line and produce output that looks like this:

This line will be inserted before each line

line 1 here

This line will be inserted before each line

line 2 here

This line will be inserted before each line

line 3 here

This line will be inserted before each line

line 4 here

If you’d like to insert multiple lines before the current line, you can add additional lines by appending a backslash to the previous line, like so:

i\

insert this line\

and this one\

and this one\

and, uh, this one too.

The append command works similarly, but will insert a line or lines after the current line in the pattern space. It’s used as follows:

a\

insert this line after each line.  Thanks! :) 

On the other hand, the "change line" command will actually replace the current line in the pattern space, and is used as follows:

c\

You're history, original line! Muhahaha!

Because the append, insert, and change line commands need to be entered on multiple lines, you’ll want to type them in to text sed scripts and tell sed to source them by using the ‘-f’ option. Using the other methods to pass commands to sed will result in problems.


Next time

Next time, in the final article of this series on sed, I’ll show you lots of excellent real-world examples of using sed for many different kinds of tasks. Not only will I show you what the scripts do, but why they do what they do. After you’re done, you’ll have additional excellent ideas of how to use sed in your various projects. I’ll see you then!

Tagged with:
Jul 07

Pick an editor

In the UNIX world, we have a lot of options when it comes to editing files. Think of it — vi, emacs, and jed come to mind, as well as many others. We all have our favorite editor (along with our favorite keybindings) that we have come to know and love. With our trusty editor, we are ready to tackle any number of UNIX-related administration or programming tasks with ease.

While interactive editors are great, they do have limitations. Though their interactive nature can be a strength, it can also be a weakness. Consider a situation where you need to perform similar types of changes on a group of files. You could instinctively fire up your favorite editor and perform a bunch of mundane, repetitive, and time-consuming edits by hand. But there’s a better way.


Enter sed

It would be nice if we could automate the process of making edits to files, so that we could "batch" edit files, or even write scripts with the ability to perform sophisticated changes to existing files. Fortunately for us, for these types of situations, there is a better way — and the better way is called "sed".

sed is a lightweight stream editor that’s included with nearly all UNIX flavors, including Linux. sed has a lot of nice features. First of all, it’s very lightweight, typically many times smaller than your favorite scripting language. Secondly, because sed is a stream editor, it can perform edits to data it receives from stdin, such as from a pipeline. So, you don’t need to have the data to be edited stored in a file on disk. Because data can just as easily be piped to sed, it’s very easy to use sed as part of a long, complex pipeline in a powerful shell script. Try doing that with your favorite editor.


GNU sed

Fortunately for us Linux users, one of the nicest versions of sed out there happens to be GNU sed, which is currently at version 3.02. Every Linux distribution has GNU sed, or at least should. GNU sed is popular not only because its sources are freely distributable, but because it happens to have a lot of handy, time-saving extensions to the POSIX sed standard. GNU sed also doesn’t suffer from many of the limitations that earlier and proprietary versions of sed had, such as a limited line length — GNU sed handles lines of any length with ease.


The newest GNU sed

While researching this article, I noticed that several online sed aficionados made reference to a GNU sed 3.02a. Strangely, I couldn’t find sed 3.02a on ftp.gnu.org , so I had to go look for it elsewhere. I found it at alpha.gnu.org, in /pub/sed. I happily downloaded it, compiled it, and installed it, only to find minutes later that the most recent version of sed is 3.02.80 — and you can find its sources right next to those for 3.02a, at alpha.gnu.org. After getting GNU sed 3.02.80 installed, I was finally ready to go.

The right sed

In this series, we will be using GNU sed 3.02.80. Some (but very few) of the most advanced examples you’ll find in my upcoming, follow-on articles in this series will not work with GNU sed 3.02 or 3.02a. If you’re using a non-GNU sed, your results may vary. Why not take some time to install GNU sed 3.02.80 now? Then, not only will you be ready for the rest of the series, but you’ll also be able to use arguably the best sed in existence!


Sed examples

Sed works by performing any number of user-specified editing operations ("commands") on the input data. Sed is line-based, so the commands are performed on each line in order. And, sed writes its results to standard output (stdout); it doesn’t modify any input files.

Let’s look at some examples. The first several are going to be a bit weird because I’m using them to illustrate how sed works rather than to perform any useful task. However, if you’re new to sed, it’s very important that you understand them. Here’s our first example:

$ sed -e 'd' /etc/services

If you type this command, you’ll get absolutely no output. Now, what happened? In this example, we called sed with one editing command, ‘d. Sed opened the /etc/services file, read a line into its pattern buffer, performed our editing command ("delete line"), and then printed the pattern buffer (which was empty). It then repeated these steps for each successive line. This produced no output, because the "d" command zapped every single line in the pattern buffer!

There are a couple of things to notice in this example. First, /etc/services was not modified at all. This is because, again, sed only reads from the file you specify on the command line, using it as input — it doesn’t try to modify the file. The second thing to notice is that sed is line-oriented. The ‘d’ command didn’t simply tell sed to delete all incoming data in one fell swoop. Instead, sed read each line of /etc/services one by one into its internal buffer, called the pattern buffer. Once a line was read into the pattern buffer, it performed the ‘d’ command and printed the contents of the pattern buffer (nothing in this example). Later, I’ll show you how to use address ranges to control which lines a command is applied to — but in the absence of addresses, a command is applied to all lines.

The third thing to notice is the use of single quotes to surround the ‘d’ command. It’s a good idea to get into the habit of using single quotes to surround your sed commands, so that shell expansion is disabled.


Another sed example

Here’s an example of how to use sed to remove the first line of the /etc/services file from our output stream:

$ sed -e '1d' /etc/services | more

As you can see, this command is very similar to our first ‘d’ command, except that it is preceded by a ‘1′. If you guessed that the ‘1′ refers to line number one, you’re right. While in our first example, we used ‘d’ by itself, this time we use the ‘d’ command preceded by an optional numerical address. By using addresses, you can tell sed to perform edits only on a particular line or lines.


Address ranges

Now, let’s look at how to specify an address range. In this example, sed will delete lines 1-10 of the output:

$ sed -e '1,10d' /etc/services | more

When we separate two addresses by a comma, sed will apply the following command to the range that starts with the first address, and ends with the second address. In this example, the ‘d’ command was applied to lines 1-10, inclusive. All other lines were ignored.


Addresses with regular expressions

Now, it’s time for a more useful example. Let’s say you wanted to view the contents of your /etc/services file, but you aren’t interested in viewing any of the included comments. As you know, you can place comments in your /etc/services file by starting the line with the ‘#’ character. To avoid comments, we’d like sed to delete lines that start with a ‘#’. Here’s how to do it:

$ sed -e '/^#/d' /etc/services | more

Try this example and see what happens. You’ll notice that sed performs its desired task with flying colors. Now, let’s figure out what happened.

To understand the ‘/^#/d’ command, we first need to dissect it. First, let’s remove the ‘d’ — we’re using the same delete line command that we’ve used previously. The new addition is the ‘/^#/’ part, which is a new kind of regular expression address. Regular expression addresses are always surrounded by slashes. They specify a pattern, and the command that immediately follows a regular expression address will only be applied to a line if it happens to match this particular pattern.

So, ‘/^#/’ is a regular expression. But what does it do? Obviously, this would be a good time for a regular expression refresher.


Regular expression refresher

We can use regular expressions to express patterns that we may find in the text. If you’ve ever used the ‘*’ character on the shell command line, you’ve used something that’s similar, but not identical to, regular expressions. Here are the special characters that you can use in regular expressions:

Character Description
^ Matches the beginning of the line
$ Matches the end of the line
. Matches any single character
* Will match zero or more occurrences of the previous character
[ ] Matches all the characters inside the [ ]

Probably the best way to get your feet wet with regular expressions is to see a few examples. All of these examples will be accepted by sed as valid addresses to appear on the left side of a command. Here are a few:

Regular

expression
Description
/./ Will match any line that contains at least one character
/../ Will match any line that contains at least two characters
/^#/ Will match any line that begins with a ‘#’
/^$/ Will match all blank lines
/}$/ Will match any lines that ends with ‘}’ (no spaces)
/} *$/ Will match any line ending with ‘}’ followed by zero or more spaces
/[abc]/ Will match any line that contains a lowercase ‘a’, ‘b’, or ‘c’
/^[abc]/ Will match any line that begins with an ‘a’, ‘b’, or ‘c’

I encourage you to try several of these examples. Take some time to get familiar with regular expressions, and try a few regular expressions of your own creation. You can use a regexp this way:

$ sed -e '/regexp/d' /path/to/my/test/file | more

This will cause sed to delete any matching lines. However, it may be easier to get familiar with regular expressions by telling sed to print regexp matches, and delete non-matches, rather than the other way around. This can be done with the following command:

$ sed -n -e '/regexp/p' /path/to/my/test/file | more

Note the new ‘-n’ option, which tells sed to not print the pattern space unless explicitly commanded to do so. You’ll also notice that we’ve replaced the ‘d’ command with the ‘p’ command, which as you might guess, explicitly commands sed to print the pattern space. Voila, now only matches will be printed.


More on addresses

Up till now, we’ve taken a look at line addresses, line range addresses, and regexp addresses. But there are even more possibilities. We can specify two regular expressions separated by a comma, and sed will match all lines starting from the first line that matches the first regular expression, up to and including the line that matches the second regular expression. For example, the following command will print out a block of text that begins with a line containing "BEGIN", and ending with a line that contains "END":

$ sed -n -e '/BEGIN/,/END/p' /my/test/file | more

If "BEGIN" isn’t found, no data will be printed. And, if "BEGIN" is found, but no "END" is found on any line below it, all subsequent lines will be printed. This happens because of sed’s stream-oriented nature — it doesn’t know whether or not an "END" will appear.


C source example

If you want to print out only the main() function in a C source file, you could type:

$ sed -n -e '/main[[:space:]]*(/,/^}/p' sourcefile.c | more

This command has two regular expressions, ‘/main[[:space:]]*(/’ and ‘/^}/’, and one command, ‘p’. The first regular expression will match the string "main" followed by any number of spaces or tabs, followed by an open parenthesis. This should match the start of your average ANSI C main() declaration.

In this particular regular expression, we encounter the ‘[[:space:]]’ character class. This is simply a special keyword that tells sed to match either a TAB or a space. If you wanted, instead of typing ‘[[:space:]]’, you could have typed ‘[', then a literal space, then Control-V, then a literal tab and a ']‘ — The Control-V tells bash that you want to insert a "real" tab rather than perform command expansion. It’s clearer, especially in scripts, to use the ‘[[:space:]]’ command class.

OK, now on to the second regexp. ‘/^}’ will match a ‘}’ character that appears at the beginning of a new line. If your code is formatted nicely, this will match the closing brace of your main() function. If it’s not, it won’t — one of the tricky things about performing pattern matching.

The ‘p’ command does what it always does, explicitly telling sed to print out the line, since we are in ‘-n’ quiet mode. Try running the command on a C source file — it should output the entire main() { } block, including the initial "main()" and the closing ‘}’.


Next time

Now that we’ve touched on the basics, we’ll be picking up the pace for the next two articles. If you’re in the mood for some meatier sed material, be patient — it’s coming! In the meantime, you might want to check out the following sed and regular expression resources.

Tagged with:
Jul 05

Formatting output

While awk’s print statement does do the job most of the time, sometimes more is needed. For those times, awk offers two good old friends called printf() and sprintf(). Yes, these functions, like so many other awk parts, are identical to their C counterparts. printf() will print a formatted string to stdout, while sprintf() returns a formatted string that can be assigned to a variable. If you’re not familiar with printf() and sprintf(), an introductory C text will quickly get you up to speed on these two essential printing functions. You can view the printf() man page by typing "man 3 printf" on your Linux system.

Here’s some sample awk sprintf() and printf() code. As you can see, everything looks almost identical to C.

x=1
b="foo"
printf("%s got a %d on the last test\n","Jim",83)
myout=("%s-%d",b,x)
print myout

This code will print:

Jim got a 83 on the last test
foo-1

String functions

Awk has a plethora of string functions, and that’s a good thing. In awk, you really need string functions, since you can’t treat a string as an array of characters as you can in other languages like C, C++, and Python. For example, if you execute the following code:

mystring="How are you doing today?"
print mystring[3]

You’ll receive an error that looks something like this:

awk: string.gawk:59: fatal: attempt to use scalar as array

Oh, well. While not as convenient as Python’s sequence types, awk’s string functions get the job done. Let’s take a look at them.

First, we have the basic length() function, which returns the length of a string. Here’s how to use it:

print length(mystring)

This code will print the value:

24

OK, let’s keep going. The next string function is called index, and will return the position of the occurrence of a substring in another string, or it will return 0 if the string isn’t found. Using mystring, we can call it this way:

print index(mystring,"you")

Awk prints:

9

We move on to two more easy functions, tolower() and toupper(). As you might guess, these functions will return the string with all characters converted to lowercase or uppercase respectively. Notice that tolower() and toupper() return the new string, and don’t modify the original. This code:

print tolower(mystring)
print toupper(mystring)
print mystring

….will produce this output:

how are you doing today?
HOW ARE YOU DOING TODAY?
How are you doing today?

So far so good, but how exactly do we select a substring or even a single character from a string? That’s where substr() comes in. Here’s how to call substr():

mysub=substr(mystring,startpos,maxlen)

mystring should be either a string variable or a literal string from which you’d like to extract a substring. startpos should be set to the starting character position, and maxlen should contain the maximum length of the string you’d like to extract. Notice that I said maximum length; if length(mystring) is shorter than startpos+maxlen, your result will be truncated. substr() won’t modify the original string, but returns the substring instead. Here’s an example:

print substr(mystring,9,3)

Awk will print:

you

If you regularly program in a language that uses array indices to access parts of a string (and who doesn’t), make a mental note that substr() is your awk substitute. You’ll need to use it to extract single characters and substrings; because awk is a string-based language, you’ll be using it often.

Now, we move on to some meatier functions, the first of which is called match(). match() is a lot like index(), except instead of searching for a substring like index() does, it searches for a regular expression. The match() function will return the starting position of the match, or zero if no match is found. In addition, match() will set two variables called RSTART and RLENGTH. RSTART contains the return value (the location of the first match), and RLENGTH specifies its span in characters (or -1 if no match was found). Using RSTART, RLENGTH, substr(), and a small loop, you can easily iterate through every match in your string. Here’s an example match() call:

print match(mystring,/you/), RSTART, RLENGTH

Awk will print:

9 9 3

String substitution

Now, we’re going to look at a couple of string substitution functions, sub() and gsub(). These guys differ slightly from the functions we’ve looked at so far in that they actually modify the original string. Here’s a template that shows how to call sub():

sub(regexp,replstring,mystring)

When you call sub(), it’ll find the first sequence of characters in mystring that matches regexp, and it’ll replace that sequence with replstring. sub() and gsub() have identical arguments; the only way they differ is that sub() will replace the first regexp match (if any), and gsub() will perform a global replace, swapping out all matches in the string. Here’s an example sub() and gsub() call:

sub(/o/,"O",mystring)
print mystring
mystring="How are you doing today?"
gsub(/o/,"O",mystring)
print mystring

We had to reset mystring to its original value because the first sub() call modified mystring directly. When executed, this code will cause awk to output:

HOw are you doing today?
HOw are yOu dOing tOday?

Of course, more complex regular expressions are possible. I’ll leave it up to you to test out some complicated regexps.

We wrap up our string function coverage by introducing you to a function called split(). split()’s job is to "chop up" a string and place the various parts into an integer-indexed array. Here’s an example split() call:

numelements=split("Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec",mymonths,",")

When calling split(), the first argument contains the literal string or string variable to be chopped. In the second argument, you should specify the name of the array that split() will stuff the chopped parts into. In the third element, specify the separator that will be used to chop the strings up. When split() returns, it’ll return the number of string elements that were split. split() assigns each one to an array index starting with one, so the following code:

print mymonths[1],mymonths[numelements]

….will print:

Jan Dec

Special string forms

A quick note — when calling length(), sub(), or gsub(), you can drop the last argument and awk will apply the function call to $0 (the entire current line). To print the length of each line in a file, use this awk script:


{
	print length()
}

Financial fun

A few weeks ago, I decided to write my own checkbook balancing program in awk. I decided that I’d like to have a simple tab-delimited text file into which I can enter my most recent deposits and withdrawals. The idea was to hand this data to an awk script that would automatically add up all the amounts and tell me my balance. Here’s how I decided to record all my transactions into my "ASCII checkbook":

23 Aug 2000	food	-	-	Y	Jimmy's Buffet		30.25

Every field in this file is separated by one or more tabs. After the date (field 1, $1), there are two fields called "expense category" and "income category". When I’m entering an expense like on the above line, I put a four-letter nickname in the exp field, and a "-" (blank entry) in the inc field. This signifies that this particular item is a "food expense" :) Here’s what a deposit looks like:

23 Aug 2000	-	inco	-	Y	Boss Man		2001.00

In this case, I put a "-" (blank) in the exp category, and put "inco" in the inc category. "inco" is my nickname for generic (paycheck-style) income. Using category nicknames allows me to generate a breakdown of my income and expenditures by category. As far as the rest of the records, all the other fields are fairly self-explanatory. The cleared? field ("Y" or "N") records whether the transaction has been posted to my account; beyond that, there’s a transaction description, and a positive dollar amount.

The algorithm used to compute the current balance isn’t too hard. Awk simply needs to read in each line, one by one. If an expense category is listed but there is no income category (it’s "-"), then this item is a debit. If an income category is listed, but no expense category ("-") is there, then the dollar amount is a credit. And, if there is both an expense and income category listed, then this amount is a "category transfer"; that is, the dollar amount will be subtracted from the expense category and added to the income category. Again, all these categories are virtual, but are very useful for tracking income and expenditures, as well as for budgeting.


The code

Time to look at the code. We’ll start off with the first line, the BEGIN block and a function definition:

balance, part 1

#!/usr/bin/env awk -f
BEGIN {
	FS="\t+"
	months="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"
}

function monthdigit(mymonth) {
	return (index(months,mymonth)+3)/4
}

Adding the first "#!…" line to any awk script will allow it to be directly executed from the shell, provided that you "chmod +x myscript" first. The remaining lines define our BEGIN block, which gets executed before awk starts processing our checkbook file. We set FS (the field separator) to "\t+", which tells awk that the fields will be separated by one or more tabs. In addition, we define a string called months that’s used by our monthdigit() function, which appears next.

The last three lines show you how to define your own awk function. The format is simple — type "function", then the function name, and then the parameters separated by commas, inside parentheses. After this, a "{ }" code block contains the code that you’d like this function to execute. All functions can access global variables (like our months variable). In addition, awk provides a "return" statement that allows the function to return a value, and operates similarly to the "return" found in C, Python, and other languages. This particular function converts a month name in a 3-letter string format into its numeric equivalent. For example, this:

print monthdigit("Mar")

….will print this:

3

Now, let’s move on to some more functions.


Financial functions

Here are three more functions that perform the bookkeeping for us. Our main code block, which we’ll see soon, will process each line of the checkbook file sequentially, calling one of these functions so that the appropriate transactions are recorded in an awk array. There are three basic kinds of transactions, credit (doincome), debit (doexpense) and transfer (dotransfer). You’ll notice that all three functions accept one argument, called mybalance. mybalance is a placeholder for a two-dimensional array, which we’ll pass in as an argument. Up until now, we haven’t dealt with two-dimensional arrays; however, as you can see below, the syntax is quite simple. Just separate each dimension with a comma, and you’re in business.

We’ll record information into "mybalance" as follows. The first dimension of the array ranges from 0 to 12, and specifies the month, or zero for the entire year. Our second dimension is a four-letter category, like "food" or "inco"; this is the actual category we’re dealing with. So, to find the entire year’s balance for the food category, you’d look in mybalance[0,"food"]. To find June’s income, you’d look in mybalance[6,"inco"].

balance, part 2


function doincome(mybalance) {
	mybalance[curmonth,$3] += amount
	mybalance[0,$3] += amount
}

function doexpense(mybalance) {
	mybalance[curmonth,$2] -= amount
	mybalance[0,$2] -= amount
}

function dotransfer(mybalance) {
	mybalance[0,$2] -= amount
	mybalance[curmonth,$2] -= amount
	mybalance[0,$3] += amount
	mybalance[curmonth,$3] += amount
}

When doincome() or any of the other functions are called, we record the transaction in two places — mybalance[0,category] and mybalance[curmonth, category], the entire year’s category balance and the current month’s category balance, respectively. This allows us to easily generate either an annual or monthly breakdown of income/expenditures later on.

If you look at these functions, you’ll notice that the array referenced by mybalance is passed in my reference. In addition, we also refer to several global variables: curmonth, which holds the numeric value of the month of the current record, $2 (the expense category), $3 (the income category), and amount ($7, the dollar amount). When doincome() and friends are called, all these variables have already been set correctly for the current record (line) being processed.


The main block

Here’s the main code block that contains the code that parses each line of input data. Remember, because we have set FS correctly, we can refer to the first field as $1, the second field as $2, etc. When doincome() and friends are called, the functions can access the current values of curmonth, $2, $3 and amount from inside the function. Take a look at the code and meet me on the other side for an explanation.

balance, part 3


{
	curmonth=monthdigit(substr($1,4,3))
	amount=$7

	#record all the categories encountered
	if ( $2 != "-" )
		globcat[$2]="yes"
	if ( $3 != "-" )
		globcat[$3]="yes"

	#tally up the transaction properly
	if ( $2 == "-" ) {
		if ( $3 == "-" ) {
			print "Error: inc and exp fields are both blank!"
			exit 1
		} else {
			#this is income
			doincome(balance)
			if ( $5 == "Y" )
				doincome(balance2)
		}
	} else if ( $3 == "-" ) {
		#this is an expense
		doexpense(balance)
		if ( $5 == "Y" )
			doexpense(balance2)
	} else {
		#this is a transfer
		dotransfer(balance)
		if ( $5 == "Y" )
			dotransfer(balance2)
	}
}

In the main block, the first two lines set curmonth to an integer between 1 and 12, and set amount to field 7 (to make the code easier to understand). Then, we have four interesting lines, where we write values into an array called globcat. globcat, or the global categories array, is used to record all those categories encountered in the file — "inco", "misc", "food", "util", etc. For example, if $2 == "inco", we set globcat["inco"] to "yes". Later on, we can iterate through our list of categories with a simple "for (x in globcat)" loop.

On the next twenty or so lines, we analyze fields $2 and $3, and record the transaction appropriately. If $2=="-" and $3!="-", we have some income, so we call doincome(). If the situation is reversed, we call doexpense(); and if both $2 and $3 contain categories, we call dotransfer(). Each time, we pass the "balance" array to these functions so that the appropriate data is recorded there.

You’ll also notice several lines that say "if ( $5 == "Y" ), record that same transaction in balance2". What exactly are we doing here? You’ll recall that $5 contains either a "Y" or a "N", and records whether the transaction has been posted to the account. Because we record the transaction to balance2 only if the transaction has been posted, balance2 will contain the actual account balance, while "balance" will contain all transactions, whether they have been posted or not. You can use balance2 to verify your data entry (since it should match with your current account balance according to your bank), and use "balance" to make sure that you don’t overdraw your account (since it will take into account any checks you have written that have not yet been cashed).


Generating the report

After the main block repeatedly processes each input record, we now have a fairly comprehensive record of debits and credits broken down by category and by month. Now, all we need to do is define an END block that will generate a report, in this case a modest one:

END {
	bal=0
	bal2=0
	for (x in globcat) {
		bal=bal+balance[0,x]
		bal2=bal2+balance2[0,x]
    	}
    	printf("Your available funds: %10.2f\n", bal)
    	printf("Your account balance: %10.2f\n", bal2)
}

This report prints out a summary that looks something like this:

Your available funds:    1174.22
Your account balance:    2399.33

In our END block, we used the "for (x in globcat)" construct to iterate through every category, tallying up a master balance based on all the transactions recorded. We actually tally up two balances, one for available funds, and another for the account balance. To execute the program and process your own financial goodies that you’ve entered into a file called "mycheckbook.txt", put all the above code into a text file called "balance", "chmod +x balance", and then type "./balance mycheckbook.txt". The balance script will then add up all your transactions and print out a two-line balance summary for you.


Upgrades

I use a more advanced version of this program to manage my personal and business finances. My version (which I couldn’t include here due to space limitations) prints out a monthly breakdown of income and expenses, including annual totals, net income and a bunch of other stuff. Even better, it outputs the data in HTML format, so that I can view it in a Web browser :) If you find this program useful, I encourage you to add these features to this script. You won’t need to configure it to record any additional information; all the information you need is already in balance and balance2. Just upgrade the END block, and you’re in business!

I hope you’ve enjoyed this series. For more information on awk, check out the resources listed below.

Tagged with:
Jul 04

Multi-line records

Awk is an excellent tool for reading in and processing structured data, such as the system’s /etc/passwd file. /etc/passwd is the UNIX user database, and is a colon-delimited text file, containing a lot of important information, including all existing user accounts and user IDs, among other things. In my previous article, I showed you how awk could easily parse this file. All we had to do was to set the FS (field separator) variable to ":".

By setting the FS variable correctly, awk can be configured to parse almost any kind of structured data, as long as there is one record per line. However, just setting FS won’t do us any good if we want to parse a record that exists over multiple lines. In these situations, we also need to modify the RS record separator variable. The RS variable tells awk when the current record ends and a new record begins.

As an example, let’s look at how we’d handle the task of processing an address list of Federal Witness Protection Program participants:

Jimmy the Weasel
100 Pleasant Drive
San Francisco, CA 12345
Big Tony
200 Incognito Ave.
Suburbia, WA 67890

Ideally, we’d like awk to recognize each 3-line address as an individual record, rather than as three separate records. It would make our code a lot simpler if awk would recognize the first line of the address as the first field ($1), the street address as the second field ($2), and the city, state, and zip code as field $3. The following code will do just what we want:

BEGIN {
	FS="\n"
	RS=""
}

Above, setting FS to "\n" tells awk that each field appears on its own line. By setting RS to "", we also tell awk that each address record is separated by a blank line. Once awk knows how the input is formatted, it can do all the parsing work for us, and the rest of the script is simple. Let’s look at a complete script that will parse this address list and print out each address record on a single line, separating each field with a comma.

BEGIN {
	FS="\n"
	RS=""
}
{
	print $1 ", " $2 ", " $3
}

If this script is saved as address.awk, and the address data is stored in a file called address.txt, you can execute this script by typing "awk -f address.awk address.txt". This code produces the following output:

Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345
Big Tony, 200 Incognito Ave., Suburbia, WA 67890

OFS and ORS

In address.awk’s print statement, you can see that awk concatenates (joins) strings that are placed next to each other on a line. We used this feature to insert a comma and a space (", ") between the three address fields that appeared on the line. While this method works, it’s a bit ugly looking. Rather than inserting literal ", " strings between our fields, we can have awk do it for us by setting a special awk variable called OFS. Take a look at this code snippet.

print "Hello", "there", "Jim!"

The commas on this line are not part of the actual literal strings. Instead, they tell awk that "Hello", "there", and "Jim!" are separate fields, and that the OFS variable should be printed between each string. By default, awk produces the following output:

Hello there Jim!

This shows us that by default, OFS is set to " ", a single space. However, we can easily redefine OFS so that awk will insert our favorite field separator. Here’s a revised version of our original address.awk program that uses OFS to output those intermediate ", " strings:

BEGIN {
	FS="\n"
	RS=""
	OFS=", "
}
{
	print $1, $2, $3
}

Awk also has a special variable called ORS, called the "output record separator". By setting ORS, which defaults to a newline ("\n"), we can control the character that’s automatically printed at the end of a print statement. The default ORS value causes awk to output each new print statement on a new line. If we wanted to make the output double-spaced, we would set ORS to "\n\n". Or, if we wanted records to be separated by a single space (and no newline), we would set ORS to " ".


Multi-line to tabbed

Let’s say that we wrote a script that converted our address list to a single-line per record, tab-delimited format for import into a spreadsheet. After using a slightly modified version of address.awk, it would become clear that our program only works for three-line addresses. If awk encountered the following address, the fourth line would be thrown away and not printed:

Cousin Vinnie
Vinnie's Auto Shop
300 City Alley
Sosueme, OR 76543

To handle situations like this, it would be good if our code took the number of records per field into account, printing each one in order. Right now, the code only prints the first three fields of the address. Here’s some code that does what we want:

BEGIN {
    FS="\n"
    RS=""
    ORS=""
} 

{
        x=1
        while ( x<NF ) {
                print $x "\t"
                x++
        }
        print $NF "\n"
} 

First, we set the field separator FS to "\n" and the record separator RS to "" so that awk parses the multi-line addresses correctly, as before. Then, we set the output record separator ORS to "", which will cause the print statement to not output a newline at the end of each call. This means that if we want any text to start on a new line, we need to explicitly write print "\n".

In the main code block, we create a variable called x that holds the number of current field that we’re processing. Initially, it’s set to 1. Then, we use a while loop (an awk looping construct identical to that found in the C language) to iterate through all but the last record, printing the record and a tab character. Finally, we print the last record and a literal newline; again, since ORS is set to "", print won’t output newlines for us. Program output looks like this, which is exactly what we wanted:

Our intended output. Not pretty, but tab delimited for easy import into a spreadsheet

Jimmy the Weasel        100 Pleasant Drive      San Francisco, CA 12345
Big Tony        200 Incognito Ave.      Suburbia, WA 67890
Cousin Vinnie   Vinnie's Auto Shop      300 City Alley  Sosueme, OR 76543

Looping constructs

We’ve already seen awk’s while loop construct, which is identical to its C counterpart. Awk also has a "do…while" loop that evaluates the condition at the end of the code block, rather than at the beginning like a standard while loop. It’s similar to "repeat…until" loops that can be found in other languages. Here’s an example:

do…while example

{
	count=1
	do {
		print "I get printed at least once no matter what"
	} while ( count != 1 )
}

Because the condition is evaluated after the code block, a "do…while" loop, unlike a normal while loop, will always execute at least once. On the other hand, a normal while loop will never execute if its condition is false when the loop is first encountered.

for loops

Awk allows you to create for loops, which like while loops are identical to their C counterpart:

for ( initial assignment; comparison; increment ) {
	code block
}

Here’s a quick example:

for ( x = 1; x <= 4; x++ ) {
	print "iteration",x
}

This snippet will print:

iteration 1
iteration 2
iteration 3
iteration 4

Break and continue

Again, just like C, awk provides break and continue statements. These statements provide better control over awk’s various looping constructs. Here’s a code snippet that desperately needs a break statement:

while (1) {
	print "forever and ever..."
}

Because 1 is always true, this while loop runs forever. Here’s a loop that only executes ten times:

x=1
while(1) {
	print "iteration",x
	if ( x == 10 ) {
		break
	}
	x++
}

Here, the break statement is used to "break out" of the innermost loop. "break" causes the loop to immediately terminate and execution to continue at the line after the loop’s code block.

The continue statement complements break, and works like this:

x=1
while (1) {
	if ( x == 4 ) {
		x++
		continue
	}
	print "iteration",x
	if ( x > 20 ) {
		break
	}
	x++
}

This code will print "iteration 1" through "iteration 21", except for "iteration 4". If iteration equals 4, x is incremented and the continue statement is called, which immediately causes awk to start to the next loop iteration without executing the rest of the code block. The continue statement works for every kind of awk iterative loop, just as break does. When used in the body of a for loop, continue will cause the loop control variable to be automatically incremented. Here’s an equivalent for loop:

for ( x=1; x<=21; x++ ) {
	if ( x == 4 ) {
		continue
	}
	print "iteration",x
}

It wasn’t necessary to increment x just before calling continue as it was in our while loop, since the for loop increments x automatically.


Arrays

You’ll be pleased to know that awk has arrays. However, under awk, it’s customary to start array indices at 1, rather than 0:

myarray[1]="jim"
myarray[2]=456

When awk encounters the first assignment, myarray is created and the element myarray[1] is set to "jim". After the second assignment is evaluated, the array has two elements.

Iterating over arrays

Once defined, awk has a handy mechanism to iterate over the elements of an array, as follows:

for ( x in myarray ) {
	print myarray[x]
}

This code will print out every element in the array myarray. When you use this special "in" form of a for loop, awk will assign every existing index of myarray to x (the loop control variable) in turn, executing the loop’s code block once after each assignment. While this is a very handy awk feature, it does have one drawback — when awk cycles through the array indices, it doesn’t follow any particular order. That means that there’s no way for us to know whether the output of above code will be:

jim
456

or

456
jim

To loosely paraphrase Forrest Gump, iterating over the contents of an array is like a box of chocolates — you never know what you’re going to get. This has something to do with the "stringiness" of awk arrays, which we’ll now take a look at.


Array index stringiness

In my previous article, I showed you that awk actually stores numeric values in a string format. While awk performs the necessary conversions to make this work, it does open the door for some odd-looking code:

a="1"
b="2"
c=a+b+3

After this code executes, c is equal to 6. Since awk is "stringy", adding strings "1" and "2" is functionally no different than adding the numbers 1 and 2. In both cases, awk will successfully perform the math. Awk’s "stringy" nature is pretty intriguing — you may wonder what happens if we use string indexes for arrays. For instance, take the following code:

myarr["1"]="Mr. Whipple"
print myarr["1"]

As you might expect, this code will print "Mr. Whipple". But how about if we drop the quotes around the second "1" index?

myarr["1"]="Mr. Whipple"
print myarr[1]

Guessing the result of this code snippet is a bit more difficult. Does awk consider myarr["1"] and myarr[1] to be two separate elements of the array, or do they refer to the same element? The answer is that they refer to the same element, and awk will print "Mr. Whipple", just as in the first code snippet. Although it may seem strange, behind the scenes awk has been using string indexes for its arrays all this time!

After learning this strange fact, some of us may be tempted to execute some wacky code that looks like this:

myarr["name"]="Mr. Whipple"
print myarr["name"]

Not only does this code not raise an error, but it’s functionally identical to our previous examples, and will print "Mr. Whipple" just as before! As you can see, awk doesn’t limit us to using pure integer indexes; we can use string indexes if we want to, without creating any problems. Whenever we use non-integer array indices like myarr["name"], we’re using associative arrays. Technically, awk isn’t doing anything different behind the scenes than when we use a string index (since even if you use an "integer" index, awk still treats it as a string). However, you should still call ‘em associative arrays — it sounds cool and will impress your boss. The stringy index thing will be our little secret. ;)


Array tools

When it comes to arrays, awk gives us a lot of flexibility. We can use string indexes, and we aren’t required to have a continuous numeric sequence of indices (for example, we can define myarr[1] and myarr[1000], but leave all other elements undefined). While all this can be very helpful, in some circumstances it can create confusion. Fortunately, awk offers a couple of handy features to help make arrays more manageable.

First, we can delete array elements. If you want to delete element 1 of your array fooarray, type:

delete fooarray[1]

And, if you want to see if a particular array element exists, you can use the special "in" boolean operator as follows:

if ( 1 in fooarray ) {
	print "Ayep!  It's there."
} else {
	print "Nope!  Can't find it."
}

Next time

We’ve covered a lot of ground in this article. Next time, I’ll round out your awk knowledge by showing you how to use awk’s math and string functions and how to create your own functions. I’ll also walk you through the creation of a checkbook balancing program. Until then, I encourage you to write some of your own awk programs, and to check out the following resources.

Tagged with:
Jul 03

In defense of awk

In this series of articles, I’m going to turn you into a proficient awk coder. I’ll admit, awk doesn’t have a very pretty or particularly "hip" name, and the GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with the language may hear "awk" and think of a mess of code so backwards and antiquated that it’s capable of driving even the most knowledgeable UNIX guru to the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for coffee machine).

Sure, awk doesn’t have a great name. But it is a great language. Awk is geared toward text processing and report generation, yet features many well-designed features that allow for serious programming. And, unlike some languages, awk’s syntax is familiar, and borrows some of the best parts of languages like C, python, and bash (although, technically, awk was created before both python and bash). Awk is one of those languages that, once learned, will become a key part of your strategic coding arsenal.


The first awk

Let’s go ahead and start playing around with awk to see how it works. At the command line, enter the following command:

$ awk '{ print }' /etc/passwd 

You should see the contents of your /etc/passwd file appear before your eyes. Now, for an explanation of what awk did. When we called awk, we specified /etc/passwd as our input file. When we executed awk, it evaluated the print command for each line in /etc/passwd, in order. All output is sent to stdout, and we get a result identical to catting /etc/passwd. Now, for an explanation of the { print } code block. In awk, curly braces are used to group blocks of code together, similar to C. Inside our block of code, we have a single print command. In awk, when a print command appears by itself, the full contents of the current line are printed.

Here is another awk example that does exactly the same thing:

$ awk '{ print $0 }' /etc/passwd 

In awk, the $0 variable represents the entire current line, so print and print $0 do exactly the same thing. If you’d like, you can create an awk program that will output data totally unrelated to the input data. Here’s an example:

$ awk '{ print "" }' /etc/passwd 

Whenever you pass the "" string to the print command, it prints a blank line. If you test this script, you’ll find that awk outputs one blank line for every line in your /etc/passwd file. Again, this is because awk executes your script for every line in the input file. Here’s another example:

$ awk '{ print "hiya" }' /etc/passwd 

Running this script will fill your screen with hiya’s. :)


Multiple fields

Awk is really good at handling text that has been broken into multiple logical fields, and allows you to effortlessly reference each individual field from inside your awk script. The following script will print out a list of all user accounts on your system:

$ awk -F":" '{ print $1 }' /etc/passwd 

Above, when we called awk, we use the -F option to specify ":" as the field separator. When awk processes the print $1 command, it will print out the first field that appears on each line in the input file. Here’s another example:

$ awk -F":" '{ print $1 $3 }' /etc/passwd 

Here’s an excerpt of the output from this script:

halt7
operator11
root0
shutdown6
sync5
bin1
....etc. 

As you can see, awk prints out the first and third fields of the /etc/passwd file, which happen to be the username and uid fields respectively. Now, while the script did work, it’s not perfect — there aren’t any spaces between the two output fields! If you’re used to programming in bash or python, you may have expected the print $1 $3 command to insert a space between the two fields. However, when two strings appear next to each other in an awk program, awk concatenates them without adding an intermediate space. The following command will insert a space between both fields:

$ awk -F":" '{ print $1 " " $3 }' /etc/passwd 

When you call print this way, it’ll concatenate $1, " ", and $3, creating readable output. Of course, we can also insert some text labels if needed:

$ awk -F":" '{ print "username: " $1 "\t\tuid:" $3" }' /etc/passwd 

This will cause the output to be:

username: halt     uid:7
username: operator uid:11
username: root     uid:0
username: shutdown uid:6
username: sync     uid:5
username: bin      uid:1
....etc. 

External scripts

Passing your scripts to awk as a command line argument can be very handy for small one-liners, but when it comes to complex, multi-line programs, you’ll definitely want to compose your script in an external file. Awk can then be told to source this script file by passing it the -f option:

$ awk -f myscript.awk myfile.in 

Putting your scripts in their own text files also allows you to take advantage of additional awk features. For example, this multi-line script does the same thing as one of our earlier one-liners, printing out the first field of each line in /etc/passwd:

BEGIN {
        FS=":"
}
{ print $1 } 

The difference between these two methods has to do with how we set the field separator. In this script, the field separator is specified within the code itself (by setting the FS variable), while our previous example set FS by passing the -F":" option to awk on the command line. It’s generally best to set the field separator inside the script itself, simply because it means you have one less command line argument to remember to type. We’ll cover the FS variable in more detail later in this article.


The BEGIN and END blocks

Normally, awk executes each block of your script’s code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it’s an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you’ll reference later in the program.

Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream.


Regular expressions and blocks

Awk allows the use of regular expressions to selectively execute an individual block of code, depending on whether or not the regular expression matches the current line. Here’s an example script that outputs only those lines that contain the character sequence foo:

/foo/ { print } 

Of course, you can use more complicated regular expressions. Here’s a script that will print only lines that contain a floating point number:

/[0-9]+\.[0-9]*/ { print } 

Expressions and blocks

There are many other ways to selectively execute a block of code. We can place any kind of boolean expression before a code block to control when a particular block is executed. Awk will execute a code block only if the preceding boolean expression evaluates to true. The following example script will output the third field of all lines that have a first field equal to fred. If the first field of the current line is not equal to fred, awk will continue processing the file and will not execute the print statement for the current line:

$1 == "fred" { print $3 }

Awk offers a full selection of comparison operators, including the usual "==", "<", ">", "<=", ">=", and "!=". In addition, awk provides the "~" and "!~" operators, which mean "matches" and "does not match". They’re used by specifying a variable on the left side of the operator, and a regular expression on the right side. Here’s an example that will print only the third field on the line if the fifth field on the same line contains the character sequence root:

$5 ~ /root/ { print $3 } 

Conditional statements

Awk also offers very nice C-like if statements. If you’d like, you could rewrite the previous script using an if statement:

{
  if ( $5 ~ /root/ ) {
          print $3
  }
} 

Both scripts function identically. In the first example, the boolean expression is placed outside the block, while in the second example, the block is executed for every input line, and we selectively perform the print command by using an if statement. Both methods are available, and you can choose the one that best meshes with the other parts of your script.

Here’s a more complicated example of an awk if statement. As you can see, even with complex, nested conditionals, if statements look identical to their C counterparts:

{
  if ( $1 == "foo" ) {
           if ( $2 == "foo" ) {
                    print "uno"
           } else {
                    print "one"
           }
  } else if ($1 == "bar" ) {
           print "two"
  } else {
           print "three"
  }
} 

Using if statements, we can also transform this code:

! /matchme/ { print $1 $3 $4 }

to this:

{
  if ( $0 !~ /matchme/ ) {
          print $1 $3 $4
  }
} 

Both scripts will output only those lines that don’t contain a matchme character sequence. Again, you can choose the method that works best for your code. They both do the same thing.

Awk also allows the use of boolean operators "||" (for "logical or") and "&&"(for "logical and") to allow the creation of more complex boolean expressions:

( $1 == "foo" ) && ( $2 == "bar" ) { print } 

This example will print only those lines where field one equals foo and field two equals bar.


Numeric variables!

So far, we’ve either printed strings, the entire line, or specific fields. However, awk also allows us to perform both integer and floating point math. Using mathematical expressions, it’s very easy to write a script that counts the number of blank lines in a file. Here’s one that does just that:

BEGIN { x=0 }
/^$/  { x=x+1 }
END   { print "I found " x " blank lines. :) " } 

In the BEGIN block, we initialize our integer variable x to zero. Then, each time awk encounters a blank line, awk will execute the x=x+1 statement, incrementing x. After all the lines have been processed, the END block will execute, and awk will print out a final summary, specifying the number of blank lines it found.


Stringy variables

One of the neat things about awk variables is that they are "simple and stringy." I consider awk variables "stringy" because all awk variables are stored internally as strings. At the same time, awk variables are "simple" because you can perform mathematical operations on a variable, and as long as it contains a valid numeric string, awk automatically takes care of the string-to-number conversion steps. To see what I mean, check out this example:

x="1.01"
# We just set x to contain the *string* "1.01"
x=x+1
# We just added one to a *string*
print x
# Incidentally, these are comments :)  

Awk will output:

2.01

Interesting! Although we assigned the string value 1.01 to the variable x, we were still able to add one to it. We wouldn’t be able to do this in bash or python. First of all, bash doesn’t support floating point arithmetic. And, while bash has "stringy" variables, they aren’t "simple"; to perform any mathematical operations, bash requires that we enclose our math in an ugly $( ) construct. If we were using python, we would have to explicitly convert our 1.01 string to a floating point value before performing any arithmetic on it. While this isn’t difficult, it’s still an additional step. With awk, it’s all automatic, and that makes our code nice and clean. If we wanted to square and add one to the first field in each input line, we would use this script:

{ print ($1^2)+1 } 

If you do a little experimenting, you’ll find that if a particular variable doesn’t contain a valid number, awk will treat that variable as a numerical zero when it evaluates your mathematical expression.


Lots of operators

Another nice thing about awk is its full complement of mathematical operators. In addition to standard addition, subtraction, multiplication, and division, awk allows us to use the previously demonstrated exponent operator "^", the modulo (remainder) operator "%", and a bunch of other handy assignment operators borrowed from C.

These include pre- and post-increment/decrement ( i++, --foo ), add/sub/mult/div assign operators ( a+=3, b*=2, c/=2.2, d-=6.2 ). But that’s not all — we also get handy modulo/exponent assign ops as well ( a^=2, b%=4 ).


Field separators

Awk has its own complement of special variables. Some of them allow you to fine-tune how awk functions, while others can be read to glean valuable information about the input. We’ve already touched on one of these special variables, FS. As mentioned earlier, this variable allows you to set the character sequence that awk expects to find between fields. When we were using /etc/passwd as input, FS was set to ":". While this did the trick, FS allows us even more flexibility.

The FS value is not limited to a single character; it can also be set to a regular expression, specifying a character pattern of any length. If you’re processing fields separated by one or more tabs, you’ll want to set FS like so:

FS="\t+" 

Above, we use the special "+" regular expression character, which means "one or more of the previous character".

If your fields are separated by whitespace (one or more spaces or tabs), you may be tempted to set FS to the following regular expression:

FS="[[:space:]+]" 

While this assignment will do the trick, it’s not necessary. Why? Because by default, FS is set to a single space character, which awk interprets to mean "one or more spaces or tabs." In this particular example, the default FS setting was exactly what you wanted in the first place!

Complex regular expressions are no problem. Even if your records are separated by the word "foo," followed by three digits, the following regular expression will allow your data to be parsed properly:

FS="foo[0-9][0-9][0-9]" 

Number of fields

The next two variables we’re going to cover are not normally intended to be written to, but are normally read and used to gain useful information about the input. The first is the NF variable, also called the "number of fields" variable. Awk will automatically set this variable to the number of fields in the current record. You can use the NF variable to display only certain input lines:

NF == 3 { print "this particular record has three fields: " $0 } 

Of course, you can also use the NF variable in conditional statements, as follows:

{
  if ( NF > 2 ) {
          print $1 " " $2 ":" $3
  }
} 

Record number

The record number (NR) is another handy variable. It will always contain the number of the current record (awk counts the first record as record number 1). Up until now, we’ve been dealing with input files that contain one record per line. For these situations, NR will also tell you the current line number. However, when we start to process multi-line records later in the series, this will no longer be the case, so be careful! NR can be used like the NF variable to print only certain lines of the input:

(NR < 10 ) || (NR > 100) { print "We are on record number 1-9 or 101+" } 

Another example:

{
  #skip header
  if ( NR > 10 ) {
          print "ok, now for the real information!"
  }
} 

Awk provides additional variables that can be used for a variety of purposes. We’ll cover more of these variables in later articles. We’ve come to the end of our initial exploration of awk. As the series continues, I’ll demonstrate more advanced awk functionality, and we’ll end the series with a real-world awk application. In the meantime, if you’re eager to learn more, check out the resources listed below.

Tagged with:
Jun 28

Much like a vernacular, the universe of UNIX tools changes almost perpetually. New tools crop up frequently, while others are eternally modernized and adapted to suit emerging best practices. Certain tools are used commonly; others are used more infrequently. Some tools are perennial; occasionally, some are obsoleted outright. To speak UNIX fluently, you have to keep up with the "lingo."

Table 1 lists 11 of the significant packages previously discussed in the Speaking UNIX series.

Table 1. Prominent UNIX tools

Name Purpose
Cygwin A UNIX-like shell and build environment for the Windows® operating system.
fish A highly interactive shell with automatic expansion and colored syntax for command names, options, and file names.
locate Build and search a database of all files
rename Rename large collections of files en masse
rsync Efficiently synchronize files and directories, locally and remotely
Screen Create and manage virtual, persistent consoles
Squirrel A cross-platform scripting shell
tac Print input in reverse order, last line first (tac is the reverse of cat)
type Reveal whether a command is an alias, an executable, a shell built in, or a script
wget Download files using the command line
zsh An advanced shell featuring automatic completion, advanced redirection operands, and advanced substitutions

This month, let’s look at 10 more utilities and applications that expand or improve on an existing or better-known UNIX package. The list runs a wide gamut, from a universal archive translator to a high-speed Web server.

In some cases, depending on your flavor of UNIX, you will have to install a new software package. You can build from source as instructed, or you can save time and effort if your package-management software provides an equivalent binary bundle. For example, if you use a Debian flavor of Linux®, many of the utilities mentioned this month can be installed directly using apt-get.


Find a command with apropos

UNIX has so many commands, it is easy to forget the name of a utility—especially if you do not use the tool frequently. If you find yourself scratching your head trying to recall a name, run apropos (or the equivalent man -k). For example, if you’re hunting for a calculator, simply type apropos calculator:

$ apropos calculator
bc (1)        - An arbitrary precision calculator language
dc (1)        - An arbitrary precision calculator

Both bc and dc are command-line calculators.

Each UNIX manual page has a short description, and apropos searches the corpus of descriptions for instances of the specified keyword. The keyword can be a literal, such as calculator, or a regular expression, such as calc*. If you use the latter form, be sure to wrap the expression in quotation marks ("") to prevent the shell from interpreting special characters:

$ apropos "calcu*"
allcm (1)     - force the most important Computer-Modern-fonts to be calculated
allec (1)     - force the most important Computer-Modern-fonts to be calculated
allneeded (1) - force the calculation of all fonts now needed
bc (1)        - An arbitrary precision calculator language
dc (1)        - An arbitrary precision calculator

Run a calculation on the command line

As shown above, dc is a capable calculator found on every UNIX system. If you run dc without arguments, you enter Interactive mode, where you can write and evaluate Reverse Polish Notation (RPN) expressions:

$ dc
5
6
*
10
/
p
3

However, you can do all that work right on the command line. Specify the -e option and provide an expression to evaluate. Again, wrap the expression in quotation marks to prevent interpolation by the shell:

$ dc -e "5 6 * 10 /"
3

Find processes with pgrep

How many times have you hunted for a process with ps aux | grep .... Countless times, probably. Sure, it works, but there is a much more effective way to search for processes. Try pgrep.

As an example, this command finds all instantiations of strike’s login shell, (where strike is the name of a user):

$ pgrep -l -u strike zsh 
10331 zsh
10966 zsh

The pgrep command provides options to filter processes by user name (the -u shown), process group, group, and more. A companion utility, pkill, takes all the options of pgrep and accepts a signal to send to all processes that match the given criteria.

For instance, the command pkill -9 -u strike zsh is the equivalent of pgrep -u strike zsh | xargs kill -9.


Generate secure passwords with pwgen

Virtually every important subsystem in UNIX requires its own password. To wit, e-mail, remote login, and superuser privileges all require a password—preferably disparate and each difficult to guess or derive using an automated attack. Moreover, if you want to develop scripts to generate accounts, you want a reliable source of random, secure passwords.

The pwgen utility is a small utility to generate gobs of passwords. You can tailor the passwords to be memorable or secure, and you can specify whether to include numbers, symbols, vowels, and capital letters.

Many UNIX systems have pwgen. If not, it is simple to build:

$ # As of March 2009, the latest version is 2.06
$ wget http://voxel.dl.sourceforge.net/sourceforge/\
  pwgen/pwgen-2.06.tar.gz
$ tar xzf pwgen-2.06.tar.gz
$ cd pwgen-2.06
$ ./configure && make && sudo make install

Here are some sample uses:

  • Print a collection of easy-to-recall passwords:
    $ pwgen -C
    ue2Ahnga Soom0Lu0 Hie8aiph gei9mooD eiXeex7N
    Wid4Ueng taShee3v Ja3shii8 iNg0viSh iegh5ouF
    ...
    zoo8Ahzu Iefev0ch MoVu4Pae goh1Ak6m EiJup5ei 
  • Generate a single, secure password:
    $ pwgen -s -1
    oYvy9WWa
  • Generate a single, secure password with no ambiguous, or easily confused, characters and at least one non-alphanumeric character:
    $ ./pwgen -s -B -1 -y
    7gEqT_V[

To see all the available options, type pwgen --help.


Watch many files with multitail

Whether you're a developer debugging new code or a systems administrator monitoring a system, you often have to keep an eye on many things at once. If you're a developer, you might watch a debug log and stdout to track down a bug; if you're an administrator, you might want to police activity to intercede as necessary. Usually, both tasks require oodles of windows tiled on screen to keep a watchful eye—perhaps tail in one window, less in another window, and a command prompt in yet another.

If you have to monitor several files at once, consider multitail. As its name implies, this utility divides a console window into multiple sections, one section per log file. Even better, multitail can colorize well-known formats (and you can define custom color schemes, too) and can merge multiple files into a single stream.

To build multitail, download the source, unpack it, and run make. (The options in the distribution's generic makefile should suffice for most UNIX systems. If the make fails, look in the topmost directory for a makefile specific to your system.)

# As this article was written, the latest version of multitail was 5.2.2
$ wget http://www.vanheusden.com/multitail/multitail-5.2.2.tgz
$ tar xzf multitail-5.2.2.tgz
$ cd multitail-5.2.2
$ make
$ sudo make install

Here are some uses of multitail to consider:

  • To watch a list of log files in the same window, launch the utility with a list of file names, as in multitail /var/log/apache2/{access,error}.log.
  • To watch a pair of files in the same window and buffer everything that's read, use the -I option to merge the named file into another, as in multitail -M 0 /var/log/apache/access.log -I /var/log/apache/error.log. Here, the Apache error log and access log are interlineated. -M 0 records all incoming data; you can see the buffer at any time by pressing the B key.
  • You can also mix and match commands and files. To watch a log file and monitor the output of ping, try multitail logfile -l "ping 192.168.1.3". This creates two views in the same console: One view shows the contents of logfile, while the other shows the ongoing output of ping 192.168.1.3.

In addition to command-line options, multitail provides a collection of interactive commands to affect the current state of the display. For instance, press the A key in the display to add a new log file. The B key displays the save buffer. The Q key quits multitail. See the man page for multitail for the complete list of commands.


Compress and extract almost anything with 7zip

Between Windows and UNIX alone, there are dozens of popular archive formats. Windows has long had .zip and .cab, for instance, while UNIX has had .tar, .cpio, and .gzip. UNIX and its variants also employ .rpm, .deb, and .dmg. All these formats are commonly found online, making for something of a Babel of bits.

To save or extract data in any particular format, you could install a bevy of specialized utilities, or you can install 7zip, a kind of universal translator that can compress and extract virtually any archive. Further, 7zip also proffers its own format, featuring a higher compression ratio than any other scheme, gigantic capacity reaching into terabytes, and strong data encryption.

To build 7zip, download the source for p7zip, a port of 7zip to UNIX, from its project page on SourceForge . Unpack the tarball, change to the source directory, and run make. (Like multitail, the generic makefile should suffice; if not, choose from one of the specialized makefiles provided.)

$ wget http://voxel.dl.sourceforge.net/sourceforge/p7zip/\
  p7zip_4.65_src_all.tar.bz2
$ tar xjf p7zip_4.65_src_all.tar.bz2
$ cd p7zip_4.65
$ make
$ sudo make install

The build produces and installs the utility 7za. Type 7za with no arguments to see a list of available commands and options. Each command is a letter—akin to tar—such as a to add a file to the archive and x to extract.

To try the utility, create an archive of the p7zip source itself in a variety of formats, and extract each archive with 7za:

$ zip -r p7.zip p7zip_4.65
$ 7za -ozip x p7.zip
$ tar cvf p7.tar p7zip_4.65
$ 7za -otar x p7.tar
$ bzip2 p7.tar
$ 7za -so x p7.tar.bz2 | tar tf -

In order from top to bottom, 7za extracted a .zip, .tar, and .bz2 archive. In the last command, 7za extracted the .bz2 archive and wrote the output to stdout, where tar decompressed and cataloged the files. Like tar, 7za can be the source or destination of a pipe (|), making it easy to combine with other utilities.


View compressed files with zcat

Per-disk capacity now exceeds a terabyte, but a disk can nonetheless fill up quickly with large data files, lengthy log files, images, and media files such as movies. To conserve space, many files can be compressed to a fraction of their original size. For example, an Apache log file, which is simply text, can shrink to one-tenth of its original size.

Although compression saves disk space, it can add effort. If you need to analyze a compressed Apache log file, for instance, you must decompress it, process the data, then re-compress it. If you have a great number of log files, which is typical if you keep records to establish trends, the overhead can become excessive.

Luckily, the gzip suite includes a number of utilities to process compressed files in situ. The utilities zcat, zgrep, zless, and zdiff, among others, serve the same purpose as cat, grep, less, and diff, respectively, but operate on compressed files.

Here, two source files are compressed with gzip and compared with zdiff:

$ cat old
This
is
Monday.
$ cat new
This
is
Tuesday.
$ gzip old new
$ zdiff -c old.gz new.gz
*** -	2009-03-30 22:26:34.518217647 +0000
--- /tmp/new.10874	2009-03-30 22:26:34.000000000 +0000
***************
*** 1,3 ****
  This
  is
! Monday.
--- 1,3 ----
  This
  is
! Tuesday.

Surf the Web, conquer the Internet, make world peace with cURL

A prior Speaking UNIX column recommended wget to download files directly from the command-line. Ideal for shell scripts, wget is great for those times where you do not have ready access to a Web browser. For example, if you are trying to install new software on a remote server, wget can be a real life-saver.

If you like wget, then you'll love cURL. Like wget, cURL can download files, but it can also post data to a Web page form, upload a file via the File Transfer Protocol (FTP), act as a proxy, set Hypertext Transfer Protocol (HTTP) headers, and a whole lot more. In many ways, cURL is a command-line surrogate for the browser and other clients. Thus, it has many potential applications.

The cURL utility is readily built using the tried-and-true ./configure && make && sudo make install process. Download, extract, and proceed:

$ wget http://curl.haxx.se/download/curl-7.19.4.tar.gz
$ tar xzf curl-7.19.4.tar.gz
$ cd curl-7.19.4
$ ./configure && make && sudo make install

The cURL utility has so many options, it's best to read over its lengthy man page. Here are some common cURL uses:

  • To download a file—say, the cURL tarball itself—use:
    $ curl -o curl.tgz http://curl.haxx.se/download/curl-7.19.4.tar.gz

    Unlike wget, cURL emits what it downloads to stdout. Use the -o option to save the download to a named file.

  • To download a number of files, you can provide a sequence, a set, or both. A sequence is a range of numbers in brackets ([]); a set is a comma-delimited list in braces ({}). For example, the following command would download all files named parta.html, partb.html, and partc.html from the directories named archive1996/vol1 through archive1999/vol4, inclusive, for a total of 48 files.
    $ curl http://any.org/archive[1996-1999]/vol[1-4]/part{a,b,c}.html \
      -o "archive#1_vol#2_part#3.html"

    When a sequence or set is specified, you can provide the -o option with a template, where #1 is replaced with the current value of the first sequence or set, #2 is a placeholder for the second, and so on. As an alternative you can also provide -O to keep each file name intact.

  • To upload a suite of images to a server, use the -T option:
    $ curl -T "img[1-1000].png" ftp://ftp.example.com/upload/

    Here, the glob img[1-1000].png is captured in quotation marks to prevent the shell from interpreting the pattern. This command uploads img1.png through img1000.png to the named server and path.

  • You can even use cURL to look up words in the dictionary:
     $ curl dict://dict.org/d:stalwart
    220 miranda.org dictd 1.9.15/rf on Linux 2.6.26-bpo.1-686
        <auth.mime> <400549.18119.1238445667@miranda.org>
    250 ok
    150 1 definitions retrieved
    151 "Stalwart" gcide "The Collaborative International Dictionary of English v.0.48"
    Stalwart \Stal"wart\ (st[o^]l"w[~e]rt or st[add]l"-; 277),
    Stalworth \Stal"worth\ (-w[~e]rth), a. [OE. stalworth, AS.
       staelwyr[eth] serviceable, probably originally, good at
       stealing, or worth stealing or taking, and afterwards
       extended to other causes of estimation. See {Steal}, v. t.,
       {Worth}, a.]
       Brave; bold; strong; redoubted; daring; vehement; violent. "A
       stalwart tiller of the soil." --Prof. Wilson.
       [1913 Webster]
    
             Fair man he was and wise, stalworth and bold. --R. of
                                                      Brunne.
       [1913 Webster]
    
       Note: Stalworth is now disused, or but little used, stalwart
             having taken its place.
             [1913 Webster]
    .
    250 ok [d/m/c = 1/0/20; 0.000r 0.000u 0.000s]
    221 bye [d/m/c = 0/0/0; 0.000r 0.000u 0.000s]

    Replace the word stalwart with the word you'd like to define.

In addition to its command-line personality, all of cURL's capabilities are available from a library aptly named libcurl. Many programming languages include interfaces to libcurl to automate tasks such as transmitting a file via FTP. For example, this PHP snippet uses libcurl to deposit a file uploaded via a form to an FTP server:

<?php
  ...
  $ch = curl_init();
  $localfile = $_FILES['upload']['tmp_name'];
  $fp = fopen($localfile, 'r');
  curl_setopt($ch, CURLOPT_URL,
      'ftp://ftp_login:password@ftp.domain.com/'.$_FILES['upload']['name']);
  curl_setopt($ch, CURLOPT_UPLOAD, 1);
  curl_setopt($ch, CURLOPT_INFILE, $fp);
  curl_setopt($ch, CURLOPT_INFILESIZE, filesize($localfile));
  curl_exec ($ch);
  $error_no = curl_errno($ch);
  curl_close ($ch);
  ...
?>

If you have to automate any sort of Web access, consider cURL.


SQLite: A database for most occasions

UNIX offers a slew of databases—many of them open source, some for general application, and some highly specialized. Most databases, though, tend to be large, independent applications—MySQL, for example, requires a separate installation, some configuration, and its own daemon—and may be overkill for a large class of software. Consider an address book accessory for the desktop: Is it appropriate to deploy MySQL to persist names and phone numbers? Probably not.

And what if the application is intended to run on a very small device or on a modest computer? Such hardware may not be suited to multiprocessing, a large memory footprint, or significant demands on physical storage. Certainly, an embedded database is an alternative. Typically, an embedded database is packaged as a library and is linked directly to application code. Such a solution makes the application independent of an external service, albeit at a cost: Queries aren't typically expressed in Structured Query Language (SQL).

SQLite combines the best of all worlds: The software is tiny, you can embed it in virtually any application, and you can query your data with vanilla SQL. PHP and Ruby on Rails use SQLite as the default storage engine, as does the Apple iPhone.

To build SQLite, download the source amalgamation (a single file combining all the source) from the SQLite download page, extract it, and run ./configure && make && sudo make install.

$ # As of March 2009, the latest version was 3.6.11.
$ wget http://www.sqlite.org/sqlite-amalgamation-3.6.11.tar.gz
$ tar xzf sqlite-amalgamation-3.6.11.tar.gz
$ cd sqlite-3.6.11
$ ./configure && make 
$ sudo make install

The build produces a library and associated application programming interface (API) header files as well as a stand-alone command-line utility named sqlite3 that's useful for exploring features. To create a database, launch sqlite3 with the name of the database. You can even place SQL right on the command line, which is great for scripting:

$ sqlite3 comics.db "CREATE TABLE issues \
  (issue INT PRIMARY KEY, \
  title TEXT NOT_NULL)"
$ sqlite3 comics.db "INSERT INTO issues (issue, title) \
  VALUES ('1', 'Amazing Adventures')"
$ sqlite3 comics.db "SELECT * FROM issues"
1|Amazing Adventures

The first command creates the database (if it does not exist already) as well as a table with two columns, an issue number, and a title. The middle command inserts a row, and the final command shows the contents of the table.

SQLite offers triggers, logging, and sequences. SQLite is also typeless, unless you specify a type. For example, the issues table declared works fine without types:

$ sqlite3 comics.db "create table issues (issue primary key, title)"
$ sqlite3 comics.db "INSERT INTO issues (issue, title) \
  VALUES (1, 'Amazing Adventures')"
$ sqlite3 comics.db "SELECT * FROM issues"1|Amazing Adventures

Lack of type is considered a feature, not a bug, and has many applications.


Grab XAMPP, an off-the-shelf Web stack

If you want to use your UNIX machine as a Web server, you have oodles of choices to compose a Web stack. Of course, there's the Apache HTTP Server, MySQL, Perl, PHP, Python, and Ruby on Rails, and this article recommends some components you may not have heard of previously, including SQLite and lighttpd.

But building a stack from scratch isn't everyone's cup of tea. Configuring Apache and other software packages to interoperate can be maddening at times, and you may not want the onus of maintaining the source yourself, recompiling each time a new patch is issued. For those good reasons, you might opt for an off-the-shelf stack. Just install and go!

XAMPP is one of many pre-packaged Web stacks you can find online. It includes Apache and compatible builds of MySQL, PHP, and Perl. A version of XAMPP is available for Linux, Sun Solaris, Windows, and Mac OS X. You download XAMPP, extract it, and start:

# The latest version for Linux was 1.7
$ wget http://www.apachefriends.org/download.php?xampp-linux-1.7.tar.gz
$ sudo tar xzf xampp-linux-1.7.tar.gz -C /opt
$ sudo /opt/lampp/lampp start
Starting XAMPP 1.7...
LAMPP: Starting Apache...
LAMPP: Starting MySQL...
LAMPP started.

The second command extracts the XAMPP distribution and places it directly in /opt (thus the need to preface the command with sudo. If you want to locate XAMPP elsewhere, change the argument to -C. The last command launches Apache and MySQL, the two daemons required to serve a Web site. To test the installation, simply point your browser to http://localhost. You should see something like Figure 1.

Figure 1. The XAMPP stack start page

Click Status to see how things are operating. XAMPP provides phpMyAdmin and webalizer to create and manage MySQL databases on the server and measure Web traffic, respectively.

By the way, XAMPP also provides the entire source code to the stack, so you can apply customizations or add to the stack if you need to. If nothing else, the XAMPP source code reveals how to build a stack, if you want to eventually tackle or customize the process yourself.


Go small with the lighttpd server

XAMPP and many bundles like it package the Apache HTTP Server. Apache is certainly capable—by most measures, it still powers the majority of sites worldwide—and an enormous number of extensions is available to add wholesale subsystems and integrate tightly with programming languages.

But Apache isn't the only Web server available, and in some cases, it isn't preferable. A complex Apache instance can require an immense memory footprint, which limits throughput. Further, even a small Apache instance may be excessive compared to the return.

"Security, speed, compliance, and flexibility" describe lighttpd (pronounced "lighty"), a small and very efficient alternative to Apache. Better yet, the lighttpd configuration file isn't the morass that Apache's is.

Building lighttpd from scratch is a little more involved, because it depends on other libraries. At a minimum, you need the development version (the version that includes the header files) of the Perl Compatible Regular Expression (PCRE) library and the Zlib compression library. After you've installed those libraries (or built the libraries from scratch), compiling lighttpd is straightforward:

$ # Lighttpd requires libpcre3-dev and zlib1g-dev
$ wget http://www.lighttpd.net/download/lighttpd-1.4.22.tar.gz
$ tar xzf lighttpd-1.4.22.tar.gz
$ cd lighttpd-1.4.22
$ ./configure && make && sudo make install

Next, you must create a configuration. The most minimal configuration possible sets the document root, server port, a few Multipurpose Internet Mail Extension (MIME) types, and the default user and group for the daemon:

server.document-root = "/var/www/lighttpd/host1"
server.groupname = "www"
server.port = 3000
server.username = "www" 

mimetype.assign = (
  ".html" => "text/html",
  ".txt" => "text/plain",
  ".jpg" => "image/jpeg",
  ".png" => "image/png"
)

static-file.exclude-extensions = ( ".fcgi", ".php", ".rb", "~", ".inc" )
index-file.names = ( "index.html" )

Assuming that you saved the text to a file named /opt/etc/lighttpd.conf, you start lighttpd with lighttpd -D -f /opt/etc/lighttpd.conf.

Like Apache, lighttpd can serve virtual hosts. All it takes is three lines, using a conditional:

$HTTP["host"] == "www2.example.org" {
  server.document-root = "/var/www/lighttpd/host2
}

Here, if the host is named www2.example.org, an alternate document root is used.

Lighttpd is especially adept at managing large numbers of parallel requests. You can readily mix lighttpd with Rails, PHP, and more.


Better, smarter, faster

Yet another "Speaking UNIX" draws to a close. Break out those keyboards, fire up the Wi-Fi, and start downloading!

Tagged with:
May 19

In order to improve the security of the operating system, while improving the efficiency of the script I wrote, I hope you can help.
The primary function of this script to stop unwanted services, including, modification of the core security-related parameters, modify the parameters of other applications.

OS:CentOS4/5
Code:

#!/bin/sh

###################################################################################
#    Security Script for RedHat Linux
#    Author:jason
#    Date:2009/05/20
#
##################################################################################

#—————————–Define Variable————————————-
LANG=EN
DATETIME=`date +%Y%m%d-%M%S`
SERVICES=(autofs firstboot cups gpm nfs nfslock xfs netfs sendmail yum-updatesd restrorecond mcstrans avahi-daemon anacron)
MD5SUM=(ps netstat ls last w ifconfig tcpdump iptraf top swatch nice lastb md5sum name)
IPV6=$(ifconfig | grep "inet6")
Filename=`ifconfig -a |grep inet |grep -v "127.0.0.1" |awk ‘{print $2}’| head -1 | awk -F":" ‘{ print $2}’`-$DATETIME-md5
BKDir=/var/ikerbk

#—————————-Create report/back Directory————————-
mkdir -p /var/ikerbk

#—————————-Modify Default Language——————————
echo -n "modfiy env_LANG"
if [ -f /etc/sysconfig/i18n ]; then
cp /etc/sysconfig/i18n $BKDir/$DATETIME\_i18n
Lang=`grep "^LANG=" /etc/sysconfig/i18n`
Lang1=`grep "^SUPPORTED="        /etc/sysconfig/i18n`
Lang2=`grep "^SYSFONT="  /etc/sysconfig/i18n`
        if [ -z "$Lang" ]; then
        sed -i ‘1i\LANG="en_US.UTF-8"’ /etc/sysconfig/i18n
        echo " : insert [OK]"
        else
        sed -i ’s/LANG=.*/LANG="en_US.UTF-8"/g’ /etc/sysconfig/i18n
        echo " : modfiy [OK]"
        fi

        if [ -z "$Lang1" ]; then
        sed -i ‘1a\SUPPORTED="en_US.UTF-8:en_US:en"’ /etc/sysconfig/i18n
        echo "SUPPORTED insert [OK]"
        else
        sed -i ’s/SUPPORTED=.*/SUPPORTED="en_US.UTF-8:en_US:en"/g’ /etc/sysconfig/i18n
        echo "SUPPORTED modfiy [OK]"
        fi

        if [ -z "$Lang2" ]; then
        sed -i ‘1a\SYSFONT="latarcyrheb-sun16"’ /etc/sysconfig/i18n
        echo "SYSFONT insert [OK]"
        else
        sed -i ’s/SYSFONT=.*/SYSFONT="latarcyrheb-sun16"/g’ /etc/sysconfig/i18n
        echo "SYSFONT modfiy [OK]"
        fi
else
        echo " : File /etc/sysconfig/i18n not exist [False]"
fi

#—————————–SSH Protocol 2————————————
echo -n "change sshd <Protocol 2>"
if [ -f /etc/ssh/sshd_config ] ; then
cp /etc/ssh/sshd_config $BKDir/$DATETIME-sshd_config
Proto=`sed -n ‘/^Protocol/p’ /etc/ssh/sshd_config`
Proto1=`sed -n ‘/^Protocol/p’ /etc/ssh/sshd_config | awk ‘{ print $2 }’`
if [ -z "$Proto" ]; then
        sed -i ‘1i\Protocol 2\’ /etc/ssh/sshd_config
        echo "  [OK]"
        elif [ "$Proto1" != "2" ]; then
        sed -i "s/^$Proto/Protocol 2/g" /etc/ssh/sshd_config
        echo "  [OK]"
fi
else
        echo "  :File /etc/ssh/sshd_config not exist [False]"
fi

#—————————–Stop Unuse Services———————————
for x in "${SERVICES[@]}"; do
    state1=`chkconfig –list | grep $x | awk ‘{print substr($5,3,5)}’`
    if [ "$state1" == "on" ]; then
        service $x stop
                 chkconfig –level 3 $x off
       else
              echo "$x state is stop [OK]"
       fi
done

#—————————–Force Password Lenth——————————–
echo -n "change <password> length"
if [ -f /etc/login.defs ]; then
cp /etc/login.defs $BKDir/$DATETIME\_login.defs
        sed -i ’s/PASS_MIN_LEN.*5/PASS_MIN_LEN  8/’ /etc/login.defs
        echo "   [OK]"
else
        echo " :File /etc/login.defs not exist [False]"
fi

#—————————-Define SSH Session TIMEOUT—————————
echo -n "modfiy Histsize and TMOUT"
if [ -f /etc/profile ]; then
cp /etc/profile $BKDir/$DATETIME\_profile
        sed -i ’s/HISTSIZE=.*/HISTSIZE=128/’ /etc/profile
        echo "  [OK]"

        Timeout=`grep "TMOUT=" /etc/profile`
        if [ -z $Timeout ] ; then
        echo "TMOUT=900" >> /etc/profile
        else
        sed -i ’s/.*TMOUT=.*/TMOUT=300/g’ /etc/profile
        fi
else
        echo "  :File /etc/profile not exist [False]"
fi

#—————————–Check tmp Directory Stick—————————
if [ -d /tmp/ ]; then
echo -n "modfiy /tmp/ +t"
chmod +t /tmp/
echo  " [OK]"
else
        mkdir /tmp &&   chmod 777 /tmp && chmod +t /tmp
        echo "  [mkdir /tmp]"
fi

#—————————–Close tty4/5/6————————————–
echo -n "modify Control-Alt-Delete"
if [ -f /etc/inittab ]; then
cp /etc/inittab  $BKDir/$DATETIME\_inittab
sed -i  ’s/\(^ca\:\:ctrlaltdel\:\/sbin\/shutdown.*\)/#\1/g’ /etc/inittab
sed -i  ’s/\(^4:2345:respawn.*\)/#\1/g’ /etc/inittab
sed -i  ’s/\(^5:2345:respawn.*\)/#\1/g’ /etc/inittab
sed -i  ’s/\(^6:2345:respawn.*\)/#\1/g’ /etc/inittab
    echo " : Control-Alt-Delete AND tty-456 [OK]"
        else
        echo "file /etc/inittab NOT EXIST"
fi

#—————————–Clean Console Information—————————
echo -n "Clean boot infomation"
Check=`sed -n ‘/issue.net/p’ /etc/rc.local`
if [ -f /etc/issue -a -f /etc/issue.net ]; then
    echo "" >  /etc/issue
       echo "" >  /etc/issue.net
    if [ -z "$Check" ]; then
        echo ‘echo "" >  /etc/issue’    >> /etc/rc.local
        echo ‘echo "" >  /etc/issue.net’        >> /etc/rc.local
        echo    "   [OK]"
    fi
else
        echo "  :File /etc/issue or /etc/issue.net not exist [False]"
fi

#—————————-Close IPV6——————————————-
if [ -n "$IPV6" ]; then
        if [ -f /etc/sysconfig/network -a -f /etc/modprobe.conf ]; then
        cp /etc/sysconfig/network $BKDir/$DATETIME\_network
        cp /etc/modprobe.conf   $BKDir/$DATETIME\_modprobe.conf
                Netipv6=`grep "^NETWORKING_IPV6=yes" /etc/sysconfig/network`
                echo -n "modfiy ipv6 clean"
                if [ -z $Netipv6 ]; then
                        echo "  already [OK]"
                else
                        sed -i ’s/^NETWORKING_IPV6=yes/NETWORKING_IPV6=no/g’ /etc/sysconfig/network
                        echo "  [OK]"
                fi
                        Ipv6mod=`sed -n  ‘/^alias.*ipv6.*off/p’ /etc/modprobe.conf`
                        echo -n "modfiy ipv6_mod clean"
                if [ -z "$Ipv6mod" ]; then
                 echo "
alias net-pf-10 off
alias ipv6 off"  >> /etc/modprobe.conf
                echo "  [OK]"
                else
                echo "  IPV6 mod already [OK]"
                fi
        else "File /etc/sysconfig/network or /etc/modprobe.conf not exist [False]"
        fi
else
        echo "IPV6 not support [OK]"
fi

#—————–Protect File passwd/shadow/group/gshadow/services—————
echo -n "modfiy passwd_file +i "
chattr +i /etc/passwd
chattr +i /etc/shadow
chattr +i /etc/group
chattr +i /etc/gshadow
chattr +i /etc/services
echo    "  [OK]"

#——————————Clean Command History——————————
echo -n "modify bash_history"
if [ -f /root/.bash_logout ]; then
        LOGOUT=`grep "rm -f" /root/.bash_logout`
        if  [ -z "$LOGOUT" ] ; then
        sed -i ‘/clear/i \rm -f  $HOME/.bash_history’ /root/.bash_logout
        echo "    [OK]"
        else
        echo "  Already [OK]"
        fi
else
        echo "  :File /root/.bash_logout not exist [False]"
fi

#—————————–Group wheel su root———————————
echo -n "modify su root"
if [ -f /etc/pam.d/su ]; then
cp /etc/pam.d/su $BKDir/$DATETIME\_su
        sed -i ’s/.*pam_wheel.so use_uid$/auth           required        pam_wheel.so use_uid/’ /etc/pam.d/su
        echo "  [OK]"
else
        echo "  :File /etc/pam.d/su not exist [False]"
fi

#————————Modify Kernel Parameters About Security——————
echo -n "modfiy /etc/sysctl.conf"
if [ -f /etc/sysctl.conf ]; then
cp /etc/sysctl.conf $BKDir/$DATETIME\_sysctl.conf
Net=(net.ipv4.ip_forward
net.ipv4.conf.all.accept_source_route
net.ipv4.conf.all.accept_redirects
net.ipv4.tcp_syncookies
net.ipv4.conf.all.log_martians
net.ipv4.icmp_echo_ignore_broadcasts
net.ipv4.icmp_ignore_bogus_error_responses
net.ipv4.conf.all.rp_filter)
for i in "${Net[@]::3}"; do
Zero=`sed  -n "/^$i/p" /etc/sysctl.conf | awk -F"="  ‘{ print $2 }’ | sed ’s/ //g’`
Zero1=`sed  -n "/^$i/p" /etc/sysctl.conf`
                if [ -z "$Zero" ]; then
                        if [ -z "$Zero1" ];then
                        echo "$i = 0" >> /etc/sysctl.conf
                        echo "$i is [OK]"
                        else
                        sed -i "s/$i.*/$i = 0/g" /etc/sysctl.conf
                        echo "$i is [OK]"
                        fi
                fi
        if [ "$Zero" == "0" ]; then
        echo "$i is [OK]"
        else
        sed -i "s/$i.*/$i = 0/g" /etc/sysctl.conf
        fi
done

for i in "${Net[@]:3}"; do
One=`sed  -n "/^$i/p" /etc/sysctl.conf | awk -F"="  ‘{ print $2 }’ | sed ’s/ //g’`
One1=`sed  -n "/^$i/p" /etc/sysctl.conf`
                if [ -z "$One" ]; then
                        if [ -z "$One1" ];then
                        echo "$i = 1" >> /etc/sysctl.conf
                        echo "$i is [OK]"
                        else
                        sed -i "s/$i.*/$i = 1/g" /etc/sysctl.conf
                        echo "$i is [OK]"
                        fi
                fi
        if [ "$One" == "1" ]; then
        echo "$i is [OK]"
        else
        sed -i "s/$i.*/$i = 1/g" /etc/sysctl.conf
        fi
done

else
        echo ":File /etc/sysctl.conf not exist [Flase]"
fi

sysctl -p >> $BKDir/$Filename
init q

Tagged with:
May 15

rsync is an open source utility that provides fast incremental file transfer. rsync is freely available under the GNU General Public License and is currently being maintained by Wayne Davison.

Here is a script of do incremental backup by rsync. Hope it can help you.
Test OS:Centos4,5 RedHatAS4,5

#!/bin/sh

#########################################################
# Script to do incremental rsync backups
# Adapted from script found on the rsync.samba.org
# Jason 3/24/2002
# This script is freely distributed under the GPL
#########################################################

##################################
# Configure These Options
##################################

###################################
# mail address for status updates
#  – This is used to email you a status report
###################################
MAILADDR=your_mail_address_here

###################################
# HOSTNAME
#  – This is also used for reporting
###################################
HOSTNAME=your_hostname_here

###################################
# directory to backup
# – This is the path to the directory you want to archive
###################################
BACKUPDIR=directory_you_want_to_backup

###################################
# excludes file – contains one wildcard pattern per line of files to exclude
#  – This is a rsync exclude file.  See the rsync man page and/or the
#    example_exclude_file
###################################
EXCLUDES=example_exclude_file

###################################
# root directory to for backup stuff
###################################
ARCHIVEROOT=directory_to_backup_to

#########################################
# From here on out, you probably don’t  #
#   want to change anything unless you  #
#   know what you’re doing.             #
#########################################

# directory which holds our current datastore
CURRENT=main

# directory which we save incremental changes to
INCREMENTDIR=`date +%Y-%m-%d`

# options to pass to rsync
OPTIONS="–force –ignore-errors –delete –delete-excluded \
–exclude-from=$EXCLUDES –backup –backup-dir=$ARCHIVEROOT/$INCREMENTDIR -av"

export PATH=$PATH:/bin:/usr/bin:/usr/local/bin

# make sure our backup tree exists
install -d $ARCHIVEROOT/$CURRENT

# our actual rsyncing function
do_rsync()
{
   rsync $OPTIONS $BACKUPDIR $ARCHIVEROOT/$CURRENT
}

# our post rsync accounting function
do_accounting()
{
   echo "Backup Accounting for Day $INCREMENTDIR on $HOSTNAME:">/tmp/rsync_script_tmpfile
   echo >> /tmp/rsync_script_tmpfile
   echo "################################################">>/tmp/rsync_script_tmpfile
   du -s $ARCHIVEROOT/* >> /tmp/rsync_script_tmpfile
   echo "Mail $MAILADDR -s $HOSTNAME Backup Report < /tmp/rsync_script_tmpfile"
   Mail $MAILADDR -s $HOSTNAME Backup Report < /tmp/rsync_script_tmpfile
   echo "rm /tmp/rsync_script_tmpfile"
   rm /tmp/rsync_script_tmpfile
}

# some error handling and/or run our backup and accounting
if [ -f $EXCLUDES ]; then
    if [ -d $BACKUPDIR ]; then
        # now the actual transfer
        do_rsync && do_accounting
    else
        echo "cant find $BACKUPDIR"; exit
    fi
    else
        echo "cant find $EXCLUDES"; exit
fi

Tagged with:
Jan 05
What is Deny_Passorwd_Crack?

Deny_Password_Crack is a simple  script progam of parsing /var/log/secure to find all login attempts and filters failed and successful attempts.Intended to be run by Linux system administrators to help thwart SSH server attacks.

If you’ve ever looked at your ssh log, you may be alarmed to see how many hackers attempted to gain access to your server. Hopefully, none of them were successful. Wouldn’t it be better to automatically prevent that attacker from continuing to gain entry into your system?
Deny_Password_Crack attempts to address the above.

Where can I download Deny_Password_Crack from?

Deny_Passorwd_Crack is available for download from here.

How do I configure cron for Deny_Password_Crack use?

Presumably, you will need to run deny_password_crack as root , so you first must become root. Once you have either logged in as root you can then run the following command:

# crontab -e

The above command will launch the crontab editor. To launch deny_password_crack every 10 minutes you would then add the following line to the crontab:

*/10 * * * /path/deny_password_crack.sh

For more information regarding the crontab format please see the crontab man page (man 5 crontab).

Will Deny_Password_Crack support  VSFTPD?

No, But I will add the feature next version. If you want to support the feature, you can rss my blog.

Will Deny_Password_Crack work with FreeBSD?

No, But I will add the feature next version. If you want to support the feature, you can rss my blog.

Need help?

If Deny_Password_Crack is unable to correctly parse your ssh server log when you run it, please email me(jason#goitworld.com,please replace # to @) the following information:

  1. SSH log entry showing a successful login
  2. SSH log entry showing a failed attempt of a valid user account (eg. root)
  3. SSH log entry showing a failed attempt of a non-existent user account (eg. blah)

I will try to respond to each support request that I receive. If I am able to help you I will be very glad.

preload preload preload