Counting lines of code on the command line

Yes, I know, counting lines of code is an evil thing to do to assess a code base, but I find it still interesting in many cases. Here is a simple bash function to count the lines of code in files that have specific file extensions:

function loc
{
    if [ "$#" -lt 1 ]
    then
        local path="."
        local search_pattern=".*"
    else
        local path=$1
        shift

        if [ "$#" -lt 1 ]
        then
            local search_pattern=".*"
        else
            local search_pattern=".*/\(.*\.$1\)"

            shift

            for extension in "$@"
            do
                search_pattern="$search_pattern\|\(.*\.$extension\)"
            done
        fi
    fi

    find $path -regex "$search_pattern" -print0 | wc -l --files0-from=- | sort -n
}

If you add this function to the .bashrc file in your home directory, you can type loc in a terminal to count the lines of code. If you do not provide any arguments when calling the function all files in the current working directory (and recursively in the sub-directories) are counted. You can, however, specify a directory to search in as well as a list of file extensions to filter the files:

# count all files in the current directory (and recursively in the sub-directories)
loc
# count files in the directory called 'src' (and its sub-directories)
loc src/
# as above, but count lines of files ending with '.java' or '.py' only
loc src/ java py

My bash prompt

If you are working a lot on the Linux command line like I do, you probably want to have a nice, fancy looking bash prompt that shows you some more information than the default one. Today, I am going to show you the prompt I am currently using. I got the ideas for it from different sources all over the internet, but I decided to implement it (and comment it!) on my own. If you want to give it a try, just copy the following lines to the .bashrc file in your home directory:

function update_prompt {
    # get information to be displayed in prompt
    hostname=$(hostname | tr -d '\n')
    username=$(whoami | tr -d '\n')
    working_directory=$(pwd | tr -d '\n' | sed "s:^$HOME:~:")
    date_time=$(date "+%H:%M")

    # compute size of prompt an number of fill characters
    local terminal_width=${COLUMNS}
    local promptsize=$(echo -n "--( $working_directory )--( $username @ $hostname )--" | wc -c)
    local fillsize=$(($terminal_width-$promptsize))

    fill=""

    # check if we have to truncate the current working directory
    if [ "$fillsize" -lt "0" ]
    then
        # working directory is too long to be fully displayed
        # -> cut off leading characters
        local cut_position=$((3-$fillsize))
        local length_working_directory=$(echo -n "$working_directory" | wc -c)
        working_directory="...${working_directory:cut_position:length_working_directory}"
    else
        # working directory is short enough to be fully displayed
        # -> create enough fill characters to align working directory to the right
        local fill_characters=""
        while [ "$fillsize" -gt "0" ]
        do
            fill="$fill-"
            fillsize=$(($fillsize-1))
        done
    fi

    local col_none="\[\033[0m\]"
    
    local col_yellow="\[\033[1;33m\]"
    local col_red="\[\033[0;31m\]"
    local col_green="\[\033[0;32m\]"

    local col_light_blue="\[\033[1;34m\]"
    local col_light_gray="\[\033[1;37m\]"
    local col_light_purple="\[\033[1;35m\]"
    local col_light_green="\[\033[1;32m\]"
    local col_light_turquois="\[\033[1;36m\]"

    PS1="$col_yellow--( $col_light_turquois$working_directory$col_yellow )-${fill}-( $col_red$username $col_yellow@ $col_light_purple$hostname$col_yellow )--\n$col_yellow--( $col_green$date_time$col_yellow )--> $col_none"
}

PROMPT_COMMAND=update_prompt

As you might have noticed, this is a two-line prompt where the first line scales to fit the width of the terminal. It tells you the current working directory, the username and hostname, and the current time. The cool thing is that the working directory is truncated if it is too long to be fully displayed. This is how it looks like:

Note how the working directory is truncated in the second line.

Static/Dynamic Typing? Strong/Weak Typing?

In a book I am currently reading (I do not want to name the book here, because I do not want to discredit it; but it doesn't matter anyway) I stumbled upon the following paragraph about typing in programming languages:

The terms strong and weak typing are sometimes used to refer to statically typed and dynamically typed languages respectively.

Albeit being mixed up quite often, this is just plain wrong. These are totally different concepts (well, "totally" is probably a bit exaggerated, but I'm trying to make a point here), so let me clear things up.

Static and Dynamic Typing

Let's consider static and dynamic typing first, because the differences are quite easy to see. In a statically typed language like Java, the type of a variable (as indicated in the variable's declaration) is fixed, so the variable may only hold values of this specific type. For example, consider the following Java code:

int foo = 13;

The variable foo is declared of type int, so foo can only hold values of this type. The following snippet is therefore illegal and produces a compiler error:

int foo = 13;
foo = "hello, world!";   // type 'String' does not match foo's declared type

In a dynamically typed language like Python however, a variable's type may change and depends on the value the variable contains at a specific position in the code. For example, the following snippet is perfectly valid Python code:

foo = 13    # foo is now of type 'int'...
foo = "hello, world!"    # ... and now it is of type 'str'

If you insert type(foo) after each assignment above, you will see that the type of foo changes depending on the value it is currently holding.

Summing up, you can say that in a statically typed language the types are bound to variables, whereas in a dynamically typed language they are bound to the values. Both concepts have their own advantages and disadvantages. I myself definitely favor statically typed languages over dynamically typed ones, because if variables have a static type many errors can be detected at compile time, whereas with dynamic typing they are not detected until run time. However, programs written in dynamically typed languages tend to be more concise and less verbose.

By the way: not having to type a variable's type name does not mean that the language is dynamically typed; some programming languages like Scala use type inference (where possible) to deduce the type of a variable automatically. For example, the compiler is able to find out that foo has to be of type String in this Scala code snippet:

var foo = "hello, world!"

However, Scala is a statically typed language like Java, so foo may only hold String values here.

Strong and Weak typing

Unfortunately, there is not a single perfect definition of strong and weak typing, and it is rather a graduation than a black/white classification. I will try to explain the differences by giving an example that shows how strongly and weakly typed languages differ. In general, a more strongly typed language makes it harder to "bypass" the type system, i.e. to use operations on "wrong" data types (there are other criteria for a language to be considered stronly or weakly typed, but I think this one is the most important). Therefore, strongly typed languages are considered to provide a higher type safety than weakly typed languages.

We will now look at a code snippet written in C, a statically typed language that is considered weakly typed (see my point here?). If you write the following:

int foo = 13;
float bar = 3.;
float bazz = foo + bar;

the value of bazz becomes 16.0, as you would expect. The compiler added an implicit type conversion to convert the int value stored in foo to a corresponding temporary value of type float. However, if you write the following lines of code:

int foo = 13;
float bar = 3.;
float bazz = *((float *)(&foo)) + bar;

the value of bazz becomes just "random" garbage. So, what did I do here? By writing ((float *)(&foo)) I took the address of foo and interpreted it as an address to a value of type float (using the explicit cast (float *)). By dereferencing this address using the * operator the value at this location in memory is treated as a value of type float, although it is actually an int value (the bit sequence at this location represents the int value 13; this is, however, not the representation of the float value 13). Thus, if you add this value to another value, the result is not what you would expect.

This example might look a bit contrived, but it shows an important point: a weakly typed language allows you to circumvent type safety, whether it makes sense or not. A more strongly typed language prevents you from doing so. For example, in Java a type cast is only possible if source and target type are in an inheritance relationship (or convertible in case of primitive types), and a cast results in a run time excpetion if a type safety violation is detected.

tl;dr

Never mix up static/dynamic typing with strong/weak typing; they are different things. A language is considered statically typed if variables have a fixed type and may therefore only hold values of this specific type, whereas in a dynamically typed language a variable's type may change. On the other hand, strongly typed languages provide higher type safety than weakly typed languages by restricting the ways in which you can access the values in memory.

Re-design of my web site

Today, I re-designed parts of my web site. The page now uses a single-column layout and I replaced the blue header by a gray one.

Generating graphs for git statistics

Sometimes, it might be interesting to look at some statistics for a git repository, e.g. the number of commits per day or the number of inserted and deleted lines. Looking at endless rows of numbers, however, is often not very satisfiying, and a graphical output showing these information would be much better.

Here are two small bash scripts I wrote for doing exactly that - producing nice graphs showing some key statistics for a given git repository. These scripts call git log with certain parameters to generate the data to be displayed visually. git's output is then processed by some standard Unix tools like awk to generate a data file, which in turn is fed to gnuplot to generate a PDF plot. For a project I am currently working on, these are the resulting plots generated by these two scripts:

The first script generates a plot showing the number of commits per day. Note that the plot's x axis contains major tics for weeks (beginning on mondays) and minor tics for single days.

#/bin/bash

# get path to git repository
git_path='./.git'
if [[ $# -ge 1 ]]; then
    git_path="$1/.git"
fi

# check if git repository exists
if [ ! -d "$git_path" ]; then
    echo "there is no git repository in $git_path"
    exit 1
fi

# generate the data needed for the plot
git_data=$(git --git-dir="$git_path" log --date=short --all --pretty=format:'%ad' | uniq -c | awk '{print $2 " " $1}')

# get the date of the monday before the first commit
# we will use this to separate weeks in the generated plot
day_first_commit=$(echo "$git_data" | tail -n1 | awk '{print $1}')
monday_before_first_commit=$(date -d "$day_first_commit -$(date -d $day_first_commit +%u) days + 1 day" "+%Y-%m-%d")

# pipe data to a gnuplot script to produce a pdf plot
echo "$git_data" | gnuplot -e "\
    set xdata time;\
    set timefmt '%Y-%m-%d';\
    set boxwidth 86400;\
    set terminal pdf enhanced font 'TexGyreSchola,9';\
    set xtics format '%b %d' rotate by 55 right;\
    set style fill solid border lc rgb '#D47400';\
    set xrange [\"$monday_before_first_commit\":];\
    set xtics \"$monday_before_first_commit\", 604800 scale 3, 1 nomirror;\
    set tics front;\
    set yrange [0:];
    set grid lc rgb '#666666';
    plot '<cat' using (timecolumn(1)+24*60*60/2):2 with boxes title '' lc rgb '#FFB823';"

The second script shows the number of inserted and deleted lines per day, boxes for both numbers being "stacked" such that the total number of changes can be seen. Again, major tics mark the beginning of weeks, whereas minor tics mark single days.

#!/bin/bash

# get path to git repository
git_path='./.git'
if [[ $# -ge 1 ]]; then
    git_path="$1/.git"
fi

# check if git repository exists
if [ ! -d "$git_path" ]; then
    echo "there is no git repository in $git_path"
    exit 1
fi

# generate the data needed for the plot
git_data=$(git --git-dir="$git_path" log --all --date=short --numstat -C --format=format:'%ad')$'\n'

# sum up all inserted and deleted lines for each commit
# after this step git_data contains one row per commit of the following format: "date insertions deletions"
read_date=true
git_data=$(echo "$git_data" | while read line ; do
    # check if the current line belongs to the same commit as the line before (i.e. line is not empty)
    if [ -n "$line" ] ; then

        # check if we have to read the date (in the first line for each commit)
        if [ "$read_date" = true ] ; then
            # read date
            date=$line
            read_date=false
        else
            # sum up number of inserted and deleted lines
            insertions=$((insertions+$(echo $line | cut -d' ' -f1)))
            deletions=$((deletions+$(echo $line | cut -d' ' -f2)))
        fi

    else
        # log for commit done -> output and reset variables
        echo "$date $insertions $deletions"
        insertions=0
        deletions=0
        read_date=true
    fi
done)

# get the date of the monday before the first commit
# we will use this to separate weeks in the generated plot
day_first_commit=$(echo -e "$git_data" | tail -n1 | awk '{print $1}')
monday_before_first_commit=$(date -d "$day_first_commit -$(date -d $day_first_commit +%u) days + 1 day" "+%Y-%m-%d")

# sum up all inserted and deleted lines for each _day_
git_data=$(echo "$git_data" | awk '{day_insert[$1]+=$2; day_delete[$1]+=$3} END { for (d in day_insert) print d " " day_insert[d] " " day_delete[d] }')

# write data to a temporary file such that it can be read by gnuplot multiple times
temporary_file=".DATA_FILE.dat"
echo "$git_data" > $temporary_file

echo "$git_data" | gnuplot -e "\
    set xdata time;\
    set timefmt '%Y-%m-%d';\
    set boxwidth 86400;\
    set terminal pdf enhanced font 'TexGyreSchola,9';\
    set xtics format '%b %d' rotate by 55 right;\
    set style fill solid border lc rgb '#282828';\
    set xrange [\"$monday_before_first_commit\":];\
    set xtics \"$monday_before_first_commit\", 604800 scale 3, 1 nomirror;\
    set tics front;\
    set yrange [0:];
    set grid lc rgb '#666666';
    plot '$temporary_file' using (timecolumn(1)+24*60*60/2):(\$2+\$3) with boxes title '' lc rgb '#FD2106', '$temporary_file' using (timecolumn(1)+24*60*60/2):2 with boxes title '' lc rgb '#79BA00';"

# remove the temporary file
rm $temporary_file

When running these scripts, a path to a git repository can be specified. If no argument is given, the current working directory is analyzed. The script writes a PDF file to stdout, so you might want to redirect the output to a file or pipe it into a PDF viewer:

$ ./git_commits.sh path_to_repo > stats.pdf
$ ./git_commits.sh path_to_repo | zathura -

How to draw a B-Tree using Dot

B-Trees are a really cool data structure. Basically, B-Trees are a special form of search trees, where a node contains multiple keys. A B-Tree has the following properties (where g is a parameter that has to be chosen):

Recently, I needed to draw B-Trees for a presentation. I have often used the Dot tool from the GraphViz package to draw graphs and trees. Unfortunately, Dot does not support drawing B-Trees directly. After some googling and some experiments, however, I found a solution which produces pretty neat results:

digraph {
    graph [margin=0, splines=line];
    edge [penwidth=2];
    node [shape = record, margin=0.03,1.2, penwidth=2, style=filled, fillcolor=white];

    node0[label = "<f0> &bull; | &nbsp;2&nbsp; | <f1> &bull; | &nbsp;6&nbsp; | <f2> &bull; | &nbsp;15&nbsp; | <f3> &bull;"];
    node1[label = "<f0> &bull; | &nbsp;0&nbsp; | <f1> &bull; | &nbsp;1&nbsp; | <f2> &bull;"];
    node2[label = "<f0> &bull; | &nbsp;3&nbsp; | <f1> &bull; | &nbsp;4&nbsp; | <f2> &bull;"];
    node3[label = "<f0> &bull; | &nbsp;7&nbsp; | <f1> &bull; | &nbsp;8&nbsp; | <f2> &bull; | &nbsp;9&nbsp; | <f3> &bull; | &nbsp;12&nbsp; | <f4> &bull;"];
    node4[label = "<f0> &bull; | &nbsp;16&nbsp; | <f1> &bull; | &nbsp;17&nbsp; | <f2> &bull;"];

    node0:f0 -> node1;
    node0:f1 -> node2;
    node0:f2 -> node3;
    node0:f3 -> node4;
}

The code shown above produces the following output:

I thinks this looks preety good and the cool thing is that you can generate the Dot description automatically if you have a working implementation of a B-Tree.

Checking winning numbers like a CS student

Most of the web pages are primarily written for people - rather than machines. Extracting information from a web page, such that it can be further processed by an application, is therefore quite tedious in most cases. However, if you are able to extract the right parts of an HTML web page you can build pretty cool things that make your life easier (well, sort of...).

For several years now, it is kind of a family tradition to buy Advent calendars (for those of you who have never heard of this: basically, such a calendar has 24 little doors hiding small pieces of chocolate as a "countdown" until christmas) from a charity organization in cooperation with the city of Schwabach (the town I grew up in). The cool thing about this special one is that each calendar has a unique winning number printed on it. Each day about ten numbers are drawn, and you can win pretty cool stuff if you have the right number on your calendar (btw: all of these winnings are contributed by local companies).

The winning numbers that were drawn are published in the local newspaper and on the charity organization's web page. Since I do not receive the newspaper myself and since I did not want to check the web page each and every day manually, I decided to write a small bash script that checks the winning numbers automatically. I wanted the script to notify me via email if one of our numbers were drawn, and therefore it was necessary to extract the winning numbers along with some other information (the prize, where you can pick it up, and so on) from the web site.

For those of you, who are interested in my script, this is how it looks like:

#!/bin/bash

for limit in '0' '5' '10' '15' '20' '25' ; do
    wget -q -O - "http://www.lions-schwabach.de/index.php?option=com_content&view=category&layout=blog&id=45&Itemid=169&limitstart=$limit" | tr -d '\n\r' | grep -oP "<tr><td height.*?ff0000;\">[0-9]*<\/span>.*?</tr>" --color=never | sed 's/  */ /g' | sed 's/\&amp;/\&/g' | awk -F "</td>" 'function extr(str) { match(str, />[^<>]*<\/span>/); return substr(str, RSTART+1, RLENGTH-8); } {printf("%4d : %s von %s (Wert: %s Euro)\n", extr($2), extr($3), extr($5), extr($4));}'
done

Well, that's pretty nasty, isn't it? Of course I know that this script is "write-only" and that you could solve this problem much easier and nicer using more sophisticated tools. However, it was quite fun to write this script, and isn't this what it is all about?

I do not want to go into much detail here, but basically the script consists of a loop (which is necessary because the winning numbers are distributed among multiple pages) and inside the loop a single HTML page is downloaded via wget and processed by some standard tools like awk, grep and sed to extract the desired information.

I wrapped this snippet in another few lines of code that filter the resulting list for our own winning numbers and that compare the new list to the old list to find out if someone of us has won a prize recently (in which case an email is sent to my address). This script is executed once every day using a cron job running on my Raspberry Pi. Now I only have to wait for the first email to come...

Responsive Design

My web site has a Responsive Design now, i.e. the layout of the page should now adapt to the screen size. Try resizing your browser window or using a mobile device. If the size of the screen is under a certain threshold, the layout changes and the sidebar on the left hand side disappears.

Relaunch of my web site

As serveral times before, I recently relaunched my web site. There are two major changes:

I am really looking forward to publishing articles on this newly designed and implemented web site.