Cide by Cide: July 2014

Yes, I know, choosing this blog's name wasn't my finest moment.

A couple of weeks ago I had to create a process to cross-validate files. Each line in those files is identified by a key (which can be repeated), and we have to sum all the values for each key in each file and compare those sums.

So, I whipped up a Ruby script in a couple of hours, and it took me a few more hours to refine it. As a side note, the "Refinement To Do" list has been growing, but my priorities lie elsewhere, so this will stay at "good enough" for now.

Then, when I got home, I've decided to replicate that Ruby script in C++. I didn't go for changing/improving the design (although I did end up changing one detail), just replicating it.

And I was pleasantly surprised.

It also took me a couple of hours.
The C++ code was not noticeably "larger" than its Ruby counterpart.

Yes, you read it right. Not only was my "C++ productivity" on par with my "Ruby productivity", but also the resulting code was very similar, both in size and in complexity.

Let's take a look at it, shall we?

Important: The Ruby code, as presented in this post, may not be legal Ruby code, i.e., if copy/pasted on a script, may cause errors. My concern here was readability, not correctness.

Paperwork, please

While the Ruby script needed no "paperwork", the C++ program needed some.

This was the only non-standard header I included:

#include <boost/algorithm/string.hpp>

using boost::split;

using boost::is_any_of;

using boost::replace_first_copy;

And these were the definitions I had in C++:

using Total = double;

Total const VALUE_DIFF = 0.0009;

Total const ZERO_DIFF = 0.0001;

using CardEntries = map<int, Total>;

using CardEntry = pair<int, Total>;

using SplitContainer = vector<string>;

#define NOOP

Support structures

We created a structure to gather the data necessary to process the files. Here it is in Ruby.

class TrafficFile
attr_reader(:description, :filename,

:key_position, :value_position, :calc_value)

def initialize(description, filename, key_position,

value_position = 0, &calc_value

)
    @description = description
    @filename = filename
    @key_position = key_position
    @value_position = value_position
    @calc_value = calc_value
end
end

The key position is the field number on the file containing the key. The value position is the field number on the file containing the value to add. Why do we also have a code block to calculate the value? Because it turned out that one of the validations had to be performed by adding three fields, instead of just one, so I decided to add the code block.

It would've been easier to turn value_position into an array, but I wanted to experiment with code blocks. So, I've done what I usually do in these situations - I set myself a deadline (in this case, 30 minutes); if I don't have something ready when the time comes, I switch to my alternative - usually, simpler - design.

And here is its counterpart in C++. I've eliminated de value_position/calc_value dichotomy, and decided to use lambdas for every case.

template <typename T>

struct TrafficFile

    TrafficFile(string desc, string fn, short kp, T cv) :

        description{desc}, filename{fn}, key_position{kp},

        calc_value{cv} {}

    string description;

    string filename;

    short key_position;

    T calc_value;

};

Note: Yes, I'm passing objects by value (e.g., the strings above). Whenever I don't foresee an actual benefit in passing by reference, I'll stick to pass-by-value.

Little Helpers

The filenames have a date, in the format MMYYYY. We extract this from the name of one particular file, which serves as an index to all the other files.

Here's the Ruby method.

def get_file_date(filename)
return filename.split('_')[8].split('.')[0]
end

And here's the equivalent in C++, which is pretty much... well, equivalent. It's not a one-liner, but other than that, it's exactly the same - two splits.

string GetFileDate(string filename)

    SplitContainer c;

    split(c, filename, is_any_of("_"));

    split(c, c[8], is_any_of("."));

    return c[0];

Main

Now, the entry point. This is where we have the most number of differences between Ruby and C++, because we're always passing a lambda to calculate the total in the C++ version. Other than that, it's similar.

We're creating an object that contains all the data to describe each file being validated and to gather the data for that validation.

Here it is in Ruby.

file_date = get_file_date(ARGV[0])

tel_resumo =

TrafficFile.new("Telemóvel (resumo)",

"#{file_date}_tlm.csv", 3, 15)
resumo =

TrafficFile.new("Resumo", "#{file_date}_rsm.csv", 8, 13)

compare(tel_resumo, resumo)

detalhe =

TrafficFile.new("Detalhe", "#{file_date}_det.csv", 4, 22)
tel_detalhe =

TrafficFile.new("Telemóvel (detalhe)",

"#{file_date}_tlm.csv", 3)

do |fields|

fields[4].gsub(',', '.').to_f() +

fields[8].gsub(',', '.').to_f() +

fields[31].gsub(',', '.').to_f()
end

compare(detalhe, tel_detalhe)

And here it is in C++.

int main(int argc, char *argv[])

    string file_date = GetFileDate(string{argv[1]});

    auto cv_tr = [] (SplitContainer const& c)

        {return stod(replace_first_copy(c[15], ",", "."));};

    TrafficFile<decltype(cv_tr)>

        tel_resumo{"Telemóvel (resumo)",

        file_date + "_tlm.csv", 3, cv_tr};

    auto cv_r = [] (SplitContainer const& c)

        {return stod(replace_first_copy(c[13], ",", "."));};

    TrafficFile<decltype(cv_r)>

        resumo{"Resumo", file_date + "_rsm.csv", 8, cv_r};

    Compare(tel_resumo, resumo);

    auto cv_d = [] (SplitContainer const& c)

        {return stod(replace_first_copy(c[22], ",", "."));};

    TrafficFile<decltype(cv_d)>

        detalhe{"Detalhe", file_date + "_det.csv", 4, cv_d};

    auto cv_td = [] (SplitContainer const& c)

        {return stod(replace_first_copy(c[4], ",", "."))

            + stod(replace_first_copy(c[8], ",", "."))

            + stod(replace_first_copy(c[31], ",", "."));};

    TrafficFile<decltype(cv_td)>

        tel_detalhe{"Telemóvel (detalhe)",

        file_date + "_tlm.csv", 3, cv_td};

    Compare(detalhe, tel_detalhe);

Validate

The validation is performed by comparing files in sets of two. For each file, we load the pair (key, total) into a container, and then we compare the totals for each key. Since there is no guarantee that every key is present on both files, when we find a key that exists on both containers, we remove that key from both containers.

This is the function that performs that comparison. We output every key that has different totals in both files.

In Ruby.

def compare(first, second)
first_data =

get_unified_totals(first.filename, first.key_position,

first.value_position, &first.calc_value)
second_data =

get_unified_totals(second.filename, second.key_position,

second.value_position, &second.calc_value)

first_data.each() do |key, value|
    if second_data.has_key?(key)
      if (value - second_data[key]).abs() > 0.0009
        puts("ERRO! #{key} tem valores incoerentes: #{value}" +

" e #{second_data[key]}")
end

      first_data.delete(key)
      second_data.delete(key)
    end
end

check_remaining(first_data)
check_remaining(second_data)
end

In C++.

template <typename T1, typename T2>

void Compare(T1 first, T2 second)

    CardEntries first_data =

        GetUnifiedTotals(first.filename, first.key_position,

            first.calc_value);

    CardEntries second_data =

        GetUnifiedTotals(second.filename, second.key_position,

            second.calc_value);

    for (auto it = first_data.cbegin();

        it != first_data.cend(); NOOP )

        auto f = second_data.find(it->first);

        if (f != second_data.end())

            if (fabs(it->second - f->second) > VALUE_DIFF)

                cout << "ERRO! " << it->first

                    << " tem valores incoerentes: "

                    << it->second << " e " << f->second << " ("

                    << fabs(it->second - f->second)

                    << ")" << endl;

            first_data.erase(it++);

            second_data.erase(f);

        else

            ++it;

    CheckRemaining(first_data);

    CheckRemaining(second_data);

Since we remove all keys that exist on both containers, in the end, each container will have only the keys that didn't exist on the other container. We then use another function to validate these keys, which should all have a 0 total.

Here's Ruby.

def check_remaining(data_set)
data_set.each() do |key, value|
    if (value - 0).abs() > 0.0001
      puts("AVISO! #{key} tem valor: #{value}")
    end
end
end

Here's C++. Once again, note the code similarity, how the C++ code isn't any more complex than the Ruby code.

void CheckRemaining(CardEntries const& data_set)

    for (auto& item : data_set)

        if (fabs(item.second - 0) > ZERO_DIFF)

            cout << "AVISO! " << item.first

                << " tem valor: " << item.second << endl;

Adding up

This is the function that reads a file and creates a container with the key and its total, which is the sum of all the values for all the occurrences of the key. The file is a CSV, but since I'm on Ruby 1.8.7, I preferred not to use Ruby's CSV lib(s); after all, I just needed to split the line on a ";", so I did it "manually".

def get_unified_totals(filename, key_position, value_position = 0)
totals = Hash.new(0)

File.open(filename) do |file|
    # skip header
    file.gets()

    while line = file.gets()
      a = line.split(';')
      if block_given?()
        totals[a[key_position]] += yield(a)
      else
        totals[a[key_position]] +=

          a[value_position].gsub(',', '.').to_f()
      end
    end
end

return totals
end

The C++ version is a bit simpler, because we always pass a lambda.

template <typename T>

CardEntries GetUnifiedTotals(string filename, short key_position,

    T calc_value)

    ifstream inFile{filename};

    string line;

    getline(inFile, line, inFile.widen('\n')); // Skip header

    CardEntries totals;

    SplitContainer c;

    while(getline(inFile, line, inFile.widen('\n')))

        split(c, line, is_any_of(";"));

        totals[stoi(c[key_position])] += calc_value(c);

    return totals;

TL;DR

When you have a task to do, just pick the language you're most familiar/comfortable with, unless you have a very good reason not to. The much-famed "productivity gap" may not be as large as is usually advertised.

One thing I read often is that a good programmer is productive/competent/idiomatic with any language. I don't doubt it, although I do struggle a bit to accept some quirks (e.g., Ruby's "no return" policy).

However, I believe we all have a natural affinity towards some language(s)/style(s); that's our comfort zone, and that's where we'll have a tendency to be most productive. I'll write a bit more about this on my next post.

Cide by Cide

Sunday 6 July 2014

C++ and Ruby - Side by side