Sunday, 24 August 2014

Productivity - Cakes and lies

On my "C++ and Ruby - Side by side" post, I mentioned having a natural affinity towards a programming language. Today, I'll elaborate on that, using something I've worked on recently to illustrate.
 
Important note: Nothing I write below is an indictment/endorsement for this or that language. What I'm going for is the exact opposite - whenever you read a book/article about a language (any language) that promises greater productivity just by switching to that language, you should be suspicious.
 

The problem

I've never liked the way IDEs (in my case, Visual Studio/Qt Creator) define their project structures, and I've been trying to setup my own structure, with the goal of supporting multiple tools.
 
Unfortunately, the tools prefer their own particular structure arrangements, and aren't very cooperative to anything that strays from their comfort zone.
 
So, in the end, I've settled for a two-step procedure - I create the projects using the IDEs and then I run a process I created to move the project to my structure. The moving part is easy, and I could do it manually. The tinkering with the project settings files is more tricky, and is what ended up giving me the final push to look at automating the whole thing.
 
I've had three attempts at creating this process.
 

Perl

The first version of this process was a Perl script. Why Perl? Well, because... scripting language... well-suited to small hacks... higher productivity... y'know?
 
It was easy to create a linear script. However, a few weeks later, when I wanted to introduce some changes, it was also easy to get lost navigating around said script, trying to get a grasp of what I had done.
 
So, I went through several redesign/refactor iterations, trying to move from a linear script to something a bit more structured.
 
I finally settled on a version that lasted several months, with Modern::Perl and Moose as foundations. As I used this script, I became aware of several weaknesses in my original design. So, a few weeks ago, I decided to review my design, and proceeded to change the script. And, again, I was lost. Even more so than in the linear script, in fact.
 
I've had to review all the scripts/modules I created in order to understand again what I had done at that time (docs? Come on, it's a simple script to create some directories, copy a few files, edit a couple of those, and init git). And, since I had to do that, I've decided to have another go at it, but this time in...
 

Ruby

Why Ruby? Well, because... scripting language... well-suited to small hacks... higher productivity... y'know?
 
It began well. I removed some syntactic quirks, especially where OO-ness was concerned. Strange as it may seem, a clean language presents a greater potential for a clear design. Maybe it's just me, maybe I'm easily distracted by these syntactic quirks, which I admit should not be quite so important.
 
Long story short, it began well, but... I've started having growing difficulties to implement my design. I've finally decided to try...
 

C++

Yes, predictable, I know...
 
Why? Because I've decided that maybe "lower productivity" was worth a shot.
 
And it went smoothly. Not "easily", not "simply"; but definitely smoothly. I've actually finished the process with the design I wanted. I can actually navigate around the code, easily finding what I'm looking for.
 
What was I aiming at? Here, look at my "main()":
 
void Run(int argc, char *argv[])
{
    AppOptions<ConfigProjectOptions>
        ao{argc, argv, "Opções ProjectConfig"};
 
    if (ao.HaveShownHelp())
    {
        return;
    }
 
    ConfigProjectOptions const& opt = ao.GetOptions();
    string project_name = opt.GetProjectName();
 
    // All objects are validated on construction.
    ProjectDirectory prj_dir(PROJ_PRJ, project_name, 
        STRUCT_PRJ, SHOULD_NOT_EXIST);
    ProjectDirectory bld_dir(PROJ_BLD, project_name, 
        STRUCT_BLD, SHOULD_NOT_EXIST);
    ProjectDirectory stg_dir(PROJ_STG, project_name, 
        STRUCT_STG, SHOULD_EXIST);
 
    Project<QtcStgValidator, QtcCopier, QtcProjectConfigUpdater>
        qtc_project{project_name, prj_dir, bld_dir, stg_dir};
    Project<MsvcStgValidator, MsvcCopier, MsvcProjectConfigUpdater>
        msvc_project(project_name, prj_dir, bld_dir, stg_dir);
 
    // Everything is valid, get user confirmation.
    if (!UserConfirms(project_name))
    {
        return;
    }
 
    prj_dir.CreateStructure();
    bld_dir.CreateStructure();
 
    if (opt.WantGit())
    {
        ConfigureGit(prj_dir.GetProjectHomeDir());
    }
 
    if (opt.IsQtcProject())
    {
        qtc_project.Copy();
    }
 
    if (opt.IsMsvcProject())
    {
        msvc_project.Copy();
    }
}

This is what I was after all along, but was unable to achieve either with Perl or Ruby. It's as clean as it gets, with the main classes clearly identified, based on the operations that I need to do, and with support classes implementing policies that actually take care of the different ways things are done.
 
The code itself is quite simple (this is a trivial program, after all), and you can find it here.
 
I probably could have achieved the same with Ruby, but this is what I'm talking about when I say "natural affinity". With C++, this code structure flowed naturally; with Ruby, not so much.
 

 What's all this about, then?

I'm repeating myself, but I'll say it anyway.
 
Don't trust productivity promises at face value, especially for quick hacks/trivial programs, where everyone says shell/scripting languages are the best choice. Sometimes, the cake is actually a lie.
 
If all you want is to get a count of particular string/regex on a log file, grep is the way to go. But say you need to do some manipulation on the results - e.g., the customer finds out he doesn't need a simple total count, but rather a list of totals for each key (e.g., customer ID). Suddenly, you're reading man pages/docs and searching the web, finding awk/perl/whatever "solutions" that don't quite give you what you want; so, you fiddle with those solutions and read some more man pages/docs.
 
And, then, you look at the clock, see how much time has passed and say - if I had fired up MS Access, I'd probably have written a little VBA, finished this and moved on (yes, "MS Access" and "VBA" are just examples, not endorsements).
 
At the end of the day, just because it's the best option for someone else, doesn't mean it's the best option for you.
 

Saturday, 16 August 2014

Still here, still going

I'm back, after another long absence.
 
I went on vacation, which meant the weeks before were spent on the pre-vacation rush, where you work like crazy to leave things in a state that can then be managed by the rest of the team.
 
After a few days winding off, I've started coding again, mainly tackling off code puzzles in C++. I've also been refactoring (and redesigning) the small program I was going to use for the "next post" I promised on my last post (no, it's not forgotten). And what else?
 

Building GCC

I'll kick this off by saying: Hats off to the folks behind GCC! Well done! I've built it a couple of times, with different configurations, on a CentOS VM, and it was totally effortless, just fire-and-forget. This time, instead of downloading and building each prerequisite by itself, I've decided to use every automatism available. So, I've reread the docs more carefully, and found this (which I had missed the first time around):
Alternatively, after extracting the GCC source archive, simply run the ./contrib/download_prerequisites script in the GCC source directory. That will download the support libraries and create symlinks, causing them to be built automatically as part of the GCC build process. Set GRAPHITE_LOOP_OPT=yes in the script if you want to build GCC with the Graphite loop optimizations.

GRAPHITE_LOOP_OPT=yes is the default, but it doesn't hurt to check it before running the script.
 
I've also noted that it is now possible for all prerequisites to be built together with GCC. A few months back, when I first did it, this was possible only for GMP, MPFR, and MPC (or so the docs said, at the time).
 

Linux distros

After playing around with a CentOS VM, I've decided I needed something else.
 
What did I need?
 
Something more "bleeding edge", that gave me simple access to more up-to-date software. Simple as in "no need to set up every repo known to man" and, at the same time, as in "no need to manually configure every damn piece of software on the system". After some research, I chose Fedora 20.
 
Why did I need it?
 
I'm going to start taking a closer look at some open-source software and, while I've become quite comfortable with building OSS on Windows, I'd rather just install it from a repository.
 
Couldn't I get by with CentOS?
 
Not really. I've decided to use FileZilla as a pilot for this, and build it. Even after adding some non-official repositories on CentOS (e.g., EPEL), it was still a pain to get all the necessary packages in the required versions. On Fedora? I was running make in a matter of minutes.
 
I may have to go for a memory upgrade, since I didn't design my system specs with virtualization in mind. But, for now, I'll make do without it.
 
I did have to install a different desktop. Not only did I find GNOME Shell (Fedora's default) non-intuitive, but it was also resource-consuming. Fedora responded a lot slower than CentOS, and both VMs had the same characteristics. I switched to MATE, and it's much more responsive.
 
I understand the need for a unified desktop experience across all devices, and I accept it brings an advantage both to the (average) user and the developer. However, not only am I perfectly capable of dealing with different paradigms on different devices, I actually prefer it that way. AFAIC, it makes sense that different devices require different experiences. On a desktop, GNOME Shell doesn't make sense for me; just like Unity, on Ubuntu, didn't; same for Windows 8 (although, to its credit, MS is making corrections). But with Linux we can, at least, switch desktops.
 
Anyway...
 
I expect to be absent again for a few weeks, since I'm going to enter the post-vacation rush, where you work like crazy to pick up everything that was left behind while on vacation.
 

Sunday, 6 July 2014

C++ and Ruby - Side by side

Yes, I know, choosing this blog's name wasn't my finest moment.
 
A couple of weeks ago I had to create a process to cross-validate files. Each line in those files is identified by a key (which can be repeated), and we have to sum all the values for each key in each file and compare those sums.
 
So, I whipped up a Ruby script in a couple of hours, and it took me a few more hours to refine it. As a side note, the "Refinement To Do" list has been growing, but my priorities lie elsewhere, so this will stay at "good enough" for now.
 
Then, when I got home, I've decided to replicate that Ruby script in C++. I didn't go for changing/improving the design (although I did end up changing one detail), just replicating it.
 
And I was pleasantly surprised.
  • It also took me a couple of hours.
  • The C++ code was not noticeably "larger" than its Ruby counterpart.
 
Yes, you read it right. Not only was my "C++ productivity" on par with my "Ruby productivity", but also the resulting code was very similar, both in size and in complexity.
 
Let's take a look at it, shall we?
 
Important: The Ruby code, as presented in this post, may not be legal Ruby code, i.e., if copy/pasted on a script, may cause errors. My concern here was readability, not correctness.
 

Paperwork, please

While the Ruby script needed no "paperwork", the C++ program needed some.

This was the only non-standard header I included:
 
#include <boost/algorithm/string.hpp>
using boost::split;
using boost::is_any_of;
using boost::replace_first_copy;

And these were the definitions I had in C++:
 
using Total = double;
Total const VALUE_DIFF = 0.0009;
Total const ZERO_DIFF = 0.0001;
 
using CardEntries = map<int, Total>;
using CardEntry = pair<int, Total>;
using SplitContainer = vector<string>;
 
#define NOOP
 

Support structures

We created a structure to gather the data necessary to process the files. Here it is in Ruby.
 
class TrafficFile
  attr_reader(:description, :filename,
    :key_position, :value_position, :calc_value)
 
  def initialize(description, filename, key_position,
    value_position = 0, &calc_value
  )
    @description = description
    @filename = filename
    @key_position = key_position
    @value_position = value_position
    @calc_value = calc_value
  end
end

The key position is the field number on the file containing the key. The value position is the field number on the file containing the value to add. Why do we also have a code block to calculate the value? Because it turned out that one of the validations had to be performed by adding three fields, instead of just one, so I decided to add the code block.
 
It would've been easier to turn value_position into an array, but I wanted to experiment with code blocks. So, I've done what I usually do in these situations - I set myself a deadline (in this case, 30 minutes); if I don't have something ready when the time comes, I switch to my alternative - usually, simpler - design.
 
And here is its counterpart in C++. I've eliminated de value_position/calc_value dichotomy, and decided to use lambdas for every case.
 
template <typename T>
struct TrafficFile
{
    TrafficFile(string desc, string fn, short kp, T cv) :
        description{desc}, filename{fn}, key_position{kp},
        calc_value{cv} {}
 
    string description;
    string filename;
    short key_position;
    T calc_value;
}; 
 
Note: Yes, I'm passing objects by value (e.g., the strings above). Whenever I don't foresee an actual benefit in passing by reference, I'll stick to pass-by-value.
 

Little Helpers

The filenames have a date, in the format MMYYYY. We extract this from the name of one particular file, which serves as an index to all the other files.
 
Here's the Ruby method.
 
def get_file_date(filename)
  return filename.split('_')[8].split('.')[0]
end
 
And here's the equivalent in C++, which is pretty much... well, equivalent. It's not a one-liner, but other than that, it's exactly the same - two splits.
 
string GetFileDate(string filename)
{
    SplitContainer c;
    split(c, filename, is_any_of("_"));
    split(c, c[8], is_any_of("."));
    return c[0];
}
 

Main 

Now, the entry point. This is where we have the most number of differences between Ruby and C++, because we're always passing a lambda to calculate the total in the C++ version. Other than that, it's similar.
 
We're creating an object that contains all the data to describe each file being validated and to gather the data for that validation.
 
Here it is in Ruby.
 
file_date = get_file_date(ARGV[0])
 
tel_resumo =
  TrafficFile.new("Telemóvel (resumo)",
    "#{file_date}_tlm.csv", 3, 15)
resumo =
  TrafficFile.new("Resumo", "#{file_date}_rsm.csv", 8, 13)
compare(tel_resumo, resumo)
 
detalhe =
  TrafficFile.new("Detalhe", "#{file_date}_det.csv", 4, 22)
tel_detalhe =
  TrafficFile.new("Telemóvel (detalhe)",
    "#{file_date}_tlm.csv", 3)
  do |fields|
    fields[4].gsub(',', '.').to_f() +
    fields[8].gsub(',', '.').to_f() +
    fields[31].gsub(',', '.').to_f()
  end

compare(detalhe, tel_detalhe)

 
And here it is in C++.
 
int main(int argc, char *argv[])
{
    string file_date = GetFileDate(string{argv[1]});
 
    auto cv_tr = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[15], ",", "."));};
    TrafficFile<decltype(cv_tr)> 
        tel_resumo{"Telemóvel (resumo)", 
        file_date + "_tlm.csv", 3, cv_tr};
 
    auto cv_r = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[13], ",", "."));};
    TrafficFile<decltype(cv_r)> 
        resumo{"Resumo", file_date + "_rsm.csv", 8, cv_r};
 
    Compare(tel_resumo, resumo);
 
    auto cv_d = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[22], ",", "."));};
    TrafficFile<decltype(cv_d)> 
        detalhe{"Detalhe", file_date + "_det.csv", 4, cv_d};
 
    auto cv_td = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[4], ",", "."))
            + stod(replace_first_copy(c[8], ",", "."))
            + stod(replace_first_copy(c[31], ",", "."));};
    TrafficFile<decltype(cv_td)>
        tel_detalhe{"Telemóvel (detalhe)", 
        file_date + "_tlm.csv", 3, cv_td};
 
    Compare(detalhe, tel_detalhe);
}
 

Validate

The validation is performed by comparing files in sets of two. For each file, we load the pair (key, total) into a container, and then we compare the totals for each key. Since there is no guarantee that every key is present on both files, when we find a key that exists on both containers, we remove that key from both containers.
 
This is the function that performs that comparison. We output every key that has different totals in both files.
 
In Ruby.
 
def compare(first, second)
  first_data =
    get_unified_totals(first.filename, first.key_position,
      first.value_position, &first.calc_value)
  second_data =
    get_unified_totals(second.filename, second.key_position,
      second.value_position, &second.calc_value)
 
  first_data.each() do |key, value|
    if second_data.has_key?(key)
      if (value - second_data[key]).abs() > 0.0009
        puts("ERRO! #{key} tem valores incoerentes: #{value}" +
          " e #{second_data[key]}")
      end

      first_data.delete(key)
      second_data.delete(key)
    end
  end
 
  check_remaining(first_data)
  check_remaining(second_data)
end

In C++. 
 

template <typename T1, typename T2>
void Compare(T1 first, T2 second)
{
    CardEntries first_data = 
        GetUnifiedTotals(first.filename, first.key_position, 
            first.calc_value);
    CardEntries second_data =
        GetUnifiedTotals(second.filename, second.key_position, 
            second.calc_value);
 
    for (auto it = first_data.cbegin(); 
        it != first_data.cend(); NOOP )
    {
        auto f = second_data.find(it->first);

        if (f != second_data.end())
        {
            if (fabs(it->second - f->second) > VALUE_DIFF)
            {
                cout << "ERRO! " << it->first 
                    << " tem valores incoerentes: "
                    << it->second << " e " << f->second << " (" 
                    << fabs(it->second - f->second)
                    << ")" << endl;
            }
 

            first_data.erase(it++);
            second_data.erase(f);
        }
        else
        {
            ++it;
        }
    }
 
    CheckRemaining(first_data);
    CheckRemaining(second_data);
} 
 
Since we remove all keys that exist on both containers, in the end, each container will have only the keys that didn't exist on the other container. We then use another function to validate these keys, which should all have a 0 total.
 
Here's Ruby.
 
def check_remaining(data_set)
  data_set.each() do |key, value|
    if (value - 0).abs() > 0.0001
      puts("AVISO! #{key} tem valor: #{value}")
    end
  end
end
 
Here's C++. Once again, note the code similarity, how the C++ code isn't any more complex than the Ruby code.
 
void CheckRemaining(CardEntries const& data_set)
{
    for (auto& item : data_set)
    {
        if (fabs(item.second - 0) > ZERO_DIFF)
        {
            cout << "AVISO! " << item.first 
                << " tem valor: " << item.second << endl;
        }
    }
}
 

Adding up

This is the function that reads a file and creates a container with the key and its total, which is the sum of all the values for all the occurrences of the key. The file is a CSV, but since I'm on Ruby 1.8.7, I preferred not to use Ruby's CSV lib(s); after all, I just needed to split the line on a ";", so I did it "manually".
 
def get_unified_totals(filename, key_position, value_position = 0)
  totals = Hash.new(0)
 
  File.open(filename) do |file|
    # skip header
    file.gets()
   
    while line = file.gets()
      a = line.split(';')
      if block_given?()
        totals[a[key_position]] += yield(a)
      else
        totals[a[key_position]] +=
          a[value_position].gsub(',', '.').to_f()
      end
    end
  end
 
  return totals
end
 
The C++ version is a bit simpler, because we always pass a lambda.
 
template <typename T>
CardEntries GetUnifiedTotals(string filename, short key_position,
    T calc_value)
{
    ifstream inFile{filename};
 
    string line;
    getline(inFile, line, inFile.widen('\n')); // Skip header
 
    CardEntries totals;
    SplitContainer c;
    while(getline(inFile, line, inFile.widen('\n')))
    {
        split(c, line, is_any_of(";"));
        totals[stoi(c[key_position])] += calc_value(c);
    }
 
    return totals;
} 
 

TL;DR

When you have a task to do, just pick the language you're most familiar/comfortable with, unless you have a very good reason not to. The much-famed "productivity gap" may not be as large as is usually advertised.
 
One thing I read often is that a good programmer is productive/competent/idiomatic with any language. I don't doubt it, although I do struggle a bit to accept some quirks (e.g., Ruby's "no return" policy).
 
However, I believe we all have a natural affinity towards some language(s)/style(s); that's our comfort zone, and that's where we'll have a tendency to be most productive. I'll write a bit more about this on my next post.