Sunday, 6 July 2014

C++ and Ruby - Side by side

Yes, I know, choosing this blog's name wasn't my finest moment.
 
A couple of weeks ago I had to create a process to cross-validate files. Each line in those files is identified by a key (which can be repeated), and we have to sum all the values for each key in each file and compare those sums.
 
So, I whipped up a Ruby script in a couple of hours, and it took me a few more hours to refine it. As a side note, the "Refinement To Do" list has been growing, but my priorities lie elsewhere, so this will stay at "good enough" for now.
 
Then, when I got home, I've decided to replicate that Ruby script in C++. I didn't go for changing/improving the design (although I did end up changing one detail), just replicating it.
 
And I was pleasantly surprised.
  • It also took me a couple of hours.
  • The C++ code was not noticeably "larger" than its Ruby counterpart.
 
Yes, you read it right. Not only was my "C++ productivity" on par with my "Ruby productivity", but also the resulting code was very similar, both in size and in complexity.
 
Let's take a look at it, shall we?
 
Important: The Ruby code, as presented in this post, may not be legal Ruby code, i.e., if copy/pasted on a script, may cause errors. My concern here was readability, not correctness.
 

Paperwork, please

While the Ruby script needed no "paperwork", the C++ program needed some.

This was the only non-standard header I included:
 
#include <boost/algorithm/string.hpp>
using boost::split;
using boost::is_any_of;
using boost::replace_first_copy;

And these were the definitions I had in C++:
 
using Total = double;
Total const VALUE_DIFF = 0.0009;
Total const ZERO_DIFF = 0.0001;
 
using CardEntries = map<int, Total>;
using CardEntry = pair<int, Total>;
using SplitContainer = vector<string>;
 
#define NOOP
 

Support structures

We created a structure to gather the data necessary to process the files. Here it is in Ruby.
 
class TrafficFile
  attr_reader(:description, :filename,
    :key_position, :value_position, :calc_value)
 
  def initialize(description, filename, key_position,
    value_position = 0, &calc_value
  )
    @description = description
    @filename = filename
    @key_position = key_position
    @value_position = value_position
    @calc_value = calc_value
  end
end

The key position is the field number on the file containing the key. The value position is the field number on the file containing the value to add. Why do we also have a code block to calculate the value? Because it turned out that one of the validations had to be performed by adding three fields, instead of just one, so I decided to add the code block.
 
It would've been easier to turn value_position into an array, but I wanted to experiment with code blocks. So, I've done what I usually do in these situations - I set myself a deadline (in this case, 30 minutes); if I don't have something ready when the time comes, I switch to my alternative - usually, simpler - design.
 
And here is its counterpart in C++. I've eliminated de value_position/calc_value dichotomy, and decided to use lambdas for every case.
 
template <typename T>
struct TrafficFile
{
    TrafficFile(string desc, string fn, short kp, T cv) :
        description{desc}, filename{fn}, key_position{kp},
        calc_value{cv} {}
 
    string description;
    string filename;
    short key_position;
    T calc_value;
}; 
 
Note: Yes, I'm passing objects by value (e.g., the strings above). Whenever I don't foresee an actual benefit in passing by reference, I'll stick to pass-by-value.
 

Little Helpers

The filenames have a date, in the format MMYYYY. We extract this from the name of one particular file, which serves as an index to all the other files.
 
Here's the Ruby method.
 
def get_file_date(filename)
  return filename.split('_')[8].split('.')[0]
end
 
And here's the equivalent in C++, which is pretty much... well, equivalent. It's not a one-liner, but other than that, it's exactly the same - two splits.
 
string GetFileDate(string filename)
{
    SplitContainer c;
    split(c, filename, is_any_of("_"));
    split(c, c[8], is_any_of("."));
    return c[0];
}
 

Main 

Now, the entry point. This is where we have the most number of differences between Ruby and C++, because we're always passing a lambda to calculate the total in the C++ version. Other than that, it's similar.
 
We're creating an object that contains all the data to describe each file being validated and to gather the data for that validation.
 
Here it is in Ruby.
 
file_date = get_file_date(ARGV[0])
 
tel_resumo =
  TrafficFile.new("Telemóvel (resumo)",
    "#{file_date}_tlm.csv", 3, 15)
resumo =
  TrafficFile.new("Resumo", "#{file_date}_rsm.csv", 8, 13)
compare(tel_resumo, resumo)
 
detalhe =
  TrafficFile.new("Detalhe", "#{file_date}_det.csv", 4, 22)
tel_detalhe =
  TrafficFile.new("Telemóvel (detalhe)",
    "#{file_date}_tlm.csv", 3)
  do |fields|
    fields[4].gsub(',', '.').to_f() +
    fields[8].gsub(',', '.').to_f() +
    fields[31].gsub(',', '.').to_f()
  end

compare(detalhe, tel_detalhe)

 
And here it is in C++.
 
int main(int argc, char *argv[])
{
    string file_date = GetFileDate(string{argv[1]});
 
    auto cv_tr = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[15], ",", "."));};
    TrafficFile<decltype(cv_tr)> 
        tel_resumo{"Telemóvel (resumo)", 
        file_date + "_tlm.csv", 3, cv_tr};
 
    auto cv_r = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[13], ",", "."));};
    TrafficFile<decltype(cv_r)> 
        resumo{"Resumo", file_date + "_rsm.csv", 8, cv_r};
 
    Compare(tel_resumo, resumo);
 
    auto cv_d = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[22], ",", "."));};
    TrafficFile<decltype(cv_d)> 
        detalhe{"Detalhe", file_date + "_det.csv", 4, cv_d};
 
    auto cv_td = [] (SplitContainer const& c)
        {return stod(replace_first_copy(c[4], ",", "."))
            + stod(replace_first_copy(c[8], ",", "."))
            + stod(replace_first_copy(c[31], ",", "."));};
    TrafficFile<decltype(cv_td)>
        tel_detalhe{"Telemóvel (detalhe)", 
        file_date + "_tlm.csv", 3, cv_td};
 
    Compare(detalhe, tel_detalhe);
}
 

Validate

The validation is performed by comparing files in sets of two. For each file, we load the pair (key, total) into a container, and then we compare the totals for each key. Since there is no guarantee that every key is present on both files, when we find a key that exists on both containers, we remove that key from both containers.
 
This is the function that performs that comparison. We output every key that has different totals in both files.
 
In Ruby.
 
def compare(first, second)
  first_data =
    get_unified_totals(first.filename, first.key_position,
      first.value_position, &first.calc_value)
  second_data =
    get_unified_totals(second.filename, second.key_position,
      second.value_position, &second.calc_value)
 
  first_data.each() do |key, value|
    if second_data.has_key?(key)
      if (value - second_data[key]).abs() > 0.0009
        puts("ERRO! #{key} tem valores incoerentes: #{value}" +
          " e #{second_data[key]}")
      end

      first_data.delete(key)
      second_data.delete(key)
    end
  end
 
  check_remaining(first_data)
  check_remaining(second_data)
end

In C++. 
 

template <typename T1, typename T2>
void Compare(T1 first, T2 second)
{
    CardEntries first_data = 
        GetUnifiedTotals(first.filename, first.key_position, 
            first.calc_value);
    CardEntries second_data =
        GetUnifiedTotals(second.filename, second.key_position, 
            second.calc_value);
 
    for (auto it = first_data.cbegin(); 
        it != first_data.cend(); NOOP )
    {
        auto f = second_data.find(it->first);

        if (f != second_data.end())
        {
            if (fabs(it->second - f->second) > VALUE_DIFF)
            {
                cout << "ERRO! " << it->first 
                    << " tem valores incoerentes: "
                    << it->second << " e " << f->second << " (" 
                    << fabs(it->second - f->second)
                    << ")" << endl;
            }
 

            first_data.erase(it++);
            second_data.erase(f);
        }
        else
        {
            ++it;
        }
    }
 
    CheckRemaining(first_data);
    CheckRemaining(second_data);
} 
 
Since we remove all keys that exist on both containers, in the end, each container will have only the keys that didn't exist on the other container. We then use another function to validate these keys, which should all have a 0 total.
 
Here's Ruby.
 
def check_remaining(data_set)
  data_set.each() do |key, value|
    if (value - 0).abs() > 0.0001
      puts("AVISO! #{key} tem valor: #{value}")
    end
  end
end
 
Here's C++. Once again, note the code similarity, how the C++ code isn't any more complex than the Ruby code.
 
void CheckRemaining(CardEntries const& data_set)
{
    for (auto& item : data_set)
    {
        if (fabs(item.second - 0) > ZERO_DIFF)
        {
            cout << "AVISO! " << item.first 
                << " tem valor: " << item.second << endl;
        }
    }
}
 

Adding up

This is the function that reads a file and creates a container with the key and its total, which is the sum of all the values for all the occurrences of the key. The file is a CSV, but since I'm on Ruby 1.8.7, I preferred not to use Ruby's CSV lib(s); after all, I just needed to split the line on a ";", so I did it "manually".
 
def get_unified_totals(filename, key_position, value_position = 0)
  totals = Hash.new(0)
 
  File.open(filename) do |file|
    # skip header
    file.gets()
   
    while line = file.gets()
      a = line.split(';')
      if block_given?()
        totals[a[key_position]] += yield(a)
      else
        totals[a[key_position]] +=
          a[value_position].gsub(',', '.').to_f()
      end
    end
  end
 
  return totals
end
 
The C++ version is a bit simpler, because we always pass a lambda.
 
template <typename T>
CardEntries GetUnifiedTotals(string filename, short key_position,
    T calc_value)
{
    ifstream inFile{filename};
 
    string line;
    getline(inFile, line, inFile.widen('\n')); // Skip header
 
    CardEntries totals;
    SplitContainer c;
    while(getline(inFile, line, inFile.widen('\n')))
    {
        split(c, line, is_any_of(";"));
        totals[stoi(c[key_position])] += calc_value(c);
    }
 
    return totals;
} 
 

TL;DR

When you have a task to do, just pick the language you're most familiar/comfortable with, unless you have a very good reason not to. The much-famed "productivity gap" may not be as large as is usually advertised.
 
One thing I read often is that a good programmer is productive/competent/idiomatic with any language. I don't doubt it, although I do struggle a bit to accept some quirks (e.g., Ruby's "no return" policy).
 
However, I believe we all have a natural affinity towards some language(s)/style(s); that's our comfort zone, and that's where we'll have a tendency to be most productive. I'll write a bit more about this on my next post.

 

Tuesday, 10 June 2014

Processing command line options in Ruby

The story so far

I've been setting up a scripting environment at work, as I've mentioned previously. After considering Perl, Python, and Ruby, we chose the latter.
 
Ruby itself was installed from RPMs by the sysadmins. Then, I've installed Ruby Gems locally, and all the gems are also locally installed. This minimizes our dependency on the sysadmins, i.e., we don't have to open a request every time we need a new gem installed. The server has no internet connection, so we download the gems and run a local install, which means sorting out dependencies manually. So far, it has been going smoothly, and I'm quite impressed by the way gem management is handled.
 
One of our top goals is working with Informix DBs, and I'm happy to report that the Ruby Informix gem installed without a single hiccup (did I already mention I'm impressed?), and worked right out of the box (because I'd like to leave that quite clear - I'm impressed).
 
This move to Ruby was caused, in part, by a server migration, where we decided to take the opportunity to simplify our batch environment. We currently have a mix of Java and shell scripts, with a liberal sprinkling of awk and some Perl. We'll keep the Java processes as-is, but we're planning on moving the scripts to Ruby.
 
The first goal is migrating everything to the new servers, making only the necessary changes in order to keep things working with a new set of system requirements (e.g., replacing FTP with SFTP). With time, we'll be converting all the scripting logic to Ruby.
 
So, I've been diving into Ruby, in a mix of planning and prototyping, trying to get a grasp of the language's basics, and creating some building blocks for future scripts.
 

Ruby & Command line options - GetoptLong

I'm currently working on a component to process command line arguments, similar (in goal, if not in design) to what I've done here.
 
As I looked at how this is done in Ruby, the first option I found was GetoptLong. I set up the options and began writing the code to do something with it. And I immediately thought "I need something else". Why?
 
Well, when you have a class that lacks a method to get an option by name, forcing you to loop through all the options with a switch/case that says "If the current option is this, then do that", I have to wonder - does anyone actually think this is elegant?
 
Yes, I know. I could build such a method myself. After all, Ruby allows me to add methods to existing classes, so I could add a getter that took a name as argument, and returned the value of the option with that name or nil/exception if it didn't exist. But I couldn't believe there was no alternative. I couldn't be the only one thinking "This is sub-par design", there had to be a better way.
 

Ruby & Command line options - OptionParser

There is. It's called OptionParser. It's definitely better, and it seems quite powerful and complete. However, I've yet to find an example/tutorial that shows me how to do a couple of things that are dead simple in, e.g., Boost.Program_options:
 
The first is setting up a required option, making the option processing throw an exception if it's not present. Notice how easy it is with Boost:
 
("fich,f", po::value<string>(&fileName)->required(), "Ficheiro a processar")
 
OptionParser has a concept called "mandatory", but it means "If you include this option in the command line, you must specify its value". However, if you don't include it, no exception is thrown.
 
The second is setting up a default value directly in the options definition, instead of having to do it in some other code, even if it's local to the options definition. Once again, courtesy of Boost:
 
("validnc,c", po::bool_switch(&validateFieldNr)->default_value(false),
    "Valida nr. de campos por linha e termina")

And I already have a pet peeve with OptionParser. Suppose you have this:

options[:delimiter] = ';'
opts.on('-d', '--delimiter', 'Delimitador utilizado no ficheiro de dados') do |delim|
  options[:delimiter] = delim
end

Looking at this, you'd think "I have a default delimiter, ';', and I have an option to specify a different delimiter on the command line". Seems logical, right? Well, not quite. If you specify this option on the command line, options[:delimiter] will be set to true, and your delimiter will be left in ARGV, i.e., it won't be consumed by OptionParser.
 
Here's what you need, instead:
 
options[:delimiter] = ';'
opts.on('-d', '--delimiter delim', 'Delimitador utilizado no ficheiro de dados') do |delim|
  options[:delimiter] = delim
end
 
By the way, you don't need to use "delim" in '--delimiter delim', you can use anything, as long as it's there. Fortunately, I wasn't the first one to get bitten by this little detail, and I quickly found someone else wondering why were all his values true or false.
 
While I can't help but wonder at the reasoning that led to "This looks like a good idea", my ignorance on the language/class means I'll just live with it, since I don't feel competent to do better.
 

Pet Peeve - Documentation

Yes, this is my pet peeve. Docs. OptionsParser's minimal example is too minimal to be useful, and the complete example is so crowded, that details like this get lost in the noise; not that there's anything wrong with that, that's what a complete example is for, it's the minimal example that shouldn't be quite so minimal.
 
I believe a rule like "If you want your options to have a value, you have to put something - anything - following it on the long form string" should be made more visible in the documentation. Actually, most of the Ruby documentation I've found so far is quite minimalistic/optimistic. Tutorials like Boost's (yes, I'm using Boost as an example of good documentation; no, I never thought that day would come), or troubleshooting sections are absent from most of the docs I've found.
 
E.g., take a look at make_switch and see what you can make of it. Then, see the  complete example. Finally, look at this tutorial. It's only after I read this tutorial that I understood the semantics of "mandatory" and "optional" in this context, and gained a better understanding of how OptionParser works.
 
And going back to the "optimistic" bit I mentioned above, here's what I mean - sentences like these (which are from the tutorial) should be in the class's docs:
  • Switches that take a parameter only need to state the parameter name in the long form of the switch.
  • While your option string can define the parameter to be called "a,b,c", OptionParser will blindly allow any number of elements in the list. So, if you need a specific number of elements, be sure to check the array length yourself.

Another example - if you go to rubygems.org, you have the installation instructions, which end with this:
For more details and other options, see:  
ruby setup.rb --help
And, after a few hours of dealing with issues on Ruby Gems setup, I had to ask - why doesn't it end with this:
For more details and other options, see <this link for online docs, where we explain said details and options with more depth than we ever will in a help message>

Then, we have the guides. Filled with useful info, no denying that. However, when I needed to learn about things like GEM_HOME and RUBYLIB, I had to go elsewhere. While we could argue the latter is not actually Gem specific, I still believe it should be there, close to the sections regarding Ruby Gems installation and setup. Not all of us can just go gem update or ruby setup.rb with "you may need admin/root". Some of us don't even have an internet connection to begin with.
 

TL;DR

As I dive into Ruby, I like most of what I see, especially after having learnt to appreciate duck typing, available with C++ templates, and the total decoupling it provides.
 
So far, I've met a few design choices that baffle me, and the docs are definitely a weak point. Fortunately, there's this internet thingie where people are more than willing to share their knowledge.
 
As for processing command line arguments, I've settled for this "pattern":
options = {}
 
options[:delimiter] = ';'
opts.on('-d', '--delimiter delim', 
'Delimitador utilizado no ficheiro de dados') do |delim|
  options[:delimiter] = delim
end
 
options[:title] = false
opts.on('-t', '--title', 
'Indica se o ficheiro de dados tem header de título') do
  options[:title] = true
end
 
options[:settings] = nil
opts.on('-s', '--settings sf', 'Ficheiro de configurações') do |sf|
  options[:settings] = sf
end
...
if options[:settings] == nil
  puts('Tem que indicar o ficheiro a processar')
  usage()
  exit
end
 

Thursday, 29 May 2014

What is performance?

I've restarted watching Alex Stepanov's A9 Lectures, "Efficient Programming with Components", and if there's one thing I find essential in what he says is that, despite all the "programming recipes" we find and use, it's important that we think. I've recently been reminded of this need to think, concerning optimization.
 
We all have our recipes. As far as optimization is concerned, my recipe is:
  • Go for readability first. Optimize as needed.
  • "Using C++ == you should extract every single ounce of performance available, because otherwise you'd be more productive in any other programming language" is not a dogma. And, as far as productivity goes, it's not even necessarily true.

A simple scenario

Suppose we have a list of char[] and we want to find duplicates on said list. In its simplest form, testing two arrays for equality means traversing both, comparing each element until we reach the end of either array, or find a difference.
 
We can consider several additions to this algorithm, in order to optimise it.
 

Test for self

We could start by testing for self-comparison. Stepanov has a point when he says this is not an optimization, because most of the time we will not be comparing an element (in our case, a char[]) to itself. This brings us to one of those cases where we need to think. Maybe not on equality, since that is trivial; but let's consider assignment.

When defining assignment, should we test for self-assignment? All over the web, the recipes sound a very loud "YES". And yet, most of the time, that test will evaluate to false. The question we should ask ourselves is - if I perform self-assignment, what will happen?

Will I release/duplicate resources? Will I invalidate a class's invariants? Then, self-assignment must be prevented, and the test is required. Not because of optimization, but because a particular class requires it. If not, will it be terribly expensive? In this case, it may actually be an optimization. Will we be self-assigning some floats? How expensive is that? OTOH, how expensive is the test? Is it even worth thinking about, or should we just follow the recipe and put it in there by default?

As with everything concerning optimization, there's no clear-cut answer. Even though I'm inclined to agree with Stepanov on this, I'd probably include the test for self-assignment by default, and consider removing it if, after profiling my code, I could make the case that removing it would benefit performance.
 

Test for size

Back to our quest for duplicates on a list, we could test for different sizes. Suppose we store each of our elements in a simple structure like this:
 
struct SimpleArray
{
    size_t size;
    char* arr;
};

Testing for size means not having to traverse the array, so that's a good thing. Not for the traversing itself, that would be trivial, unless our arrays were longer than the processor's cache lines, but because we'd avoid going to memory again to read *arr (arr itself should be on the processor's cache). That is a good thing.
 
Except when it isn't.
 
What if you know your data set is composed of arrays that have fixed length? Again, you may measure and conclude the impact of this test is negligible. Or it may mean the difference between respecting a performance requirement or not. In this particular case, if I had to optimise, I'd skip the structure and work directly on the array, because the size would be irrelevant.
 

A simple scenario - what we don't know...

We need to think, in order to make decisions. And for that, we need data.
  • What is our process doing? If we're invoking services across a network, I doubt either a test for self-comparison or a size test on fixed size arrays will have much of an impact.
  • What is the actual weight of these trivial operations? Under our expected workload? Under our worst-case scenario? Under 4x our worst-case scenario? When another process on our machine suddenly goes haywire and gobbles up resources like crazy?
  • Is our process critical? Does it have strict performance requirements, with associated penalties?
 

The big picture

We sometimes have to deal with performance issues, where all the data we have to perform an analysis is "This damn thing is too slow". Based on this treasure trove of data, we have to consider: Our application; the application servers; the web servers; the database servers; the middleware servers; the servers hosting the services we consume (usually, via the middleware servers); and all the myriad of network equipment allowing all these nodes to communicate.
 
And I don't need more than two hands to count the cases where performance-related issues were caused by application code. And, most of those cases were caused by poor design, rather than by poor implementation - such as an app we received where an operation that failed was placed back into the queue for retrying; but a) instead of applying a delay before retrying, it was placed in the queue for immediate reprocessing; and b) it was placed back into the queue even if the error was not recoverable (e.g., a telephone number with 5 digits). As you might imagine, much fun (and quasi-DOS attacks) ensued.
 
So, let's get back to our title...
 

What is performance?

When we think of performance, we envision CPU cycles and cache hits/misses; or that we're processing X lines per second from this file, and we need to process 4X; or sending data via the network, and can we really afford to use SOAP, or should we choose a light-weight protocol (or even roll our own).
 
Suppose we have Platform A, which invokes a service on Platform B, via a Middleware Server. The service on PlatB sends back something which we then use to call another service on PlatB, which gives us the actual result.
 
The straightforward implementation looks like this:
 
PlatA (invoke) -> MWare (invoke) -> PlatB -> (return) -> MWare (return) -> Plat A
PlatA (invoke) -> MWare (invoke) -> PlatB -> (return) -> MWare (return) -> Plat A

On invocation #1, PlatA gets the something required to get the result, and on invocation #2 gets the actual result.
 
Now, looking at this, we have the obvious optimization:
 
PlatA (invoke) -> MWare (invoke) -> PlatB -> (return) -> MWare
MWare (invoke) -> PlatB -> (return) -> MWare (return) -> Plat A
 
In fact, not only have we just saved a few seconds from the whole process, but we also abstracted PlatA from PlatB's two-call design. Brilliant, right?
 
Well...
 
Let's suppose we're responsible for PlatA and we receive a complaint from a costumer, saying his data is all wrong and he won't pay his invoice until we sort it out. Our analysis concludes we're getting wrong data from PlatB. We contact their support team, but in order to perform their analysis they require the something they sent us for that particular invocation. So, now we have to contact the MWare support team, ask for that piece of data, and then send it to PlatB's team. And, during all this time, the costumer's waiting; and our invoice is not getting paid.
 
"Hold on!" you say "That's not related to process performance, that's an operational problem". Yes, it is. And that's irrelevant, unless you're developing in some sort of hermit's vacuum with no contact with the organization for which you develop. Is that the case? I didn't think so.
 
Yes, the fact that we may have to wait 1-2 hours before we even begin to actually analyse the cause of the costumer's complaint is an operational problem. But that doesn't make it - and its consequences - any less real. And the potential of this operational problem has been fulfilled by our optimization.
 
Let's look at the gains and losses of this optimization:
  • We've cut a few seconds off each invocation. Let's be optimistic and assume few == 10.
  • We've added 1-2 hours to the response time, in case of costumer's complaints.
If we look at the "recipe" that says "don't optimize for the uncommon case", everything looks fine. However, different cases have different weights, and if there's one particular case where you want to be as fast as you can is when you've got an unsatisfied customer waiting for an answer (and saying he won't pay you until he gets one).
 
So much for recipes...
 
So, all things considered, what is performance? I don't have a clear-cut answer to this question, but I believe it goes beyond "how many requests per second can we handle".