Thursday, 29 May 2014

What is performance?

I've restarted watching Alex Stepanov's A9 Lectures, "Efficient Programming with Components", and if there's one thing I find essential in what he says is that, despite all the "programming recipes" we find and use, it's important that we think. I've recently been reminded of this need to think, concerning optimization.
We all have our recipes. As far as optimization is concerned, my recipe is:
  • Go for readability first. Optimize as needed.
  • "Using C++ == you should extract every single ounce of performance available, because otherwise you'd be more productive in any other programming language" is not a dogma. And, as far as productivity goes, it's not even necessarily true.

A simple scenario

Suppose we have a list of char[] and we want to find duplicates on said list. In its simplest form, testing two arrays for equality means traversing both, comparing each element until we reach the end of either array, or find a difference.
We can consider several additions to this algorithm, in order to optimise it.

Test for self

We could start by testing for self-comparison. Stepanov has a point when he says this is not an optimization, because most of the time we will not be comparing an element (in our case, a char[]) to itself. This brings us to one of those cases where we need to think. Maybe not on equality, since that is trivial; but let's consider assignment.

When defining assignment, should we test for self-assignment? All over the web, the recipes sound a very loud "YES". And yet, most of the time, that test will evaluate to false. The question we should ask ourselves is - if I perform self-assignment, what will happen?

Will I release/duplicate resources? Will I invalidate a class's invariants? Then, self-assignment must be prevented, and the test is required. Not because of optimization, but because a particular class requires it. If not, will it be terribly expensive? In this case, it may actually be an optimization. Will we be self-assigning some floats? How expensive is that? OTOH, how expensive is the test? Is it even worth thinking about, or should we just follow the recipe and put it in there by default?

As with everything concerning optimization, there's no clear-cut answer. Even though I'm inclined to agree with Stepanov on this, I'd probably include the test for self-assignment by default, and consider removing it if, after profiling my code, I could make the case that removing it would benefit performance.

Test for size

Back to our quest for duplicates on a list, we could test for different sizes. Suppose we store each of our elements in a simple structure like this:
struct SimpleArray
    size_t size;
    char* arr;

Testing for size means not having to traverse the array, so that's a good thing. Not for the traversing itself, that would be trivial, unless our arrays were longer than the processor's cache lines, but because we'd avoid going to memory again to read *arr (arr itself should be on the processor's cache). That is a good thing.
Except when it isn't.
What if you know your data set is composed of arrays that have fixed length? Again, you may measure and conclude the impact of this test is negligible. Or it may mean the difference between respecting a performance requirement or not. In this particular case, if I had to optimise, I'd skip the structure and work directly on the array, because the size would be irrelevant.

A simple scenario - what we don't know...

We need to think, in order to make decisions. And for that, we need data.
  • What is our process doing? If we're invoking services across a network, I doubt either a test for self-comparison or a size test on fixed size arrays will have much of an impact.
  • What is the actual weight of these trivial operations? Under our expected workload? Under our worst-case scenario? Under 4x our worst-case scenario? When another process on our machine suddenly goes haywire and gobbles up resources like crazy?
  • Is our process critical? Does it have strict performance requirements, with associated penalties?

The big picture

We sometimes have to deal with performance issues, where all the data we have to perform an analysis is "This damn thing is too slow". Based on this treasure trove of data, we have to consider: Our application; the application servers; the web servers; the database servers; the middleware servers; the servers hosting the services we consume (usually, via the middleware servers); and all the myriad of network equipment allowing all these nodes to communicate.
And I don't need more than two hands to count the cases where performance-related issues were caused by application code. And, most of those cases were caused by poor design, rather than by poor implementation - such as an app we received where an operation that failed was placed back into the queue for retrying; but a) instead of applying a delay before retrying, it was placed in the queue for immediate reprocessing; and b) it was placed back into the queue even if the error was not recoverable (e.g., a telephone number with 5 digits). As you might imagine, much fun (and quasi-DOS attacks) ensued.
So, let's get back to our title...

What is performance?

When we think of performance, we envision CPU cycles and cache hits/misses; or that we're processing X lines per second from this file, and we need to process 4X; or sending data via the network, and can we really afford to use SOAP, or should we choose a light-weight protocol (or even roll our own).
Suppose we have Platform A, which invokes a service on Platform B, via a Middleware Server. The service on PlatB sends back something which we then use to call another service on PlatB, which gives us the actual result.
The straightforward implementation looks like this:
PlatA (invoke) -> MWare (invoke) -> PlatB -> (return) -> MWare (return) -> Plat A
PlatA (invoke) -> MWare (invoke) -> PlatB -> (return) -> MWare (return) -> Plat A

On invocation #1, PlatA gets the something required to get the result, and on invocation #2 gets the actual result.
Now, looking at this, we have the obvious optimization:
PlatA (invoke) -> MWare (invoke) -> PlatB -> (return) -> MWare
MWare (invoke) -> PlatB -> (return) -> MWare (return) -> Plat A
In fact, not only have we just saved a few seconds from the whole process, but we also abstracted PlatA from PlatB's two-call design. Brilliant, right?
Let's suppose we're responsible for PlatA and we receive a complaint from a costumer, saying his data is all wrong and he won't pay his invoice until we sort it out. Our analysis concludes we're getting wrong data from PlatB. We contact their support team, but in order to perform their analysis they require the something they sent us for that particular invocation. So, now we have to contact the MWare support team, ask for that piece of data, and then send it to PlatB's team. And, during all this time, the costumer's waiting; and our invoice is not getting paid.
"Hold on!" you say "That's not related to process performance, that's an operational problem". Yes, it is. And that's irrelevant, unless you're developing in some sort of hermit's vacuum with no contact with the organization for which you develop. Is that the case? I didn't think so.
Yes, the fact that we may have to wait 1-2 hours before we even begin to actually analyse the cause of the costumer's complaint is an operational problem. But that doesn't make it - and its consequences - any less real. And the potential of this operational problem has been fulfilled by our optimization.
Let's look at the gains and losses of this optimization:
  • We've cut a few seconds off each invocation. Let's be optimistic and assume few == 10.
  • We've added 1-2 hours to the response time, in case of costumer's complaints.
If we look at the "recipe" that says "don't optimize for the uncommon case", everything looks fine. However, different cases have different weights, and if there's one particular case where you want to be as fast as you can is when you've got an unsatisfied customer waiting for an answer (and saying he won't pay you until he gets one).
So much for recipes...
So, all things considered, what is performance? I don't have a clear-cut answer to this question, but I believe it goes beyond "how many requests per second can we handle".

Thursday, 22 May 2014

Logger Abstraction (PoC)

I've finally uploaded my logging macros to github, along with a sample program (you must have either Boost Log or Poco installed to run it). In this post, I'll detail its design.
My original requirements were prompted by a particular "use-case" - adding reusable classes to an application and allowing those classes to use the application's logging implementation, instead of whatever implementation was originally used. So, in this scenario, my requirements translate to: 1) The reusable classes must be used with no changes; and 2) The application developer should only need to create a header file with a list of well-defined macros that will invoke the app's logging implementation, thus causing the reusable classes to invoke that same implementation.
I've divided this into several header files, keeping in mind requirement #2. Let's take a look at these header files, then. 

Macro Overloading - macro_overload_base.h

This is the foundation of it all. It's a group of macros that allow overloading based on the number of arguments, up to a limit of 16.
My first design required the distinction between zero arguments, one argument, and more than one argument. However, the complexity of correctly detecting zero arguments was more than I was willing to accept, so I've worked around that requirement. Detecting invocations with one argument proved to be good enough, and the resulting macros, although still ugly, are a lot simpler.
I've been all over the web while searching for this, and I've been a bit beyond my knowledge quite some times (which was one of the reasons why I decided to work around the zero arguments requirement); These were my starting points, in case you're interested.
Detecting the number of arguments is up to the PCBASE__MOVERLOAD_SELECT_VALUE macro. In order to make it work, you must call it with the correct list of values. PCBASE__MOVERLOAD_ONE_ARG_OR_MORE calls it like this

because we just want to know if we have one argument or more than one argument.


because we need to know the exact number of arguments. PCBASE__MOVERLOAD_FOR_EACH does something similar, but passes as arguments the names of the macros that will allow to apply an action to each argument. E.g.,
PCBASE__MOVERLOAD_FOR_EACH(<<, "[", __FILE__, ":", __LINE__, "] ", 
    "Blimey! I didn't expect the Spanish Inquisition!", 
    chiefWeapons, mill.Trouble(), 42);
<< "[" << __FILE__ << ":" << __LINE__ << "] " 
    << "Blimey! I didn't expect the Spanish Inquisition!" 
    << chiefWeapons << mill.Trouble() << 42
You may have noticed the macro names look quite ugly. Since there's no namespace partitioning with macros, I've decided to make these names as ugly as I possibly can, to minimize name clashing.
Now, Let's move up one layer.

Logging Interface Type Abstraction - li_concat.h / li_outop.h

Here, we build on macro_overload_base.h to create the macros that will receive the logging arguments and call the user-supplied macros (USMs), which will, in turn, call the logging implementation.
There is an abstraction leak at this point. At some point in this "macro chain", we have to invoke the USMs. This means we have a contact point, similar to this: 
#define PCBASE__LAMOVERLOAD_LOG_IMPL_1(level, ...) \
#define PCBASE__LAMOVERLOAD_LOG_IMPL_2(level, x, ...) \
On the left-hand side we have the names from the logging abstraction macros, on the right-hand side we have the USMs.
I considered three locations for this contact point:
  1. Place it in the logging abstraction headers, i.e., in li_concat.h and li_outop.h. I rejected this idea because it would require the user to edit an extra header file, thus going against requirement #2.
  2. Place it in the user-supplied header.
  3. Create a separate header just for these macros.
While I don't like the fact the we'll have references to "abstraction names" in the user-supplied header, it seemed the best alternative, so I went for alternative #2.
li_output.h is quite simple - for each argument, prepend a "<<" to it. li_concat.h is more complex, because we can't add a "+" to a single argument.
In fact, li_output.h is so simple it could almost be dispensed with; however I couldn't come up with a design clean enough without it. Besides, keeping it maintains the parallelism between the design for these two interfaces, concatenation and stream.
NOTE: Concatenation is working, but it doesn't actually respect requirement #1 (read why here), because it's not automatically converting its arguments to string.

Logging Interface Type Selection - log_interface_type.h

This is where we define the interface type, which can be one of:
  • Comma. Function call interface with several arguments, such as we would have in a printf()-like output function. I have not implemented this.
  • Stream output operator (operator<<).
  • Concatenation (using +). Function call interface with only one argument, created from the concatenation of several arguments. Since it's a binary operator, it requires a more complex implementation, because it needs to distinguish between two different cases: one argument, which must not involve any concatenation ("a" + is not a valid expression); and more than one argument, which must be concatenated.
This header file will also #include the correct header file for the interface type chosen. For now, it's a choice of either li_concat.h or li_outop.h. Both are described above.
We use this it by adding a #define to the project file/makefile, defining which interface type we want. E.g., on Qt Creator, to use the stream output operator, we'd do something like this on the .pro file:


The default type (if we haven't #defined PCBLUESY__LOGINTERFACETYPE anywhere) is the stream output operator.

For now, I'll leave this header specific to the application, i.e., I'll have to create a new header for each application. I suspect this is not the best design, but I'll wait until I've used it a few times to see how it turns out.

Logging implementation - e.g., poco_log.h or boost_log.h

This is where we define the macros that will directly invoke the required functionality in the logging implementation (e.g., Poco or Boost).

The way I've set it up, we have a macro to get the logger (e.g., for Boost Log):

    src::severity_logger<boost::log::trivial::severity_level> \

Then, we have the top level logging macro, i.e., the macro that will get used in all the logging statements in the code:

#define PCBLUESY__LOG(level, ...) \

These are the only two macros used in the application code. We then have the actual logging macros, i.e., the macros that call the logging implementation. Again, example for Boost Log:

    boost::log::trivial::severity_level::trace) __VA_ARGS__

Why the PCBLUESY__ repetition? This is the way the macro is used in the code:
    "Blimey! I didn't expect the Spanish Inquisition ERROR!", \
    chiefWeapons, mill.Trouble(), 42);
I'm using PCBLUESY__ERROR as logging level in order to distinguish it from any other *ERROR* defined "out there". Since these names are defined in this header and won't be used anywhere else, I figured a little uglyness would be harmless, and could actually be useful.
How do we use this? I've set up an example on my "SimpleSampleTuts" github. You'll need Boost Log and/or Poco C++ to see it in action. And if you use any other logging lib, provided it has an interface compatible with the ones discussed here, you should be able to create a header like poco_log.h/boost_log.h and start using it.
One final detail - initializing the logger. If you use a different implementation, you'll have to add that code, too.

Monday, 12 May 2014

Logging - The argument for the stream interface

Of all the things mentioned in my "resurface" post, I've been dedicating attention to my "logging macro front-end". To recap, the goal is to abstract the user code from the logging implementation used, thus allowing to substitute for another logging implementation with no changes to the code.
I want something like this:
    "Blimey! I didn't expect the Spanish Inquisition!", 
    chiefWeapons, mill.Trouble(), 42);

This is then "translated" into whatever interface the logging implementation provides for logging. So, with Boost Log, this could become something like this:
BOOST_LOG_SEV(someLog, boost::log::trivial::severity_level::error) 
    << __FILE__ << __LINE__ 
    << "Blimey! I didn't expect the Spanish Inquisition!" 
    << chiefWeapons << mill.Trouble() << 42;
Or, using Poco Logger's stream interface:
if (someLogRef.rdbuf()->logger().error())
    someLogRef.error() << __FILE__ << __LINE__
        << "Blimey! I didn't expect the Spanish Inquisition!" 
        << chiefWeapons << mill.Trouble() << 42 << endl;
else (void) 0;
Since I prefer a stream interface, that's where I began my work. Then, I moved on to what I call the "concatenation interface", where everything is concatenated into a single string. This was my first attempt, a few months ago, when I was using Poco Logger, but wasn't aware of Poco LogStream. The reason I settled on this was because I'm not a fan of the "format string" interface (i.e., printf()-like).
So, starting from our logging macro:
    "Blimey! I didn't expect the Spanish Inquisition!", 
    chiefWeapons, mill.Trouble(), 42);

we'd have something similar to this:
poco_error(someLogRef, __FILE__ + __LINE__ 
    + "Blimey! I didn't expect the Spanish Inquisition!" 
    + chiefWeapons + mill.Trouble() + 42);

And while I always preferred the stream interface, as I worked more on concatenation, I became aware it was not just a matter of preference; the stream interface is vastly superior. What do I mean by "superior"?
Let's look at this:
someStream << __FILE__ << __LINE__
    << "Blimey! I didn't expect the Spanish Inquisition!" 
    << chiefWeapons << mill.Trouble() << 42
__FILE__ and "Blimey! etc..." are char* (I'll ignore constness here), and work right out of the box. Ditto for __LINE__ and 42. So, our wildcards here are mill.Trouble() and chiefWeapons; for the sake of argument, let's assume mill.Trouble() returns a float and chiefWeapons is a container that defines its own operator<<(). This means close to 84% of our logging line works with no work required on our part. Our only additional work is defining operator<<() for whatever type chiefWeapons happens to be. And we probably would define it anyway, since output to a stream is always a handy feature, IMHO.
Now, let's look at this:

__FILE__ + __LINE__
    + "Blimey! I didn't expect the Spanish Inquisition!" 
    + chiefWeapons + mill.Trouble() + 42
This is supposed to be concatenation; concatenation assumes some string type. Since none of these arguments is a string type, we'd need to convert them. That's not difficult, but the question is - where would the conversion occur?

We don't want to place it in the original logging line, because we want it to be interface-agnostic. However, that's the only place where we know the type of each argument; we certainly couldn't place it in our macro mechanism, because macro parameters have no type.

We could create a family of template functions for this, and solve our problem through specialization - if we pass a string type, just return the string itself (yes, I'm ignoring the several string types in C++ libs and the necessity to copy those into a single type); if we pass a char*, build a string with it; if we pass an int, call to_string() (or similar); etc, meaning, every type we use would need a way to convert to string. And, while we might not require a template specialization for each type (e.g., we could use to_string() for int, long, or float), we'd be pretty close to that mark.

So, assuming this could be pulled off, with a proper design, it's still more work than taking advantage of a group of core types that have an already-functioning operator<<(), and only adding this to other types that require it.

Then, there is the question of performance - unless we use some mechanism like QstringBuilder, concatenation will be much more expensive than streaming output.
Finally, there is one last point that makes me prefer the stream interface, one that I've come upon as my logging usage became more "complex" - no conversions necessary. Conversions are one of the sources of problems in C/C++, and while we can mark them as explicit, I prefer sticking to a simple rule of defining no unnecessary conversions. If I only need a conversion to string when I'm outputting to log, then it is an unnecessary conversion.
So, where does this leaves my "logging macro front-end"? I'm going to finish my work and publish it with only the stream interface functional. The concatenation interface will be semi-functional, meaning no provision for conversion to string. I realize this is mostly self-defeating, since this means I probably won't use it. But I have two good options with operator<<() at the moment, so I'll stick to it, for the time being.

Sunday, 4 May 2014

Single-line comments that want to be multi-line

Update: While working on this code today, I had some "mystical" results, and ran a rebuild instead of the usual build. Et voilĂ , gcc presented me with this: "warning: multi-line comment [-Wcomment]". So, in all fairness, this issue isn't as much of an issue as it appeared at first.

So, we have our good-ole multi-line comment, aka, C comment:

/* I am a long comment, spanning a lot of lines,
I just go on and on until I start foaming at the
mouth and falling over backwards, and...
I say, I seem to have drifted a bit, haven't I...? */

Then, we have our heavy-duty single-line comment, aka, C++ comment (actually, since C99, this has been a bit of a misnomer): 

// What do we mean by no?
// What do we mean by yes?
// What do we mean, allowing backslash \
line continuation to work on single-line comments? \

Oh, yes, this will happily compile. Let's see it in action, in a terribly contrived example:

int main()
    // We don't pass any argument to CountFiles()
    // because we always start at \
    int tot_files = 0;
    tot_files = CountFiles();

So, our intrepid - and rather uncognizant - developer documents the design decision to always start counting files at the current drive's root. Developing on the Windows side of C++, and being eco-minded, he decides to save 3 characters by replacing "root" with "\".

The compiler (in this case, gcc) rewards him with this beautiful message: "error: 'tot_files' was not declared in this scope".

In all fairness, VC++ correctly flags the continuation lines as comments, the "syntax artists" responsible for painting our IDEs in all those lovely colours correctly paint those lines as comments. Qt Creator? Not so much, as you can readily see from the code snippet above. Not that I blame them, really; I imagine they have more important things to do.

Reading about it, I learned this behaviour occurs because continuation lines are eliminated (i.e., the backslash and the newline are removed, thus joining the lines together) before comments are removed.

I find this a bit strange. It would be more sensible for the first step to be comment removal. Which would have the beneficial side-effect of getting rid of this brilliant rule (in the "brilliant bean" sense, I mean) .

Yes, I know, there are more important things to do.

Thursday, 1 May 2014

It's Been A Long Time

Actually, the song is called "Rock and Roll", but it has been a long time.
It all started in late September, as I've mentioned in a previous post. We rolled out a massive project, and spent quite some time picking up the pieces. Shortly thereafter, I was involved in a new massive project, and it kept me busier than usual until late January. During this time, I had little energy left to pursue my personal projects. On the other hand, I did pick up the guitar again, which I hadn't done in months. And, while there haven't been any more massive projects since that time, there has been enough research to keep this busy schedule.
Still, I haven't been idle on the personal project front. And, in some cases, there has been some overlap - lessons learned on the personal front that carry over the professional, and vice-versa. So, what has happened during this long time? Actually, it's still happening, it's a Work in Progress.


I've started working on programming puzzles, and I quickly became aware that my math knowledge is lacking, since the vast majority of such puzzles is math-based. I began working on this, but I have a very long road ahead. It's been progressing slowly. 


I've also realized my knowledge of C, and how it interacts with C++, is also lacking. So, I'm - slowly - starting to learn some of C's particularities. It's making me realize, among other things, how much I really like C++.

Systems knowledge

I've been extending my systems knowledge. This is the area where most of the above-mentioned overlap happens. Due to requirements on one of the projects (mentioned above), I began with the working of certificates, DNS, and http requests. During the project, it was all very task-oriented, i.e., getting the knowledge necessary to perform specific tasks. Now, I'm going for a more systematic approach.

Build gcc

I've set up a VM with CentOS and built a local gcc installation from scratch (which is actually something a friend had suggested to me a long time ago). I've set up an environment similar to what I have on Windows, with Qt Creator and gcc 4.8, and I've built ICU and Boost. Then, I've decided to take a step back because the method I used to build gcc was quite time-consuming and involved, and I'm now investigating if there's an easier/faster way to do it. Also, from what I can see here, building CLoog together with gcc is now supported (I didn't see this a few months ago, when I built gcc), so I'll definitely give it a try.


I've also started learning more about linking, both static and dynamic, after being bitten by an unexpected versioning issue. As I've said above, I've built ICU and Boost, and used the latest versions available at the time. However, the version of ICU I've built was 52.x, and the version included with Qt was 51.x. I didn't think much of it, trusting I could use symlinks to the same version for both Boost and Qt. No such luck. When I tried using ICU 52.x for both, I got an error from QtCore about an undefined symbol: u_strToLower_51. Puzzled by the version numbering on function names, I searched online and learned that DLL/SO symbol resolution works differently on Unix/Linux and Windows. This means I'll have to plan my building process more carefully. I don't want to carry different versions of the same libs around (especially ICU, which is huge). Since I get Qt pre-built, that will mean using the ICU version that comes with Qt, instead of building my own.


I've been taking on logging again. Not only taking a better look at Boost Log, but also at Poco Logger. I've already seen something I've missed the first time around - Poco Logger has a stream interface; however, I found no way to set the value of __FILE__ and __LINE__ on the format string. So, I've decided to move forward by writing those values out with the message. On Boost Log, there is a way to do it, using custom formatters, but I've decided to use the same "solution" I've used for Poco Logger, i.e., write __FILE__ and __LINE__ in a sort of faux-format before the message. I'll probably implement a proper solution later on; I have an idea of how it may be done on Boost, but not yet on Poco.
And I've decided to get rid of my logging bridge class template and stick with macros. Yes, evil, probably. But... I've been saying from the start I won't get rid of macros because I want __FILE__ and __LINE__, and I want to write it only once, on the macro definition. And my goal of being able to swap logging implementations can be achieved solely with macros. Also, by using variadic macros, I can deal with the case where different implementations have a different number of arguments, which was the reason for using the class template and SFINAE in the first place.

Local scripting environment

Due to work requirements, I've been setting up a scripting environment for a local user, on a Red Hat server. Among other things, I wanted to avoid building from source. And, as an added bonus, the machine has no internet connectivity. At first, I decided to user Perl. And, after much gnashing of teeth (and of many other body parts, some of which I had no idea were capable of gnashing), I gave up. Why? Too many hurdles, and the realization that the only viable alternative would be building from source. So, I turned to Ruby. So far, it's going much better, and I see two main reasons for this:
  1. Ruby, unlike Perl, has not become a "system component". This means we can ask the sysadmins to install it from the RPM repo without fear of breaking anything (e.g., system scripts).
  2. I found the module (gem, in Ruby parlance) management simpler. Installing Ruby Gem locally with no internet connection was simple, and so was installing gems. I still have to deal with dependencies manually, but even that has been simpler. And there is a possibility that the gem for connecting to Informix DBs may not require native building, which was not the case in Perl.


I've added some of the simple code I've done recently to github, here, including an evolution of the AppOptions class template. The design remains mostly the same (some minor interface changes), and even the implementation changes are minor.
Well, that's it. I'll try to return to a more regular posting schedule.