Introduction

In a previous article[1], we discussed how logs enable to identify the weak signals of cyber attacks.

By correlating the events recorded in logs, performing a behavioural analysis of specific user profiles, enhancing logs with external resources and rendering the resulting analyses, it is possible to detect the so-called “weak signals” (weak because they are indirect and, alone, are no evidence of incidents) of breaching attempts or events.

The article focused on log analysis in the specific context of attack detection, but it also made it clear that many electronic devices are generating logs: Windows, Mac or Linux workstations, Android mobile phones, home automation equipment, future connected cars…

The article below aims at describing a broader log analysis with a view to understanding the syntax and semantics of logs and rendering them. To illustrate this “broader case”, we will discuss the Internet of Things[2], since this is the archetype of the proliferation of heterogeneous log formats, for which it is difficult to apply a homogeneous analysis method.

How can we deal with the diversity not only of the log formats, but also of the meaning of such formatted data? And how can we generate renderings to clearly illustrate the results of such data correlation ?

Log parsing

Highly heterogeneous formats

If you have ever browsed logs[3] generated by the Checkpoint firewall, you know how difficult it is to understand them, even though this product is widely used, and its log formats, duly documented (or even made public). If you have never analysed such a log, the excellent example published by the SANS Institute will enable you to grasp the approach.

Now, imagine the difficulty, for someone analysing or developing an intrusion detection system, to understand logs generated in a proprietary binary format! With confidential or incomplete documentation, this task becomes a huge forensic challenge.

Each format has its own parser

It is therefore necessary to develop, for each log type, a parser (or syntactic analyser), able to extract the so-called “key-value” pairs of that log. These keys, representing metadata, will be sent to the correlation tools of analysts in order to detect, for instance, the weak signals of cyber attacks, as described in our previous article.

In this field, there are very few solutions to generalise the design of parsers; so most often, either parsers are developed in-house for very specific log types (and are therefore not easily upgradable), or there are no parsers at all (as a matter of fact, log analysis is far from being something obvious for companies)!

Thus, log parsing remains a technical and arduous issue, which actually requires skills more related to the development and administration of IT systems. Furthermore, the difficulty of analysing heterogeneous sources is not limited to the variety of data formats.

Indeed, data using the same format might have different meanings, i.e. the meaning of a given piece of data is not entirely represented by a table of (metadata, value) pairs . In other words, parsing does not imply the semantic analysis of data, despite the usefulness of such analysis in generating indicators related to that data!

Semantic analysis of logs

More or less objective data

In the context of the Internet of Things, the meaning of logs becomes even more varied. Let us consider an example of a “miscellaneous” meaning for a piece of data.

Events in the syslog[4] format are rather clear and objective – for instance: “User X connected at time T”, or “Process Z read file A at time T”. The log may be further interpreted, but basically, the formatted data is not biased by an analyst’s judgement.

The situation becomes more complex when you receive information regarding an IP’s reputation, for instance. Of course, we may define a format to send a structure associated with such reputation (this will be discussed in the next paragraph), but it is obvious that the nature of the data, the IP’s possible malicious history, to be specific, differs from that of a connection notification.

However, our aim is to correlate such data of a different nature. The correlation mechanism thus becomes a scientific issue per se.

How can we define data rendering standards?

Though this issue dates back to long ago, it is far from being outdated. In our previous article, we mentioned the very active work performed by the MITRE[5], in particular as regards the design of:

a standardised language to represent structured cyber threat information, STIX[6]. This language is in the process of being adopted by several governmental and transnational CERTs[7] and by numerous companies[8];
a structured language to describe “cyber events”: CYBOX[9]. The language is little used, as it is based on STIX (the publishers mentioned in Note #8 have scheduled to enable the production of documents in the STIX format by the last quarter of 2014[10]);
a language for encoding and communicating high-fidelity information about malware: MAEC[11];
a set of services that enable sharing of cyber threat information: TAXII[12].

These formats will be a major step forward in the exchange of cyber attack data. However, this is just a particular case, where the meaning of exchanged data is only related to cyber attacks… a situation that does not apply to all data exchanged by connected devices!

The issue today is how to present data which have totally different meanings.

Let us imagine a tool able to enhance the logs sent by various devices (whether related to security or not) by using the data collected from social networks, underground forums, websites and other intelligence sources.

This infers several questions, which represent a major challenge for companies dealing with business intelligence and Open Source intelligence (OSINT):

How can we correlate data of a very different nature? How can we systematically generate relevant and measurable clues from such diverse data?
How can we weight the “objectivity” of the data to be correlated? I.e., if we decide to correlate system logs (for instance, notifications of connection failures on the SSH port) with the origin of alleged attacks (typically, countries arousing the systematic suspicion of analysts, such as China or Russia), we bias the reasoning. This may lead to false positives, or even to unjustified accusations.

In the field of connected devices, we are facing the same issues.

Let us take the example of the correlation of logs generated by a connected thermometer[13], refrigerator[14] and smart electricity meter[15]. Since three data types have been defined, any attempt to correlate the data exchanged (typically, the temperature, content and power consumption at a given time) is exactly equivalent to “pre-determining the correlation rules between the exchanged data”.

Such rules may include the following: if temperature increases, the weather is warmer; hence, household members will store more beverages in the fridge and turn the air conditioning on; hence, power consumption will increase by ∆%.

But what if, one day, you forget to switch the radiator on? Or if, another day, you store fruit juices in your cellar because the fridge is full? Or if, one evening, you leave all the lights on because you have friends at home, and hence, your power consumption is abnormally high?

All these are exceptions or “singularities”, in the field of observed events, which make it very hard to answer the question “how can we correlate data of different nature?”.

As a matter of fact, we humbly believe that, to date, the only way to generate relevant correlation indicators is to make a detailed and specific study of a given log-generating system, and to combine it with the experience and judgement of an analyst. In our opinion, any too general solution would seem highly suspicious!

The asynchrony of data

Time synchronisation in networks of connected devices is much more complex than in “traditional” Internet-type networks. More particularly, it is difficult to synchronise the various NTP implementations in your refrigerator, watch, iPhone… For instance, it is extremely hard to manage communication latencies, because such times are specific to each central equipment to which a group of devices is connecting[16]. Consequently, a correlation rule as simple as the event occurrence date is difficult, if not impossible, to implement.

Finally, it is interesting to note that for cost reduction purposes, many manufacturers of connected devices have decided to simply not generate logs!

Data rendering

Despite the difficulties related to the syntactic and semantic analysis of logs, we are now able to aggregate and correlate certain types of data! The practical issue is how to render the result of such correlations.

In this respect, we had mentioned the necessity of in-house developments, since the solutions of SIEM[17] regarding enhancement with external sources have clearly underperformed with respect to promises.

Rendering structured data raises two major issues.

Trivially speaking, we are living in 3 dimensions and most often view our data on a 2-D screen (although the implementations of WebGL in browsers generate “3-D” renderings – see next paragraph – and in the waiting of “augmented reality” tools, which will only add a mere third dimension, at best).

So, we must define rendering solutions that the human eye is able to perceive, and the human brain, to process!
Aggregating data and associating them to more or less numerous and complex indicators leads to the generation of multidimensional vectors (for instance, N pieces of data regarding an individual). Rendering a 2-D graph, where each node would be such a vector, would inevitably lead to an inextricable mess.

Furthermore, the issue is not only related to calculation power. Indeed, an N-dimensional problem is not necessarily N times more complex than a one-dimensional problem: it is often worse, if not insoluble! Hence the poor performances of the browser libraries already mentioned to display 3-D objects (and even worse beyond 3 dimensions). Therefore, data representation is moving from a rather technical issue to a mainly mathematical one. This challenge is taken up by a few CEOs of (sometimes French) SMEs, who have moved from mathematics academia to software development for companies. They develop high-level concepts and mainly use skills related to algebraic geometry, topology and differential analysis[18], i.e. (in the likely event that such terminology is not crystal-clear to you) mathematical theories that enable to subdivide elements in multidimensional spaces and to make a whole array of fun or less-fun calculations that will eventually lead to true, real-life performance improvements in your freshly compiled software!

Conclusion

Log analysis in the context of the Internet of Things is coming up against three difficulties:

the parsing of logs in extremely diverse formats;
the analysis of data that have varying subjective meaning and for which it is difficult to establish correlation rules;
the 2-D or 3-D representation of objects that aggregate a large quantity of data, which soon becomes illegible on a screen and for which calculations are impossible without mathematical tricks.

Log analysis is thus inducing a great number of exciting research projects in mathematics and computer sciences, among other disciplines, and will require the definition of new concepts for data rendering, process and transfer.

Heartfelt thanks to Barbara Louis-Sidney and Jérôme Desbonnet for their proofreading work, to Philippe Saadé for his enlightening mathematical advice, and to Guillaume Rossignol for offering me the opportunity to apply the ideas described in this article in an industrial environment.

Charles Ibrahim, Cybersecurity Analyst, Bull.

@Ibrahimous

[1] https://observatoire-fic.com/detecter-les-signaux-faibles-des-cyberattaques-ou-pourquoi-vous-devriez-analyser-vos-logs-par-charles-ibrahim-bull

[2] For a brief and clear presentation of the Internet of Things, click here

[3] http://ossec-docs.readthedocs.org/en/latest/log_samples/firewalls/checkpoint.html

[4] http://en.wikipedia.org/wiki/Syslog

[5] Not-for-profit organisation financed by and working for the US government. Is currently managing three research and development centres, on behalf of: the Department of Defence (DOD Command, Control, Communications and Intelligence FFRDC); the Federal Aviation Administration (Center for Advanced Aviation System Development); and the Internal Revenue Service (Center for Enterprise Modernization).

[6] http://stix.mitre.org/

[7] In particular, the US-CERT, the Siemens CERT, the Advanced Cyber Defence Centre (ACDC)…

[8] For instance: HP, Microsoft, Bromium, Checkpoint, Malcovery, Vorstack, ThreatConnect…

[9] http://cybox.mitre.org/

[10] From the statements read in the mailing list of STIX

[11] http://maec.mitre.org/

[12] http://taxii.mitre.org/

[13] For instance: http://connected-objects.fr/2014/06/igrill-thermometre-barbecue/

[14] See http://www.maison-numerique.com/produits.php?id_cat=1

[15] In France, the Linky EDF electricity meter: https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CCsQFjAA&url=http%3A%2F%2Fwww.erdf.fr%2FLinky%2F&ei=hqCUU4C8IuTD0QWCmoH4Cw&usg=AFQjCNHq5MXUzaG6L_4pFrdcqUJBct-LWg&sig2=oxMM82bwChWrAK-PD4nVKQ&bvm=bv.68445247,d.d2k

[16] See for instance the “Discussion” section of this link

[17] Security Information and Events Management

[18] For those interested, see http://www.math.ens.fr/~debarre/DEA99.pdf, http://www-fourier.ujf-grenoble.fr/~demailly/L3_topologie_B/topologie_nier_iftimie.pdf, and http://www.math.jussieu.fr/~delabrie/CalcDiff/CalcDiff.pdf