Intrusion Learning: An Overview of an Emergent Discipline

The purpose of this article is to provide a definition of intrusion learning, identify its distinctive aspects, and provide recommendations for advancing intrusion learning as a practice domain. The authors define intrusion learning as the collection of online network algorithms that learn from and monitor streaming network data resulting in effective intrusion-detection methods for enabling the security and resiliency of enterprise systems. The network algorithms build on advances in cyber-defensive and cyber-offensive capabilities. Intrusion learning is an emerging domain that draws from machine learning, intrusion detection, and streaming network data. Intrusion learning offers to significantly enhance enterprise security and resiliency through augmented perimeter defense and may mitigate increasing threats facing enterprise perimeter protection. The article will be of interest to researchers, sponsors, and entrepreneurs interested in enhancing enterprise security and resiliency. The illiterate of the 21st Century are not those who cannot read and write but those who cannot learn, unlearn and relearn.


Introduction
Intrusion learning offers the potential of significantly improving the security and resiliency of enterprise systems and increase the enterprise's capability to adapt to adversaries and changes in business environments.This article positions the emerging domain of intrusion learning at the intersection of machine learning, intrusion detection, and streaming network data.Machine learning refers to the algorithms that are first trained with reference input to "learn" its specifics, to then be deployed on previously unseen input for the actual detection process (Sommer & Paxson, 2010).Intrusion detection is the process of monitoring the events occurring in a computer system or network and analyzing them for signs of possible incidents, which are violations or imminent threats of violation of computer security policies, acceptable use policies, or standard security practices (Scarfone & Mell, 2007).By streaming network data, we mean streams of distinct and diverse network events flowing on a network over time.This definition is consistent with the definition of data stream provided by Savvius (2016).
We draw upon the results of a literature review carried out for the purpose of defining intrusion learning.We start with a summary of the literature review and then define intrusion learning, identify its distinctive aspects, and provide recommendations for advancing the emerging discipline.We end with our conclusions.

Literature Review
We performed a systematic narrative review to identify the latest advancements published in the academic literature with respect to machine learning, streaming network data, and intrusion detection.Articles in English-language journals published from 2010 to 2015 in North America and Europe were reviewed.We organized the literature into five themes: i) feature extraction, ii) learning algorithms, iii) clustering, iv) datasets, and v) tools.
The purpose of this article is to provide a definition of intrusion learning, identify its distinctive aspects, and provide recommendations for advancing intrusion learning as a practice domain.The authors define intrusion learning as the collection of online network algorithms that learn from and monitor streaming network data resulting in effective intrusion-detection methods for enabling the security and resiliency of enterprise systems.The network algorithms build on advances in cyber-defensive and cyber-offensive capabilities.Intrusion learning is an emerging domain that draws from machine learning, intrusion detection, and streaming network data.Intrusion learning offers to significantly enhance enterprise security and resiliency through augmented perimeter defense and may mitigate increasing threats facing enterprise perimeter protection.The article will be of interest to researchers, sponsors, and entrepreneurs interested in enhancing enterprise security and resiliency.
The illiterate of the 21st Century are not those who cannot read and write but those who cannot learn, unlearn and relearn.

Alvin Toffler Writer and futurist
In Powershift

" "
Intrusion Learning: An Overview of an Emergent Discipline Tony Bailetti, Mahmoud Gad, and Ahmed Shah Feature extraction Feature extraction is the process of determining a subset of features from an original set.The intent of feature extraction is to find a combination of original features or data attributes that can better describe the internal structure of the data.The three principal algorithms that are used for feature extraction are: locality preserving projection (linear projective maps arising from solving a variational problem optimally preserving neighbourhood structure), linear discriminate analysis (a method for finding a linear combination of variables that optimally separates classes) and principle component analysis (a linear technique that projects the data along the directions of maximal variance) (Fisher, 1936;He, 2005;Parakash & Surendran, 2013).
Intrusion detection systems use feature extraction to determine what features or attributes can assist with detecting malicious traffic (Laxhammer, 2014).We found two feature extraction challenges in the context of streaming network data.First, the dynamic changing nature of the streams results in challenges pertaining to the evolution of features (the emergence of new features), concept evolution (new classes evolving into the stream), and concept drift (underlying concepts change) (Momin & Hambir, 2015).The second challenge is that data streams are, in principle, of infinite length (Masud et al., 2010).Most existing data stream classification techniques address only the infinite length and concept-drift problems; concept evolution and feature evolution are ignored.In the face of a dynamic adversary, ignoring concept evolution and feature evolution increases enterprise risk.

Learning algorithms
Three emerging machine-learning algorithms play important roles in intrusion learning: active learning, adversarial learning, and conformal prediction.Active learning is a subfield of artificial intelligence and machine learning, and it refers to the study of computer systems that improve with experience and training (Settles, 2012).Adversarial learning refers to the study of effective machine learning techniques against an adversarial opponent (Huang et al., 2011).Conformal prediction refers to hedging individual predictions made by machine learning algorithms with valid measures of confidence (Laxhammar & Falkman, 2011).
The presence of an adversary changes the dynamics for learning algorithms.An adversary will attempt to poison or manipulate the data so that the algorithms treat the malicious as benign.This adversarial context has led to research on how algorithms can unlearn poisoned and polluted data (Cao & Yang, 2015).

Clustering
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning (Jain, 2010).Clustering is used to detect unknown attacks and discover unusual activities or usage patterns in traffic data in real time.The value of clustering comes from discovering groups and structures in the data that, in some way, are similar to each other, without prior knowledge of the data structures.
Data stream algorithms can only read the incoming data once and must do so in the context of having to respond in real-time with bounded memory usage.These algorithms can only provide approximate results and must support evolving concepts (Nguyen & Luo, 2013).
Because real-time data streams are unbounded, it will only be possible to process a portion of the entire data stream one "window" at a time (Nguyen & Luo, 2013).
Various kinds of windows-based algorithms exist.For example, the sliding window algorithm analyzes the most recent data points and is suitable for applications where only the most recent information is of interest.The main disadvantage is that it ignores parts of the data streams.
An adversary could manipulate a sliding window so that malicious activities occur in those parts of the streams being ignored by the algorithm.

Datasets
A dataset contains network traffic that is used to benchmark the performance of network intrusion algorithms.Datasets may include a combination of malicious traffic, non-malicious traffic, and identified features that can be used for testing.The most commonly used dataset researchers use for intrusion detection dates back to the KDD Cup 1999 (archive.ics.uci.edu/ml/datasets.html).It is surprising that a dataset from 1999 is still commonly used given the significant changes in attack tools, techniques, and data types that have occurred since then.
That the KDD Cup 1999 dataset is still used suggests that developing or accessing contemporary datasets is a major challenge.Privacy rights, confidentiality, and intellectual property are all concerns that impede access to real network data.Though there are other datasets available, the reality is that valid contemporary streaming data is unavailable outside of large Internet providers.The absence of new datasets retards science-based experimentation of new algorithms.

Intrusion Learning: An Overview of an Emergent Discipline
Tony Bailetti, Mahmoud Gad, and Ahmed Shah Tools Many publicly available experiments that are applying machine learning to intrusion detection are using a tool called massive online analysis (MOA; moa.cms.waikato.ac.nz).MOA is a machine-learning framework that contains real-time stream processing algorithms.It is not customizable for multi-node and scalable distributable processing.
However, scalable and distributable machine-learning processing engines that can process real-time streaming information do exist (e.g., SAMOA; samoa.incubator.apache.org).However, they have not been widely found in streaming intrusion-detection machine-learning experiments.We have not determined why this situation exists, though we note that SAMOA is a relatively new Apache project.SAMOA is one of few open source tools that is specifically designed for distributed and true real-time streaming (Landset et al., 2015).Apache Spark with MLib also includes a distributed architecture for processing data streams (spark.apache.org).

Defining Intrusion Learning
In this section, we propose a definition of intrusion learning based upon four elements: i) the ultimate outcome of intrusion learning; ii) the target of the ultimate outcome; iii) the mechanism used to deliver the ultimate outcome; and iv) the interdependence between intrusion learning and scientific and technological advances.
We propose the following definition of intrusion learning:

Intrusion learning is the collection of online network algorithms that learn from and monitor streaming network data resulting in effective intrusion detection methods for enabling the security and resiliency of enterprise systems. The network algorithms build on advances in cyber-defensive and cyber-offensive capabilities.
We characterized the elements underpinning this definition as follows: 1. Ultimate outcome: Effective intrusion-detection methods on streaming network data.
2. Target of ultimate outcome: Security and resiliency of enterprise systems is the key target outcome.

Mechanism used to deliver ultimate outcome:
Online network algorithms that learn from and monitor streaming network data.

Interdependence of this mechanism from scientific and technological advances:
The mechanisms must build upon advances in both cyber-defensive and cyber-offensive capabilities (e.g., new machine-learning algorithms, new attack vectors), which themselves are informed by multi-disciplinary thinking.

Distinctive Aspects
We believe that there are five distinctive aspects of the intrusion learning domain relative to the machine learning, intrusion detection, and streaming domains: 1. Real-time analysis of streaming network data: Intrusion learning must respond to intrusions in real time.Unlike big data analytics, intrusion learning requires approximations, windowing, and other techniques to produce effective timely scalable analysis of network data (Aggarwal, 2007).

High cost of failure:
The cost of failure of machinelearning algorithms is much higher for intrusion detection (e.g., loss of intellectual property and brand damage) compared to other applications of machine learning such as optical character recognition (Sommer & Paxson, 2010).
3. Adversarial context: Intrusion learning must deal with the existence of talented and determined adversaries.The presence of the adversary requires that intrusion learning must evolve with ongoing advances in both cyber-defensive and cyber-offensive capabilities (Cao & Yang, 2015;Corona et al., 2013).
4. Network traffic diversity: Intrusion learning must deal with the variability of network traffic (e.g., bandwidth, load balancing, and connection requests).Traffic diversity complicates the perspective of "normal" and therefore hinders the ability to identify an anomaly (Sommer & Paxson, 2010).

5.
Outlier detection: Machine-learning algorithms are better at finding similarities than anomalies.As noted by Sommer and Paxon (2010), "the classic machine learning application is a classification problem, rather than discovering meaningful outliers as required by an anomaly detection system."

Intrusion Learning: An Overview of an Emergent Discipline
Tony Bailetti, Mahmoud Gad, and Ahmed Shah

Recommendations
The recommendations that follow are directed at researchers, sponsors and entrepreneurs interested in intrusion learning: 1. Understand the threat model.For example, researchers must know the cost of missed attacks (Sommer & Paxson, 2010).
2. Learn, unlearn, and relearn.Adversaries will act to mislead algorithms by steering the analyses to recognize the malicious as benign.Effective responses to such attacks need development.Corona and colleagues (2013) examine adversarial attacks against intrusion-detection systems as well as related taxonomies and potential solutions to known issues.This perspective leads to the concept of systems "unlearning" or forgetting what they had incorrectly "learned" (Cao & Yang, 2015). 3

Conclusion
In this article, we introduced the concept of intrusion learning as a domain that draws from machine learning, intrusion detection, and streaming network data.A key benefit of intrusion learning is that it may significantly enhance enterprise security and resiliency through augmented perimeter defense.
We identified a set of unique attributes and recommendations for advancing intrusion learning.For intrusion learning to meet its objectives of enhanced security and resiliency, these recommendations should not be treated in isolation but build upon each other: cross-cutting thinking (over machine learning, intrusion detection, and streaming) that focuses upon the distinctive aspects of intrusion learning will enhance progress.
Perhaps our most important recommendation is the development of new datasets that reflect contemporary network data and malware.The absence of such datasets is a significant impediment to the validation of intrusion-learning techniques.Privacy rights, confidentiality, etc., are concerns that are impeding the development of such datasets.We end this article with a "call to action" to develop such datasets, properly informed by researchers, privacy advocates, policy personnel, and so on, so that societal concerns are addressed.
. Select a narrow research scope.The objectives of the research must be concrete.Enhance feature extraction.Research should aim to expand the set of extractable features that correlate with malicious traffic.This research could remain at the level of network flow, but richer theories are likely to provide more substantial payoffs.