Deployment

When deploying AKIPS, the following information can help to guide your deployment approach (VM or bare metal) and determine the hardware specifications required for your network

Platform

Recommendation

Specifying a VM or bare metal platform is difficult because every network is different (i.e. number of users, devices, polled MIB objects, syslog/trap/NetFlow rates). AKIPS recommends starting with a VM installation to determine a resource baseline required for monitoring your infrastructure and then increase the CPU/RAM/Storage resources as needed.

As a general rule, we recommend:

a. VM Deployment

  • Commercial grade VM (e.g. VMware)
  • Dedicated CPU cores
  • Ample RAM (50% free)
  • SAN with thick provisioned preallocated storage

OR

b. Physical / Bare Metal

  • Off the shelf server (e.g. Cisco, Dell, HP, IBM, etc)
  • Ample RAM (50% free)
  • RAID 1 or 10

NOTE: Before purchasing physical hardware, contact AKIPS support with your intended vendor/model/spec so we can confirm the operating system has the appropriate disk and Ethernet controller driver support.

AKIPS is known to work on the following virtual machine platforms:

  • VMware

  • VirtualBox

  • Hyper-V

  • KVM

 
Minimum recommended platform
Network SizeMinimum Platform

Small

50,000 interfaces

•  Virtual Machine

•  2+ CPU Cores

•  8 GB RAM

•  200 GB disk space

Medium

100,000 interfaces

•  Virtual Machine

•  4+ CPU Cores

•  16 GB RAM

•  500 GB disk space

Large

250,000 interfaces

•  Virtual Machine

•  8+ CPU Cores

•  32 GB RAM

•  1 TB disk space

System resources

Ping and SNMP polling
AKIPS pretty much nails Ping/SNMP polling scalability. The poller consumes less than 50% of a core when monitoring in excess of 15 million MIB objects per minute on commodity hardware, which equates to over 1 million interfaces.
 
Syslog and SNMP trap
Constantly throwing large amounts (e.g. 200+ per second) of Syslog and SNMP Traps at AKIPS may require the resources of an additional CPU core. Syslog/Trap messages are stored in compressed 10 megabyte chunks. The higher the volume of Syslog/Traps, the more often the data has to be compressed.
 
NetFlow

The AKIPS flow collector and meters were engineered in the expectation of a large number of flow records (e.g. 1 million flows per second) from a small number of flow exporters (e.g. 50 to 100). The software performs as expected in that environment when ample CPU cores and RAM is available.

What was unexpected was customers wanting to send flows from 1000s of flow exporters. A flow meter process is started for each flow exporter, which means 1000s of concurrent meter processes. This issue is being investigated and will be rectified by allowing a meter process to handle data from many flow exporters, therefore significantly reducing the number of running processes.

Increasing the specs of a VM

The procedure of increasing CPU/RAM/Storage sizes in a VM is simple:

  1. Shutdown the VM using the Admin -> System -> System Shutdown menu.
  2. Wait for the VM to completely shutdown.
  3. Increase the number of CPU cores, memory size or disk space.
  4. Start up the VM.

The AKIPS startup script will automagically detect the expanded disk space and do the appropriate partition and file system commands.

System performance graphs

AKIPS provides internal system and application performance information under the Admin -> Performance menus. The important things to note are:

 
System graphs
  • Memory usage should be fairly static over a day. Assign enough memory so the memory usage graph generally shows less than 50% usage. Lots of free memory is always useful because the operating system will consume as much memory as you give it (e.g. for disk caching).
  • CPU load, System Calls, Context Switches, Interrupts and Disk I/O will spike every 80 minutes when the background data processing occurs. This is normal.
 
Poller graphs
  • Ping rate should be constant.
  • SNMP requests should start at second 5 and complete by second 45 (i.e. a 40 second polling window each minute).
  • Poller memory should be constant.
  • Poller CPU usage should be under 50%. In most cases it will be below 10%.
  • Poller Context Switches should be mostly Voluntary. If there are a lot of Involuntary context switches, then additional CPU cores may be required. High levels of involuntary context switches is a sign of process CPU contention.
 
Database graphs
  • Compression Runtime should be less than 20 minutes. The database compression works on 30 day data blocks. At the end of 30 days, a new block will be created and the compression runtime will drop. The compression runtime is usually CPU limited. Adding CPU cores to the system should have a linear decrease in the compression runtime, unless the limiting factor is the back-end storage speed.
  • Rotation Runtime should be less than Compression Runtime. The limiting factor will be the storage speed. A database file rotation occurs when they become more than 1% fragmented.

CPU

General notes
  • The number of required CPU cores depends entirely on the size of your configuration (i.e. number of monitored devices, MIB objects, syslog/trap rate, NetFlow exporters and flows/sec).
  • Hyperthreading on modern Xeon, Core i3/5/7 CPUs works fine. Leave it turned on.
  • In a VM environment, always assign dedicated CPU cores. Do NOT over provision CPU cores. Over provisioning CPU cores will lead to significant pauses in real-time data polling and processing, and large jumps in time.
 
Number of CPU cores
Basically you want enough free CPU cores to handle user requests without any noticeable delays. The Ping/SNMP poller is an extremely efficient single monolithic process, so it will only ever consume a portion of a single CPU core, even when monitoring 1 million+ interfaces. The poller context switch performance graphs (Admin -> Performance -> Poller) are a very good indicator of whether there are enough CPU cores. For smooth operation, you want mostly voluntary context switches, not involuntary context switches.
 
CPU clock speed

Comparing raw CPU core clock speeds is a fairly meaningless due to differences in core architectures (e.g. number of on die cores, L1/2/3 cache sizes and speeds). AKIPS performs various CPU speed tests for gzip/md5/sha which can be viewed in the Admin -> System -> System Info menu.

The following are some examples:

 Seconds
CPU ModelGZIPMD5SHA
Xeon E5-2683 v3 2.00GHz1.72.93.6
Xeon E5-2660 2.20GHz1.94.03.7
Xeon E5-2670 v3 2.30GHz1.42.82.9
Xeon E5-2630L v2 2.40GHz1.53.03.3
Xeon E5-2670 2.60GHz1.22.73.3
Xeon E5-4650 2.70GHz1.32.83.4
Xeon X5660 2.80GHz2.63.74.8
Core i5-2500K 3.30GHz1.12.42.9
Core i7-5820K 3.30GHz1.12.22.3

Memory

Memory speed is fairly critical for performance. The Admin -> System -> System Info menu will display the memory speed of your system. A value of 8 Gigabytes/sec or greater is recommended. Older/legacy systems appear to have poor memory speed (e.g. 5 Gigabytes/sec or less).

Over provisioning memory in a VMware VM works fine because AKIPS loads the necessary kernel module that performs memory ballooning. Memory ballooning allows the guest VM to gracefully hand back unused free memory to the host machine.

Storage

Storage size

UNIX file systems require plenty of spare space so they can write files out sequentially. In a VM it isn’t such an issue because increasing the storage size is trivial. When deploying on physical hardware, it’s best to install enough disk space up front for the entire life cycle of the box (e.g. several terabytes). Disks are cheap. Contact AKIPS support if unsure on disk space requirements.

 
Sequential read / write performance

Databases typically access storage in a random order, but AKIPS databases are arranged in a manner so the majority of read/write I/O is performed sequentially. The large databases are repacked if they become more than 1% fragmented. Good sequential I/O performance is important in large installations.

 
Spindles vs SSD

A modern SATA 2Tbyte disk typically gets over 200Mbytes/sec sequential transfer rates, whereas a SSD typically gets ~400Mbytes/sec read, but somewhat slower write performance because SSD uses a copy-on-write mechanism where every write operation has to be written to a zeroed disk block. That is how SSD works. The painfully slow part of SSD is zeroing disk blocks. If there are no zeroed blocks available for a write operation, write performance falls off a cliff while unused blocks are zeroed.

Having a large pool of pre-zeroed blocks greatly enhances consistent write performance. The SSD trim feature (turned on by default in AKIPS) allows the operating system to inform the SSD when a disk block can be zeroed. Some SSDs also have a hidden pool of pre-zeroed blocks.

 
DAS vs SAN vs NAS

Storage types:

AKIPS preferred order of storage:

  • SAN
  • DAS RAID 10
  • DAS RAID 1
  • DAS RAID 0
  • DAS JBOD
  • NAS (thick provisioned)

DAS and SAN provide efficient “block level” storage to the operating system, whereas a NAS is just a “file store” accessed over 10G Ethernet/IP/NFS. A NAS will have significantly higher latency and fragmentation performance issues compared to a SAN/DAS.

 
Thick vs thin provisioning
  • Thick provisioning – storage is preallocated when created (preferred)
  • Thin provisioning – storage is allocated on-the-fly (slow, poor performance)

Do NOT use thin provisioned dynamically allocated storage. It ALWAYS leads to massive database performance problems due to fragmentation. AKIPS reads/writes large sequential database files and expects minimal underlying block level fragmentation and latency.

Using thin provisioned storage is also pointless because AKIPS uses a copy-on-write file system, therefore all disk blocks on the virtual storage will be quickly allocated and consumed, but in a highly fragmented order.