Linux Based Environment

Goal: identify faulty server to find the root cause, gather evidence, either resolve the issue or escalate it. Assumptions

  • An incident has been declared and the server is suspected to be involved
  • The operations person has login credentials(SSH and password)
  • The existent of a ticketing or issue tracking system(Jira, ServiceNow)

phase1: preparation and initial assessment#

Acknowledge the assigned ticket:

  • Take up the assigned ticket and read through the details
  • From this point, document every action carried out, command run and the timestamp for better documentation in the postmortem Gather context:
  • Understand the issues presented or associated with the server in the ticket(e.g ‘slow server’, ‘API returning 500’, ‘Users unable to login into the domain’ )
  • Understand the issue reported so as to know what to tackle: * what errors are presented
    • what services are hosted in the server?
    • Are any other services in the server experiencing the same issue?
    • Check for monitoring dashboards before logging in(DataDog, Grafana, Nagios, ) look for obvious anomalies:

      High CPU/Memory Usage Latency spikes Error rate increase Health check failures

Phase2: Basic Connectivity & system status#

Connection test From your workstation or bastion host, attempt to host the server Observervation: check for packet loss, latency or reachability Login attempt : try logingin in using SSH or RDP observation: successful login? slow login? connection refused? Timeout? Authentication error? Initial system Overview when logged in: check system uptime uptime observation: How long has the system been up? Has the system booted recently? what is the load average? check for logged in users: who w or Observation: Any unexpected user or session? Recent Login History: last | head -n 20 Observation: check for any unusual logins around when the issue started

Phase 3: Resource Utilisation Analysis#

CPU Usage: Check overall per-process CPU usage top or htop if installed observation: is the CPU usage close to 100%, which processes are consuming most CPU? Run for a short period to see trends vmstat 1 5 Observation: Look at us , sys , id , wa high wa (IO wait time ) means that there are bottle necks

Memory Usage: check memory usage free -h observation: is available memory very low? is swap heavily used? use top or htop(sort by memory usage type M in top, or f6 in htop the select PERCENT_MEM) observation: which processes are consuming alot of memory? **Check OOM(out of memory) ** killer events dmesg | grep -i oom-killer or journalctl | grep -i oom observation: Has the kernel killed processes due to low memory? Disk I/O: check for disk utilisation and wait times: iostat -dxz 1 5 (reports disk status every minute 5 times) observation: Look for high %util and high wait times. which devices are affected? check per process I/O iotop observation: Check for processes with most disk writes

Disk space: check file system usage: df -h observation: any critical file mounts(/var , /temp , / ) full or almost full?(> 90-95) Network I/O clonnection Check network interface statistics ip -s link show or netstat -i observation: Look for high error count errors , dropped

Check network throughput per interface/ process: iftop or nethogs observation: is a particular interface saturated? is there a process sending high abnormal traffick? Listening ports and established connections: ss -tulnp netstat -tulpn observation: Are there any expected services listening? Are there expected services listening?(ESTABLISHED, TIME_WAIT) Any suspicious listening port?

Application and service level checks#

service status

  • check the status of the primary running apps/services on the server
  • systemd: system status [servicename]
  • SysVinit: service [servicename] status or /etc/init.d/[servicename] status observation: is the servive active/running? Did the service exit with an error? check the recent logs provided by the status command

Process Check:

  • Verrify the application proces is running: ps aux | grep [proces-nam] or pgrep -lf [proces-name] Observation: is the process running? Are there any zombie processes? (z)
  • Application logs
  • Locate application specific logs(/var/log/nginx/error.log, /var/log/app/app/log)
  • You can tail the logs for live activity tail -f /path/to/applogs
  • Search for logs around the time the error occured grep -iE "error|warn|fatal|exception|timeout" path/to/log
  • You can specify the time and specific error: journalctl --since "1 hour ago" or journalctl --since "1 hour ago"| grep -f failed observation: are there error messages correlating with the incident start time? ** Local Functionality test**
    • If its a webserve, try acccesing it localy: curl -v http://localhost:port/path
    • If its a database, try connecting to it using a local client tool observation: Does it work locally? any slow response? Are there any errors?
      • This could help differentiate a network issue to external sources and an internal server issue

phase 5: Deeper system configuration check#

Kernel/system log deepdive

  • Go back to /var/log/messges, var/log/syslog/ or journalctl
  • Look at the timeframe of the incident
  • search for specific errors related to hardware, storage, network and kernel modules. configuration changes
    • check if relevant configuration files /etc/ have been chabged
    • if you are using git then check the version control history Hardware Health*
  • Check disk health if suspected: smartctl -a /dev/sdX(required smartmontools)
  • Checkn for hardware errors reported by the system (dmesg, vendor specific tools)

Escalate and Resolution#

Synthesize findings

  • Review all the findings gathered
  • Formulate a hypothesis about the root cause; e.g Network connectivity led to Database outage Attempt remefiation
  • if the the main cause of the issue is clear and within you scope, fix it e.g reboot the servers, clearing temporary files
  • Adviceable to document anything before doing it.
  • Verrify if the fix resolved the issue. Monitor closely. Escalate if necessary
  • If the issue is beyond your scope, this is the best time to escalate it to a higher manager or another team(eg developers, networking team)
  • This could also be as a result of an issue that require reboot and might affect other services in the same server, this might require authorisation
  • When escalating the issue, make sure you have gathered enough data(evidence) to point out the exact issue with the server, do not escalate issues with vague diagnosis statement like(server X is broken)

Reference Security Incident survery cheat sheet for server Administrators