Python Script to Parse Nginx Log Files

In this article, we will learn how to efficiently parse Nginx logs using a Python script. Nginx is a widely used web server known for its performance and reliability. Its logs can provide valuable insights into web traffic and server performance, making them a crucial resource for system administrators and developers. We will explore a Python script that processes Nginx log files, extracting key information and writing it to a CSV file for further analysis.

Prerequisites

AWS Account with Ubuntu 24.04 LTS EC2 Instance.
Basic knowledge of Python.

Step #1:Installing Python and Nginx

First update the Package List.

sudo apt update

Then install Python and pip.

sudo apt install python3 python3-pip

Install Nginx on the system.

sudo apt install nginx

Verify nginx status to see if its running properly or not.

sudo systemctl status nginx

Python Script to Parse Nginx Log Files 4

Step #2:Write a Python Script to Parse Nginx Logs

First create a file to write a script.

nano parse_nginx_logs.py

Below is a Python script to parse Nginx logs. This script reads log files, extracts relevant fields, and writes the information to a CSV file.

import argparse
import csv
import re
import glob
from datetime import datetime
import os
import gzip

# Define a regular expression pattern to match a line in an nginx log file
line_format = re.compile(r'(\S+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)"')

# Define a function to format bytes as a string with a unit
def format_bytes(bytes):
    for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
        if bytes < 1024.0:
            return f"{bytes:.2f} {unit}"
        bytes /= 1024.0

# Define a function to process nginx log files
def process_logs(log_path, output_file):

    # Check if the log_path is a directory or a file, and get a list of files to process
    if os.path.isdir(log_path):
        files = glob.glob(log_path + "/*.gz") + glob.glob(log_path + "/*.log")
    elif os.path.isfile(log_path):
        files = [log_path]
    else:
        print("Invalid log path")
        return

    # Define some variables to store summary statistics
    ip_counts = {}
    status_counts = {}
    status403_ips = {}
    referrer_counts = {}
    bytes_sent_total = 0

    # Open the output file and write the header row
    with open(output_file, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["IP", "Timestamp", "Method", "URL", "Status", "Bytes Sent", "Referrer", "User Agent"])

        # Loop over each file to process
        for filename in files:

            # Check if the file is gzipped or not, and open it accordingly
            if filename.endswith('.gz'):
                open_fn = gzip.open
            else:
                open_fn = open

            # Open the file and loop over each line
            with open_fn(filename, 'rt', encoding='utf-8') as file:
                for line in file:

                    # Match the line against the regular expression pattern
                    match = line_format.match(line.strip())

                    # If there is a match, extract the relevant fields and write them to the output file
                    if match:
                        ip, date_str, request, status, bytes_sent, referrer, user_agent = match.groups()
                        dt = datetime.strptime(date_str, '%d/%b/%Y:%H:%M:%S %z')
                        try:
                            method, url = request.split()[0], " ".join(request.split()[1:])
                        except IndexError:
                            method, url = request, ''
                        writer.writerow([ip, dt, method, url, status, bytes_sent, referrer, user_agent])

                        # Update summary statistics
                        ip_counts[ip] = ip_counts.get(ip, 0) + 1
                        status_counts[status] = status_counts.get(status, 0) + 1
                        bytes_sent_total += int(bytes_sent)
                        if status == '403':
                            status403_ips[ip] = status403_ips.get(ip, 0) + 1
                        referrer_counts[referrer] = referrer_counts.get(referrer, 0) + 1


    # Print summary stats
    print("\033[1m\033[91mTotal number of log entries:\033[0m", sum(ip_counts.values()))
    print("\033[1m\033[91mNumber of unique IP addresses:\033[0m", len(ip_counts))
    print("\033[1m\033[91mNumber of unique status codes:\033[0m", len(status_counts))
    print("\033[1m\033[91mBytes sent in total:\033[0m", format_bytes(bytes_sent_total))
    print("\033[1m\033[91mTop 10 IP addresses by request count:\033[0m")
    for ip, count in sorted(ip_counts.items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"{ip}: {count}")
    print("\033[1m\033[91mTop 10 status codes by count:\033[0m")
    for status, count in sorted(status_counts.items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"{status}: {count}")
    print("\033[1m\033[91mTop 20 IP addresses with status code 403:\033[0m")
    for ip, count in sorted(status403_ips.items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"{ip}: {count}")
    print("\033[1m\033[91mTop 10 referrers by count:\033[0m")
    for referrer, count in sorted(referrer_counts.items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"{referrer}: {count}")   


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Process nginx log files.')
    parser.add_argument('--log_path', metavar='LOG_PATH', type=str,
                        help='Path to the log file or directory')
    parser.add_argument('--output_file', metavar='OUTPUT_FILE', type=str,
                        help='Path to the output CSV file')
    args = parser.parse_args()
    process_logs(args.log_path, args.output_file)

Python Script to Parse Nginx Log Files 6

Save the file (ctrl + x, Y) and exit (esc).

Execute the script with the desired log path and output file location.

python parse_nginx_logs.py --log_path /var/log/nginx --output_file output.csv

Python Script to Parse Nginx Log Files 7

How the Code Works:

The script to parse Nginx logs is by following steps:

Define the Log Format: It uses a regular expression to match the standard Nginx log format, capturing fields like IP address, timestamp, HTTP request, status code, bytes sent, referrer, and user agent.
Process and Parse Nginx Logs: The process_logs function checks if the provided log path is a file or directory. It then processes each file, including gzipped files, extracts relevant information using the regular expression, and writes the data to a CSV file.
Generate Summary Statistics: The script also calculates and prints summary statistics such as the total number of log entries, unique IP addresses, unique status codes, total bytes sent, and top entries by request count and status code.

Conclusion

In conclusion, parsing Nginx logs using a Python script can provide valuable insights into web traffic and server performance. This script simplifies the process by extracting relevant fields from Nginx logs and writing them to a CSV file for easy analysis. With a few steps, you can efficiently analyze your web server logs and gain a deeper understanding of your web traffic and server behavior.

Related Articles:

Python Script to Create Jira Ticket using GitHub Events

Reference:

Official Nginx Documentation

Python Script to Parse Nginx Log Files

Prerequisites

Step #1:Installing Python and Nginx

Step #2:Write a Python Script to Parse Nginx Logs

Prasad Hole

Leave a Comment Cancel reply