#Data Analysis with Nginx+Fluentd+BigQuery on QueryPie

Both web and app services track data on web pages through Google Analytics for site visits and user behavior analysis, but it is difficult to expect 100% accuracy in the data. Try googling 'Why isn't google analytics accurate?'. You'll find a variety of reasons. Basically because GA is a structure that puts Java Script code on the page, so depending on the Web browser environment, there are problems that cannot be properly tagged and session times can't be measured.

In this case, to analyze a more accurate Page View and User-Agent, you can use the log of a web server that records every action. But web server logs are generally stored in a file format, making it difficult to analyze them in data analysis languages such as SQL.

This article will go over how to use Fluentd to send web server logs to BigQuery, and how to use SQL to analyze those logs. Our test environment utilizes Fluentd 3.0 in Cent OS 7.

πŸ’¬ What is Fluentd?

https://www.fluentd.org/architecture

Fluentd is a data collector / log aggregator that can collect a wide variety of log data sources in open source form, with the aim of integrating logs collected from many data sources and systems into a single logging layer. Fluentd also has an output plug-in that allows you to use BigQuery as the destination for storing collected logs, which allows many servers to load logs directly into BigQuery practically in real time.

Okay, let's start with the installation.

πŸ’» Send Nginx logs to Fleuntd

1. Installing Fluentd for Log Transfer

# sudo curl -L https://toolbelt.treasuredata.com/sh/install-redhat-td-agent3.sh | sh

2. Run Fluentd

# sudo systemctl start td-agent.service
# sudo systemctl status td-agent.service

FluentD Execution

3. Install the Fluentd BigQuery Plugin

# sudo td-agent-gem install fluent-plugin-bigquery

4. Create datasets in BigQuery

Click CREATE DATASET at the bottom of the BigQuery Console.

5. Create Table

Select the dataset you just created (found in the navigation window on the left) and click the CREATE TABLE button. Select Empty table from the drop-down menu, enter your desired table name (in our case, access_log) and make sure Edit as Text is on before entering your table information.

[
	{
		"name": "agent",
		"type": "STRING"
	},
	{
		"name": "code",
		"type": "STRING"
	},
	{
		"name": "host",
		"type": "STRING"
	},
	{
		"name": "method",
		"type": "STRING"
	},
	{
		"name": "path",
		"type": "STRING"
	},
	{
		"name": "referer",
		"type": "STRING"
	},
	{
		"name": "size",
		"type": "INTEGER"
	},
	{
		"name": "user",
		"type": "STRING"
	},
	{
		"name": "time",
		"type": "INTEGER"
	}
]

6. Setting up TD-Agent:

Delete all the contents of ==/etc/td-agent/td-agent.conf== and paste the following:

<source>
  @type tail
  tag accesslog
  path {NGINX_ACCESS_LOG_PATH}
  format apache
  buffer_type file
  buffer_path /tmp/tdagent.rickts-dev-box
  pos_file /var/log/td-agent/nginx.pos
</source>

<match accesslog>
  @type bigquery_insert
  auth_method json_key
  email {GOOGLE SERVICE ACCOUNT EMAIL}
  json_key {GOOGLE SERVICE ACCOUNT JSON KEY}

  project "{PROJECT_ID}"
  dataset {DATA_SET ID}
  tables {TABLE NAME}

  time_format %s
  time_field time
  fetch_schema true
  field_integer time
</match>

Make sure to change the below to your desired naming convention:

{NGINX_ACCESS_LOG_PATH} : NGINX log file name
{GOOGLE SERVICE ACCOUNT EMAIL} : Google Service Account Email
{GOOGLE SERVICE ACCOUNT JSON KEY} Β : Google Service Account JSON File
{PROJECT_ID} : Project ID
{DATA_SET ID} : BigQuery Data Set ID
{TABLE NAME} : BigQuery Table name

But use the default log format shown below for your NGINX Log Format:

   log_format fluentd '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent"';

7. Restart TD-Agent

# systemctl restart td-agent

8. Check Log

If you run into an error like Β unexpected error error_class=Errno::EACCES error="Permission denied @ rb_file_s_stat", chances are it is caused by Fluentd's lack of access to the log files. In this particular case, try changing the Fluentd execution privilege from systemd to root.

# sudo vi /lib/systemd/system/td-agent.service

Execution Privilege

βœ” Check logs in BigQuery

Once you have confirmed that the logs are stacked well, let's check how many requests have been received by path, agent, host, and the size of the average HTTP response with a simple query.

The results we get are stacked pretty well here. But viewing the actual results of the query all at once is a bit inconvenient in the Web UI.

As you can see, the query was executed normally, but each query result must be checked by clicking VIEW RESULTS. Unfortunately there is no way to view the data all at once. This is the best method available in the BigQuery Web UI.

Thankfully, we can analyze data from BigQuery through other SQL tools in order to see the results more comfortably.

πŸ“ˆ Analyze Log Data in QueryPie

Let's try pulling up the data in QueryPie, where the UI is designed to allow easy viewing of multiple query results at the same time.

When we run the same query into QueryPie, we can view the results of all queries simultaneously and stack them however we prefer (side-to-side or stacked).

--Group by path
SELECT path, count(*) as count FROM access_log
GROUP BY path
ORDER BY count(*) desc;

--Group by user agent
SELECT agent, count(*) as count FROM access_log
GROUP BY agent
ORDER BY count(*) desc;

It's also very easy to quickly create queries because you can view table structures or data on the object panel at a glance. According to Google Cloud, real-time log analysis using Nginx+Fluentd+BigQuery is a useful way to track large-scale logs and debug problems. However, when we have deployed the end-to-end log analytics query, it is difficult to see the results within the BigQuery platform. Using SQL tools like QueryPie will make it faster and easier to see the data in the final analysis phase.

πŸ“Œ Download QueryPie: https://www.querypie.com/download
πŸ“Œ How to connect BigQuery on QueryPie: https://blog.querypie.com/how-to-connect-to-snowflake-bigquery-and-redshift-via-querypie/