How we migrated the current batch-processing log pipeline to a near-real-time streaming system
I’m Manh, Data Engineer from the Data Platform Team at Classi corp - an education service for teachers and students in Japan.
At Classi, we developed a product called “Seneka,” which links logs output by services on AWS to BigQuery, providing support to internal developers and analysts.
Previously, the log linking was done through batch processing, but this summer, we transitioned to near real-time processing, taking on a technical challenge. In this article, we would like to introduce the structure before and after the transition, as well as the key points we focused on during the migration.
Seneka’s Old Architecture: Batch Processing
Before discussing the improvements, let’s briefly introduce the previous architecture:
- Logs from applications or load balancers were uploaded to S3.
- A Cloud Composer DAG, which ran hourly, executed the following tasks:
- Used Storage Transfer Service to transfer data from S3 to Google Cloud Storage (GCS).
- Inserted data from GCS into a temporary BigQuery table using the Storage Write API.
- Merged the temporary table data with the Data Lake table.
This architecture wasn’t entirely bad, but developers had to wait up to 60 minutes for the logs to be reflected because the process was scheduled to run hourly. Although increasing the batch frequency could improve real-time performance, the existing structure led to unnecessary reprocessing of certain files, causing inefficiencies in both cost and execution time. Given these limitations, we decided it was worthwhile to take on the challenge of transitioning to an event-driven architecture.
Seneka’s New Architecture: Event-Driven Processing
The new architecture looks like this:
- Logs from applications or load balancers are uploaded to S3.
- When a file is placed in S3, a PutObject event is triggered, and a notification is sent to SQS.
- On the GCP side, a Storage Transfer Service task is created and executed, triggered by the SQS message.
- When files are placed in GCS, a Finalized event is triggered, which is published to Pub/Sub.
- A Cloud Function is triggered by the Pub/Sub message.
- The method of inserting data into BigQuery remains the same.
Results
In the old architecture, it took up to 60 minutes for logs to become queryable in BigQuery, but with the new architecture, this is reduced to just a few minutes. We measured the time taken for logs from an ELB to be linked to BigQuery and created a histogram.
As shown above, logs can now be linked within a maximum of six minutes. Although we didn’t examine every log, similar results were observed with other logs, including those from ECS services. Additionally, the event-driven approach did not increase costs; in fact, we managed to reduce costs by about 10%.
Key Points We Focused On During the Migration
We took the following approaches to ensure a smooth migration:
Shared Understanding through Design Documents
At Classi, we have a culture of writing design documents.
Before implementing the new system, we conducted thorough research and designed the system based on those findings. We clearly outlined the pros and cons of our proposals and shared our thoughts within the team before finalizing the approach.
We considered the following factors when selecting technologies:
- Service Quotas and Limits: Cloud services often have limits, so we ensured the services we chose would not face issues even if these limits were exceeded in the future.
- Service Cost Structures and Free Tiers: This is a crucial factor. For example, Cloud Functions provides 2 million free requests per month, which made it a cost-effective choice for our event-driven architecture.
Gradual Migration
Migrating everything at once was risky, so we took a gradual approach.
We first applied the new architecture to some buckets in the staging environment to validate the impact. During testing, we noticed that BigQuery’s scan costs had increased significantly, so we carefully analyzed the cause.
Cost Tuning
After deploying to the production environment, we encountered an issue where log generation frequency was higher than expected, leading to increased costs.
Through analysis, we identified three main problems:
- Frequent Retries due to Insufficient Scaling: The initial Cloud Functions scaling limit was exceeded, causing retries and increased costs. We resolved this by increasing the max number of instances and the concurrency per instance.
- Increased Class A Operations in GCS: The use of the
listObjects
method in our code resulted in unnecessary operations. We replaced this withgetObject
and reduced costs. - Increased Costs from Data Lineage API: We disabled the Data Lineage API, as it was not needed in this context, reducing costs further.
Fixing Memory Leaks
Four days after the release, we encountered an “Out Of Memory” error. After investigating with Go’s pprof, we found that BigQuery’s gRPC client was not being released properly, causing a memory leak. We fixed this issue, and the system stabilized.
Conclusion
In this article, we discussed how we used event-driven processing to significantly improve log linking speed. There are many solutions available for real-time log analysis, and in the future, we may explore services like BigQuery Omni, which allow querying data without the need for data transfer.
Overall, this project allowed us to improve log reflection time from hours to minutes while keeping costs under control.