Cloud

Data pipeline approaches in serverless computing: a taxonomy and research challenges

November 7, 2024

Introduction

Serverless computing has gained significant popularity due to its scalability, cost-effectiveness, and ease of deployment. With the exponential growth of data, organizations face the challenge of efficiently processing and analyzing vast amounts of data in a serverless environment. Data pipelines play a crucial role in managing and transforming data within serverless architectures.

This comprehensive article provides a detailed taxonomy of data pipeline approaches in serverless computing. We classify these approaches into three primary methods: heuristic-based, machine learning-based, and framework-based. For each approach, we analyze the advantages, disadvantages, case studies, performance metrics, and evaluation tools used. Additionally, we discuss the open issues and future research directions in this rapidly evolving field.

Serverless Computing and Data Pipelines

Serverless computing, also known as Function-as-a-Service (FaaS), allows developers to focus on writing code without the need to manage or provision servers. This approach provides a more efficient and scalable model for running applications. In a serverless architecture, applications are broken down into smaller, independent functions or microservices that are event-driven and executed in ephemeral containers.

Data pipelines play a crucial role in contemporary data-driven environments, serving as the foundation for processing, analysis, and decision-making activities. These pipelines facilitate the seamless and dependable movement of data from diverse sources to designated systems, enabling efficient data processing. In a serverless environment, data pipelines can be designed as a series of functions triggered by events like new data arriving in a storage system.

Taxonomy of Data Pipeline Approaches in Serverless Computing

Based on the review of the literature, we have identified three primary approaches for implementing data pipelines in serverless computing environments:

Machine Learning-based Approaches

These approaches leverage machine learning techniques to optimize the performance and efficiency of data pipelines in serverless computing. The machine learning-based approaches can be further categorized into:

Predictive: Utilizes machine learning models to predict and optimize pipeline performance.
Adaptive: Employs machine learning techniques to dynamically adapt the pipeline based on evolving data patterns.
Unsupervised Learning: Identifies patterns and relationships in the data to automate pipeline tasks like data selection, feature engineering, or anomaly detection.

Heuristic-based Approaches

These approaches rely on predefined rules or thresholds to manage and orchestrate the data pipeline. The heuristic-based approaches can be further categorized into:

Rule-based: Uses predefined rules or heuristics to manage and orchestrate the data pipeline.
Threshold-based: Leverages thresholds or triggers to determine when to scale or modify the pipeline.
Expert Systems: Captures human expertise and decision-making processes into rules or models to automate specific pipeline tasks.

Framework-based Approaches

These approaches utilize serverless data pipeline frameworks or big data processing frameworks to orchestrate the pipeline. The framework-based approaches can be further categorized into:

Serverless Data Pipeline Frameworks:
Event-driven: Orchestrates the pipeline based on event triggers (e.g., AWS Lambda, Google Cloud Functions).
Workflow-based: Provides a workflow engine to define and manage the pipeline (e.g., AWS Step Functions, Azure Durable Functions).

Big Data Processing Frameworks:
Batch processing: Leverages batch-oriented big data frameworks (e.g., Apache Spark, Apache Flink).
Stream processing: Utilizes stream-oriented big data frameworks (e.g., Apache Kafka, Apache Storm).

Machine Learning-based Approaches

Machine learning-based approaches in data pipelines for serverless computing involve incorporating machine learning algorithms and techniques within the data processing and transformation stages of the pipeline. These approaches leverage serverless computing capabilities to perform machine learning tasks efficiently and effectively.

Predictive Approaches:
Nesen et al. propose a framework for processing data from multiple modalities, such as text, video, and sensor data, using serverless computing. The framework aims to extract useful knowledge from large amounts of data and process it in a fast and scalable manner, with a focus on public safety solutions.

Rausch et al. introduce Skippy, a container scheduling system that optimizes the placement of serverless edge functions. Skippy addresses limitations of existing serverless platforms in managing data-intensive applications on edge systems by considering factors like data proximity, compute capabilities, and edge/cloud locality.

Adaptive Approaches:
León-Sandoval et al. discuss the use of big data and serverless architecture to monitor and measure the emotional response of the Mexican population to the COVID-19 pandemic. The study utilizes a large dataset of public domain tweets and applies sentiment analysis tools to analyze the changes in sentiment.

Anshuman et al. introduce a system called Smartpick, which combines serverless (SL) and virtual machine (VM) resources to optimize cost and performance in data analytics systems. Smartpick uses machine learning prediction models to determine the optimal configuration of SL and VM instances based on workload characteristics.

Unsupervised Learning Approaches:
Efterpi et al. discuss the use of serverless computing and Function-as-a-Service (FaaS) to facilitate Machine Learning Functions-as-a-Service (MLFaaS). They present an approach for creating composite services, or workflows, of ML tasks within a serverless architecture, allowing data scientists to focus on the complete data path functions required for their analysis.

Rahman et al. discuss the use of serverless computing for big data analytics, focusing on a personalized recommendation system, and how serverless computing can provide a cost-effective and high-performance solution for processing and analyzing large amounts of data.

Bhattacharjee et al. introduce “Stratum,” a serverless framework designed for the lifecycle management of machine learning-based data analytics tasks. It addresses the challenges of ML model development and deployment by providing an end-to-end platform that can deploy, schedule, and dynamically manage various data analytics tools and services across the cloud, fog, and edge computing environments.

Heuristic-based Approaches

Heuristic-based approaches in data pipelines for serverless computing involve using heuristics or rule-based methods to make decisions and perform data processing and transformation tasks within the pipeline. These approaches leverage defined rules or algorithms to guide the data pipeline’s behavior and achieve specific objectives.

Rule-based Approaches:
Enes et al. introduce a novel platform for scaling resources in real-time for Big Data workloads on serverless environments. The platform dynamically adjusts container resources without the need for restarts, using operating-system-level virtualization.

Kuhlenkamp et al. discuss the evaluation of Function-as-a-Service (FaaS) platforms as a foundation for serverless big data processing. They introduce a Serverless Infrastructure Evaluation Method (SIEM) to understand the impact of automatic infrastructure management on serverless big data applications.

Threshold-based Approaches:
Shivananda et al. explore the use of serverless computing and data pipelines for handling Internet of Things (IoT) data in fog and cloud computing environments. They investigate three different approaches for designing serverless data pipelines and evaluate their performance using real-time fog computing workloads.

Toader et al. introduce a serverless graph-processing system called “Graphless” designed to make graph processing more accessible. Graphless combines the serverless computing paradigm with the data-intensive nature of graph processing through an architectural approach and backend services.

Expert Systems Approaches:
Bian et al. discuss the use of cloud function (CF) services, such as AWS Lambda, as accelerators for elastic data analytics. They propose a hybrid query engine called Pixels-Turbo, which leverages CFs to accelerate processing during workload spikes while using a scalable VM cluster for regular query processing.

Jarachanthan et al. introduce ACTS, an autonomous cost-efficient task orchestration framework for serverless analytics. It addresses the challenges in adapting data analytics applications to the serverless environment by mitigating cold-start latency, reducing state sharing overhead, and optimizing cost efficiency.

Framework-based Approaches

Framework-based approaches in data pipelines for serverless computing refer to the utilization of frameworks or platforms that provide a structured and efficient way to design, develop, and deploy data pipelines.

Serverless Data Pipeline Frameworks:
Mirampalli et al. evaluate two serverless data pipeline approaches, NiFi and Message Queuing Telemetry Transport (MQTT), in fog computing environments. The study focuses on image streaming data and compares the performance of the two approaches in terms of pipeline execution time, memory usage, and CPU usage.

Dehury et al. propose an extension to the TOSCA standard called TOSCAdata, which focuses on modeling data pipeline-based cloud applications. TOSCAdata provides a number of TOSCA models that are independently deployable, schedulable, scalable, and re-usable.

Big Data Processing Frameworks:
Müller et al. present Lambada, a serverless distributed data processing framework designed for data analytics. They explore the suitability of serverless computing for data processing and demonstrate its cost and performance advantages in certain scenarios.

Sedlak et al. propose a novel approach for sharing privacy-sensitive data within federations of independent organizations. The approach combines data meshes and serverless computing to streamline data sharing processes and address the specific requirements of variable data sharing constellations.

Discussion

Each data pipeline approach in serverless computing has its own advantages and disadvantages, and the selection of the appropriate approach depends on the specific requirements of the use case.

Machine Learning-based Approaches:
Advantages: Can handle complex data processing tasks, such as feature extraction and pattern recognition. Improve accuracy and predictive capabilities through iterative model modification.
Disadvantages: Require extensive computing resources to process and analyze large datasets. Complexity in determining optimal configurations of serverless and virtual machine instances.

Heuristic-based Approaches:
Advantages: Simpler to implement and can handle specific data processing requirements that may not require complex models. Faster and more efficient processing as it doesn’t involve training and deploying models.
Disadvantages: More rigid and dependent on predefined rules or thresholds. Complexity in managing and coordinating the execution of functions through calls may introduce performance bottlenecks.

Framework-based Approaches:
Advantages: Support integration with various data sources and tools for seamless data processing. Offer a wide range of pre-built components and functionalities for data pipeline development.
Disadvantages: Often limit the data pipeline to a specific vendor or platform, potentially creating vendor lock-in. Complexity in managing and orchestrating the execution of multiple functions and workflows.

In terms of complexity, flexibility, and vendor lock-in, the approaches can be compared as follows:

Complexity vs. Simplicity: Heuristic-based approaches are simpler to implement, while ML-based solutions offer more advanced features and therefore higher complexity.
Flexibility vs. Rigidity: Machine learning-based pipelines are highly adaptable, while heuristic-based approaches are more rigid and dependent on predefined rules.
Vendor Lock-in: Framework-based approaches often limit the data pipeline to a specific vendor or platform, while heuristic and ML-based approaches may provide more vendor-agnostic solutions.

The selection of the appropriate data pipeline approach should be based on the specific requirements of the use case, such as the complexity of the data, the need for adaptability, the importance of performance and reliability, and the available resources and expertise within the organization.

Open Issues and Challenges

Despite the advancements in data pipeline approaches for serverless computing, there are several open issues and challenges that warrant further research and investigation:

Scalability: Developing techniques for efficiently scaling data pipelines in serverless computing environments to manage large and complex data sets, including auto-scaling mechanisms, load balancing strategies, and resource allocation algorithms.

Fault Tolerance: Enhancing mechanisms to handle failures, such as automatic retry mechanisms, error handling strategies, and fault detection and recovery techniques, to ensure the robustness and reliability of the pipeline.

Security and Privacy: Investigating methods to ensure data security and privacy in serverless data pipelines, including secure data transfer and storage, encryption techniques, access control mechanisms, and compliance with privacy regulations.

Cost Optimization: Developing approaches to optimize the cost of executing data pipelines in serverless environments, considering factors such as resource allocation, function sizing, and data transfer costs.

Workflow Orchestration: Exploring techniques for managing and orchestrating complex workflows in serverless data pipelines, including workflow specification languages, coordination mechanisms, and task scheduling algorithms.

Real-time Processing: Investigating techniques to enable real-time data processing in serverless data pipelines, including event-driven processing, stream processing, and near real-time analytics.

Hybrid Architectures: Exploring the integration of serverless computing with other computing paradigms, such as edge computing or hybrid cloud approaches, to leverage the strengths of serverless computing for data processing while considering data locality, latency, and data governance requirements.

State Management: Developing efficient state management solutions that integrate seamlessly with serverless architecture, as serverless functions are inherently stateless, but pipelines often require data persistence.

Addressing these open issues and challenges will play a crucial role in enhancing the capabilities and adoption of data pipeline approaches in serverless computing environments.

Conclusion

Serverless computing offers a scalable and cost-effective solution for handling data pipelines, as it eliminates the need for managing and provisioning servers. This article has provided a comprehensive taxonomy of data pipeline approaches in the context of serverless computing, classifying them into three main categories: machine learning-based, heuristic-based, and framework-based.

Each approach has its own advantages and disadvantages, and the selection of the appropriate strategy depends on the specific requirements of the use case. Factors such as complexity, flexibility, and vendor lock-in should be carefully considered when choosing a data pipeline approach.

Furthermore, the article has highlighted several open issues and future research directions in the field of data pipelines in serverless computing, including scalability, fault tolerance, security and privacy, cost optimization, workflow orchestration, real-time processing, hybrid architectures, and state management. Addressing these challenges will be crucial in enhancing the capabilities and adoption of data pipeline approaches in serverless computing environments.

As the demand for efficient data processing and analysis continues to grow, the development of robust and versatile data pipeline approaches in serverless computing will remain a critical area of focus for researchers and practitioners alike.