Data Engineering and Data Science Toolbox: List of Must-Have Tools and Platforms
Data Engineering
- ETL Tools:
- Apache Nifi (Apache Software Foundation)
- Talend (Talend)
- Apache Camel (Apache Software Foundation)
- Informatica PowerCenter (Informatica)
- Microsoft SQL Server Integration Services (SSIS) (Microsoft)
- Apache Spark (Apache Software Foundation) (for data preprocessing)
- Data Warehousing Platforms:
- Amazon Redshift (Amazon Web Services)
- Google BigQuery (Google Cloud Platform)
- Snowflake (Snowflake Computing)
- Microsoft Azure Synapse Analytics (Microsoft Azure)
- Teradata (Teradata)
- IBM Db2 Warehouse (IBM)
- Data Integration Platforms:
- Apache Kafka (Apache Software Foundation)
- Apache Flume (Apache Software Foundation)
- AWS Glue (Amazon Web Services)
- Apache Sqoop (Apache Software Foundation)
- Talend Data Integration (Talend)
- Data Storage Technologies:
- Hadoop HDFS (Hadoop Distributed File System) (Apache Software Foundation)
- NoSQL databases (e.g., MongoDB, Cassandra, Redis):
- MongoDB (MongoDB Inc.)
- Cassandra (DataStax)
- Redis (Redis Labs)
- Apache Hive (Apache Software Foundation)
- Apache HBase (Apache Software Foundation)
- Amazon S3 (Amazon Web Services)
- Google Cloud Storage (Google Cloud Platform)
Data Science
- Data Analysis and Visualization Tools:
- Python (with libraries like Pandas, NumPy) (Python Software Foundation)
- R (with libraries like ggplot2) (R Foundation for Statistical Computing)
- Tableau (Tableau Software)
- Power BI (Microsoft)
- Matplotlib
- Seaborn
- Plotly
- Machine Learning Frameworks:
- TensorFlow (Google)
- PyTorch (Facebook)
- Scikit-Learn (Python Software Foundation)
- Keras (Google)
- XGBoost (dmlc.ai)
- LightGBM (Microsoft and Ant Financial)
- Data Science Platforms:
- Jupyter Notebook (Project Jupyter)
- RStudio (RStudio)
- Google Colab (Google)
- Microsoft Azure Machine Learning Studio (Microsoft Azure)
- IBM Watson Studio (IBM)
- Big Data Analytics Tools:
- Apache Spark (Apache Software Foundation)
- Hadoop (MapReduce) (Apache Software Foundation)
- Databricks (Databricks)
- Statistical Analysis Tools:
- SAS (SAS Institute)
- SPSS (IBM)
- STATA (StataCorp)
- Version Control and Collaboration Tools:
- Git (e.g., GitHub, GitLab, Bitbucket) (Git Software Foundation)
- JIRA (Atlassian)
- Confluence (Atlassian)
- Data Preprocessing and Cleaning Tools:
- OpenRefine (OpenRefine Foundation)
- Trifacta (Trifacta)
- DataRobot (DataRobot)
- Natural Language Processing (NLP) Tools:
- NLTK (Natural Language Toolkit) (Stanford University)
- spaCy (explosion.ai)
- Gensim (radimrehurek.com)
- Computer Vision Tools:
- OpenCV (OpenCV)
- TensorFlow Object Detection API (Google)
- Automated Machine Learning (AutoML) Tools:
- Auto-sklearn (scikit-learn)
- H2O.ai
- DataRobot
GitOps
- GitOps tools:
- Flux
- Argo CD
- Jenkins X
- Infrastructure as code (IaC) tools:
- Terraform
- CloudFormation
- Pulumi
DevOps
- CI/CD tools:
- Jenkins
- CircleCI
- GitHub Actions
- Container orchestration tools:
- Kubernetes
- Docker Swarm
- Mesos
- Configuration management tools:
- Ansible
- Chef
- Puppet
DataOps
- Data pipeline orchestration tools:
- Apache Airflow
- Luigi
- Prefect
- Data warehousing and data lake platforms:
- Amazon Redshift
- Google BigQuery
- Snowflake
- Azure Synapse Analytics
MLOps
- Machine learning platform:
- Google Cloud AI Platform
- Amazon SageMaker
- Azure Machine Learning
- Version control and collaboration tools:
- Git
- JIRA
- Confluence
- Model monitoring tools:
- Domino Data Lab
- MLflow
- Weights & Biases
AIOps
- AI observability tools:
- Opsgenie
- PagerDuty
- VictorOps
- AIOps platforms:
- Splunk AIOps
- Dynatrace AIOps
- Datadog AIOps
Infrastructure automation
- IaC tools:
- Terraform
- CloudFormation
- Pulumi
- Cloud management tools:
- AWS Console
- Azure Portal
- Google Cloud Platform Console
Ops
- Monitoring tools:
- Prometheus
- Grafana
- Zabbix
- Logging tools:
- Elasticsearch
- Logstash
- Kibana