• Proven track record in HDFS and Unix commands.
• Knowledge on extracting data from different sources such as DBMS, NoSQL.
• Managed Spark on HDFS cluster
• Very good knowledge of Spark & Scala
• Ability to write MapReduce & Spark jobs
• Experience with open source technologies used in Big Data analytics like Pig, Hive, HBase, Kafka
• PySpark knowledge is a must and handle with Impala, Hive Data Lake.
• Extracting Text using OCR mainly or with knowledge in Tesseract would fulfill the same.
• Willingness to learn, ability to think skeptically about problems and results, curious to explore new techniques and domains.
• Design Patterns (GoF) would be great in developing complex PySpark algorithms.
• Data Extraction from various file formats.
• In-depth knowledge on Unix commands especially using HDFS, Linux(Gentoo) and Spark.
• Hands on experience is a must on Hadoop ecosystem.
• Ability to work independently in a quickly evolving environment.
• Familiarity with tools like Team Foundation Server [TFS].
• Ability to analyze data, to identify issues like gaps and inconsistencies and to do root cause analysis
• Experience in working with customers to identify and clarify requirements
• Ability to design solutions that are fit for purpose whilst keeping options open for future needs
• Strong verbal and written communication skills, good customer relationship skills
Database SQL, NoSQL
Hive, Imapala, SparkSQL
Connecting R to Hive/ Impala
Unix Hadoop Admin capabilities and Spark Admin capabilities
Advance Linux commands
Multithreading and Distributed understanding and development knowledge
Tools/ Languages Python (numpy, pandas, scikit, sklearn, nltk)
Spark /PySpark (sql, ml, graphX, streaming) developing certain algorithms that are not available in PySpark
Notebooks (jupyter notebook, zeppeling, databricks)
Data Ingestion (Kafka streaming using Spark
Data management Data Extraction (different formats such as PDF, HTML, JPEG and so on)
Data Cleaning (creating text format for data scientist)
Data Validation (post extraction using OCR the text that are extracted has to be validated and recorded)
Data Loading (connection to database, FTP connection, etc.)
Engineering Code packaging
ETL & Environment configuration (versioning, packages installation)