Data Scientist Interview Questions – Explain what precision and recall are.   After the predictive model has been finished, the most important question is: How good is it? Does it predict well? Evaluating the model is one of the most important tasks in the data science project,  it indicates how good predictions are. Very often for classification problems we look at metrics called precision and recall,…

Read More

Data Science News – handpicked articles, news, and stories from Data Science world. NEWS   Experts Predict When Artificial Intelligence Will Exceed Human Performance – Artificial intelligence is changing the world and doing it at breakneck speed. The promise is that intelligent machines will be able to do every task better and more cheaply than humans. Rightly or wrongly, one industry after another is falling…

Read More

Predictive model validation and testing. Why evaluate/test model at all? Evaluating the performance of a model is one of the most important stages in predictive modeling, it indicates how successful model has been for the dataset. It enables to tune parameters and in the end test the tuned model against a fresh cut of data. Below we will look at few most common validation metrics used…

Read More

Regularization – definition and python example. In Machine Learning, very often the task is to fit a model to a set of training data and use the fitted model to make predictions or classify new (out of sample) data points. Sometimes model fits the training data very well but does not well in predicting out of sample data points. A model may be too complex…

Read More

TensorFlow Tutorial – Introduction. What is TensorFlow? The shortest definition would be, TensorFlow is a general-purpose library for graph-based computation. But there is a variety of other ways to define TensorFlow, for example, Rodolfo Bonnin in his book – Building Machine Learning Projects with TensorFlow brings up definition like this: “TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in…

Read More

Data Science News Digest – handpicked articles, news, and stories from Data Science world.   NEWS   CUDA 9 Features Revealed  – At the GPU Technology Conference, NVIDIA announced CUDA 9, the latest version of CUDA’s powerful parallel computing platform and programming model.   Explaining How End-to-End Deep LearninSteers a Self-Driving Car – As part of a complete software stack for autonomous driving, NVIDIA has created a deep-learning-based…

Read More

Hadoop YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored on a single platform, unlocking an entirely new approach to analytics. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture. YARN also extends the…

Read More

Hadoop Flume was created in the course of incubator Apache project to allow you to flow data from a source into your Hadoop environment. In Flume, the entities you work with are called sources, decorators, and sinks. A source can be any data source, and Flume has many predefined source adapters. A sink is the target of a specific operation (and in Flume, among other…

Read More

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects…

Read More

Hadoop Zookeeper is an open source Apache™ project that provides a centralized infrastructure and services that enable synchronization across a cluster. ZooKeeper maintains common objects needed in large cluster environments. Examples of these objects include configuration information, hierarchical naming space, etc. Applications can leverage these services to coordinate distributed processing across large clusters. Name services, group services, synchronization services, configuration management, and more, are available…

Read More
Show Buttons
Hide Buttons