Skip to main content

The Value of Data

Although, there is not a simple answer for what came first - Chicken or Egg?, in Machine Learning, there is an easy answer. Data came first before any function. Machine learning is all about learning from data. The learning algorithm tries to learn a function that can either classify the data into different categories, or learn the function itself that plots the data.

There are two popular ways in which a Machine Learning algorithm can be taught to learn a function, but in both cases it needs data.

  • Supervised Learning: We give the algorithm a lot of data with both input and output, and it learns the function. In case of regression problems, the function approximately plots the function that understands the data. In case of classification problems, the function tries to classify the data.

  • Unsupervised Learning: We give the algorithm a lot of input data with no output, and it tries to find patterns in the data. The algorithm classifies the data based on the similarities of the data points. This method is called clustering.


In both the cases, enough data is needed for the algorithm to learn the function. Most of the machine learning algorithms are provided with three kinds of data -

  • Training Data: This data is used by the algorithm to learn the function, based on which it tries to generalize.

  • Validation Data: There is a high chance that the algorithm might overfit the training data, and will fail on any other data. To protect it from doing so, validation data is used. Validation data helps the algorithm to correlate how accurate it works on both known and unknown data.

  • Test Data: Once the function is learnt, it is tested on the test data. Here the algorithm checks whether it is also able to generalize on the test data, and thus able to stand a better chance in generalizing future data it has to do inference on.


Two interesting questions emerge, both having elegant answers.

Question. What are the parameters to decide on the amount of data needed for the algorithm to learn a function?
Answer. This might come with experience, but the more the merrier. Today’s Machine Learning algorithms such as Neural Networks are so powerful that if not enough data is given, they overfit easily. Also once your model is fine tuned and no more optimization is possible, it will only do better with more data.

Question. What if there are outliers in the data?
Answer. To get better performance, such outliers should be filtered from the data, otherwise the algorithm might get confused, and thus create a function that tries to learn the outlier too. It also depends on how sensitive some algorithms are to such outliers.

In today’s generation where companies having more data are richer than companies with less data, it seems data will decide the future.

P.S: You don’t get any spam emails in your inbox, thanks to billions of email that your spam filter has been trained on.

Comments

Popular posts from this blog

Understanding Generative Adversarial Networks - Part II

In "Understanding Generative Adversarial Networks - Part I" you gained a conceptual understanding of how GAN works. In this post let us get a mathematical understanding of GANs.
The loss functions can be designed most easily using the idea of zero-sum games. 
The sum of the costs of all players is 0. This is the Minimax algorithm for GANs
Let’s break it down.
Some terminology: V(D, G) : The value function for a minimax game E(X) : Expectation of a random variable X, also equal to its average value D(x) : The discriminator output for an input x from real data, represents probability G(z): The generator's output when its given z from the noise distribution D(G(z)): Combining the above, this represents the output of the discriminator when 
given a generated image G(z) as input
Now, as explained above, the discriminator is the maximizer and hence it tries to 
maximize

Understanding Generative Adverserial Networks - Part 1

This is a two part series on understanding Generative Adversarial Networks (GANs). This part deals with the conceptual understanding of GANs. In the second part we will try to understand the mathematics behind GANs.

Generative networks have been in use for quite a while now. And so have discriminative networks. But only in 2014 did someone get the brilliant idea of using them together. These are the generative adversarial networks. This kind of deep learning model was invented by Ian Goodfellow. When we work with data already labelled, it’s called supervised learning. It’s much easier compared to unsupervised learning, which has no predefined labels, making the task more vague. 

"Generative Adversarial Networks is the most interesting idea in the last ten years in Machine Learning." - Yann LeCun

In this post, we’ll discuss what GANs are and how they work, at a higher , more abstract level. Since 2014, many variations of the traditional GAN have come out, but the underlying conc…