Tagged: learning Toggle Comment Threads | Keyboard Shortcuts

  • jkabtech 12:17 pm on November 20, 2017 Permalink | Reply
    Tags: hierarchy, learning   

    Learning a hierarchy 

    October 26, 2017

    We’ve developed a hierarchical reinforcement learning algorithm that learns high-level actions useful for solving a range of tasks, allowing fast solving of tasks requiring thousands of timesteps. Our algorithm, when applied to a set of navigation problems, discovers a set of high-level actions for walking and crawling in different directions, which enables the agent to master new navigation tasks quickly.

    PaperCode

    Humans solve complicated challenges by breaking them up into small, manageable components. Grilling pancakes consists of a series of high-level actions, such as measuring flour, whisking eggs, transferring the mixture to the pan, turning the stove on, and so on. Humans are able to learn new tasks rapidly by sequencing together these learned components, even though the task might take millions of low-level actions, i.e., individual muscle contractions.

    On the other hand, today’s reinforcement learning methods operate through brute force search over low-level actions, requiring an enormous number of attempts to solve a new task. These methods become very inefficient at solving tasks that take a large number of timesteps.

    Our solution is based on the idea of hierarchical reinforcement learning, where agents represent complicated behaviors as a short sequence of high-level actions. This lets our agents solve much harder tasks: while the solution might require 2000 low-level actions, the hierarchical policy turns this into a sequence of 10 high-level actions, and it’s much more efficient to search over the 10-step sequence than the 2000-step sequence.

    Meta-Learning Shared Hierarchies

    View the Original article

    Advertisements
     
  • jkabtech 8:17 pm on November 6, 2017 Permalink | Reply
    Tags: Academy, , , learning, ,   

    Show HN: Neutron Academy, Google Assistant Powered Learning 

    View the Original article

     
  • jkabtech 9:51 am on July 22, 2017 Permalink | Reply
    Tags: learning   

    When not to use deep learning 

    Jun 16, 2017 12 minutes read

    I know it’s a weird way to start a blog with a negative, but there was a wave of discussion in the last few days that I think serves as a good hook for some topics on which I’ve been thinking recently. It all started with a post in the Simply Stats blog by Jeff Leek on the caveats of using deep learning in the small sample size regime. In sum, he argues that when the sample size is small (which happens a lot in the bio domain), linear models with few parameters perform better than deep nets even with a modicum of layers and hidden units. He goes on to show that a very simple linear predictor, with top ten most informative features, performs better than a simple deep net when trying to classify zeros and ones in the MNIST dataset using only 80 or so samples. This prompted Andrew beam to write a rebuttal in which a properly trained deep net was able to beat the simple linear model, even with very few training samples. This back-and-forth comes at a time where more and more researchers in biomedical informatics are adopting deep learning for various problems. Is the hype real or are linear models really all we need? The answer, as alwyas, is that it depends. In this post, I wanted to visit use cases in machine learning where deep learning would not really make sense to use as well as tackle preconceptions that I think prevent deep learning to be used effectively, especially for newcomers.

    Breaking deep learning preconceptions

    First, let’s tackle some preconceptions that folks outside the field have that turn out to be half-truths. There’s two broad ones and one a bit more technical that I’m going to elaborate on. This is somewhat of an extension to Andrew Beam’s excellent “Misconceptions” section in his post.

    Deep learning can really work on small sample sizes

    Deep learning’s claim to fame was in a context with lots of data (remember that the first Google brain project was feeding lots of youtube videos to a deep net), and ever since it has constantly been publicized as complex algorithms running in lots of data. Unfortunately, this big data/deep learning pair somehow translated into the converse as well: the myth that it cannot be used in the small sample regime. If you have just a few samples tapping into a neural net with a high parameter-per-sample ratio superficially may seem like asking to overfit. However, just considering sample size and dimensionality for a given problem, be it supervised or unsupervised, is sort of modeling the data in a vacuum, without any context. It is probably the case that you have data sources that are related to your problem, or that there’s a strong prior that a domain expert can provide, or that the data is structured in a very particular way (e.g. is encoded in a graph or image). In all of these cases, there’s a chance deep learning can make sense as a method of choice – for example, you can encode useful representations of bigger, related datasets and use those representations in your problem. A classic illustration of this is common in natural language processing, where you cam learn word embeddings on a large corpus like Wikipedia and then use those as embeddings in a smaller, narrower corpus for a supervised task. In the extreme, you can have a set of neural nets jointly learn a representation and an effective way to reuse the representation in small sets of samples. This is called one-shot learning and has been successfully applied in a number of fields with high-dimensional data including computer vision and drug discovery.

    View the Original article

     
  • jkabtech 12:34 am on February 1, 2016 Permalink | Reply
    Tags: , , learning,   

    Deep Learning Is Easy – Learn Something Harder 

    AppId is over the quota
    inFERENCe

    posts on machine learning, statistics, opinions on things I’m reading in the space

    January 25th, 2016

    Caveat: This post is meant address people who are completely new to deep learning and are planning an entry into this field. The intention is to help them think critically about the complexity of the field, and to help them tell apart things that are trivial from things that are really hard. As I wrote and published this article, I realised it ended up overly provocative, and I’m not a good enough writer to write a thought provoking post without, well, provoking some people. So please read the article through this lens.

    These days I come across many people who want to get into machine learning/AI, particularly deep learning. Some are asking me what the best way is to get started and learn. Clearly, at the speed things are evolving, there seems to be no time for a PhD. Universities are sometimes a bit behind the curve on applications, technology and infrastructure, so is a masters worth doing? A couple companies now offer residency programmes, extended internships, which supposedly allow you to kickstart a successful career in machine learning without a PhD. What your best option is depends largely on your circumstances, but also on what you want to achieve.

    The general advice I increasingly find myself giving is this: deep learning is too easy. Pick something harder to learn, learning deep neural networks should not be the goal but a side effect.

    Deep learning is powerful exactly because it makes hard things easy.

    The reason deep learning made such a splash is the very fact that it allows us to phrase several previously impossible learning problems as empirical loss minimisation via gradient descent, a conceptually super simple thing. Deep networks deal with natural signals we previously had no easy ways of dealing with: images, video, human language, speech, sound. But almost whatever you do in deep learning, at the end of the day it becomes super simple: you combine a couple basic building blocks and ideas (convolution, pooling, recurrence), you can do it without overthinking it, if you have enough data the network will figure it out. Increasingly high-level, declarative frameworks like TensorFlow, Theano, Lasagne, Blocks, Keras, etc simplify this to the level of building Lego towers.

    This is not to say there are no genuinely novel ideas coming out of deep learning, or using deep learning in more innovative ways, far from it. Generative Adversarial Networks and Variational Autoencoders are brilliant examples that sparked new interest in probabilistic/generative modelling. Understanding why/how those work, and how to generalise/build on them is real hard – the deep learning bit is easy. Similarly, there is a lot of exciting research on understanding why and how these deep neural networks really work.

    There is also a feeling in the field that low-hanging for deep learning is disappearing. Building deep neural networks for supervised learning – while still being improved – is now considered boring or solved by many (this is a bold statement and of course far from the truth). The next frontier, unsupervised learning will certainly benefit from the deep learning toolkit, but it also requires a very different kind of thinking, familiarity with information theory/probabilities/geometry. Insight into how to make these methods actually work are unlikely to come in the form of improvements to neural network architectures alone.

    What I’m saying is that by learning deep learning, most people mean learning to use a relatively simple toolbox. But in six months time, many, many more people will have those skills. Don’t spend time working on/learning about stuff that retrospectively turns out to be too easy. You might miss your chance to make a real impact with your work and differentiate your career in the long term. Think about what you really want to be able to learn, pick something harder, and then go work with people who can help you with that.

    What are examples of harder things to learn? Consider what knowledge authors like Ian Goodfellow, Durk Kingma, etc have used when they came up with the algorithms mentioned before. Much of the relevant stuff that is now being rediscovered was actively researched in the early 2000’s. Learn classic things like the EM algorithm, variational inference, unsupervised learning with linear Gaussian systems: PCA, factor analysis, Kalman filtering, slow feature analysis. I can also recommend Aapo Hyvarinen’s work on ICA, pseudolikelihood. You should try to read (and understand) this seminal deep belief network paper.

    While deep learning is where most interesting breakthroughs happened recently, it’s worth trying to bet on areas that might gain relevance going forward:

    probabilistic programming and black-box probabilistic inference (with- or without deep neural networks). Take a look at Picture for example, or Josh Tenenbaum’s work on inverse graphics networks. Or stuff at this NIPS workshop on black-box inference. To quote my colleague Lucas Theis:

    probabilistic programming could do for Bayesian ML what Theano has done for neural networks

    better/scaleable MCMC and variational inference methods, again, with or without the use of deep neural networks. There is a lot of recent work on things like this. Again, if we made MCMC as reliable as stochastic gradient descent now is for deep networks, that could mean a resurgence of more explicit Bayesian/probabilistic models and hierarchical graphical models, of which RBMs are just one example.

    Roughly the same thing happened around the data scientist buzzword some years ago. Initially, using Hadoop, Hive, etc were a big deal, and several early adopters made a very successful career out of – well – being early adopters. Early on, all you really needed to do was counting stuff on smallish distributed clusters, and you quickly accumulated tens of thousands of followers who worshipped you for being a big data pioneer.

    What people did back then seemed magic at the time, but looking back from just a couple years it’s trivial: lots of people use Hadoop and spark now, and tools like Amazon’s Redshift made stuff even simpler. Back in the days, your startup could get funded on the premise that your team could use Hive, but unless you used it in some interesting way, that technological advantage evaporated very quickly. At the top of the hype cycle, there were data science internships, residential training programs, bootcamps, etc. By the time people graduated from these programs, these skills were rendered somewhat irrelevant and trivial. What is happening now with deep learning looks very similar.

    In summary, if you are about to get into deep learning, just think about what that means, and try to be more specific. Think about how many other people are in your position right now, and how are you going to make sure the things you learn aren’t the ones that will appear super-boring in a year’s time.

    The research field of deep learning touches on a lot of interesting, very complex topics from machine learning, statistics, optimisation, geometry and so on. The slice of deep learning most people are likely to come across – the lego block building aspect – however is relatively simple and straightforward. If you are completely new to the field, it is important to see beyond this simple surface, and pick some of the harder concepts to master.

    © 2016 inFERENCe. All rights reserved. Powered by Ghost. Crisp theme by Kathy Qian.

    View the original article here

     
  • jkabtech 11:13 am on January 19, 2016 Permalink | Reply
    Tags: basketball, learning, , predict, scores,   

    Using machine learning to predict basketball scores 

    By: Scott Clark, PhD

    Here at SigOpt we think a lot about model tuning and building optimization strategies; one of our goals is to help users get the most out of their Machine Learning (ML) models as quickly as possible. When our last hackathon rolled around I was inspired by some recent articles about using machine learning to make sports bets. For my hackathon project I teamed up with our amazing intern George Ke and set out to use a simple algorithm and open data to build a model that could predict the best basketball bets to make. We used SigOpt to tune the features and hyperparameters of this model to make it as profitable as possible, hoping to find a winning combination that could beat the house. Is it possible to use optimized machine learning models to beat Vegas? The short answer is yes; read on to find out how [0].

    Broadly speaking, there are three main challenges before deploying a machine learning model. First, you must Extract the data from somewhere, Transform it into a usable state, and then Load it somewhere you can quickly access it (ETL). This stage often requires a lot of creativity and good old-fashioned hacking. Next, you must apply your domain expertise about the problem to build the features and pick the model that will best solve it. Once you have your data and model you must train and tune the model to get it to the best possible state. This is what we will focus on in this post.

    It is often completely intractable to tune a model with more than a handful of parameters using traditional methods like grid and random search, because of the curse of dimensionality and how resource-intensive this process is. Model tuning is non-intuitive and orthogonal to the domain expertise required for the rest of the ML process so it is often also prohibitively inefficient to be done by hand. However, with the advent of optimization tools like SigOpt to properly tune models it is now possible for experts in any field to get the most out of their models quickly and easily. While sometimes in practice this final stage of model building is skipped, it can often mean the difference between making money and losing money with your model, as we see below.

    We used one of the simplest possible sports bets you can make in Vegas for our experiment, the Over/Under line. This is a bet that the total number of points scored by both teams in a game will be higher, or lower, than some number that Vegas picks. For example, if Vegas says the sum of scores for a game will be 200.5, and the scores totaled to 210, and we bet “over,” then we would be entitled to \$100 of winnings for every \$110 we bet [1], otherwise (if we bet “under” or the score came in lower than 200.5) we would lose our \$110 bet. On each game we simulated the same \$110 bet (only winning \$100 when we choose correctly). We picked NBA games for the experiment both for the wide availability of open statistics [2] and because over 1,000 games are played per year, giving us many data points with which to train our model.

    We picked a random forest regression model as our algorithm because it is easy to use and has interesting parameters to tune (hyperparameters) [3]. 23 different team-based statistics were chosen to build the features of the model [4]. We did not modify the feature set beyond our initial picks in order to show how model tuning, independent of feature selection, would fare against Vegas. For each of the 23 features we created a slow and fast moving average for both the home and away team. These fast and slow moving averages are tunable feature parameters which we use SigOpt to optimize [5]. The averages were calculated both for a total number of games and for a number of games of similar type (home games for the home team, away games for the away team). This led us to 184 total features for every game and a total of 7 tunable parameters [3] [5].

    The output of our model is a predicted number of total points scored given the historical statistics of the two teams playing in a given game. If the model predicts a lower score than the Vegas Over/Under line then we will bet under; similarly if the model predicts a higher score we will bet over. We will also let SigOpt tune how “certain” the model needs to be in order for us to make a bet by only simulating a bet when the difference between our prediction and the overunder line is greater than a tunable threshold.

    We used the ‘00-’14 NBA seasons to train our model (training data), and random subsets of the ‘14-’15 season to evaluate it in the tuning phase (test data). For every set of tunable parameters we calculated the average winnings (and variance of winnings) that we would have achieved over many random subsets of the testing data. Every evaluation took 15 minutes on a high CPU Linux machine. Note that grid search and random search (traditional approaches to model tuning) would be an impractical way to perform parameter optimization on this problem because the number of required evaluations grows so large with the number of parameters for both methods [6]. SigOpt takes a linear number of evaluations with respect to the number of parameters in practice. It is worth noting that even though it requires fewer evaluations, SigOpt also tends to find better results than grid and random search. Figure 1 shows how profitability increases with evaluations as SigOpt tunes the model.

    image

    Figure 1: Over the course of 100 different train and test evaluations, SigOpt was able to tune our model from losing more than \$500 to winning more than \$1,000, on average.  This value was computed on random subsets of the ‘14-’15 test season, which was not used for training.

    Once we have used SigOpt to fine tune the model, we want to see how it performs on a holdout dataset that we have never seen before. This is simulating using our model to make bets where the only information is historical information. Since the model was trained and tuned on the ‘00-’15 seasons, we used the first games of the ‘15-’16 season (being played now) to evaluate our tuned model. After simulating 131 total bets over a month, we observe that the SigOpt tuned model would have made \$1,550 in profit. An untuned version of this same model racked up \$1,020 in losses over the same holdout dataset [7]. Not only does model tuning with SigOpt make a huge difference, but a simple, well-tuned model can beat the house.

    image

    Figure 2: The blue line is cumulative winnings after each day of the SigOpt tuned model. The grey dashed line is the cumulative winnings of the untuned model. The dashed red line is the breakeven line.

    We are releasing all of the code used in this example in our github repo. We were able to use the power of SigOpt optimization to take a relatively simple model and make it beat Vegas. Can you use a more complicated model to get better results? Can you think of more features to add? Does including individual player stats increase accuracy? All of these questions can be explored by forking the repository and using a free trial of SigOpt to tune your model [0].

    [0]: All bets in this blog post were simulated, no actual money was gambled. SigOpt does not advocate sports gambling. Check your local laws to learn if gambling is legal in your area. Make sure you read the SigOpt terms of service before using SigOpt to tune your models.

    [1]: Betting \$110 to win \$100 is part of the edge that Vegas keeps. This keeps a player from breaking even by picking “over” and “under” randomly.

    [2]: NBA stats: http://stats.nba.com, Vegas lines: http://www.nowgoal.net/nba.htm

    [3]: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html We will tune the hyperparameters of n_estimators, min_samples_leaf and min_samples_split.

    [4]: 23 different team level features were chosen: points per minute, offensive rebounds per minute, defensive rebounds per minute, steals per minute, blocks per minute, assists per minute, points in paint per minute, second chance points per minute, fast break points per minute, lead changes per minute, times tied per minute, largest lead per game, point differential per game, field goal attempts per minute, field goals made per minute, free throw attempts per minute, free throws made per minute, three point field goals attempted per minute, three point field goals made per minute, first quarter points per game, second quarter points per game, third quarter points per game, and fourth quarter points per game.

    [5]: The feature parameters included the number of games to look back for the slow and fast moving averages, as well as an exponential decay parameter for how much the most recent games count towards that average (with a value of 0 indicating linear decay), and the threshold for the difference between our prediction and the overunder line required to make a bet.

    [6]: Even a coarse grid of width 5 would require 5^7 = 78125 evaluations, taking over 800 days to run sequentially. The coarse width would almost certainly also perform poorly compared to the Bayesian approach that SigOpt takes, for examples see this blog post.

    [7]: The untuned model uses the same random forest implementation (with default hyperparameters), the same features, a fast and slow moving linear average of 1 and 10 games respectively, and a “certainty” threshold of 0.0 points.

    View the original article here

     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel
%d bloggers like this: