WEBVTT 1 00:00:01.650 --> 00:00:02.820 Daniel Bienstock: Okay? 2 00:00:07.653 --> 00:00:08.546 Daniel Bienstock: Alright, 3 00:00:09.500 --> 00:00:12.479 Daniel Bienstock: Before we get started there's a chance 4 00:00:12.540 --> 00:00:14.373 Daniel Bienstock: that I'm going to get 5 00:00:15.290 --> 00:00:21.899 Daniel Bienstock: called by the electricity company. They have to come and do some service, and I'm just. I'm waiting for them to call back. 6 00:00:22.660 --> 00:00:30.760 Daniel Bienstock: Okay, let me open this up alright. So now let's review where we were last time. We have a bit of ground to cover today. 7 00:00:30.920 --> 00:00:39.189 Daniel Bienstock: we're going to need at least one more lecture in these Ml tools, if not 2 more lectures 8 00:00:39.390 --> 00:00:42.110 Daniel Bienstock: and a quick review. Last lecture. 9 00:00:46.650 --> 00:00:52.299 Daniel Bienstock: we saw deep learning. I encourage you to. Look at the little Pdf. 10 00:00:52.900 --> 00:00:58.750 Daniel Bienstock: That I had uploaded. This is a deep learning using gradient descent 11 00:00:59.556 --> 00:01:03.530 Daniel Bienstock: to me, minimize mean square error 12 00:01:04.209 --> 00:01:05.949 Daniel Bienstock: use training data 13 00:01:06.020 --> 00:01:16.780 Daniel Bienstock: that the data consists in the example that we considered a vector each. Each element of the data consists of a vector and a value. 14 00:01:16.790 --> 00:01:30.179 Daniel Bienstock: And now we are trying to use the vectors to predict the values, and the network will predict the number. And the error is squared. And now, over all the data samples, we take the average of these square errors, the mean square error 15 00:01:30.586 --> 00:01:35.340 Daniel Bienstock: and the the variables and the optimization problem are the network weights. 16 00:01:36.315 --> 00:01:45.330 Daniel Bienstock: And we were using, or in the community gradient descent. First order methods are used to try to minimize this function 17 00:01:45.350 --> 00:01:52.839 Daniel Bienstock: and thus build a network or put voids on the network, and then that gets used to make predictions. 18 00:01:53.620 --> 00:01:58.339 Daniel Bienstock: Then we described how to use a variation, a variation on this 19 00:01:58.430 --> 00:02:00.789 Daniel Bienstock: to do classification. 20 00:02:00.950 --> 00:02:06.019 Daniel Bienstock: and then classification. What you have is you have a network 21 00:02:06.643 --> 00:02:12.626 Daniel Bienstock: and the the last layer, the last layer of the network. 22 00:02:14.590 --> 00:02:15.910 Daniel Bienstock: outputs 23 00:02:15.920 --> 00:02:21.880 Daniel Bienstock: one number per each class that we trying to use to classify. 24 00:02:22.050 --> 00:02:40.629 Daniel Bienstock: So one number of those one number per class. So if you're trying to use images as inputs, and then, describe what is in the image. It could be different types of animals. And so let's say, there are 5 choices. So the last layer of the network will have 5 neurons outputting numbers. 25 00:02:41.029 --> 00:02:57.190 Daniel Bienstock: And then, for example, you can take the maximum of those numbers, and that is the one that is the that is the prediction. And the the network is trained the same way. More generally you take these 5 numbers, and out of them you fashion a probability. 26 00:02:57.330 --> 00:03:00.150 Daniel Bienstock: discrete probability distribution. 27 00:03:00.390 --> 00:03:20.620 Daniel Bienstock: for example, using this what I call the soft Max, this exponential, divided by the sum of the exponentials. So you take the exponential for each of the numbers you divide by the sum of these 5 exponentials, and that gives you a probability distribution. And now you use that probability distribution to make a stochastic prediction. 28 00:03:21.280 --> 00:03:26.599 Daniel Bienstock: And again, you can, you can, use this metric 29 00:03:26.610 --> 00:03:28.480 Daniel Bienstock: as compared to the truth. 30 00:03:28.530 --> 00:03:33.140 Daniel Bienstock: to set up an optimization problem to train the network. 31 00:03:34.010 --> 00:03:50.859 Daniel Bienstock: Okay? Now, let's move on. Past all of this we were discussing the application which is the application of interest here, which is how to play a game like chess, or in particular play go, which is considered more, more difficult. 32 00:03:51.010 --> 00:03:53.710 Daniel Bienstock: then then tests. 33 00:03:53.910 --> 00:04:06.304 Daniel Bienstock: And so the first topic that we started to look at in the last lecture is what we call supervised learning of policy network. So this is a a very loaded terminology. 34 00:04:06.730 --> 00:04:11.840 Daniel Bienstock: let's see what it what it means, and then we'll see how it goes was used 35 00:04:11.930 --> 00:04:30.970 Daniel Bienstock: by the Alphago people. By the deepmind people to create the first element in this setup that they call Alphago. That proved very successful. Not the most successful we get. We'll get to that in a future lecture, maybe next lecture or the one after that. 36 00:04:31.110 --> 00:04:34.280 Daniel Bienstock: But this is supervised learning 37 00:04:35.760 --> 00:04:37.679 Daniel Bienstock: of policy networks. 38 00:04:41.000 --> 00:04:58.939 Daniel Bienstock: Okay, the learning is what we have been discussing before. This is the AI term networks is because we have a network supervised here is because we have input data which is considered to be a kind of ground truth. 39 00:04:59.140 --> 00:05:05.159 Daniel Bienstock: In the case of the goal playing algorithms, the supervision 40 00:05:05.814 --> 00:05:11.720 Daniel Bienstock: was provided by data coming from games played by masters of the game. 41 00:05:11.790 --> 00:05:26.460 Daniel Bienstock: And policy is yet another loaded term. Okay, will be exposed to this term multiple times today. And we'll come to understand what that means. This is a term that predates to some extent this AI community. 42 00:05:26.630 --> 00:05:32.699 Daniel Bienstock: And so what is the setup? What is the setup? You know we have pairs. This is the data that we have. 43 00:05:34.390 --> 00:05:35.900 Daniel Bienstock: We have pairs 44 00:05:36.550 --> 00:05:44.149 Daniel Bienstock: of the form action state. And so here, state is the state of a system. 45 00:05:44.460 --> 00:05:48.449 Daniel Bienstock: and the system, for example, could be the board for the game. 46 00:05:49.520 --> 00:05:57.110 Daniel Bienstock: My understanding is for for the goal playing algorithms the State was actually a picture 47 00:05:57.210 --> 00:06:03.200 Daniel Bienstock: of the board with the pieces in place. A picture, okay? And a is action 48 00:06:05.401 --> 00:06:07.870 Daniel Bienstock: taken, let's say, by an expert. 49 00:06:08.530 --> 00:06:13.039 Daniel Bienstock: the correct action to take when the system is in this given state? 50 00:06:14.660 --> 00:06:21.690 Daniel Bienstock: And then how does how? What is a supervised learning of a policy network. So here's our network. 51 00:06:24.210 --> 00:06:40.930 Daniel Bienstock: Okay? And we have all these layers and so on multiple layers. We have. These are parameters. let's not call it Theta, actually, because they they chose not to call theta theta. Let's call it that Greek letter. So we have the state 52 00:06:42.580 --> 00:06:44.449 Daniel Bienstock: the state coming in. 53 00:06:44.730 --> 00:06:55.579 Daniel Bienstock: And it's going to be mapped as a vector into the input layer. And out of the output layer comes, not an action con comes up. Probability distribution. 54 00:06:56.500 --> 00:07:01.999 Daniel Bienstock: Okay? The probability that we should take action. A given state. S, 55 00:07:02.830 --> 00:07:09.489 Daniel Bienstock: and we use the the data that we have the pairs that we have to provide the training data. 56 00:07:11.270 --> 00:07:18.053 Daniel Bienstock: Okay? And so we run this classification network. 57 00:07:19.580 --> 00:07:21.629 Daniel Bienstock: it makes a prediction 58 00:07:21.660 --> 00:07:23.740 Daniel Bienstock: for the correct action to take. 59 00:07:24.740 --> 00:07:33.539 Daniel Bienstock: and then we compare it to the recommended action recommended by the expert, and thus we obtain a metric. 60 00:07:33.920 --> 00:07:57.070 Daniel Bienstock: Okay? However, the way that this would got was got used was by doing gradient ascent gradient ascent. And what do I mean by that gradient ascent to maximize maximize. What maximize the probability that we make the correct action given any any of the States? 61 00:07:57.190 --> 00:08:09.920 Daniel Bienstock: Okay, so let me write the formula for what would amount to the gradient ascent, and we'll explain what we did. I uploaded the first paper by the Alphago people. Oop. 62 00:08:09.920 --> 00:08:32.599 Daniel Bienstock: There are more people now by the Alphago people. There's a section toward the end of the paper that's called results or methods, pardon methods, and in it they describe in somewhat more detail what exactly it is that they did. Okay, so let me. And it's best to look at what they did in terms of the formulas which we're going to do to understand exactly what it is that we are doing. 63 00:08:32.760 --> 00:08:35.239 Daniel Bienstock: Okay, this is gradient ascent 64 00:08:37.580 --> 00:08:39.320 Daniel Bienstock: rather than decent. 65 00:08:41.700 --> 00:08:43.940 Daniel Bienstock: Okay, gradient ascent. 66 00:08:45.570 --> 00:08:48.169 Daniel Bienstock: And let me write the formula. 67 00:08:49.040 --> 00:08:54.529 Daniel Bienstock: And I'll explain what it is that that we are doing. So each iteration 68 00:08:56.430 --> 00:09:18.320 Daniel Bienstock: is as follows, so delta of the segment, these are the network weights. Okay, these are the numbers that go in the network. Okay, which is our optimization problem is what? Okay? So it's a number. Okay, Alpha, Alpha is what we will call the learning rate. It's the step size, basically for a gradient method. 69 00:09:18.750 --> 00:09:21.720 Daniel Bienstock: But I have this. M. Now, I have a son. 70 00:09:22.900 --> 00:09:49.100 Daniel Bienstock: Okay of what? And I take the gradient the notation I'm using is more standard that what they have in their paper. But you'll understand what I mean. Gradient, with respect to the network weight is our our variables. This is the gradient part of the gradient descent method. Gradient of what function, and I'm going to write the function here. The log, the log of the probability 71 00:09:50.852 --> 00:09:53.869 Daniel Bienstock: of a K given SK. 72 00:09:53.940 --> 00:10:02.699 Daniel Bienstock: So what is all of this. So ak comma should should do it the other way around. Aksk. 73 00:10:03.530 --> 00:10:07.019 Daniel Bienstock: SKAK. This is the training data. 74 00:10:09.390 --> 00:10:12.589 Daniel Bienstock: These are, you know, this is a state. 75 00:10:13.600 --> 00:10:15.450 Daniel Bienstock: and this is the action 76 00:10:16.620 --> 00:10:19.059 Daniel Bienstock: by a master of it. Again. 77 00:10:20.430 --> 00:10:24.809 Daniel Bienstock: okay. And that K is what K. Is roughly 30 million. 78 00:10:26.550 --> 00:10:29.690 Daniel Bienstock: Well, roughly, actually, 1 million 79 00:10:30.440 --> 00:10:32.750 Daniel Bienstock: or 1.5 something like that. 80 00:10:34.600 --> 00:10:37.149 Daniel Bienstock: We write it the standard way. So we have. 81 00:10:37.150 --> 00:10:38.599 matias: Is that M. Then. 82 00:10:38.890 --> 00:10:42.650 Daniel Bienstock: No, that's not them. Oh, no, no good point 83 00:10:43.888 --> 00:11:01.830 Daniel Bienstock: one so good point. Okay. What is m is M is a mini batch. Okay? So let me write all the data this is. If you look at the paper, you'll see how highly dimensional they paper is too many training wheels. This is training data training data 84 00:11:05.435 --> 00:11:10.310 Daniel Bienstock: is about that large training data is about that large M 85 00:11:10.750 --> 00:11:12.270 Daniel Bienstock: is Mini batch 86 00:11:15.180 --> 00:11:18.379 Daniel Bienstock: batch, which was 16. 87 00:11:19.360 --> 00:11:28.260 Daniel Bienstock: So this is a case of stochastic gradient. Okay, we take a mini batch of size or 2 people enter the room that'd be in that'd be allowed. 88 00:11:28.490 --> 00:11:30.109 Daniel Bienstock: But both of them 89 00:11:31.330 --> 00:11:41.419 Daniel Bienstock: okay, my network is kind of slow. Admit all they are getting admitted. Okay, joining joining fine. we. We have roughly 1.5 million 90 00:11:43.630 --> 00:12:10.209 Daniel Bienstock: data points. We take many batches of size. They said, size 16. I'm not ex exactly sure. Y, 16. Okay, but the the sum that we see here this is I. I'm taking an average right? An average of the of the 16 gradients that I'm computing. And this is stochastic gradient descent. 91 00:12:10.260 --> 00:12:12.060 Daniel Bienstock: Okay, so. 92 00:12:12.160 --> 00:12:34.430 Daniel Bienstock: but really, what are we trying to do? You know, we are trying to maximize the probability that the that the system, the train network predicts the correct action. Okay, P of Ak, given sk is the action that the sorry ak given sk is the action that the expert took given State. Sk 93 00:12:35.050 --> 00:12:45.970 Daniel Bienstock: p of ak, sk, let me let me further highlight. This is the probability that our train system gives to this correct action. 94 00:12:47.190 --> 00:13:07.289 Daniel Bienstock: And so if we didn't have the log we would just that we would have something. It looks natural or right with that for each sample. If we had only one example instead of M. Equals to 16, we have M equal to one we are at that point. We are trying to maximize the probability of the action taken by the expert. 95 00:13:07.840 --> 00:13:29.009 Daniel Bienstock: And then we are taking an average which is a stochastic gradient descent part. But it's not that the P. It's a log of the P. Okay? Why the log? Okay. Now, log is consistent with probability. If you maximize the log, you're also maximizing the probability. So that's okay. But still, why the log? Okay. So we'll get to that in a minute. Okay, we'll get to that in a little bit. 96 00:13:29.490 --> 00:13:42.479 Daniel Bienstock: So this is what these characters did. Okay, they took. Let's say, a training data set of one and a half or close to 2 million moves taken by experts. 97 00:13:42.850 --> 00:14:06.139 Daniel Bienstock: And they use that to train the network using gradient descent exactly with these rules. What else did they have? Hold on! Hold on! Wait a minute what is Alpha? Alpha is the training rate. Okay, this is the step size for those of us who remember first order methods. What did they do? They use Alpha equals to 3 times 10 to the minus 3 very small. 98 00:14:06.770 --> 00:14:08.829 Daniel Bienstock: and reduced 99 00:14:10.150 --> 00:14:12.499 Daniel Bienstock: by a factor of a half 100 00:14:12.770 --> 00:14:18.710 Daniel Bienstock: every. How many, how many steps, every what, every 80 million steps. 101 00:14:20.710 --> 00:14:23.250 Daniel Bienstock: This tells you that it took a lot of steps 102 00:14:25.018 --> 00:14:31.009 Daniel Bienstock: they are not shy. Okay, what else? No momentum we learned about momentum 103 00:14:32.036 --> 00:14:50.810 Daniel Bienstock: in previous lectures. No momentum, straight, fixed step, fixed learning rate grade in the center. Simple as possible. Okay. They say, now in the network, in the, in the network, in the paper. They describe some details about the network 104 00:14:51.692 --> 00:14:54.990 Daniel Bienstock: in a future lecture in a future lecture. 105 00:14:55.420 --> 00:14:59.970 Daniel Bienstock: We'll do something practical. But discuss, you know, what are the network architectures? 106 00:15:00.000 --> 00:15:13.480 Daniel Bienstock: There's some terminology. If you read this Google paper in that methods section. They have a subsection, a few paragraphs describing the network architecture. The main thing that that we can say is, it has 13 layers 107 00:15:14.080 --> 00:15:29.920 Daniel Bienstock: if you keep reading, and that they tell you something about the number of revenues that they have. You'll see the term kernel and the term stride. These are AI, or I should say Ml. Terms that describe to some extent the architecture 108 00:15:29.970 --> 00:15:43.290 Daniel Bienstock: of the network, how the layers, what the layers look like. The last layer was what we called a convolutional layer, or the last between the last 2 layers is a convolutional layer means it's fully connected. 109 00:15:43.740 --> 00:15:57.700 Daniel Bienstock: Okay, what is the size of these layers. How about the first layer? Well, they you have to be able to input a state. Okay? A ste, the go board has an is like 19 by 19. 110 00:15:58.640 --> 00:16:11.650 Daniel Bienstock: And so and you have to have that dimensionality, but a little more, because you have to say where the pieces are. But even 19 by 19. And it's roughly, you know, almost 400 dimensional. 111 00:16:12.380 --> 00:16:18.830 Daniel Bienstock: Okay, 13 layers. So alright it, they say, hold on. 112 00:16:19.180 --> 00:16:26.260 Daniel Bienstock: They say that it took some time to train this. How long did it take? I think they say 3 weeks, 3 weeks 113 00:16:28.130 --> 00:16:52.760 Daniel Bienstock: to do the training they have this. They have this word in there. It took 3 weeks this term. I'm not exactly sure what it means. Does it mean 3 weeks of continuous running of their their computing. That would be the direct interpretation. I don't know if that's what they mean, or if they they, this is the entire length of time that they spent, you know, with the. 114 00:16:52.760 --> 00:17:15.099 Daniel Bienstock: you know, making mistakes and correcting them and changing parameters and hyper parameters. I do not know. Okay? And roughly, roughly, it's it it oh, they furthermore say, Oh, and I I cannot see. I'm computing impaired today because of various issues. Sorry. 14 layers, 14 layers, and in terms of the total runtime 115 00:17:15.099 --> 00:17:32.019 Daniel Bienstock: it works to this works to roughly 100 and I forget. I forget the number of steps that they took. These are gradient steps. Okay? But I did the computation. And it works roughly, about 187 steps per second. 116 00:17:34.610 --> 00:17:48.360 Daniel Bienstock: Okay, and I don't know whether this is very fast or not. Very fast. Okay, and so so this is the training, the training data set. 117 00:17:48.760 --> 00:17:56.270 Daniel Bienstock: Okay, they you. They approximately solve this, a gradient descent problem to try to maximize the probability. 118 00:17:56.707 --> 00:18:10.499 Daniel Bienstock: that you pick the right action in the in the given the the input state just maximize that probability, and then they tested it, testing data 119 00:18:14.350 --> 00:18:17.359 Daniel Bienstock: on what? Roughly 29 million 120 00:18:19.764 --> 00:18:20.659 Daniel Bienstock: cases. 121 00:18:21.280 --> 00:18:28.520 Daniel Bienstock: So the testing data was much, much bigger than the training data and the accuracy that they got 122 00:18:30.420 --> 00:18:34.580 Daniel Bienstock: accuracy was about how much? 57%. 123 00:18:36.040 --> 00:18:38.720 Daniel Bienstock: Okay, so a little bit better than half the time. 124 00:18:39.560 --> 00:18:45.109 Daniel Bienstock: Okay? And apparently this was much better than what was available in the state of the art. 125 00:18:45.140 --> 00:18:50.832 Daniel Bienstock: Okay? Now, what other information can we? 126 00:18:51.860 --> 00:18:58.139 Daniel Bienstock: can we provide? I lost the numbers here. So roughly. 127 00:18:58.160 --> 00:19:05.220 Daniel Bienstock: So this is training. Okay in deployment when you want to use it. And you want to evaluate the state 128 00:19:05.320 --> 00:19:10.919 Daniel Bienstock: in order to make a prediction. How fast was that? And they said roughly, 3 ms. 129 00:19:12.120 --> 00:19:22.369 Daniel Bienstock: They also drained a less accurate network where the accuracy was a lot less than 57% only about 25. 130 00:19:22.500 --> 00:19:29.830 Daniel Bienstock: But the advantage is that the making a prediction was much faster, only a few microseconds. 131 00:19:30.300 --> 00:19:35.540 Daniel Bienstock: and we will see that later on, perhaps in the next lecture. Why, that is important. 132 00:19:36.040 --> 00:19:47.549 Daniel Bienstock: Okay, so this is supervised learning of of policy networks. The word policy has to do with the fact that, given a state. 133 00:19:48.350 --> 00:19:53.160 Daniel Bienstock: the trained network amounts to a policy. It tells you what to do. 134 00:19:54.100 --> 00:19:55.100 Daniel Bienstock: Okay. 135 00:19:55.310 --> 00:20:03.609 Daniel Bienstock: what do you do? That's a policy. This is a a old language really in in in decision sciences. 136 00:20:04.020 --> 00:20:06.009 Daniel Bienstock: Just a policy. Alright. 137 00:20:06.330 --> 00:20:13.750 Daniel Bienstock: and it's supervised. Because well, they we had the data provided by the masters, the other supervisors. 138 00:20:14.410 --> 00:20:17.619 Daniel Bienstock: And it's a network. And we learned it. Okay. 139 00:20:18.390 --> 00:20:26.059 Daniel Bienstock: the next element and what they had okay, moving on with the the machine learning hierarchy is what 140 00:20:26.100 --> 00:20:32.449 Daniel Bienstock: they would call, or they call reinforcement reinforce cement learning 141 00:20:34.990 --> 00:20:36.980 Daniel Bienstock: of policy networks. 142 00:20:42.930 --> 00:20:48.520 Daniel Bienstock: Okay? And what do I mean by that. So before we had supervised learning 143 00:20:49.580 --> 00:20:56.460 Daniel Bienstock: reinforcement, learning means that that will use an algorithm to try to correct our errors. Okay. 144 00:20:57.100 --> 00:20:59.340 Daniel Bienstock: so what is the setting. 145 00:21:02.670 --> 00:21:03.949 Daniel Bienstock: we start 146 00:21:05.530 --> 00:21:07.939 Daniel Bienstock: from the previously trained network. 147 00:21:15.190 --> 00:21:20.510 Daniel Bienstock: Okay, and now we want to want to improve, want to improve 148 00:21:22.360 --> 00:21:24.240 Daniel Bienstock: on the weights 149 00:21:24.440 --> 00:21:28.630 Daniel Bienstock: on the trained weights. Sigma. These are the weights that define the network. 150 00:21:29.010 --> 00:21:38.439 Daniel Bienstock: And what is the goal? Now? All right. Look before we had the system that tries to predict what a master would do. 151 00:21:39.140 --> 00:21:41.369 Daniel Bienstock: You could use that to play a game 152 00:21:41.950 --> 00:21:45.320 Daniel Bienstock: every time that you are in a certain board position. 153 00:21:45.400 --> 00:21:49.470 Daniel Bienstock: Well, you use the network to tell you to predict what a master would do. 154 00:21:50.200 --> 00:21:51.709 Daniel Bienstock: And you play that move. 155 00:21:52.160 --> 00:21:53.270 Daniel Bienstock: Okay? 156 00:21:54.530 --> 00:22:00.869 Daniel Bienstock: well, now, the goal is to here in reinforcement learning. We want to get better at that. 157 00:22:01.290 --> 00:22:06.309 Daniel Bienstock: We want to start from the system we trained before that predicted what a master would do. 158 00:22:06.390 --> 00:22:08.150 Daniel Bienstock: and starting from there. 159 00:22:08.180 --> 00:22:11.390 Daniel Bienstock: improve on that system to win games 160 00:22:12.540 --> 00:22:14.380 Daniel Bienstock: when games 161 00:22:15.130 --> 00:22:18.139 Daniel Bienstock: okay? And what is the method? 162 00:22:19.860 --> 00:22:23.370 Daniel Bienstock: So the method we could call it self play. 163 00:22:24.800 --> 00:22:29.796 Daniel Bienstock: Okay, you play. This is how it's going to work play a game. 164 00:22:30.490 --> 00:22:31.900 Daniel Bienstock: play a game 165 00:22:33.690 --> 00:22:36.309 Daniel Bienstock: against an opponent. 166 00:22:36.970 --> 00:22:47.460 Daniel Bienstock: actually, an opponent, an opponent is going to be one of our prior algorithms. And in the very first iteration it'll be the the previously trained network. 167 00:22:47.900 --> 00:22:50.209 Daniel Bienstock: But because do it over and over again. 168 00:22:50.330 --> 00:22:55.039 Daniel Bienstock: every time that we run this exercise, we're going to get a better network. 169 00:22:55.360 --> 00:22:58.249 Daniel Bienstock: And now we'll play again against that network. 170 00:22:58.450 --> 00:23:00.990 Daniel Bienstock: Okay, against a prior 171 00:23:01.390 --> 00:23:02.800 Daniel Bienstock: algorithm. 172 00:23:05.390 --> 00:23:09.990 Daniel Bienstock: Okay, and so what are the States that we see the States 173 00:23:10.660 --> 00:23:18.140 Daniel Bienstock: going to be? S. One s. 2. These are board states. Okay, as we play the game. 174 00:23:18.230 --> 00:23:40.417 Daniel Bienstock: And what is the T. They put a finite ending time for the game. Okay, if the game was taken too long. Well, it was given value 0. We're going to set up another maximization problem. So value 0 means that we didn't play the game too effectively. I forget what the T was. Some maximum number of moves that they allowed. Okay? And 175 00:23:42.078 --> 00:23:45.030 Daniel Bienstock: What are the actions? What are the actions? 176 00:23:45.520 --> 00:23:52.520 Daniel Bienstock: So these are the actions that that we took in playing the game. And let's say, these are a one, a 2. These are the moves that we took. 177 00:23:52.940 --> 00:23:55.079 Daniel Bienstock: Okay? And what is the outcome. 178 00:23:57.120 --> 00:24:03.509 Daniel Bienstock: the outcome. I'm going to denote it as a Z, which is going to be equal to plus one or minus one win or lose. 179 00:24:05.090 --> 00:24:15.329 Daniel Bienstock: okay, or 0. If the game gets end up terminated, and then we applying gradient descent. So now again, let me write the formula gradient ascent ascent. 180 00:24:15.560 --> 00:24:21.370 Daniel Bienstock: We want to basically win games. We want to increase the score that we can get. 181 00:24:21.530 --> 00:24:36.290 Daniel Bienstock: So let me write the formula, and then we'll puzzle it out. It's going to look similar to what we had before. We're using exactly the same network, you know. So the the very first time that we we do this self play, we play against the previously trained network. 182 00:24:36.760 --> 00:24:53.379 Daniel Bienstock: Okay? And now, this is one game, one game. Okay, we'll change this in a minute to use multiple games like in less than a minute. But let's do it for one game. Okay, Alpha, this is again, that learning rate the gradient the size of the gradient step. 183 00:24:53.410 --> 00:24:55.160 Daniel Bienstock: And now I take the sun. 184 00:24:56.070 --> 00:25:03.440 Daniel Bienstock: and we'll have to explain this again. You know why this? So the gradient, with respect to the weights of the logarithm 185 00:25:04.500 --> 00:25:09.120 Daniel Bienstock: of the the of the probability. Pardon 186 00:25:12.866 --> 00:25:14.599 Daniel Bienstock: action. T 187 00:25:14.800 --> 00:25:16.619 Daniel Bienstock: given state T. 188 00:25:17.050 --> 00:25:18.330 Daniel Bienstock: All of this 189 00:25:18.470 --> 00:25:19.950 Daniel Bienstock: times. Z. 190 00:25:20.190 --> 00:25:24.429 Daniel Bienstock: Alright. So what does this say? Let's say that Z is plus one. 191 00:25:25.190 --> 00:25:26.770 Daniel Bienstock: So we won the game. 192 00:25:27.760 --> 00:25:40.620 Daniel Bienstock: Okay, does this make sense? Yes, we want the game. And so now I'm doing something that at least superficially, is consistent with maximizing the probabilities of the various moves that we did. 193 00:25:43.150 --> 00:25:44.190 Daniel Bienstock: Okay. 194 00:25:45.990 --> 00:25:46.920 Daniel Bienstock: Alright. 195 00:25:47.363 --> 00:25:56.730 Daniel Bienstock: but why the log again? Okay. Why the log? Again, we have to try to understand this. Well, actually, this is not quite right. We did. 196 00:25:57.780 --> 00:26:00.599 Daniel Bienstock: Stochastic, gradient descent, actually 197 00:26:01.800 --> 00:26:03.090 Daniel Bienstock: stochastic 198 00:26:05.040 --> 00:26:06.370 Daniel Bienstock: great dissent. 199 00:26:09.730 --> 00:26:27.120 Daniel Bienstock: which is what? Well, it's going to be alpha divided by the number of games. So we play multiple games. We don't play one game, we play multiple games. And now we take this sum. Okay? So it's an average. This is a stochastic part. We take an average 200 00:26:27.250 --> 00:26:29.150 Daniel Bienstock: of steps 201 00:26:29.852 --> 00:26:33.000 Daniel Bienstock: and then we have something. It looks like what I had above 202 00:26:33.230 --> 00:26:38.070 Daniel Bienstock: T equals one. Well, each game ends up at a different time. Maybe. 203 00:26:38.460 --> 00:27:02.089 Daniel Bienstock: Okay. Ti is at most as capital T. And now I have. Now I have the gradient log of P. Sigma of a T sub. I is, and king game, IST sub i, and now I have a Z sub. I, the outcome output of game. I outcome of game. I, okay, so this is just an average of terms of what I had before. So really, it's like a very similar 204 00:27:02.130 --> 00:27:10.280 Daniel Bienstock: and so and what what was them? I forgot. They they told me. I mean, it's in the paper. Pardon, and there's a hundred 28. 205 00:27:11.450 --> 00:27:18.950 Daniel Bienstock: So they play mini batches. They they, they play sequences of mini batches of a hundred 28 games 206 00:27:19.170 --> 00:27:21.190 Daniel Bienstock: against a prior opponent. 207 00:27:23.330 --> 00:27:27.000 Daniel Bienstock: From that they compute at one gradient step 208 00:27:27.570 --> 00:27:36.759 Daniel Bienstock: for all the network weights. The gradient step is a gradient descent step. It's consistent with increasing the probability that we win again. 209 00:27:37.260 --> 00:27:41.760 Daniel Bienstock: We are just looking at each sample, each trajectory, as it were. 210 00:27:42.495 --> 00:27:47.919 Daniel Bienstock: Again, when Z is one, we want to make the probability big when Z is minus one, we lost 211 00:27:48.040 --> 00:27:50.959 Daniel Bienstock: so and we make we want to decrease the probability. 212 00:27:51.360 --> 00:27:58.420 Daniel Bienstock: And then, okay, how long did this take? They say that it took how long it took? One day one day. 213 00:27:58.910 --> 00:28:01.599 Daniel Bienstock: in terms of the the training 214 00:28:02.075 --> 00:28:04.630 Daniel Bienstock: with a world with 50 gpus. 215 00:28:05.830 --> 00:28:09.429 Daniel Bienstock: Okay, so think about the Gpu as being one big computer. 216 00:28:09.540 --> 00:28:12.709 Daniel Bienstock: So if they had 50 of them running in parallel for one day. 217 00:28:13.260 --> 00:28:20.320 Daniel Bienstock: Okay, running stochastic, gradient descent, they say, asynchronously. So they're doing the different steps in parallel. 218 00:28:21.279 --> 00:28:41.349 Daniel Bienstock: in this Mini batch computation that we have in here alright. But now we want to understand, you know why the log finally, why the log? Okay, why the log before we go there, you know. Notice that this the term with that has the gradient inside the sun. This is really the same. This is really the same 219 00:28:44.925 --> 00:28:57.990 Daniel Bienstock: sorry I I I. This is correct. This is correct. What's inside inside inside the sum here? This is the same as the gradient of the sun. It's a sum of gradients, which is the gradient of the sum 220 00:28:58.990 --> 00:29:01.049 Daniel Bienstock: T equals one to Ti. 221 00:29:02.982 --> 00:29:05.819 Daniel Bienstock: Sigma's no sigma is gone 222 00:29:06.060 --> 00:29:07.339 Daniel Bienstock: of log 223 00:29:08.570 --> 00:29:14.640 Daniel Bienstock: of P. Sigma of action. T sub. I given state Z, sub, I, 224 00:29:15.370 --> 00:29:17.209 Daniel Bienstock: okay, that's what that is. 225 00:29:18.160 --> 00:29:23.209 Daniel Bienstock: And a log. A sum of logs is the log of of the product. 226 00:29:24.290 --> 00:29:35.459 Daniel Bienstock: Okay, so this is really, this is really just the log of the probability of the entire sequence. Okay, because the different moves are independent of one another. 227 00:29:36.390 --> 00:29:48.879 Daniel Bienstock: You know, if we find ourselves in state. Sti. Well, we'll take the move that the network the train network dictates at that point. The past is in is gone, it's independent. 228 00:29:50.190 --> 00:29:53.810 Daniel Bienstock: And so this is a little bit reminiscent. Again, of dynamic programming. 229 00:29:54.530 --> 00:30:00.859 Daniel Bienstock: Okay, so I'm just taking the the the log of the probability of the entire trajectory. 230 00:30:02.840 --> 00:30:10.639 Daniel Bienstock: and I'm taking the gradient of that. So again, this is all very, very makes sense. But why the log? All right. So now let's get to that. 231 00:30:11.380 --> 00:30:17.320 Daniel Bienstock: How much time do I have? Well, they don't have much, all right. So here's some classical literature that they cite. 232 00:30:17.705 --> 00:30:30.069 Daniel Bienstock: Did they cite? They think they sign. I'm not sure how they cited it. Classical literature going back to the dawn of the AI. Okay, there's a one of their gods is this Guy Williams. 233 00:30:31.580 --> 00:30:33.540 Daniel Bienstock: 1992. 234 00:30:33.820 --> 00:30:45.659 Daniel Bienstock: And then even, and he and other people today, they cite even earlier work very famous work in very early work in AI Barton and Sutton. 235 00:30:46.990 --> 00:30:49.830 Daniel Bienstock: and roughly, 1983. 236 00:30:49.990 --> 00:30:54.470 Daniel Bienstock: Okay, I can tell you that back then nobody took AI seriously. 237 00:30:54.790 --> 00:30:57.610 Daniel Bienstock: everybody would say, this is this is such a joke! 238 00:30:58.300 --> 00:31:06.999 Daniel Bienstock: It's a complete joke. These are charlatans, you know. People who are, you know, should be thrown away out of universities and such. 239 00:31:07.250 --> 00:31:08.140 Daniel Bienstock: Okay. 240 00:31:08.640 --> 00:31:12.890 Daniel Bienstock: so now let's look at a generic setup for reinforcement learning. 241 00:31:13.380 --> 00:31:19.119 Daniel Bienstock: I take it that some of you know I'm I'm I'm told in in good faith 242 00:31:19.310 --> 00:31:22.860 Daniel Bienstock: that some of you know what a mark of decision process is. 243 00:31:23.500 --> 00:31:25.430 Daniel Bienstock: but not everybody does. 244 00:31:25.740 --> 00:31:34.340 Daniel Bienstock: So let me give you an example of a markup decision process and what it is. And and then we'll take the conversation elsewhere. 245 00:31:35.070 --> 00:31:37.829 Daniel Bienstock: Okay, so what is a mark of decision process. 246 00:31:45.450 --> 00:31:50.799 Daniel Bienstock: Okay, so we have a number of states. Okay, here we have a little 4 States. 247 00:31:51.080 --> 00:31:54.950 Daniel Bienstock: and out of each State, we have possible transitions. 248 00:31:55.440 --> 00:32:02.470 Daniel Bienstock: Let's say there and they are to there, and they are to there, I don't know. And and and that's it. 249 00:32:03.550 --> 00:32:07.260 Daniel Bienstock: Okay. Now, in, in each state, in each State. 250 00:32:07.570 --> 00:32:08.803 Daniel Bienstock: We have 251 00:32:09.590 --> 00:32:14.460 Daniel Bienstock: some number of actions, possible actions, you know. Action one 252 00:32:15.610 --> 00:32:21.189 Daniel Bienstock: 2, and 3 possible actions at that state. Let's call it State a. 253 00:32:21.880 --> 00:32:26.890 Daniel Bienstock: And now there are 2 things that happen. One is that we get a reward. 254 00:32:27.620 --> 00:32:31.690 Daniel Bienstock: If we take an action, we get a reward. So here's the reward. 255 00:32:34.220 --> 00:32:39.569 Daniel Bienstock: And let's say the it's a 5 and minus 10 and and 6. Okay. 256 00:32:39.650 --> 00:32:42.590 Daniel Bienstock: so at this state there are 3 actions. 257 00:32:43.070 --> 00:32:46.200 Daniel Bienstock: and then for each action. If we take that action. 258 00:32:46.250 --> 00:32:47.710 Daniel Bienstock: we get a reward. 259 00:32:49.150 --> 00:33:07.600 Daniel Bienstock: What else? Well, in addition to getting a reward, we're going to get a probability distribution. Okay, one third and 2 thirds, or one half and one half, and 0 and one. So what is that probability? Distribution? Well, notice 260 00:33:07.650 --> 00:33:10.179 Daniel Bienstock: that, you know there are 2 numbers in each case. 261 00:33:10.760 --> 00:33:12.770 Daniel Bienstock: blue and green. 262 00:33:14.100 --> 00:33:15.600 Daniel Bienstock: blue and green. 263 00:33:15.830 --> 00:33:20.030 Daniel Bienstock: Okay, and there are 2 arcs going out blue and green. 264 00:33:21.300 --> 00:33:29.439 Daniel Bienstock: Okay, so if we take action. The action one, then the probability distribution indicates a probability that we will switch 265 00:33:29.600 --> 00:33:33.710 Daniel Bienstock: to States blue or green. From 266 00:33:38.160 --> 00:33:39.170 Daniel Bienstock: okay. 267 00:33:39.430 --> 00:33:44.480 Daniel Bienstock: And we have such information for every node of this network. 268 00:33:44.770 --> 00:33:47.640 Daniel Bienstock: So we choose the action that we want to take. 269 00:33:48.060 --> 00:33:50.590 Daniel Bienstock: and then we get the reward immediately. 270 00:33:50.850 --> 00:33:55.739 Daniel Bienstock: and then we transition to the other States according to the probability distribution. 271 00:33:56.130 --> 00:34:04.840 Daniel Bienstock: Another question is, okay, what policy should we have? What do I mean by a policy? A policy tells you in each state which action you should take. 272 00:34:05.390 --> 00:34:25.590 Daniel Bienstock: Okay? And what is the goal? The goal is to maximize the total reward. Think about this. Well, if you run this infinitely many times, of course. Then you're going to get some kind of infinite, infinite reward. But let's make it time discounted. Okay, with some discounting factor, you know. Let's say a half. 273 00:34:27.040 --> 00:34:37.290 Daniel Bienstock: You know we want to choose a but is there such a thing as a policy that maximizes them? The the maximum that maximizes the expected reward. Discounted expected reward. 274 00:34:38.159 --> 00:34:40.500 Daniel Bienstock: Okay, a fixed policy. 275 00:34:40.710 --> 00:34:53.469 Daniel Bienstock: And the answer is, yes, there is. There's a whole slew of theorems. And one nice thing that we don't have time to go over today is that this this particular problem can be solved as a small linear program. 276 00:34:54.040 --> 00:35:07.880 Daniel Bienstock: Okay, as a small linear program. There are several methods for solving this problem efficiently. This markup decision model efficiently. And I can see that the blue is not showing up. Do you see the blue Matthias. 277 00:35:10.070 --> 00:35:10.920 matias: Yes. Done. 278 00:35:11.210 --> 00:35:12.300 Daniel Bienstock: Okay. Good. 279 00:35:12.300 --> 00:35:14.110 matias: Oh, so you have. 280 00:35:14.380 --> 00:35:19.560 matias: I I we only see 2 colors like tail and green, or. 281 00:35:20.000 --> 00:35:24.889 Daniel Bienstock: Yeah. Okay, so let me let me highlight. I mean, whatever the color was that this is the other color. 282 00:35:27.010 --> 00:35:29.510 Daniel Bienstock: Okay, now, it's a little more visible. Alright. 283 00:35:29.896 --> 00:35:39.679 Daniel Bienstock: right! And then, for every node, you know, like the this node. On the other hand, you know they are. Notice there are no, no exiting states wrong thing 284 00:35:40.839 --> 00:35:44.520 Daniel Bienstock: so once you're there, you're stuck there forever. Okay. 285 00:35:44.560 --> 00:35:47.579 Daniel Bienstock: and you get the reward. You don't get anything. 286 00:35:47.680 --> 00:35:50.089 Daniel Bienstock: And so on. This can happen 287 00:35:50.740 --> 00:35:51.599 Daniel Bienstock: all right 288 00:35:51.730 --> 00:35:52.610 Daniel Bienstock: now. 289 00:35:54.028 --> 00:36:05.029 Daniel Bienstock: later. During the summer I have some some meetings with with the the students here. We'll go over Markov. Decision processes in a little more detail. Okay? 290 00:36:05.040 --> 00:36:06.600 Daniel Bienstock: But now, alright 291 00:36:07.362 --> 00:36:18.320 Daniel Bienstock: now that we understand what a system is and what a policy is. And so on. Let's look at a generic setup, not for mark of decision processes, for reinforcing reinforcement learning. 292 00:36:22.740 --> 00:36:27.379 Daniel Bienstock: Okay, which is basically something that Williams was looking at. 293 00:36:27.630 --> 00:36:35.040 Daniel Bienstock: So we control a system. Okay? And now we can talk in generalities. We control a system 294 00:36:36.180 --> 00:36:38.340 Daniel Bienstock: by a policy. 295 00:36:40.150 --> 00:36:48.299 Daniel Bienstock: Let's call it pi, okay, in the network learning case, the policy really was the network. Okay. 296 00:36:48.400 --> 00:36:51.060 Daniel Bienstock: the network instantiates a policy 297 00:36:51.380 --> 00:36:53.330 Daniel Bienstock: and given a state. 298 00:36:53.410 --> 00:36:55.020 Daniel Bienstock: It tells you what to do. 299 00:36:56.800 --> 00:37:03.029 Daniel Bienstock: You pull a lever, a big lever. Given a state that policy. Pi tells you what to do. 300 00:37:03.400 --> 00:37:06.299 Daniel Bienstock: Alright. Now the policy 301 00:37:06.540 --> 00:37:07.880 Daniel Bienstock: is controlled 302 00:37:09.550 --> 00:37:10.970 Daniel Bienstock: by parameters. 303 00:37:14.070 --> 00:37:21.000 Daniel Bienstock: Sigma, which is a high dimensional vector and in the network case well, these are the weights of the network. 304 00:37:21.940 --> 00:37:24.450 Daniel Bienstock: Okay, the system. 305 00:37:25.100 --> 00:37:26.979 Daniel Bienstock: Given an application 306 00:37:30.710 --> 00:37:32.210 Daniel Bienstock: of the policy. 307 00:37:34.950 --> 00:37:36.540 Daniel Bienstock: this system 308 00:37:38.380 --> 00:37:39.779 Daniel Bienstock: will follow 309 00:37:40.880 --> 00:37:43.050 Daniel Bienstock: a stochastic trajectory. 310 00:37:48.920 --> 00:37:52.019 Daniel Bienstock: Then I'm going to call Tao. 311 00:37:53.810 --> 00:38:02.630 Daniel Bienstock: Okay? In the network case, it's stochastic. Well, because at the very end the very last layer, this soft Max layer, and so on. That's a probability. 312 00:38:02.720 --> 00:38:04.279 Daniel Bienstock: Okay, it's stochastic. 313 00:38:04.670 --> 00:38:11.309 Daniel Bienstock: We we assemble an action to take according to the probabilities that the network has produced. 314 00:38:12.110 --> 00:38:19.869 Daniel Bienstock: We run as we run the state as the input you run it. It produces these numbers, which are probabilities. 315 00:38:20.160 --> 00:38:24.100 Daniel Bienstock: We sample from the distribution, and we get the desired action. 316 00:38:24.860 --> 00:38:28.088 Daniel Bienstock: Okay? And let's use a notation 317 00:38:28.730 --> 00:38:31.429 Daniel Bienstock: pi of Tau is the probability. 318 00:38:31.920 --> 00:38:34.050 Daniel Bienstock: Oh, that trajectory 319 00:38:35.250 --> 00:38:41.569 Daniel Bienstock: in a more general reinforcement learning setup. There could be stochastics at multiple steps. 320 00:38:42.790 --> 00:38:45.020 Daniel Bienstock: Okay? And so 321 00:38:45.716 --> 00:38:48.339 Daniel Bienstock: at termination, at termination 322 00:38:52.500 --> 00:38:54.110 Daniel Bienstock: we get a reward. 323 00:38:56.080 --> 00:38:58.920 Daniel Bienstock: Then I'm going to call R of Tau. 324 00:38:59.720 --> 00:39:19.160 Daniel Bienstock: That depends on the entire trajectory. Okay? And the network setup I was describing before. The the reward does not depend on the entire trajectory. We just collect a reward at the very end if we match if we win the game. Pardon or not. Okay. 325 00:39:19.480 --> 00:39:23.840 Daniel Bienstock: it was plus or minus or minus one alright. And so 326 00:39:23.850 --> 00:39:26.430 Daniel Bienstock: maximize. The goal goal 327 00:39:27.400 --> 00:39:31.559 Daniel Bienstock: is to choose policy, choose policy 328 00:39:32.860 --> 00:39:34.520 Daniel Bienstock: to maximize 329 00:39:34.800 --> 00:39:36.590 Daniel Bienstock: the expected reward. 330 00:39:39.230 --> 00:39:53.520 Daniel Bienstock: which is what that's the integral. So that'd be a little bit sloppy, the integral. Let me write that the notation and then we will debug it. Okay, it's a little, just a little sloppy, you know, it's a elegantly sloppy 331 00:39:58.050 --> 00:40:05.409 Daniel Bienstock: okay, this is our notation for saying what we are saying. Okay, look, we are sampling trajectories 332 00:40:06.350 --> 00:40:18.809 Daniel Bienstock: with some distribution. So this is like a stochastic integral pi is the probability of sampling trajectory Tau, and then we get the reward. So this is the expected reward. And we want to maximize this. 333 00:40:19.730 --> 00:40:22.259 Daniel Bienstock: Okay, it's only a little bit sloppy. 334 00:40:22.500 --> 00:40:27.760 Daniel Bienstock: It's a sloppy enough to scandalize somebody who's serious about stochastic processes. 335 00:40:28.125 --> 00:40:31.180 Daniel Bienstock: But it's correct enough that we understand what it says. 336 00:40:31.620 --> 00:40:55.119 Daniel Bienstock: We want to maximize this. How do we maximize this? Well, Williams, back, you know, like I don't know. 100 years ago, whatever 1992, he said. Look, I have no clue. What we are doing. Let's take gradient steps. Okay, this is long before modern AI and Ml. And Gpus, and this and that, and you know anything existed, he said. Look, this is so hard to do. 337 00:40:55.120 --> 00:41:04.950 Daniel Bienstock: I know how to take gradients, you know. Let's take a small gradient steps and then start with a policy and correct it. Okay, and now that the thing is well. 338 00:41:05.140 --> 00:41:09.520 Daniel Bienstock: how do we take the gradient of something like this? 339 00:41:09.830 --> 00:41:12.189 Daniel Bienstock: So we want to take the gradient. 340 00:41:13.140 --> 00:41:21.299 Daniel Bienstock: Oh, boy, it seems like we actually going to make it through the lecture on time? Excellent! The gradient of the expected reward, the gradient of this. 341 00:41:25.910 --> 00:41:27.270 Daniel Bienstock: Let me copy this 342 00:41:30.470 --> 00:41:31.290 Daniel Bienstock: paste. 343 00:41:31.400 --> 00:41:38.949 Daniel Bienstock: and I? Well, you know, let's not review how what this means in in in the case of the prior slide, this. 344 00:41:39.060 --> 00:41:42.939 Daniel Bienstock: Okay? let's put aside 345 00:41:43.240 --> 00:41:48.910 Daniel Bienstock: or let's ignore for the time being, the details toward the bottom. Here 346 00:41:49.540 --> 00:41:54.690 Daniel Bienstock: now we are sampling. If we have a policy, if we have a policy which is a trained network. 347 00:41:55.820 --> 00:42:00.929 Daniel Bienstock: We are sampling games. Okay? That's those are our trajectories. A game is a trajectory. 348 00:42:02.080 --> 00:42:14.520 Daniel Bienstock: Okay? And we want to maximize the expected reward that we would get. You know it's a it's a if we if we keep playing many games over and over and over again. You know we are sampling from the same distribution for a given train network. 349 00:42:15.118 --> 00:42:21.469 Daniel Bienstock: It's you know, we're playing a game at each move of the game. We're getting a certain move with a certain probability. 350 00:42:21.540 --> 00:42:24.929 Daniel Bienstock: and at termination of the game we get a reward. 351 00:42:25.460 --> 00:42:27.910 Daniel Bienstock: And so we keep sampling games. 352 00:42:29.340 --> 00:42:33.180 Daniel Bienstock: and we want to somehow maximize the expected reward. 353 00:42:34.240 --> 00:42:36.320 Daniel Bienstock: Buy a first order methods. 354 00:42:36.580 --> 00:42:53.649 Daniel Bienstock: Okay? So we need to be able to compute something like that. A gradient of of an integral, you know, like combines 2 things that probably many of my students don't like. Alright so how do we do that? Okay, so let me copy that and move on to the next page. 355 00:43:00.900 --> 00:43:02.569 Daniel Bienstock: So we want to do this. 356 00:43:03.880 --> 00:43:23.310 Daniel Bienstock: Okay? And let's refresh our memory. The tiles of the the we are sampling trajectories. According to our policy, R. Of Tau is the terminal reward that we get, or or the reward that we get as we go through the trajectory and pi of Tau is the probability that we sample that particular trajectory. 357 00:43:24.540 --> 00:43:37.100 Daniel Bienstock: Okay, so alright, let's see if we can do some math over here. Okay, you want to take the gradient. So let's break some eggs. This is we can immediately say, this is equal to this. 358 00:43:45.900 --> 00:44:02.599 Daniel Bienstock: Okay, so I switch the gradient with the integral. I'm sure that there there's there are hundreds of pages of mathematics that say, well, in order for you to do that, you have to satisfy A, BCD, and E, and so on. And I'm saying, Yes, I satisfy everything. I'm I'm happy with us. Okay. 359 00:44:03.760 --> 00:44:04.690 Daniel Bienstock: alright. 360 00:44:04.720 --> 00:44:06.969 Daniel Bienstock: So how do we do that? Okay. 361 00:44:09.250 --> 00:44:23.740 Daniel Bienstock: alright. So now let's be creative. Okay, there's a a proof. What am I doing here? Okay, what am I doing here. I'm doing something that is called the policy gradient theorem. 362 00:44:24.150 --> 00:44:41.849 Daniel Bienstock: which I think was proved by this guy Williams that I had before. But it's possible. Some people say that actually, these 2 other guys, Barto and Sutton had already proved or outlined or something. Okay. But this a policy gradient. Theorem. 363 00:44:42.630 --> 00:44:56.310 Daniel Bienstock: Okay policy, gradient theorem or gradient policy. Theorem is like one of the golden one of the the. This is the eleventh commandment of the AI community, the traditional AI community. 364 00:44:56.350 --> 00:45:00.027 Daniel Bienstock: This is what they believe in. So alright. So let's let's do this. Okay. 365 00:45:00.780 --> 00:45:14.380 Daniel Bienstock: now, alright, I want you to think about this and and what I'm going to be doing here. I'm sampling in in the integral that is fully written up at the top on the right. I'm sampling trajectories. 366 00:45:14.530 --> 00:45:18.040 Daniel Bienstock: So let's say that the towel has been sampled. Okay. 367 00:45:18.550 --> 00:45:20.550 Daniel Bienstock: I have a given trajectory. 368 00:45:21.870 --> 00:45:24.019 Daniel Bienstock: It has a certain reward. 369 00:45:25.040 --> 00:45:26.180 Daniel Bienstock: So 370 00:45:26.340 --> 00:45:32.269 Daniel Bienstock: let me write this, and then let's see if we agree. This is equal to the gradient of the probability. 371 00:45:32.710 --> 00:45:33.710 Daniel Bienstock: Okay. 372 00:45:34.200 --> 00:45:35.750 Daniel Bienstock: time's a reward. 373 00:45:36.740 --> 00:45:45.689 Daniel Bienstock: And again, I want you to think of the integral at the top, right? As one where you're just. It's like a an integral is always like a big sum. Okay 374 00:45:46.302 --> 00:45:51.930 Daniel Bienstock: for any given cow. It's a trajectory. Its reward is is a well given. 375 00:45:52.240 --> 00:46:05.320 Daniel Bienstock: The only function of the parameters, sigma, are the probabilities, not the not the reward of the of the trajectory, and this is why I can write the the integral at the bottom left. 376 00:46:05.360 --> 00:46:10.369 Daniel Bienstock: The rewards do not depend on the probabilities given the trajectory. 377 00:46:11.930 --> 00:46:20.980 Daniel Bienstock: Okay, it took me like 5 days to understand this. And so what does this equal to? So now let me multiply and divide 378 00:46:22.890 --> 00:46:24.120 Daniel Bienstock: by that. 379 00:46:24.140 --> 00:46:25.720 Daniel Bienstock: And I have this. 380 00:46:29.690 --> 00:46:45.250 Daniel Bienstock: Okay, I multiplied and divided. Truth be told, the pi, the probability pi depends on the Sigmas. I'm I'm I'm I'm skipping that to make the notations lighter. Remember, in the network training case, the probability is output by the network. 381 00:46:45.350 --> 00:46:51.249 Daniel Bienstock: And the network depends on the segments. Okay? And that's what we are changing times the reward. 382 00:46:51.950 --> 00:47:05.170 Daniel Bienstock: Times. D pi, okay, detail. And now equal. And now what is the sequel? So this animal here, this animal here, that is delta sigma of the log. Finally. 383 00:47:06.010 --> 00:47:09.060 Daniel Bienstock: okay. Times, pi of Tau 384 00:47:09.510 --> 00:47:11.180 Daniel Bienstock: times are of Tao 385 00:47:11.550 --> 00:47:18.310 Daniel Bienstock: detail. Okay? And now, what is this animal? Okay, remember, pi is the probability of trajectory 386 00:47:18.874 --> 00:47:28.169 Daniel Bienstock: Tau. This is the expectation. This is the expectation of the log of Sigma of the log, of the probability 387 00:47:30.320 --> 00:47:31.990 Daniel Bienstock: times. R. 388 00:47:33.720 --> 00:47:34.730 Daniel Bienstock: That's it. 389 00:47:35.090 --> 00:47:43.810 Daniel Bienstock: It's the expectation of that quantity. The pi entered this pie here entered because we are taking an expectation. That's all. 390 00:47:44.230 --> 00:47:49.739 Daniel Bienstock: Okay, that thing, the integral and the pi means that I'm taking an expectation. 391 00:47:51.100 --> 00:47:54.530 Daniel Bienstock: And this is this famous gradient policy, theorem 392 00:47:55.510 --> 00:47:58.249 Daniel Bienstock: or policy grading policy grading. 393 00:48:03.500 --> 00:48:04.580 Daniel Bienstock: Okay. 394 00:48:04.810 --> 00:48:09.800 Daniel Bienstock: alright. And so if we go back to what we were doing before 395 00:48:09.820 --> 00:48:23.580 Daniel Bienstock: you can see here in in this term over here in this term, over here in the stochastic term, taking this average, this average. Let me leave out. Let me leave out the alpha. 396 00:48:24.100 --> 00:48:27.940 Daniel Bienstock: Let me leave that out, or rather, let me write that outside. 397 00:48:30.630 --> 00:48:40.969 Daniel Bienstock: Okay, the stuff that is in yellow. This is an estimate of that expectation. Think about the central limit, theorem. Okay? Applied to a function 398 00:48:41.490 --> 00:48:46.650 Daniel Bienstock: to take an average. You just take a large number of samples, and then you average them. 399 00:48:47.440 --> 00:48:51.779 Daniel Bienstock: And so here I'm taking a large well, a hundred 28 400 00:48:51.810 --> 00:48:53.780 Daniel Bienstock: samples of the gradient 401 00:48:54.660 --> 00:48:57.859 Daniel Bienstock: of what of of the Lord? Times a reward? 402 00:48:58.910 --> 00:48:59.830 Daniel Bienstock: Okay. 403 00:49:00.510 --> 00:49:07.279 Daniel Bienstock: The reward does not depend on on on anything really given. Given that the the trajectory. Given the trajectory. 404 00:49:07.420 --> 00:49:09.260 Daniel Bienstock: Okay? And, in fact. 405 00:49:09.500 --> 00:49:33.170 Daniel Bienstock: here I could have done that right if the expectation with respect to the weights, or the sigma or the towel. Pardon the trajectory. It's the same thing but the the if we think again in terms of the sampling interpretation of an expectation there the rewards don't depend okay 406 00:49:33.350 --> 00:49:35.440 Daniel Bienstock: on on the weights. 407 00:49:36.450 --> 00:49:41.429 Daniel Bienstock: Alright, but I should put it in there otherwise it it's not clear what it is that I'm doing. 408 00:49:42.110 --> 00:49:44.329 Daniel Bienstock: But I wanted to highlight that alright. 409 00:49:44.430 --> 00:49:52.879 Daniel Bienstock: And this is the policy gradient algorithm. And that's what we are doing in here. We are sampling games. Okay, that's the towel. The game. A full game is a towel. 410 00:49:52.900 --> 00:49:59.820 Daniel Bienstock: I'm taking the the gradient of each of the samples, and I'm averaging them, and then the alpha is the learning rate, the step size. 411 00:50:00.600 --> 00:50:14.849 Daniel Bienstock: And so they're doing gradient descent consistent, consistent with with maximizing. Well, maximize, Max maximizing the the the expected reward, maximizing the expected reward which is winning the game 412 00:50:18.210 --> 00:50:26.019 Daniel Bienstock: alrighty. And this was a step 2 of what they did. But you know it's something that is standard. This a policy gradient. Theorem. 413 00:50:27.800 --> 00:50:29.880 Daniel Bienstock: Okay? Alright 414 00:50:31.602 --> 00:50:50.930 Daniel Bienstock: alright. And what was the outcome of all of us? With these, you know, 128 Mini batches one day, and 50 gpus and all that. So they had started with the original train system that was supposed to mimic what the masters did. 415 00:50:51.060 --> 00:50:57.960 Daniel Bienstock: Then they took the weights from from that, and then they applied this reinforcement learning algorithm 416 00:50:58.010 --> 00:51:02.679 Daniel Bienstock: to actually improve their chances of winning games. 417 00:51:03.160 --> 00:51:11.600 Daniel Bienstock: So now they had a better game playing system. Okay? And what were the the hitting? The statistics 418 00:51:11.630 --> 00:51:14.630 Daniel Bienstock: for this? And so they said that they beat 419 00:51:15.870 --> 00:51:20.540 Daniel Bienstock: the original. The masters 420 00:51:21.860 --> 00:51:23.050 Daniel Bienstock: trained 421 00:51:24.130 --> 00:51:25.330 Daniel Bienstock: system 422 00:51:26.050 --> 00:51:27.929 Daniel Bienstock: 80% of the time. 423 00:51:30.980 --> 00:51:31.670 matias: Before it was. 424 00:51:31.670 --> 00:51:32.160 Daniel Bienstock: Loud. 425 00:51:32.160 --> 00:51:34.280 matias: Before it was 7, 57, right. 426 00:51:34.280 --> 00:51:39.760 Daniel Bienstock: No, no, so the the 57 was accuracy, and predicting individual moves. 427 00:51:41.270 --> 00:51:45.320 Daniel Bienstock: Now they use that system. Now they played games against that correct. 428 00:51:45.320 --> 00:51:46.253 matias: You see. 429 00:51:46.720 --> 00:51:58.870 Daniel Bienstock: Now the part that I'm that I that I skipped here. Sorry. Let me go back because it is important. Here is that every how many, every 500 steps or so. Okay. 430 00:51:59.150 --> 00:52:02.739 Daniel Bienstock: they would take the network that that they had just computed. 431 00:52:03.120 --> 00:52:06.170 Daniel Bienstock: and they would make that into a new opponent. 432 00:52:07.830 --> 00:52:14.599 Daniel Bienstock: Okay? And after a while they would have this set of previously developed opponents. 433 00:52:15.130 --> 00:52:22.230 Daniel Bienstock: And then, the next time that they played a game like here play a game against a prior adversary. They would pick a random one 434 00:52:23.050 --> 00:52:25.800 Daniel Bienstock: of the ones that they had previously trained. 435 00:52:26.680 --> 00:52:28.569 Daniel Bienstock: So now they are doing self play. 436 00:52:29.390 --> 00:52:30.290 Daniel Bienstock: Okay? 437 00:52:30.600 --> 00:52:37.529 Daniel Bienstock: So what else? Now there was a system. It was an existing open source system, open source. 438 00:52:38.210 --> 00:52:39.670 Daniel Bienstock: Code. 439 00:52:39.900 --> 00:52:51.900 Daniel Bienstock: there was call. I think it was called Pachy. I have no idea what Pachy means. Okay, and I don't know what language that's supposed to be. This was a a simulation Monte Carlo. 440 00:52:52.400 --> 00:52:53.890 Daniel Bienstock: Monte Carlo 441 00:52:54.960 --> 00:52:56.110 Daniel Bienstock: system. 442 00:52:56.380 --> 00:52:58.419 Daniel Bienstock: It would evaluate 443 00:53:00.210 --> 00:53:02.160 Daniel Bienstock: a hundred 1,000 444 00:53:02.460 --> 00:53:04.110 Daniel Bienstock: moves per second. 445 00:53:05.740 --> 00:53:07.889 Daniel Bienstock: according to some criterion. 446 00:53:08.360 --> 00:53:12.450 Daniel Bienstock: Okay? And choose what it decided was the best. 447 00:53:12.900 --> 00:53:17.890 Daniel Bienstock: Okay? And so it beat this guy. It beat 448 00:53:18.510 --> 00:53:19.720 Daniel Bienstock: Pachy 449 00:53:19.770 --> 00:53:21.920 Daniel Bienstock: 85% of the time. 450 00:53:22.860 --> 00:53:25.549 Daniel Bienstock: Okay, Pachy was, I guess, considered the best 451 00:53:25.990 --> 00:53:31.150 Daniel Bienstock: at that point. Okay? And they beat it. 85% of the time and the previous 452 00:53:33.600 --> 00:53:34.720 Daniel Bienstock: best 453 00:53:35.050 --> 00:53:37.930 Daniel Bienstock: from any other system was 12% of the time. 454 00:53:38.440 --> 00:53:52.349 Daniel Bienstock: So up to then, a anybody who played against Pachy. At best. They beat Pachy 12% of the time. But this system, this Alphago system I I mean, up to this point we are missing. 455 00:53:52.400 --> 00:53:54.330 Daniel Bienstock: We're missing 2 big things. 456 00:53:54.460 --> 00:53:58.290 Daniel Bienstock: It was already beating the best 85% of the time. 457 00:53:59.210 --> 00:54:03.750 Daniel Bienstock: Alright, which is great. But not the best that one can do. 458 00:54:04.210 --> 00:54:25.309 Daniel Bienstock: Okay. Now, there's more that we have to go into the next lecture. Okay. there are 2 things. Let me outline them both. They are both very important. They go together. Okay, One thing is called what it's called reinforce reinforcement 459 00:54:26.880 --> 00:54:28.030 Daniel Bienstock: learning. 460 00:54:29.160 --> 00:54:34.189 Daniel Bienstock: So we know what that means by now of something networks. 461 00:54:36.680 --> 00:54:41.229 Daniel Bienstock: I'm going to call it value. This is new. What do I mean by this. 462 00:54:41.390 --> 00:54:44.039 Daniel Bienstock: So the last thing that I described 463 00:54:44.120 --> 00:54:48.800 Daniel Bienstock: was a system for playing a game. It would play games for you. 464 00:54:49.680 --> 00:54:53.229 Daniel Bienstock: Okay, what does this do? This predicts 465 00:54:55.020 --> 00:54:57.509 Daniel Bienstock: win or loss, win or lose 466 00:54:58.050 --> 00:54:59.280 Daniel Bienstock: quickly. 467 00:55:01.670 --> 00:55:09.509 Daniel Bienstock: Okay, so this is no longer a system for playing games. This tells you given a state of the board. 468 00:55:09.860 --> 00:55:12.569 Daniel Bienstock: If you play really? Well, will you win? 469 00:55:12.960 --> 00:55:14.460 Daniel Bienstock: Okay or not? 470 00:55:15.900 --> 00:55:16.626 Daniel Bienstock: Okay. 471 00:55:18.935 --> 00:55:25.449 Daniel Bienstock: We have to describe next lecture what this is. Okay? it has to be, what are the goals? 472 00:55:26.353 --> 00:55:30.889 Daniel Bienstock: It should be very fast, should be very fast. Goal. 473 00:55:35.510 --> 00:55:37.180 Daniel Bienstock: Make it very fast. 474 00:55:38.630 --> 00:55:45.020 Daniel Bienstock: Question is, why? Well, there's going to be another algorithm that's going to use this as a sub routine many times. 475 00:55:45.350 --> 00:55:51.180 Daniel Bienstock: many, many times. Why? Well, we'll have to explain. Okay, does not need 476 00:55:53.520 --> 00:55:55.319 Daniel Bienstock: to be super accurate. 477 00:55:58.440 --> 00:56:05.730 Daniel Bienstock: So for those of you who play any kind of card game or or anything like that, that requires many moves. 478 00:56:06.060 --> 00:56:07.390 Daniel Bienstock: many moves 479 00:56:08.120 --> 00:56:10.040 Daniel Bienstock: or the stock market. 480 00:56:10.470 --> 00:56:21.240 Daniel Bienstock: Okay, if you know that you're going to be correct, at least some fraction of the time, and you can use that to reinforce your decisions that when you, when you actually do, when you reinforce that. 481 00:56:21.250 --> 00:56:26.650 Daniel Bienstock: then maybe you can actually boost your chances of winning. So this is the spirit. 482 00:56:26.950 --> 00:56:50.780 Daniel Bienstock: Alright. So this is only one thing. So this output went into the last thing, and probably the most important thing where we want to spend a little more time. Okay, maybe the single most important component of the entire setup that they did. All of that we did until now was a way to prepare to to develop the next. The final phase. 483 00:56:51.210 --> 00:56:57.650 Daniel Bienstock: Everything that we did so far is very, very important, so that the next phase gets off the ground properly. 484 00:56:57.950 --> 00:57:03.489 Daniel Bienstock: But it's going to be the most important phase of all. And that's called Monte Carlo tree search. 485 00:57:10.190 --> 00:57:14.109 Daniel Bienstock: Okay? And usually abbreviated like that. 486 00:57:14.450 --> 00:57:19.690 Daniel Bienstock: And what is that. Okay? So we'll definitely go through these 2 things next lecture. 487 00:57:19.970 --> 00:57:25.470 Daniel Bienstock: we have to. So what is that? Okay, what is the tree that we are talking about here. 488 00:57:25.520 --> 00:57:30.759 Daniel Bienstock: So imagine a game like chess or go, you know, and you play first. 489 00:57:31.430 --> 00:57:34.200 Daniel Bienstock: And now there are many moves that you can make. 490 00:57:35.240 --> 00:57:47.509 Daniel Bienstock: There are many moves that you can make. And so you can have a picture. You know, this is the beginning. And they all these different moves. There's a whole bunch of moves that you can make. And now you don't know. Okay, which which one is is a good move to make at the very beginning. 491 00:57:48.160 --> 00:57:57.660 Daniel Bienstock: Well, then you could simulate. For each of these moves. You could simulate what your opponent would do. Okay, and your opponent in each case will have a bunch of moves. 492 00:58:00.260 --> 00:58:06.219 Daniel Bienstock: So if you assume that your opponent is very intelligent. Okay, how would they choose it? 493 00:58:06.540 --> 00:58:09.700 Daniel Bienstock: Well, they could simulate what you would do in each case. 494 00:58:11.250 --> 00:58:16.910 Daniel Bienstock: And so if you continue like this, you're you're going to grow this humongous tree. 495 00:58:17.530 --> 00:58:22.490 Daniel Bienstock: It's going to be both very broad and probably way too deep. 496 00:58:24.700 --> 00:58:34.260 Daniel Bienstock: If somehow you could evaluate this tree all the way. Eventually each branch, each branch, is going to terminate in some terminal state where somebody wins. 497 00:58:35.170 --> 00:58:44.530 Daniel Bienstock: Okay? And if you could. If you could visualize the entire tree all at once, then you could pick the best move in some sense 498 00:58:44.970 --> 00:59:06.250 Daniel Bienstock: right, and the the worth of a move. That move would be dependent upon 2 things right? The it's going to be a blend, you know. In in terms of a binary game like chess or go. You know you want to know. Okay, in the subtree, you know. How likely is it that I win? That is to say, ho! How many times do I win. 499 00:59:06.770 --> 00:59:08.830 Daniel Bienstock: and how many times do I not win? 500 00:59:09.870 --> 00:59:17.849 Daniel Bienstock: So we have to combine the 2 things. There's some some kind of a stochastic interpretation and some kind of a value interpretation. Okay. 501 00:59:18.326 --> 00:59:22.389 Daniel Bienstock: and you have to somehow balance the 2 of them. 502 00:59:22.490 --> 00:59:32.969 Daniel Bienstock: So Monte Carlo tree search is a way to to heuristically, heuristically, evaluate a subtree of this tree 503 00:59:33.370 --> 00:59:42.699 Daniel Bienstock: where you're evaluating. You are repeatedly evaluating some branch of the tree, and then you're adjusting your estimate of probabilities in particular. 504 00:59:42.760 --> 00:59:44.260 Daniel Bienstock: of a success. 505 00:59:45.676 --> 01:00:12.929 Daniel Bienstock: If you had a game that was not a binary game, it's the game that, in addition that gives you a a number, a value not plus minus one, but some number. Okay? Then it's a more delicate process, because you don't just want to to take the branch that gives you. You know, th has the single node the single leaf with the highest possible value. It's a blend of that. And how many times you know some kind of expectation, and so on. 506 01:00:13.140 --> 01:00:16.680 Daniel Bienstock: And so, Monte Carlo tree search is a collection of heuristics 507 01:00:16.880 --> 01:00:21.339 Daniel Bienstock: to try to narrow down the tree, both in terms of its breadth. 508 01:00:21.390 --> 01:00:25.370 Daniel Bienstock: and also rapidly getting to the bottom in some sense 509 01:00:27.055 --> 01:00:37.649 Daniel Bienstock: and so these last 2 elements is a reinforcement learning of value networks which builds upon the very first thing that we saw here, or it starts from from 510 01:00:38.430 --> 01:00:41.040 Daniel Bienstock: from that supervised learning 511 01:00:41.734 --> 01:00:46.829 Daniel Bienstock: and then it also uses this to give you good moves. 512 01:00:46.950 --> 01:01:09.209 Daniel Bienstock: Give you good moves that are likely to win games. So the the first tool the the prediction of what a good move would be that in fast, in a fast mode is going to be used to develop a quick estimate as to whether given a state of the board, you're likely to win or not very quickly. 513 01:01:09.390 --> 01:01:17.670 Daniel Bienstock: and the second one, the reinforcement learning of policy networks will be used to begin the Monte Carlo tree, search 514 01:01:19.142 --> 01:01:26.200 Daniel Bienstock: and as we take a dive, you know, we'll be taking Dives down this tree. Okay. 515 01:01:26.280 --> 01:01:33.689 Daniel Bienstock: then, occasionally, occasionally we may terminate the dive early. Okay, we. The game is not over yet. 516 01:01:33.760 --> 01:01:39.279 Daniel Bienstock: But then we evaluate the chances of winning using this first component using that 517 01:01:40.120 --> 01:01:44.029 Daniel Bienstock: very quickly. Okay. And we're going to take many, many such dives. 518 01:01:44.460 --> 01:01:54.359 Daniel Bienstock: We have to discuss how this reinforcement, learning, this reinforcement learning of value networks actually works, and how Monte Carlo Tree search actually works. 519 01:01:54.660 --> 01:01:59.769 Daniel Bienstock: Okay, there's a ton of heuristics in there. This is a very, very interesting. 520 01:01:59.830 --> 01:02:01.640 Daniel Bienstock: They got it to work. 521 01:02:02.020 --> 01:02:14.020 Daniel Bienstock: so we'll see next time how they did this, and then beginning next time, and probably finishing the time after. We'll see how the Deepmind people move from from this setup. 522 01:02:14.419 --> 01:02:17.829 Daniel Bienstock: Which, incorporating all of these things they called Alphago. 523 01:02:17.980 --> 01:02:22.720 Daniel Bienstock: where they moved to something, where they let go. Of the masters there were no more masters. 524 01:02:23.000 --> 01:02:28.380 Daniel Bienstock: Okay, no more, no more training of their initial step. 525 01:02:28.410 --> 01:02:40.079 Daniel Bienstock: using masters data There we go. Instead, they they kind of combine the 2 steps, this one and that one by doing self play. 526 01:02:40.720 --> 01:02:43.960 Daniel Bienstock: They just kept playing against themselves over and over again. 527 01:02:43.980 --> 01:02:47.770 Daniel Bienstock: And as they improved their algorithms they would 528 01:02:48.070 --> 01:02:57.219 Daniel Bienstock: send these improved algorithms into this, a a basket of previously developed algorithms against which they keep playing games. 529 01:02:57.330 --> 01:02:59.059 Daniel Bienstock: okay and always improving. 530 01:03:00.200 --> 01:03:18.869 Daniel Bienstock: And when they completed that task, and we'll see that second, maybe next lecture or the lecture. After that. That's when they got this system. They beat everybody. Okay. They beat all the human masters, you know, by some ridiculous margins like, you know, they would have a championship. I don't know 5 games, and they would win 5 to nothing. 531 01:03:19.190 --> 01:03:23.119 Daniel Bienstock: They beat all their all all other programs, a hundred percent of the time. 532 01:03:23.310 --> 01:03:26.289 Daniel Bienstock: I'm not sure there's any competition today. 533 01:03:26.870 --> 01:03:37.230 Daniel Bienstock: My understanding is that this technology is still getting developed, not necessarily to play silly games, but to do other things, and it's certainly not open source. 534 01:03:38.968 --> 01:03:47.689 Daniel Bienstock: and how much we can divine by reading their papers is probably going to be somewhat limited, but it's still very entertaining. 535 01:03:48.160 --> 01:03:49.250 Daniel Bienstock: Alright. 536 01:03:49.500 --> 01:03:58.489 Daniel Bienstock: that's it for today. And I can see that the electricity company never called me. Hopefully, they're not going to cut power. And 537 01:03:58.906 --> 01:04:05.409 Daniel Bienstock: I'll see you guys on Tuesday. I guess the semester is over, probably for for everybody. 538 01:04:05.470 --> 01:04:13.229 Daniel Bienstock: But we have some ground to cover. You know there are some of these Ml. Related lectures, and then we should do something about Cuda. 539 01:04:13.260 --> 01:04:15.169 Daniel Bienstock: because I promised that we would. 540 01:04:16.450 --> 01:04:18.320 Daniel Bienstock: Okay, that's like. 541 01:04:18.790 --> 01:04:20.800 Daniel Bienstock: Hey, yeah, take care. 542 01:04:22.110 --> 01:04:23.579 Daniel Bienstock: and I'll stay on 543 01:04:24.360 --> 01:04:25.909 Daniel Bienstock: with a blink. 544 01:04:26.130 --> 01:04:26.720 Blake: Yeah, we're. 545 01:04:26.720 --> 01:04:28.440 Daniel Bienstock: We are. We are on next right, Blake. 546 01:04:28.440 --> 01:04:29.550 Blake: Yeah. Yep. 547 01:04:29.550 --> 01:04:31.470 Daniel Bienstock: Alright! Give me 1 min. 548 01:04:31.660 --> 01:04:34.089 Blake: Yeah, that's alright. I was gonna go grab a jacket. 549 01:04:34.210 --> 01:04:35.250 Blake: Sorry. Okay. 550 01:04:35.480 --> 01:04:36.940 Daniel Bienstock: Alright, I'm gonna get a drink. 551 01:04:37.250 --> 01:04:38.330 Blake: Okay. Be right back. 552 01:04:39.830 --> 01:04:41.389 Daniel Bienstock: Let me stop recording.