WEBVTT

1
00:00:01.650 --> 00:00:02.820
Daniel Bienstock: Okay?

2
00:00:07.653 --> 00:00:08.546
Daniel Bienstock: Alright,

3
00:00:09.500 --> 00:00:12.479
Daniel Bienstock: Before we get started there's a chance

4
00:00:12.540 --> 00:00:14.373
Daniel Bienstock: that I'm going to get

5
00:00:15.290 --> 00:00:21.899
Daniel Bienstock: called by the electricity company. They have to come and do some service, and I'm just. I'm waiting for them to call back.

6
00:00:22.660 --> 00:00:30.760
Daniel Bienstock: Okay, let me open this up alright. So now let's review where we were last time. We have a bit of ground to cover today.

7
00:00:30.920 --> 00:00:39.189
Daniel Bienstock: we're going to need at least one more lecture in these Ml tools, if not 2 more lectures

8
00:00:39.390 --> 00:00:42.110
Daniel Bienstock: and a quick review. Last lecture.

9
00:00:46.650 --> 00:00:52.299
Daniel Bienstock: we saw deep learning. I encourage you to. Look at the little Pdf.

10
00:00:52.900 --> 00:00:58.750
Daniel Bienstock: That I had uploaded. This is a deep learning using gradient descent

11
00:00:59.556 --> 00:01:03.530
Daniel Bienstock: to me, minimize mean square error

12
00:01:04.209 --> 00:01:05.949
Daniel Bienstock: use training data

13
00:01:06.020 --> 00:01:16.780
Daniel Bienstock: that the data consists in the example that we considered a vector each. Each element of the data consists of a vector and a value.

14
00:01:16.790 --> 00:01:30.179
Daniel Bienstock: And now we are trying to use the vectors to predict the values, and the network will predict the number. And the error is squared. And now, over all the data samples, we take the average of these square errors, the mean square error

15
00:01:30.586 --> 00:01:35.340
Daniel Bienstock: and the the variables and the optimization problem are the network weights.

16
00:01:36.315 --> 00:01:45.330
Daniel Bienstock: And we were using, or in the community gradient descent. First order methods are used to try to minimize this function

17
00:01:45.350 --> 00:01:52.839
Daniel Bienstock: and thus build a network or put voids on the network, and then that gets used to make predictions.

18
00:01:53.620 --> 00:01:58.339
Daniel Bienstock: Then we described how to use a variation, a variation on this

19
00:01:58.430 --> 00:02:00.789
Daniel Bienstock: to do classification.

20
00:02:00.950 --> 00:02:06.019
Daniel Bienstock: and then classification. What you have is you have a network

21
00:02:06.643 --> 00:02:12.626
Daniel Bienstock: and the the last layer, the last layer of the network.

22
00:02:14.590 --> 00:02:15.910
Daniel Bienstock: outputs

23
00:02:15.920 --> 00:02:21.880
Daniel Bienstock: one number per each class that we trying to use to classify.

24
00:02:22.050 --> 00:02:40.629
Daniel Bienstock: So one number of those one number per class. So if you're trying to use images as inputs, and then, describe what is in the image. It could be different types of animals. And so let's say, there are 5 choices. So the last layer of the network will have 5 neurons outputting numbers.

25
00:02:41.029 --> 00:02:57.190
Daniel Bienstock: And then, for example, you can take the maximum of those numbers, and that is the one that is the that is the prediction. And the the network is trained the same way. More generally you take these 5 numbers, and out of them you fashion a probability.

26
00:02:57.330 --> 00:03:00.150
Daniel Bienstock: discrete probability distribution.

27
00:03:00.390 --> 00:03:20.620
Daniel Bienstock: for example, using this what I call the soft Max, this exponential, divided by the sum of the exponentials. So you take the exponential for each of the numbers you divide by the sum of these 5 exponentials, and that gives you a probability distribution. And now you use that probability distribution to make a stochastic prediction.

28
00:03:21.280 --> 00:03:26.599
Daniel Bienstock: And again, you can, you can, use this metric

29
00:03:26.610 --> 00:03:28.480
Daniel Bienstock: as compared to the truth.

30
00:03:28.530 --> 00:03:33.140
Daniel Bienstock: to set up an optimization problem to train the network.

31
00:03:34.010 --> 00:03:50.859
Daniel Bienstock: Okay? Now, let's move on. Past all of this we were discussing the application which is the application of interest here, which is how to play a game like chess, or in particular play go, which is considered more, more difficult.

32
00:03:51.010 --> 00:03:53.710
Daniel Bienstock: then then tests.

33
00:03:53.910 --> 00:04:06.304
Daniel Bienstock: And so the first topic that we started to look at in the last lecture is what we call supervised learning of policy network. So this is a a very loaded terminology.

34
00:04:06.730 --> 00:04:11.840
Daniel Bienstock: let's see what it what it means, and then we'll see how it goes was used

35
00:04:11.930 --> 00:04:30.970
Daniel Bienstock: by the Alphago people. By the deepmind people to create the first element in this setup that they call Alphago. That proved very successful. Not the most successful we get. We'll get to that in a future lecture, maybe next lecture or the one after that.

36
00:04:31.110 --> 00:04:34.280
Daniel Bienstock: But this is supervised learning

37
00:04:35.760 --> 00:04:37.679
Daniel Bienstock: of policy networks.

38
00:04:41.000 --> 00:04:58.939
Daniel Bienstock: Okay, the learning is what we have been discussing before. This is the AI term networks is because we have a network supervised here is because we have input data which is considered to be a kind of ground truth.

39
00:04:59.140 --> 00:05:05.159
Daniel Bienstock: In the case of the goal playing algorithms, the supervision

40
00:05:05.814 --> 00:05:11.720
Daniel Bienstock: was provided by data coming from games played by masters of the game.

41
00:05:11.790 --> 00:05:26.460
Daniel Bienstock: And policy is yet another loaded term. Okay, will be exposed to this term multiple times today. And we'll come to understand what that means. This is a term that predates to some extent this AI community.

42
00:05:26.630 --> 00:05:32.699
Daniel Bienstock: And so what is the setup? What is the setup? You know we have pairs. This is the data that we have.

43
00:05:34.390 --> 00:05:35.900
Daniel Bienstock: We have pairs

44
00:05:36.550 --> 00:05:44.149
Daniel Bienstock: of the form action state. And so here, state is the state of a system.

45
00:05:44.460 --> 00:05:48.449
Daniel Bienstock: and the system, for example, could be the board for the game.

46
00:05:49.520 --> 00:05:57.110
Daniel Bienstock: My understanding is for for the goal playing algorithms the State was actually a picture

47
00:05:57.210 --> 00:06:03.200
Daniel Bienstock: of the board with the pieces in place. A picture, okay? And a is action

48
00:06:05.401 --> 00:06:07.870
Daniel Bienstock: taken, let's say, by an expert.

49
00:06:08.530 --> 00:06:13.039
Daniel Bienstock: the correct action to take when the system is in this given state?

50
00:06:14.660 --> 00:06:21.690
Daniel Bienstock: And then how does how? What is a supervised learning of a policy network. So here's our network.

51
00:06:24.210 --> 00:06:40.930
Daniel Bienstock: Okay? And we have all these layers and so on multiple layers. We have. These are parameters. let's not call it Theta, actually, because they they chose not to call theta theta. Let's call it that Greek letter. So we have the state

52
00:06:42.580 --> 00:06:44.449
Daniel Bienstock: the state coming in.

53
00:06:44.730 --> 00:06:55.579
Daniel Bienstock: And it's going to be mapped as a vector into the input layer. And out of the output layer comes, not an action con comes up. Probability distribution.

54
00:06:56.500 --> 00:07:01.999
Daniel Bienstock: Okay? The probability that we should take action. A given state. S,

55
00:07:02.830 --> 00:07:09.489
Daniel Bienstock: and we use the the data that we have the pairs that we have to provide the training data.

56
00:07:11.270 --> 00:07:18.053
Daniel Bienstock: Okay? And so we run this classification network.

57
00:07:19.580 --> 00:07:21.629
Daniel Bienstock: it makes a prediction

58
00:07:21.660 --> 00:07:23.740
Daniel Bienstock: for the correct action to take.

59
00:07:24.740 --> 00:07:33.539
Daniel Bienstock: and then we compare it to the recommended action recommended by the expert, and thus we obtain a metric.

60
00:07:33.920 --> 00:07:57.070
Daniel Bienstock: Okay? However, the way that this would got was got used was by doing gradient ascent gradient ascent. And what do I mean by that gradient ascent to maximize maximize. What maximize the probability that we make the correct action given any any of the States?

61
00:07:57.190 --> 00:08:09.920
Daniel Bienstock: Okay, so let me write the formula for what would amount to the gradient ascent, and we'll explain what we did. I uploaded the first paper by the Alphago people. Oop.

62
00:08:09.920 --> 00:08:32.599
Daniel Bienstock: There are more people now by the Alphago people. There's a section toward the end of the paper that's called results or methods, pardon methods, and in it they describe in somewhat more detail what exactly it is that they did. Okay, so let me. And it's best to look at what they did in terms of the formulas which we're going to do to understand exactly what it is that we are doing.

63
00:08:32.760 --> 00:08:35.239
Daniel Bienstock: Okay, this is gradient ascent

64
00:08:37.580 --> 00:08:39.320
Daniel Bienstock: rather than decent.

65
00:08:41.700 --> 00:08:43.940
Daniel Bienstock: Okay, gradient ascent.

66
00:08:45.570 --> 00:08:48.169
Daniel Bienstock: And let me write the formula.

67
00:08:49.040 --> 00:08:54.529
Daniel Bienstock: And I'll explain what it is that that we are doing. So each iteration

68
00:08:56.430 --> 00:09:18.320
Daniel Bienstock: is as follows, so delta of the segment, these are the network weights. Okay, these are the numbers that go in the network. Okay, which is our optimization problem is what? Okay? So it's a number. Okay, Alpha, Alpha is what we will call the learning rate. It's the step size, basically for a gradient method.

69
00:09:18.750 --> 00:09:21.720
Daniel Bienstock: But I have this. M. Now, I have a son.

70
00:09:22.900 --> 00:09:49.100
Daniel Bienstock: Okay of what? And I take the gradient the notation I'm using is more standard that what they have in their paper. But you'll understand what I mean. Gradient, with respect to the network weight is our our variables. This is the gradient part of the gradient descent method. Gradient of what function, and I'm going to write the function here. The log, the log of the probability

71
00:09:50.852 --> 00:09:53.869
Daniel Bienstock: of a K given SK.

72
00:09:53.940 --> 00:10:02.699
Daniel Bienstock: So what is all of this. So ak comma should should do it the other way around. Aksk.

73
00:10:03.530 --> 00:10:07.019
Daniel Bienstock: SKAK. This is the training data.

74
00:10:09.390 --> 00:10:12.589
Daniel Bienstock: These are, you know, this is a state.

75
00:10:13.600 --> 00:10:15.450
Daniel Bienstock: and this is the action

76
00:10:16.620 --> 00:10:19.059
Daniel Bienstock: by a master of it. Again.

77
00:10:20.430 --> 00:10:24.809
Daniel Bienstock: okay. And that K is what K. Is roughly 30 million.

78
00:10:26.550 --> 00:10:29.690
Daniel Bienstock: Well, roughly, actually, 1 million

79
00:10:30.440 --> 00:10:32.750
Daniel Bienstock: or 1.5 something like that.

80
00:10:34.600 --> 00:10:37.149
Daniel Bienstock: We write it the standard way. So we have.

81
00:10:37.150 --> 00:10:38.599
matias: Is that M. Then.

82
00:10:38.890 --> 00:10:42.650
Daniel Bienstock: No, that's not them. Oh, no, no good point

83
00:10:43.888 --> 00:11:01.830
Daniel Bienstock: one so good point. Okay. What is m is M is a mini batch. Okay? So let me write all the data this is. If you look at the paper, you'll see how highly dimensional they paper is too many training wheels. This is training data training data

84
00:11:05.435 --> 00:11:10.310
Daniel Bienstock: is about that large training data is about that large M

85
00:11:10.750 --> 00:11:12.270
Daniel Bienstock: is Mini batch

86
00:11:15.180 --> 00:11:18.379
Daniel Bienstock: batch, which was 16.

87
00:11:19.360 --> 00:11:28.260
Daniel Bienstock: So this is a case of stochastic gradient. Okay, we take a mini batch of size or 2 people enter the room that'd be in that'd be allowed.

88
00:11:28.490 --> 00:11:30.109
Daniel Bienstock: But both of them

89
00:11:31.330 --> 00:11:41.419
Daniel Bienstock: okay, my network is kind of slow. Admit all they are getting admitted. Okay, joining joining fine. we. We have roughly 1.5 million

90
00:11:43.630 --> 00:12:10.209
Daniel Bienstock: data points. We take many batches of size. They said, size 16. I'm not ex exactly sure. Y, 16. Okay, but the the sum that we see here this is I. I'm taking an average right? An average of the of the 16 gradients that I'm computing. And this is stochastic gradient descent.

91
00:12:10.260 --> 00:12:12.060
Daniel Bienstock: Okay, so.

92
00:12:12.160 --> 00:12:34.430
Daniel Bienstock: but really, what are we trying to do? You know, we are trying to maximize the probability that the that the system, the train network predicts the correct action. Okay, P of Ak, given sk is the action that the sorry ak given sk is the action that the expert took given State. Sk

93
00:12:35.050 --> 00:12:45.970
Daniel Bienstock: p of ak, sk, let me let me further highlight. This is the probability that our train system gives to this correct action.

94
00:12:47.190 --> 00:13:07.289
Daniel Bienstock: And so if we didn't have the log we would just that we would have something. It looks natural or right with that for each sample. If we had only one example instead of M. Equals to 16, we have M equal to one we are at that point. We are trying to maximize the probability of the action taken by the expert.

95
00:13:07.840 --> 00:13:29.009
Daniel Bienstock: And then we are taking an average which is a stochastic gradient descent part. But it's not that the P. It's a log of the P. Okay? Why the log? Okay. Now, log is consistent with probability. If you maximize the log, you're also maximizing the probability. So that's okay. But still, why the log? Okay. So we'll get to that in a minute. Okay, we'll get to that in a little bit.

96
00:13:29.490 --> 00:13:42.479
Daniel Bienstock: So this is what these characters did. Okay, they took. Let's say, a training data set of one and a half or close to 2 million moves taken by experts.

97
00:13:42.850 --> 00:14:06.139
Daniel Bienstock: And they use that to train the network using gradient descent exactly with these rules. What else did they have? Hold on! Hold on! Wait a minute what is Alpha? Alpha is the training rate. Okay, this is the step size for those of us who remember first order methods. What did they do? They use Alpha equals to 3 times 10 to the minus 3 very small.

98
00:14:06.770 --> 00:14:08.829
Daniel Bienstock: and reduced

99
00:14:10.150 --> 00:14:12.499
Daniel Bienstock: by a factor of a half

100
00:14:12.770 --> 00:14:18.710
Daniel Bienstock: every. How many, how many steps, every what, every 80 million steps.

101
00:14:20.710 --> 00:14:23.250
Daniel Bienstock: This tells you that it took a lot of steps

102
00:14:25.018 --> 00:14:31.009
Daniel Bienstock: they are not shy. Okay, what else? No momentum we learned about momentum

103
00:14:32.036 --> 00:14:50.810
Daniel Bienstock: in previous lectures. No momentum, straight, fixed step, fixed learning rate grade in the center. Simple as possible. Okay. They say, now in the network, in the, in the network, in the paper. They describe some details about the network

104
00:14:51.692 --> 00:14:54.990
Daniel Bienstock: in a future lecture in a future lecture.

105
00:14:55.420 --> 00:14:59.970
Daniel Bienstock: We'll do something practical. But discuss, you know, what are the network architectures?

106
00:15:00.000 --> 00:15:13.480
Daniel Bienstock: There's some terminology. If you read this Google paper in that methods section. They have a subsection, a few paragraphs describing the network architecture. The main thing that that we can say is, it has 13 layers

107
00:15:14.080 --> 00:15:29.920
Daniel Bienstock: if you keep reading, and that they tell you something about the number of revenues that they have. You'll see the term kernel and the term stride. These are AI, or I should say Ml. Terms that describe to some extent the architecture

108
00:15:29.970 --> 00:15:43.290
Daniel Bienstock: of the network, how the layers, what the layers look like. The last layer was what we called a convolutional layer, or the last between the last 2 layers is a convolutional layer means it's fully connected.

109
00:15:43.740 --> 00:15:57.700
Daniel Bienstock: Okay, what is the size of these layers. How about the first layer? Well, they you have to be able to input a state. Okay? A ste, the go board has an is like 19 by 19.

110
00:15:58.640 --> 00:16:11.650
Daniel Bienstock: And so and you have to have that dimensionality, but a little more, because you have to say where the pieces are. But even 19 by 19. And it's roughly, you know, almost 400 dimensional.

111
00:16:12.380 --> 00:16:18.830
Daniel Bienstock: Okay, 13 layers. So alright it, they say, hold on.

112
00:16:19.180 --> 00:16:26.260
Daniel Bienstock: They say that it took some time to train this. How long did it take? I think they say 3 weeks, 3 weeks

113
00:16:28.130 --> 00:16:52.760
Daniel Bienstock: to do the training they have this. They have this word in there. It took 3 weeks this term. I'm not exactly sure what it means. Does it mean 3 weeks of continuous running of their their computing. That would be the direct interpretation. I don't know if that's what they mean, or if they they, this is the entire length of time that they spent, you know, with the.

114
00:16:52.760 --> 00:17:15.099
Daniel Bienstock: you know, making mistakes and correcting them and changing parameters and hyper parameters. I do not know. Okay? And roughly, roughly, it's it it oh, they furthermore say, Oh, and I I cannot see. I'm computing impaired today because of various issues. Sorry. 14 layers, 14 layers, and in terms of the total runtime

115
00:17:15.099 --> 00:17:32.019
Daniel Bienstock: it works to this works to roughly 100 and I forget. I forget the number of steps that they took. These are gradient steps. Okay? But I did the computation. And it works roughly, about 187 steps per second.

116
00:17:34.610 --> 00:17:48.360
Daniel Bienstock: Okay, and I don't know whether this is very fast or not. Very fast. Okay, and so so this is the training, the training data set.

117
00:17:48.760 --> 00:17:56.270
Daniel Bienstock: Okay, they you. They approximately solve this, a gradient descent problem to try to maximize the probability.

118
00:17:56.707 --> 00:18:10.499
Daniel Bienstock: that you pick the right action in the in the given the the input state just maximize that probability, and then they tested it, testing data

119
00:18:14.350 --> 00:18:17.359
Daniel Bienstock: on what? Roughly 29 million

120
00:18:19.764 --> 00:18:20.659
Daniel Bienstock: cases.

121
00:18:21.280 --> 00:18:28.520
Daniel Bienstock: So the testing data was much, much bigger than the training data and the accuracy that they got

122
00:18:30.420 --> 00:18:34.580
Daniel Bienstock: accuracy was about how much? 57%.

123
00:18:36.040 --> 00:18:38.720
Daniel Bienstock: Okay, so a little bit better than half the time.

124
00:18:39.560 --> 00:18:45.109
Daniel Bienstock: Okay? And apparently this was much better than what was available in the state of the art.

125
00:18:45.140 --> 00:18:50.832
Daniel Bienstock: Okay? Now, what other information can we?

126
00:18:51.860 --> 00:18:58.139
Daniel Bienstock: can we provide? I lost the numbers here. So roughly.

127
00:18:58.160 --> 00:19:05.220
Daniel Bienstock: So this is training. Okay in deployment when you want to use it. And you want to evaluate the state

128
00:19:05.320 --> 00:19:10.919
Daniel Bienstock: in order to make a prediction. How fast was that? And they said roughly, 3 ms.

129
00:19:12.120 --> 00:19:22.369
Daniel Bienstock: They also drained a less accurate network where the accuracy was a lot less than 57% only about 25.

130
00:19:22.500 --> 00:19:29.830
Daniel Bienstock: But the advantage is that the making a prediction was much faster, only a few microseconds.

131
00:19:30.300 --> 00:19:35.540
Daniel Bienstock: and we will see that later on, perhaps in the next lecture. Why, that is important.

132
00:19:36.040 --> 00:19:47.549
Daniel Bienstock: Okay, so this is supervised learning of of policy networks. The word policy has to do with the fact that, given a state.

133
00:19:48.350 --> 00:19:53.160
Daniel Bienstock: the trained network amounts to a policy. It tells you what to do.

134
00:19:54.100 --> 00:19:55.100
Daniel Bienstock: Okay.

135
00:19:55.310 --> 00:20:03.609
Daniel Bienstock: what do you do? That's a policy. This is a a old language really in in in decision sciences.

136
00:20:04.020 --> 00:20:06.009
Daniel Bienstock: Just a policy. Alright.

137
00:20:06.330 --> 00:20:13.750
Daniel Bienstock: and it's supervised. Because well, they we had the data provided by the masters, the other supervisors.

138
00:20:14.410 --> 00:20:17.619
Daniel Bienstock: And it's a network. And we learned it. Okay.

139
00:20:18.390 --> 00:20:26.059
Daniel Bienstock: the next element and what they had okay, moving on with the the machine learning hierarchy is what

140
00:20:26.100 --> 00:20:32.449
Daniel Bienstock: they would call, or they call reinforcement reinforce cement learning

141
00:20:34.990 --> 00:20:36.980
Daniel Bienstock: of policy networks.

142
00:20:42.930 --> 00:20:48.520
Daniel Bienstock: Okay? And what do I mean by that. So before we had supervised learning

143
00:20:49.580 --> 00:20:56.460
Daniel Bienstock: reinforcement, learning means that that will use an algorithm to try to correct our errors. Okay.

144
00:20:57.100 --> 00:20:59.340
Daniel Bienstock: so what is the setting.

145
00:21:02.670 --> 00:21:03.949
Daniel Bienstock: we start

146
00:21:05.530 --> 00:21:07.939
Daniel Bienstock: from the previously trained network.

147
00:21:15.190 --> 00:21:20.510
Daniel Bienstock: Okay, and now we want to want to improve, want to improve

148
00:21:22.360 --> 00:21:24.240
Daniel Bienstock: on the weights

149
00:21:24.440 --> 00:21:28.630
Daniel Bienstock: on the trained weights. Sigma. These are the weights that define the network.

150
00:21:29.010 --> 00:21:38.439
Daniel Bienstock: And what is the goal? Now? All right. Look before we had the system that tries to predict what a master would do.

151
00:21:39.140 --> 00:21:41.369
Daniel Bienstock: You could use that to play a game

152
00:21:41.950 --> 00:21:45.320
Daniel Bienstock: every time that you are in a certain board position.

153
00:21:45.400 --> 00:21:49.470
Daniel Bienstock: Well, you use the network to tell you to predict what a master would do.

154
00:21:50.200 --> 00:21:51.709
Daniel Bienstock: And you play that move.

155
00:21:52.160 --> 00:21:53.270
Daniel Bienstock: Okay?

156
00:21:54.530 --> 00:22:00.869
Daniel Bienstock: well, now, the goal is to here in reinforcement learning. We want to get better at that.

157
00:22:01.290 --> 00:22:06.309
Daniel Bienstock: We want to start from the system we trained before that predicted what a master would do.

158
00:22:06.390 --> 00:22:08.150
Daniel Bienstock: and starting from there.

159
00:22:08.180 --> 00:22:11.390
Daniel Bienstock: improve on that system to win games

160
00:22:12.540 --> 00:22:14.380
Daniel Bienstock: when games

161
00:22:15.130 --> 00:22:18.139
Daniel Bienstock: okay? And what is the method?

162
00:22:19.860 --> 00:22:23.370
Daniel Bienstock: So the method we could call it self play.

163
00:22:24.800 --> 00:22:29.796
Daniel Bienstock: Okay, you play. This is how it's going to work play a game.

164
00:22:30.490 --> 00:22:31.900
Daniel Bienstock: play a game

165
00:22:33.690 --> 00:22:36.309
Daniel Bienstock: against an opponent.

166
00:22:36.970 --> 00:22:47.460
Daniel Bienstock: actually, an opponent, an opponent is going to be one of our prior algorithms. And in the very first iteration it'll be the the previously trained network.

167
00:22:47.900 --> 00:22:50.209
Daniel Bienstock: But because do it over and over again.

168
00:22:50.330 --> 00:22:55.039
Daniel Bienstock: every time that we run this exercise, we're going to get a better network.

169
00:22:55.360 --> 00:22:58.249
Daniel Bienstock: And now we'll play again against that network.

170
00:22:58.450 --> 00:23:00.990
Daniel Bienstock: Okay, against a prior

171
00:23:01.390 --> 00:23:02.800
Daniel Bienstock: algorithm.

172
00:23:05.390 --> 00:23:09.990
Daniel Bienstock: Okay, and so what are the States that we see the States

173
00:23:10.660 --> 00:23:18.140
Daniel Bienstock: going to be? S. One s. 2. These are board states. Okay, as we play the game.

174
00:23:18.230 --> 00:23:40.417
Daniel Bienstock: And what is the T. They put a finite ending time for the game. Okay, if the game was taken too long. Well, it was given value 0. We're going to set up another maximization problem. So value 0 means that we didn't play the game too effectively. I forget what the T was. Some maximum number of moves that they allowed. Okay? And

175
00:23:42.078 --> 00:23:45.030
Daniel Bienstock: What are the actions? What are the actions?

176
00:23:45.520 --> 00:23:52.520
Daniel Bienstock: So these are the actions that that we took in playing the game. And let's say, these are a one, a 2. These are the moves that we took.

177
00:23:52.940 --> 00:23:55.079
Daniel Bienstock: Okay? And what is the outcome.

178
00:23:57.120 --> 00:24:03.509
Daniel Bienstock: the outcome. I'm going to denote it as a Z, which is going to be equal to plus one or minus one win or lose.

179
00:24:05.090 --> 00:24:15.329
Daniel Bienstock: okay, or 0. If the game gets end up terminated, and then we applying gradient descent. So now again, let me write the formula gradient ascent ascent.

180
00:24:15.560 --> 00:24:21.370
Daniel Bienstock: We want to basically win games. We want to increase the score that we can get.

181
00:24:21.530 --> 00:24:36.290
Daniel Bienstock: So let me write the formula, and then we'll puzzle it out. It's going to look similar to what we had before. We're using exactly the same network, you know. So the the very first time that we we do this self play, we play against the previously trained network.

182
00:24:36.760 --> 00:24:53.379
Daniel Bienstock: Okay? And now, this is one game, one game. Okay, we'll change this in a minute to use multiple games like in less than a minute. But let's do it for one game. Okay, Alpha, this is again, that learning rate the gradient the size of the gradient step.

183
00:24:53.410 --> 00:24:55.160
Daniel Bienstock: And now I take the sun.

184
00:24:56.070 --> 00:25:03.440
Daniel Bienstock: and we'll have to explain this again. You know why this? So the gradient, with respect to the weights of the logarithm

185
00:25:04.500 --> 00:25:09.120
Daniel Bienstock: of the the of the probability. Pardon

186
00:25:12.866 --> 00:25:14.599
Daniel Bienstock: action. T

187
00:25:14.800 --> 00:25:16.619
Daniel Bienstock: given state T.

188
00:25:17.050 --> 00:25:18.330
Daniel Bienstock: All of this

189
00:25:18.470 --> 00:25:19.950
Daniel Bienstock: times. Z.

190
00:25:20.190 --> 00:25:24.429
Daniel Bienstock: Alright. So what does this say? Let's say that Z is plus one.

191
00:25:25.190 --> 00:25:26.770
Daniel Bienstock: So we won the game.

192
00:25:27.760 --> 00:25:40.620
Daniel Bienstock: Okay, does this make sense? Yes, we want the game. And so now I'm doing something that at least superficially, is consistent with maximizing the probabilities of the various moves that we did.

193
00:25:43.150 --> 00:25:44.190
Daniel Bienstock: Okay.

194
00:25:45.990 --> 00:25:46.920
Daniel Bienstock: Alright.

195
00:25:47.363 --> 00:25:56.730
Daniel Bienstock: but why the log again? Okay. Why the log? Again, we have to try to understand this. Well, actually, this is not quite right. We did.

196
00:25:57.780 --> 00:26:00.599
Daniel Bienstock: Stochastic, gradient descent, actually

197
00:26:01.800 --> 00:26:03.090
Daniel Bienstock: stochastic

198
00:26:05.040 --> 00:26:06.370
Daniel Bienstock: great dissent.

199
00:26:09.730 --> 00:26:27.120
Daniel Bienstock: which is what? Well, it's going to be alpha divided by the number of games. So we play multiple games. We don't play one game, we play multiple games. And now we take this sum. Okay? So it's an average. This is a stochastic part. We take an average

200
00:26:27.250 --> 00:26:29.150
Daniel Bienstock: of steps

201
00:26:29.852 --> 00:26:33.000
Daniel Bienstock: and then we have something. It looks like what I had above

202
00:26:33.230 --> 00:26:38.070
Daniel Bienstock: T equals one. Well, each game ends up at a different time. Maybe.

203
00:26:38.460 --> 00:27:02.089
Daniel Bienstock: Okay. Ti is at most as capital T. And now I have. Now I have the gradient log of P. Sigma of a T sub. I is, and king game, IST sub i, and now I have a Z sub. I, the outcome output of game. I outcome of game. I, okay, so this is just an average of terms of what I had before. So really, it's like a very similar

204
00:27:02.130 --> 00:27:10.280
Daniel Bienstock: and so and what what was them? I forgot. They they told me. I mean, it's in the paper. Pardon, and there's a hundred 28.

205
00:27:11.450 --> 00:27:18.950
Daniel Bienstock: So they play mini batches. They they, they play sequences of mini batches of a hundred 28 games

206
00:27:19.170 --> 00:27:21.190
Daniel Bienstock: against a prior opponent.

207
00:27:23.330 --> 00:27:27.000
Daniel Bienstock: From that they compute at one gradient step

208
00:27:27.570 --> 00:27:36.759
Daniel Bienstock: for all the network weights. The gradient step is a gradient descent step. It's consistent with increasing the probability that we win again.

209
00:27:37.260 --> 00:27:41.760
Daniel Bienstock: We are just looking at each sample, each trajectory, as it were.

210
00:27:42.495 --> 00:27:47.919
Daniel Bienstock: Again, when Z is one, we want to make the probability big when Z is minus one, we lost

211
00:27:48.040 --> 00:27:50.959
Daniel Bienstock: so and we make we want to decrease the probability.

212
00:27:51.360 --> 00:27:58.420
Daniel Bienstock: And then, okay, how long did this take? They say that it took how long it took? One day one day.

213
00:27:58.910 --> 00:28:01.599
Daniel Bienstock: in terms of the the training

214
00:28:02.075 --> 00:28:04.630
Daniel Bienstock: with a world with 50 gpus.

215
00:28:05.830 --> 00:28:09.429
Daniel Bienstock: Okay, so think about the Gpu as being one big computer.

216
00:28:09.540 --> 00:28:12.709
Daniel Bienstock: So if they had 50 of them running in parallel for one day.

217
00:28:13.260 --> 00:28:20.320
Daniel Bienstock: Okay, running stochastic, gradient descent, they say, asynchronously. So they're doing the different steps in parallel.

218
00:28:21.279 --> 00:28:41.349
Daniel Bienstock: in this Mini batch computation that we have in here alright. But now we want to understand, you know why the log finally, why the log? Okay, why the log before we go there, you know. Notice that this the term with that has the gradient inside the sun. This is really the same. This is really the same

219
00:28:44.925 --> 00:28:57.990
Daniel Bienstock: sorry I I I. This is correct. This is correct. What's inside inside inside the sum here? This is the same as the gradient of the sun. It's a sum of gradients, which is the gradient of the sum

220
00:28:58.990 --> 00:29:01.049
Daniel Bienstock: T equals one to Ti.

221
00:29:02.982 --> 00:29:05.819
Daniel Bienstock: Sigma's no sigma is gone

222
00:29:06.060 --> 00:29:07.339
Daniel Bienstock: of log

223
00:29:08.570 --> 00:29:14.640
Daniel Bienstock: of P. Sigma of action. T sub. I given state Z, sub, I,

224
00:29:15.370 --> 00:29:17.209
Daniel Bienstock: okay, that's what that is.

225
00:29:18.160 --> 00:29:23.209
Daniel Bienstock: And a log. A sum of logs is the log of of the product.

226
00:29:24.290 --> 00:29:35.459
Daniel Bienstock: Okay, so this is really, this is really just the log of the probability of the entire sequence. Okay, because the different moves are independent of one another.

227
00:29:36.390 --> 00:29:48.879
Daniel Bienstock: You know, if we find ourselves in state. Sti. Well, we'll take the move that the network the train network dictates at that point. The past is in is gone, it's independent.

228
00:29:50.190 --> 00:29:53.810
Daniel Bienstock: And so this is a little bit reminiscent. Again, of dynamic programming.

229
00:29:54.530 --> 00:30:00.859
Daniel Bienstock: Okay, so I'm just taking the the the log of the probability of the entire trajectory.

230
00:30:02.840 --> 00:30:10.639
Daniel Bienstock: and I'm taking the gradient of that. So again, this is all very, very makes sense. But why the log? All right. So now let's get to that.

231
00:30:11.380 --> 00:30:17.320
Daniel Bienstock: How much time do I have? Well, they don't have much, all right. So here's some classical literature that they cite.

232
00:30:17.705 --> 00:30:30.069
Daniel Bienstock: Did they cite? They think they sign. I'm not sure how they cited it. Classical literature going back to the dawn of the AI. Okay, there's a one of their gods is this Guy Williams.

233
00:30:31.580 --> 00:30:33.540
Daniel Bienstock: 1992.

234
00:30:33.820 --> 00:30:45.659
Daniel Bienstock: And then even, and he and other people today, they cite even earlier work very famous work in very early work in AI Barton and Sutton.

235
00:30:46.990 --> 00:30:49.830
Daniel Bienstock: and roughly, 1983.

236
00:30:49.990 --> 00:30:54.470
Daniel Bienstock: Okay, I can tell you that back then nobody took AI seriously.

237
00:30:54.790 --> 00:30:57.610
Daniel Bienstock: everybody would say, this is this is such a joke!

238
00:30:58.300 --> 00:31:06.999
Daniel Bienstock: It's a complete joke. These are charlatans, you know. People who are, you know, should be thrown away out of universities and such.

239
00:31:07.250 --> 00:31:08.140
Daniel Bienstock: Okay.

240
00:31:08.640 --> 00:31:12.890
Daniel Bienstock: so now let's look at a generic setup for reinforcement learning.

241
00:31:13.380 --> 00:31:19.119
Daniel Bienstock: I take it that some of you know I'm I'm I'm told in in good faith

242
00:31:19.310 --> 00:31:22.860
Daniel Bienstock: that some of you know what a mark of decision process is.

243
00:31:23.500 --> 00:31:25.430
Daniel Bienstock: but not everybody does.

244
00:31:25.740 --> 00:31:34.340
Daniel Bienstock: So let me give you an example of a markup decision process and what it is. And and then we'll take the conversation elsewhere.

245
00:31:35.070 --> 00:31:37.829
Daniel Bienstock: Okay, so what is a mark of decision process.

246
00:31:45.450 --> 00:31:50.799
Daniel Bienstock: Okay, so we have a number of states. Okay, here we have a little 4 States.

247
00:31:51.080 --> 00:31:54.950
Daniel Bienstock: and out of each State, we have possible transitions.

248
00:31:55.440 --> 00:32:02.470
Daniel Bienstock: Let's say there and they are to there, and they are to there, I don't know. And and and that's it.

249
00:32:03.550 --> 00:32:07.260
Daniel Bienstock: Okay. Now, in, in each state, in each State.

250
00:32:07.570 --> 00:32:08.803
Daniel Bienstock: We have

251
00:32:09.590 --> 00:32:14.460
Daniel Bienstock: some number of actions, possible actions, you know. Action one

252
00:32:15.610 --> 00:32:21.189
Daniel Bienstock: 2, and 3 possible actions at that state. Let's call it State a.

253
00:32:21.880 --> 00:32:26.890
Daniel Bienstock: And now there are 2 things that happen. One is that we get a reward.

254
00:32:27.620 --> 00:32:31.690
Daniel Bienstock: If we take an action, we get a reward. So here's the reward.

255
00:32:34.220 --> 00:32:39.569
Daniel Bienstock: And let's say the it's a 5 and minus 10 and and 6. Okay.

256
00:32:39.650 --> 00:32:42.590
Daniel Bienstock: so at this state there are 3 actions.

257
00:32:43.070 --> 00:32:46.200
Daniel Bienstock: and then for each action. If we take that action.

258
00:32:46.250 --> 00:32:47.710
Daniel Bienstock: we get a reward.

259
00:32:49.150 --> 00:33:07.600
Daniel Bienstock: What else? Well, in addition to getting a reward, we're going to get a probability distribution. Okay, one third and 2 thirds, or one half and one half, and 0 and one. So what is that probability? Distribution? Well, notice

260
00:33:07.650 --> 00:33:10.179
Daniel Bienstock: that, you know there are 2 numbers in each case.

261
00:33:10.760 --> 00:33:12.770
Daniel Bienstock: blue and green.

262
00:33:14.100 --> 00:33:15.600
Daniel Bienstock: blue and green.

263
00:33:15.830 --> 00:33:20.030
Daniel Bienstock: Okay, and there are 2 arcs going out blue and green.

264
00:33:21.300 --> 00:33:29.439
Daniel Bienstock: Okay, so if we take action. The action one, then the probability distribution indicates a probability that we will switch

265
00:33:29.600 --> 00:33:33.710
Daniel Bienstock: to States blue or green. From

266
00:33:38.160 --> 00:33:39.170
Daniel Bienstock: okay.

267
00:33:39.430 --> 00:33:44.480
Daniel Bienstock: And we have such information for every node of this network.

268
00:33:44.770 --> 00:33:47.640
Daniel Bienstock: So we choose the action that we want to take.

269
00:33:48.060 --> 00:33:50.590
Daniel Bienstock: and then we get the reward immediately.

270
00:33:50.850 --> 00:33:55.739
Daniel Bienstock: and then we transition to the other States according to the probability distribution.

271
00:33:56.130 --> 00:34:04.840
Daniel Bienstock: Another question is, okay, what policy should we have? What do I mean by a policy? A policy tells you in each state which action you should take.

272
00:34:05.390 --> 00:34:25.590
Daniel Bienstock: Okay? And what is the goal? The goal is to maximize the total reward. Think about this. Well, if you run this infinitely many times, of course. Then you're going to get some kind of infinite, infinite reward. But let's make it time discounted. Okay, with some discounting factor, you know. Let's say a half.

273
00:34:27.040 --> 00:34:37.290
Daniel Bienstock: You know we want to choose a but is there such a thing as a policy that maximizes them? The the maximum that maximizes the expected reward. Discounted expected reward.

274
00:34:38.159 --> 00:34:40.500
Daniel Bienstock: Okay, a fixed policy.

275
00:34:40.710 --> 00:34:53.469
Daniel Bienstock: And the answer is, yes, there is. There's a whole slew of theorems. And one nice thing that we don't have time to go over today is that this this particular problem can be solved as a small linear program.

276
00:34:54.040 --> 00:35:07.880
Daniel Bienstock: Okay, as a small linear program. There are several methods for solving this problem efficiently. This markup decision model efficiently. And I can see that the blue is not showing up. Do you see the blue Matthias.

277
00:35:10.070 --> 00:35:10.920
matias: Yes. Done.

278
00:35:11.210 --> 00:35:12.300
Daniel Bienstock: Okay. Good.

279
00:35:12.300 --> 00:35:14.110
matias: Oh, so you have.

280
00:35:14.380 --> 00:35:19.560
matias: I I we only see 2 colors like tail and green, or.

281
00:35:20.000 --> 00:35:24.889
Daniel Bienstock: Yeah. Okay, so let me let me highlight. I mean, whatever the color was that this is the other color.

282
00:35:27.010 --> 00:35:29.510
Daniel Bienstock: Okay, now, it's a little more visible. Alright.

283
00:35:29.896 --> 00:35:39.679
Daniel Bienstock: right! And then, for every node, you know, like the this node. On the other hand, you know they are. Notice there are no, no exiting states wrong thing

284
00:35:40.839 --> 00:35:44.520
Daniel Bienstock: so once you're there, you're stuck there forever. Okay.

285
00:35:44.560 --> 00:35:47.579
Daniel Bienstock: and you get the reward. You don't get anything.

286
00:35:47.680 --> 00:35:50.089
Daniel Bienstock: And so on. This can happen

287
00:35:50.740 --> 00:35:51.599
Daniel Bienstock: all right

288
00:35:51.730 --> 00:35:52.610
Daniel Bienstock: now.

289
00:35:54.028 --> 00:36:05.029
Daniel Bienstock: later. During the summer I have some some meetings with with the the students here. We'll go over Markov. Decision processes in a little more detail. Okay?

290
00:36:05.040 --> 00:36:06.600
Daniel Bienstock: But now, alright

291
00:36:07.362 --> 00:36:18.320
Daniel Bienstock: now that we understand what a system is and what a policy is. And so on. Let's look at a generic setup, not for mark of decision processes, for reinforcing reinforcement learning.

292
00:36:22.740 --> 00:36:27.379
Daniel Bienstock: Okay, which is basically something that Williams was looking at.

293
00:36:27.630 --> 00:36:35.040
Daniel Bienstock: So we control a system. Okay? And now we can talk in generalities. We control a system

294
00:36:36.180 --> 00:36:38.340
Daniel Bienstock: by a policy.

295
00:36:40.150 --> 00:36:48.299
Daniel Bienstock: Let's call it pi, okay, in the network learning case, the policy really was the network. Okay.

296
00:36:48.400 --> 00:36:51.060
Daniel Bienstock: the network instantiates a policy

297
00:36:51.380 --> 00:36:53.330
Daniel Bienstock: and given a state.

298
00:36:53.410 --> 00:36:55.020
Daniel Bienstock: It tells you what to do.

299
00:36:56.800 --> 00:37:03.029
Daniel Bienstock: You pull a lever, a big lever. Given a state that policy. Pi tells you what to do.

300
00:37:03.400 --> 00:37:06.299
Daniel Bienstock: Alright. Now the policy

301
00:37:06.540 --> 00:37:07.880
Daniel Bienstock: is controlled

302
00:37:09.550 --> 00:37:10.970
Daniel Bienstock: by parameters.

303
00:37:14.070 --> 00:37:21.000
Daniel Bienstock: Sigma, which is a high dimensional vector and in the network case well, these are the weights of the network.

304
00:37:21.940 --> 00:37:24.450
Daniel Bienstock: Okay, the system.

305
00:37:25.100 --> 00:37:26.979
Daniel Bienstock: Given an application

306
00:37:30.710 --> 00:37:32.210
Daniel Bienstock: of the policy.

307
00:37:34.950 --> 00:37:36.540
Daniel Bienstock: this system

308
00:37:38.380 --> 00:37:39.779
Daniel Bienstock: will follow

309
00:37:40.880 --> 00:37:43.050
Daniel Bienstock: a stochastic trajectory.

310
00:37:48.920 --> 00:37:52.019
Daniel Bienstock: Then I'm going to call Tao.

311
00:37:53.810 --> 00:38:02.630
Daniel Bienstock: Okay? In the network case, it's stochastic. Well, because at the very end the very last layer, this soft Max layer, and so on. That's a probability.

312
00:38:02.720 --> 00:38:04.279
Daniel Bienstock: Okay, it's stochastic.

313
00:38:04.670 --> 00:38:11.309
Daniel Bienstock: We we assemble an action to take according to the probabilities that the network has produced.

314
00:38:12.110 --> 00:38:19.869
Daniel Bienstock: We run as we run the state as the input you run it. It produces these numbers, which are probabilities.

315
00:38:20.160 --> 00:38:24.100
Daniel Bienstock: We sample from the distribution, and we get the desired action.

316
00:38:24.860 --> 00:38:28.088
Daniel Bienstock: Okay? And let's use a notation

317
00:38:28.730 --> 00:38:31.429
Daniel Bienstock: pi of Tau is the probability.

318
00:38:31.920 --> 00:38:34.050
Daniel Bienstock: Oh, that trajectory

319
00:38:35.250 --> 00:38:41.569
Daniel Bienstock: in a more general reinforcement learning setup. There could be stochastics at multiple steps.

320
00:38:42.790 --> 00:38:45.020
Daniel Bienstock: Okay? And so

321
00:38:45.716 --> 00:38:48.339
Daniel Bienstock: at termination, at termination

322
00:38:52.500 --> 00:38:54.110
Daniel Bienstock: we get a reward.

323
00:38:56.080 --> 00:38:58.920
Daniel Bienstock: Then I'm going to call R of Tau.

324
00:38:59.720 --> 00:39:19.160
Daniel Bienstock: That depends on the entire trajectory. Okay? And the network setup I was describing before. The the reward does not depend on the entire trajectory. We just collect a reward at the very end if we match if we win the game. Pardon or not. Okay.

325
00:39:19.480 --> 00:39:23.840
Daniel Bienstock: it was plus or minus or minus one alright. And so

326
00:39:23.850 --> 00:39:26.430
Daniel Bienstock: maximize. The goal goal

327
00:39:27.400 --> 00:39:31.559
Daniel Bienstock: is to choose policy, choose policy

328
00:39:32.860 --> 00:39:34.520
Daniel Bienstock: to maximize

329
00:39:34.800 --> 00:39:36.590
Daniel Bienstock: the expected reward.

330
00:39:39.230 --> 00:39:53.520
Daniel Bienstock: which is what that's the integral. So that'd be a little bit sloppy, the integral. Let me write that the notation and then we will debug it. Okay, it's a little, just a little sloppy, you know, it's a elegantly sloppy

331
00:39:58.050 --> 00:40:05.409
Daniel Bienstock: okay, this is our notation for saying what we are saying. Okay, look, we are sampling trajectories

332
00:40:06.350 --> 00:40:18.809
Daniel Bienstock: with some distribution. So this is like a stochastic integral pi is the probability of sampling trajectory Tau, and then we get the reward. So this is the expected reward. And we want to maximize this.

333
00:40:19.730 --> 00:40:22.259
Daniel Bienstock: Okay, it's only a little bit sloppy.

334
00:40:22.500 --> 00:40:27.760
Daniel Bienstock: It's a sloppy enough to scandalize somebody who's serious about stochastic processes.

335
00:40:28.125 --> 00:40:31.180
Daniel Bienstock: But it's correct enough that we understand what it says.

336
00:40:31.620 --> 00:40:55.119
Daniel Bienstock: We want to maximize this. How do we maximize this? Well, Williams, back, you know, like I don't know. 100 years ago, whatever 1992, he said. Look, I have no clue. What we are doing. Let's take gradient steps. Okay, this is long before modern AI and Ml. And Gpus, and this and that, and you know anything existed, he said. Look, this is so hard to do.

337
00:40:55.120 --> 00:41:04.950
Daniel Bienstock: I know how to take gradients, you know. Let's take a small gradient steps and then start with a policy and correct it. Okay, and now that the thing is well.

338
00:41:05.140 --> 00:41:09.520
Daniel Bienstock: how do we take the gradient of something like this?

339
00:41:09.830 --> 00:41:12.189
Daniel Bienstock: So we want to take the gradient.

340
00:41:13.140 --> 00:41:21.299
Daniel Bienstock: Oh, boy, it seems like we actually going to make it through the lecture on time? Excellent! The gradient of the expected reward, the gradient of this.

341
00:41:25.910 --> 00:41:27.270
Daniel Bienstock: Let me copy this

342
00:41:30.470 --> 00:41:31.290
Daniel Bienstock: paste.

343
00:41:31.400 --> 00:41:38.949
Daniel Bienstock: and I? Well, you know, let's not review how what this means in in in the case of the prior slide, this.

344
00:41:39.060 --> 00:41:42.939
Daniel Bienstock: Okay? let's put aside

345
00:41:43.240 --> 00:41:48.910
Daniel Bienstock: or let's ignore for the time being, the details toward the bottom. Here

346
00:41:49.540 --> 00:41:54.690
Daniel Bienstock: now we are sampling. If we have a policy, if we have a policy which is a trained network.

347
00:41:55.820 --> 00:42:00.929
Daniel Bienstock: We are sampling games. Okay? That's those are our trajectories. A game is a trajectory.

348
00:42:02.080 --> 00:42:14.520
Daniel Bienstock: Okay? And we want to maximize the expected reward that we would get. You know it's a it's a if we if we keep playing many games over and over and over again. You know we are sampling from the same distribution for a given train network.

349
00:42:15.118 --> 00:42:21.469
Daniel Bienstock: It's you know, we're playing a game at each move of the game. We're getting a certain move with a certain probability.

350
00:42:21.540 --> 00:42:24.929
Daniel Bienstock: and at termination of the game we get a reward.

351
00:42:25.460 --> 00:42:27.910
Daniel Bienstock: And so we keep sampling games.

352
00:42:29.340 --> 00:42:33.180
Daniel Bienstock: and we want to somehow maximize the expected reward.

353
00:42:34.240 --> 00:42:36.320
Daniel Bienstock: Buy a first order methods.

354
00:42:36.580 --> 00:42:53.649
Daniel Bienstock: Okay? So we need to be able to compute something like that. A gradient of of an integral, you know, like combines 2 things that probably many of my students don't like. Alright so how do we do that? Okay, so let me copy that and move on to the next page.

355
00:43:00.900 --> 00:43:02.569
Daniel Bienstock: So we want to do this.

356
00:43:03.880 --> 00:43:23.310
Daniel Bienstock: Okay? And let's refresh our memory. The tiles of the the we are sampling trajectories. According to our policy, R. Of Tau is the terminal reward that we get, or or the reward that we get as we go through the trajectory and pi of Tau is the probability that we sample that particular trajectory.

357
00:43:24.540 --> 00:43:37.100
Daniel Bienstock: Okay, so alright, let's see if we can do some math over here. Okay, you want to take the gradient. So let's break some eggs. This is we can immediately say, this is equal to this.

358
00:43:45.900 --> 00:44:02.599
Daniel Bienstock: Okay, so I switch the gradient with the integral. I'm sure that there there's there are hundreds of pages of mathematics that say, well, in order for you to do that, you have to satisfy A, BCD, and E, and so on. And I'm saying, Yes, I satisfy everything. I'm I'm happy with us. Okay.

359
00:44:03.760 --> 00:44:04.690
Daniel Bienstock: alright.

360
00:44:04.720 --> 00:44:06.969
Daniel Bienstock: So how do we do that? Okay.

361
00:44:09.250 --> 00:44:23.740
Daniel Bienstock: alright. So now let's be creative. Okay, there's a a proof. What am I doing here? Okay, what am I doing here. I'm doing something that is called the policy gradient theorem.

362
00:44:24.150 --> 00:44:41.849
Daniel Bienstock: which I think was proved by this guy Williams that I had before. But it's possible. Some people say that actually, these 2 other guys, Barto and Sutton had already proved or outlined or something. Okay. But this a policy gradient. Theorem.

363
00:44:42.630 --> 00:44:56.310
Daniel Bienstock: Okay policy, gradient theorem or gradient policy. Theorem is like one of the golden one of the the. This is the eleventh commandment of the AI community, the traditional AI community.

364
00:44:56.350 --> 00:45:00.027
Daniel Bienstock: This is what they believe in. So alright. So let's let's do this. Okay.

365
00:45:00.780 --> 00:45:14.380
Daniel Bienstock: now, alright, I want you to think about this and and what I'm going to be doing here. I'm sampling in in the integral that is fully written up at the top on the right. I'm sampling trajectories.

366
00:45:14.530 --> 00:45:18.040
Daniel Bienstock: So let's say that the towel has been sampled. Okay.

367
00:45:18.550 --> 00:45:20.550
Daniel Bienstock: I have a given trajectory.

368
00:45:21.870 --> 00:45:24.019
Daniel Bienstock: It has a certain reward.

369
00:45:25.040 --> 00:45:26.180
Daniel Bienstock: So

370
00:45:26.340 --> 00:45:32.269
Daniel Bienstock: let me write this, and then let's see if we agree. This is equal to the gradient of the probability.

371
00:45:32.710 --> 00:45:33.710
Daniel Bienstock: Okay.

372
00:45:34.200 --> 00:45:35.750
Daniel Bienstock: time's a reward.

373
00:45:36.740 --> 00:45:45.689
Daniel Bienstock: And again, I want you to think of the integral at the top, right? As one where you're just. It's like a an integral is always like a big sum. Okay

374
00:45:46.302 --> 00:45:51.930
Daniel Bienstock: for any given cow. It's a trajectory. Its reward is is a well given.

375
00:45:52.240 --> 00:46:05.320
Daniel Bienstock: The only function of the parameters, sigma, are the probabilities, not the not the reward of the of the trajectory, and this is why I can write the the integral at the bottom left.

376
00:46:05.360 --> 00:46:10.369
Daniel Bienstock: The rewards do not depend on the probabilities given the trajectory.

377
00:46:11.930 --> 00:46:20.980
Daniel Bienstock: Okay, it took me like 5 days to understand this. And so what does this equal to? So now let me multiply and divide

378
00:46:22.890 --> 00:46:24.120
Daniel Bienstock: by that.

379
00:46:24.140 --> 00:46:25.720
Daniel Bienstock: And I have this.

380
00:46:29.690 --> 00:46:45.250
Daniel Bienstock: Okay, I multiplied and divided. Truth be told, the pi, the probability pi depends on the Sigmas. I'm I'm I'm I'm skipping that to make the notations lighter. Remember, in the network training case, the probability is output by the network.

381
00:46:45.350 --> 00:46:51.249
Daniel Bienstock: And the network depends on the segments. Okay? And that's what we are changing times the reward.

382
00:46:51.950 --> 00:47:05.170
Daniel Bienstock: Times. D pi, okay, detail. And now equal. And now what is the sequel? So this animal here, this animal here, that is delta sigma of the log. Finally.

383
00:47:06.010 --> 00:47:09.060
Daniel Bienstock: okay. Times, pi of Tau

384
00:47:09.510 --> 00:47:11.180
Daniel Bienstock: times are of Tao

385
00:47:11.550 --> 00:47:18.310
Daniel Bienstock: detail. Okay? And now, what is this animal? Okay, remember, pi is the probability of trajectory

386
00:47:18.874 --> 00:47:28.169
Daniel Bienstock: Tau. This is the expectation. This is the expectation of the log of Sigma of the log, of the probability

387
00:47:30.320 --> 00:47:31.990
Daniel Bienstock: times. R.

388
00:47:33.720 --> 00:47:34.730
Daniel Bienstock: That's it.

389
00:47:35.090 --> 00:47:43.810
Daniel Bienstock: It's the expectation of that quantity. The pi entered this pie here entered because we are taking an expectation. That's all.

390
00:47:44.230 --> 00:47:49.739
Daniel Bienstock: Okay, that thing, the integral and the pi means that I'm taking an expectation.

391
00:47:51.100 --> 00:47:54.530
Daniel Bienstock: And this is this famous gradient policy, theorem

392
00:47:55.510 --> 00:47:58.249
Daniel Bienstock: or policy grading policy grading.

393
00:48:03.500 --> 00:48:04.580
Daniel Bienstock: Okay.

394
00:48:04.810 --> 00:48:09.800
Daniel Bienstock: alright. And so if we go back to what we were doing before

395
00:48:09.820 --> 00:48:23.580
Daniel Bienstock: you can see here in in this term over here in this term, over here in the stochastic term, taking this average, this average. Let me leave out. Let me leave out the alpha.

396
00:48:24.100 --> 00:48:27.940
Daniel Bienstock: Let me leave that out, or rather, let me write that outside.

397
00:48:30.630 --> 00:48:40.969
Daniel Bienstock: Okay, the stuff that is in yellow. This is an estimate of that expectation. Think about the central limit, theorem. Okay? Applied to a function

398
00:48:41.490 --> 00:48:46.650
Daniel Bienstock: to take an average. You just take a large number of samples, and then you average them.

399
00:48:47.440 --> 00:48:51.779
Daniel Bienstock: And so here I'm taking a large well, a hundred 28

400
00:48:51.810 --> 00:48:53.780
Daniel Bienstock: samples of the gradient

401
00:48:54.660 --> 00:48:57.859
Daniel Bienstock: of what of of the Lord? Times a reward?

402
00:48:58.910 --> 00:48:59.830
Daniel Bienstock: Okay.

403
00:49:00.510 --> 00:49:07.279
Daniel Bienstock: The reward does not depend on on on anything really given. Given that the the trajectory. Given the trajectory.

404
00:49:07.420 --> 00:49:09.260
Daniel Bienstock: Okay? And, in fact.

405
00:49:09.500 --> 00:49:33.170
Daniel Bienstock: here I could have done that right if the expectation with respect to the weights, or the sigma or the towel. Pardon the trajectory. It's the same thing but the the if we think again in terms of the sampling interpretation of an expectation there the rewards don't depend okay

406
00:49:33.350 --> 00:49:35.440
Daniel Bienstock: on on the weights.

407
00:49:36.450 --> 00:49:41.429
Daniel Bienstock: Alright, but I should put it in there otherwise it it's not clear what it is that I'm doing.

408
00:49:42.110 --> 00:49:44.329
Daniel Bienstock: But I wanted to highlight that alright.

409
00:49:44.430 --> 00:49:52.879
Daniel Bienstock: And this is the policy gradient algorithm. And that's what we are doing in here. We are sampling games. Okay, that's the towel. The game. A full game is a towel.

410
00:49:52.900 --> 00:49:59.820
Daniel Bienstock: I'm taking the the gradient of each of the samples, and I'm averaging them, and then the alpha is the learning rate, the step size.

411
00:50:00.600 --> 00:50:14.849
Daniel Bienstock: And so they're doing gradient descent consistent, consistent with with maximizing. Well, maximize, Max maximizing the the the expected reward, maximizing the expected reward which is winning the game

412
00:50:18.210 --> 00:50:26.019
Daniel Bienstock: alrighty. And this was a step 2 of what they did. But you know it's something that is standard. This a policy gradient. Theorem.

413
00:50:27.800 --> 00:50:29.880
Daniel Bienstock: Okay? Alright

414
00:50:31.602 --> 00:50:50.930
Daniel Bienstock: alright. And what was the outcome of all of us? With these, you know, 128 Mini batches one day, and 50 gpus and all that. So they had started with the original train system that was supposed to mimic what the masters did.

415
00:50:51.060 --> 00:50:57.960
Daniel Bienstock: Then they took the weights from from that, and then they applied this reinforcement learning algorithm

416
00:50:58.010 --> 00:51:02.679
Daniel Bienstock: to actually improve their chances of winning games.

417
00:51:03.160 --> 00:51:11.600
Daniel Bienstock: So now they had a better game playing system. Okay? And what were the the hitting? The statistics

418
00:51:11.630 --> 00:51:14.630
Daniel Bienstock: for this? And so they said that they beat

419
00:51:15.870 --> 00:51:20.540
Daniel Bienstock: the original. The masters

420
00:51:21.860 --> 00:51:23.050
Daniel Bienstock: trained

421
00:51:24.130 --> 00:51:25.330
Daniel Bienstock: system

422
00:51:26.050 --> 00:51:27.929
Daniel Bienstock: 80% of the time.

423
00:51:30.980 --> 00:51:31.670
matias: Before it was.

424
00:51:31.670 --> 00:51:32.160
Daniel Bienstock: Loud.

425
00:51:32.160 --> 00:51:34.280
matias: Before it was 7, 57, right.

426
00:51:34.280 --> 00:51:39.760
Daniel Bienstock: No, no, so the the 57 was accuracy, and predicting individual moves.

427
00:51:41.270 --> 00:51:45.320
Daniel Bienstock: Now they use that system. Now they played games against that correct.

428
00:51:45.320 --> 00:51:46.253
matias: You see.

429
00:51:46.720 --> 00:51:58.870
Daniel Bienstock: Now the part that I'm that I that I skipped here. Sorry. Let me go back because it is important. Here is that every how many, every 500 steps or so. Okay.

430
00:51:59.150 --> 00:52:02.739
Daniel Bienstock: they would take the network that that they had just computed.

431
00:52:03.120 --> 00:52:06.170
Daniel Bienstock: and they would make that into a new opponent.

432
00:52:07.830 --> 00:52:14.599
Daniel Bienstock: Okay? And after a while they would have this set of previously developed opponents.

433
00:52:15.130 --> 00:52:22.230
Daniel Bienstock: And then, the next time that they played a game like here play a game against a prior adversary. They would pick a random one

434
00:52:23.050 --> 00:52:25.800
Daniel Bienstock: of the ones that they had previously trained.

435
00:52:26.680 --> 00:52:28.569
Daniel Bienstock: So now they are doing self play.

436
00:52:29.390 --> 00:52:30.290
Daniel Bienstock: Okay?

437
00:52:30.600 --> 00:52:37.529
Daniel Bienstock: So what else? Now there was a system. It was an existing open source system, open source.

438
00:52:38.210 --> 00:52:39.670
Daniel Bienstock: Code.

439
00:52:39.900 --> 00:52:51.900
Daniel Bienstock: there was call. I think it was called Pachy. I have no idea what Pachy means. Okay, and I don't know what language that's supposed to be. This was a a simulation Monte Carlo.

440
00:52:52.400 --> 00:52:53.890
Daniel Bienstock: Monte Carlo

441
00:52:54.960 --> 00:52:56.110
Daniel Bienstock: system.

442
00:52:56.380 --> 00:52:58.419
Daniel Bienstock: It would evaluate

443
00:53:00.210 --> 00:53:02.160
Daniel Bienstock: a hundred 1,000

444
00:53:02.460 --> 00:53:04.110
Daniel Bienstock: moves per second.

445
00:53:05.740 --> 00:53:07.889
Daniel Bienstock: according to some criterion.

446
00:53:08.360 --> 00:53:12.450
Daniel Bienstock: Okay? And choose what it decided was the best.

447
00:53:12.900 --> 00:53:17.890
Daniel Bienstock: Okay? And so it beat this guy. It beat

448
00:53:18.510 --> 00:53:19.720
Daniel Bienstock: Pachy

449
00:53:19.770 --> 00:53:21.920
Daniel Bienstock: 85% of the time.

450
00:53:22.860 --> 00:53:25.549
Daniel Bienstock: Okay, Pachy was, I guess, considered the best

451
00:53:25.990 --> 00:53:31.150
Daniel Bienstock: at that point. Okay? And they beat it. 85% of the time and the previous

452
00:53:33.600 --> 00:53:34.720
Daniel Bienstock: best

453
00:53:35.050 --> 00:53:37.930
Daniel Bienstock: from any other system was 12% of the time.

454
00:53:38.440 --> 00:53:52.349
Daniel Bienstock: So up to then, a anybody who played against Pachy. At best. They beat Pachy 12% of the time. But this system, this Alphago system I I mean, up to this point we are missing.

455
00:53:52.400 --> 00:53:54.330
Daniel Bienstock: We're missing 2 big things.

456
00:53:54.460 --> 00:53:58.290
Daniel Bienstock: It was already beating the best 85% of the time.

457
00:53:59.210 --> 00:54:03.750
Daniel Bienstock: Alright, which is great. But not the best that one can do.

458
00:54:04.210 --> 00:54:25.309
Daniel Bienstock: Okay. Now, there's more that we have to go into the next lecture. Okay. there are 2 things. Let me outline them both. They are both very important. They go together. Okay, One thing is called what it's called reinforce reinforcement

459
00:54:26.880 --> 00:54:28.030
Daniel Bienstock: learning.

460
00:54:29.160 --> 00:54:34.189
Daniel Bienstock: So we know what that means by now of something networks.

461
00:54:36.680 --> 00:54:41.229
Daniel Bienstock: I'm going to call it value. This is new. What do I mean by this.

462
00:54:41.390 --> 00:54:44.039
Daniel Bienstock: So the last thing that I described

463
00:54:44.120 --> 00:54:48.800
Daniel Bienstock: was a system for playing a game. It would play games for you.

464
00:54:49.680 --> 00:54:53.229
Daniel Bienstock: Okay, what does this do? This predicts

465
00:54:55.020 --> 00:54:57.509
Daniel Bienstock: win or loss, win or lose

466
00:54:58.050 --> 00:54:59.280
Daniel Bienstock: quickly.

467
00:55:01.670 --> 00:55:09.509
Daniel Bienstock: Okay, so this is no longer a system for playing games. This tells you given a state of the board.

468
00:55:09.860 --> 00:55:12.569
Daniel Bienstock: If you play really? Well, will you win?

469
00:55:12.960 --> 00:55:14.460
Daniel Bienstock: Okay or not?

470
00:55:15.900 --> 00:55:16.626
Daniel Bienstock: Okay.

471
00:55:18.935 --> 00:55:25.449
Daniel Bienstock: We have to describe next lecture what this is. Okay? it has to be, what are the goals?

472
00:55:26.353 --> 00:55:30.889
Daniel Bienstock: It should be very fast, should be very fast. Goal.

473
00:55:35.510 --> 00:55:37.180
Daniel Bienstock: Make it very fast.

474
00:55:38.630 --> 00:55:45.020
Daniel Bienstock: Question is, why? Well, there's going to be another algorithm that's going to use this as a sub routine many times.

475
00:55:45.350 --> 00:55:51.180
Daniel Bienstock: many, many times. Why? Well, we'll have to explain. Okay, does not need

476
00:55:53.520 --> 00:55:55.319
Daniel Bienstock: to be super accurate.

477
00:55:58.440 --> 00:56:05.730
Daniel Bienstock: So for those of you who play any kind of card game or or anything like that, that requires many moves.

478
00:56:06.060 --> 00:56:07.390
Daniel Bienstock: many moves

479
00:56:08.120 --> 00:56:10.040
Daniel Bienstock: or the stock market.

480
00:56:10.470 --> 00:56:21.240
Daniel Bienstock: Okay, if you know that you're going to be correct, at least some fraction of the time, and you can use that to reinforce your decisions that when you, when you actually do, when you reinforce that.

481
00:56:21.250 --> 00:56:26.650
Daniel Bienstock: then maybe you can actually boost your chances of winning. So this is the spirit.

482
00:56:26.950 --> 00:56:50.780
Daniel Bienstock: Alright. So this is only one thing. So this output went into the last thing, and probably the most important thing where we want to spend a little more time. Okay, maybe the single most important component of the entire setup that they did. All of that we did until now was a way to prepare to to develop the next. The final phase.

483
00:56:51.210 --> 00:56:57.650
Daniel Bienstock: Everything that we did so far is very, very important, so that the next phase gets off the ground properly.

484
00:56:57.950 --> 00:57:03.489
Daniel Bienstock: But it's going to be the most important phase of all. And that's called Monte Carlo tree search.

485
00:57:10.190 --> 00:57:14.109
Daniel Bienstock: Okay? And usually abbreviated like that.

486
00:57:14.450 --> 00:57:19.690
Daniel Bienstock: And what is that. Okay? So we'll definitely go through these 2 things next lecture.

487
00:57:19.970 --> 00:57:25.470
Daniel Bienstock: we have to. So what is that? Okay, what is the tree that we are talking about here.

488
00:57:25.520 --> 00:57:30.759
Daniel Bienstock: So imagine a game like chess or go, you know, and you play first.

489
00:57:31.430 --> 00:57:34.200
Daniel Bienstock: And now there are many moves that you can make.

490
00:57:35.240 --> 00:57:47.509
Daniel Bienstock: There are many moves that you can make. And so you can have a picture. You know, this is the beginning. And they all these different moves. There's a whole bunch of moves that you can make. And now you don't know. Okay, which which one is is a good move to make at the very beginning.

491
00:57:48.160 --> 00:57:57.660
Daniel Bienstock: Well, then you could simulate. For each of these moves. You could simulate what your opponent would do. Okay, and your opponent in each case will have a bunch of moves.

492
00:58:00.260 --> 00:58:06.219
Daniel Bienstock: So if you assume that your opponent is very intelligent. Okay, how would they choose it?

493
00:58:06.540 --> 00:58:09.700
Daniel Bienstock: Well, they could simulate what you would do in each case.

494
00:58:11.250 --> 00:58:16.910
Daniel Bienstock: And so if you continue like this, you're you're going to grow this humongous tree.

495
00:58:17.530 --> 00:58:22.490
Daniel Bienstock: It's going to be both very broad and probably way too deep.

496
00:58:24.700 --> 00:58:34.260
Daniel Bienstock: If somehow you could evaluate this tree all the way. Eventually each branch, each branch, is going to terminate in some terminal state where somebody wins.

497
00:58:35.170 --> 00:58:44.530
Daniel Bienstock: Okay? And if you could. If you could visualize the entire tree all at once, then you could pick the best move in some sense

498
00:58:44.970 --> 00:59:06.250
Daniel Bienstock: right, and the the worth of a move. That move would be dependent upon 2 things right? The it's going to be a blend, you know. In in terms of a binary game like chess or go. You know you want to know. Okay, in the subtree, you know. How likely is it that I win? That is to say, ho! How many times do I win.

499
00:59:06.770 --> 00:59:08.830
Daniel Bienstock: and how many times do I not win?

500
00:59:09.870 --> 00:59:17.849
Daniel Bienstock: So we have to combine the 2 things. There's some some kind of a stochastic interpretation and some kind of a value interpretation. Okay.

501
00:59:18.326 --> 00:59:22.389
Daniel Bienstock: and you have to somehow balance the 2 of them.

502
00:59:22.490 --> 00:59:32.969
Daniel Bienstock: So Monte Carlo tree search is a way to to heuristically, heuristically, evaluate a subtree of this tree

503
00:59:33.370 --> 00:59:42.699
Daniel Bienstock: where you're evaluating. You are repeatedly evaluating some branch of the tree, and then you're adjusting your estimate of probabilities in particular.

504
00:59:42.760 --> 00:59:44.260
Daniel Bienstock: of a success.

505
00:59:45.676 --> 01:00:12.929
Daniel Bienstock: If you had a game that was not a binary game, it's the game that, in addition that gives you a a number, a value not plus minus one, but some number. Okay? Then it's a more delicate process, because you don't just want to to take the branch that gives you. You know, th has the single node the single leaf with the highest possible value. It's a blend of that. And how many times you know some kind of expectation, and so on.

506
01:00:13.140 --> 01:00:16.680
Daniel Bienstock: And so, Monte Carlo tree search is a collection of heuristics

507
01:00:16.880 --> 01:00:21.339
Daniel Bienstock: to try to narrow down the tree, both in terms of its breadth.

508
01:00:21.390 --> 01:00:25.370
Daniel Bienstock: and also rapidly getting to the bottom in some sense

509
01:00:27.055 --> 01:00:37.649
Daniel Bienstock: and so these last 2 elements is a reinforcement learning of value networks which builds upon the very first thing that we saw here, or it starts from from

510
01:00:38.430 --> 01:00:41.040
Daniel Bienstock: from that supervised learning

511
01:00:41.734 --> 01:00:46.829
Daniel Bienstock: and then it also uses this to give you good moves.

512
01:00:46.950 --> 01:01:09.209
Daniel Bienstock: Give you good moves that are likely to win games. So the the first tool the the prediction of what a good move would be that in fast, in a fast mode is going to be used to develop a quick estimate as to whether given a state of the board, you're likely to win or not very quickly.

513
01:01:09.390 --> 01:01:17.670
Daniel Bienstock: and the second one, the reinforcement learning of policy networks will be used to begin the Monte Carlo tree, search

514
01:01:19.142 --> 01:01:26.200
Daniel Bienstock: and as we take a dive, you know, we'll be taking Dives down this tree. Okay.

515
01:01:26.280 --> 01:01:33.689
Daniel Bienstock: then, occasionally, occasionally we may terminate the dive early. Okay, we. The game is not over yet.

516
01:01:33.760 --> 01:01:39.279
Daniel Bienstock: But then we evaluate the chances of winning using this first component using that

517
01:01:40.120 --> 01:01:44.029
Daniel Bienstock: very quickly. Okay. And we're going to take many, many such dives.

518
01:01:44.460 --> 01:01:54.359
Daniel Bienstock: We have to discuss how this reinforcement, learning, this reinforcement learning of value networks actually works, and how Monte Carlo Tree search actually works.

519
01:01:54.660 --> 01:01:59.769
Daniel Bienstock: Okay, there's a ton of heuristics in there. This is a very, very interesting.

520
01:01:59.830 --> 01:02:01.640
Daniel Bienstock: They got it to work.

521
01:02:02.020 --> 01:02:14.020
Daniel Bienstock: so we'll see next time how they did this, and then beginning next time, and probably finishing the time after. We'll see how the Deepmind people move from from this setup.

522
01:02:14.419 --> 01:02:17.829
Daniel Bienstock: Which, incorporating all of these things they called Alphago.

523
01:02:17.980 --> 01:02:22.720
Daniel Bienstock: where they moved to something, where they let go. Of the masters there were no more masters.

524
01:02:23.000 --> 01:02:28.380
Daniel Bienstock: Okay, no more, no more training of their initial step.

525
01:02:28.410 --> 01:02:40.079
Daniel Bienstock: using masters data There we go. Instead, they they kind of combine the 2 steps, this one and that one by doing self play.

526
01:02:40.720 --> 01:02:43.960
Daniel Bienstock: They just kept playing against themselves over and over again.

527
01:02:43.980 --> 01:02:47.770
Daniel Bienstock: And as they improved their algorithms they would

528
01:02:48.070 --> 01:02:57.219
Daniel Bienstock: send these improved algorithms into this, a a basket of previously developed algorithms against which they keep playing games.

529
01:02:57.330 --> 01:02:59.059
Daniel Bienstock: okay and always improving.

530
01:03:00.200 --> 01:03:18.869
Daniel Bienstock: And when they completed that task, and we'll see that second, maybe next lecture or the lecture. After that. That's when they got this system. They beat everybody. Okay. They beat all the human masters, you know, by some ridiculous margins like, you know, they would have a championship. I don't know 5 games, and they would win 5 to nothing.

531
01:03:19.190 --> 01:03:23.119
Daniel Bienstock: They beat all their all all other programs, a hundred percent of the time.

532
01:03:23.310 --> 01:03:26.289
Daniel Bienstock: I'm not sure there's any competition today.

533
01:03:26.870 --> 01:03:37.230
Daniel Bienstock: My understanding is that this technology is still getting developed, not necessarily to play silly games, but to do other things, and it's certainly not open source.

534
01:03:38.968 --> 01:03:47.689
Daniel Bienstock: and how much we can divine by reading their papers is probably going to be somewhat limited, but it's still very entertaining.

535
01:03:48.160 --> 01:03:49.250
Daniel Bienstock: Alright.

536
01:03:49.500 --> 01:03:58.489
Daniel Bienstock: that's it for today. And I can see that the electricity company never called me. Hopefully, they're not going to cut power. And

537
01:03:58.906 --> 01:04:05.409
Daniel Bienstock: I'll see you guys on Tuesday. I guess the semester is over, probably for for everybody.

538
01:04:05.470 --> 01:04:13.229
Daniel Bienstock: But we have some ground to cover. You know there are some of these Ml. Related lectures, and then we should do something about Cuda.

539
01:04:13.260 --> 01:04:15.169
Daniel Bienstock: because I promised that we would.

540
01:04:16.450 --> 01:04:18.320
Daniel Bienstock: Okay, that's like.

541
01:04:18.790 --> 01:04:20.800
Daniel Bienstock: Hey, yeah, take care.

542
01:04:22.110 --> 01:04:23.579
Daniel Bienstock: and I'll stay on

543
01:04:24.360 --> 01:04:25.909
Daniel Bienstock: with a blink.

544
01:04:26.130 --> 01:04:26.720
Blake: Yeah, we're.

545
01:04:26.720 --> 01:04:28.440
Daniel Bienstock: We are. We are on next right, Blake.

546
01:04:28.440 --> 01:04:29.550
Blake: Yeah. Yep.

547
01:04:29.550 --> 01:04:31.470
Daniel Bienstock: Alright! Give me 1 min.

548
01:04:31.660 --> 01:04:34.089
Blake: Yeah, that's alright. I was gonna go grab a jacket.

549
01:04:34.210 --> 01:04:35.250
Blake: Sorry. Okay.

550
01:04:35.480 --> 01:04:36.940
Daniel Bienstock: Alright, I'm gonna get a drink.

551
01:04:37.250 --> 01:04:38.330
Blake: Okay. Be right back.

552
01:04:39.830 --> 01:04:41.389
Daniel Bienstock: Let me stop recording.