So the thing that seems kind of amazing um to me and us is the fact that well actually this course was taught um just last quarter and here we are with the enormous number of people again taking this class um I guess that says something maybe approximately what it says is chat GPT um um um but anyway it's great to have you all lots of exciting content to have and um Hope you'll all enjoy it so let me get started and um start telling you a bit about the course um before diving straight into um

today's content um for people still coming in you know there are oodles of seats still right on either side especially down near the front there are tons of seats so um do feel um empowered to go out and seek those seats if other if people on the corridors are really nice they could even move towards the edges to make it Easier for people but um One Way or Another feel free to find a seat okay so this is the plan for what I want to get through today so first of all I'm going to tell

you about the course for a few minutes um then um have a few remarks about human language and word meaning um then the main technical thing we want to get into today today is start learning about the word Tove algorithm so the word Tove algorithm is slightly over a decade old now it was introduced in um 2013 but it was a wildly successful simple way of learning Vector representations of words so I want to show you um that as a sort of a first easy baby system for the kind of neural representations that we're going

to talk about in class um with then going to get more concrete with that looking at its objective function gradients and optimization and then hopefully if all goes I stick to schedule spend a few minutes just um playing around an i Python notebook um huh I'm going to have to change computers for that um than um sort of um seeing some of the things you can do with this okay so this is the um course Logistics and brief I'm Christopher Manning hi again everyone um the head ta is who unfortunately has a bit of a

health problem so he's not actually here today um we've got a course manager for the course who is who is up the back there um and then we've got a whole um lot of teas if You're a TA who's here you could stand up and wave or something like that so people can see a few of the Tas and see some friendly faces um okay we've got some Tas um and some other ones and so you can look at them on the website um if you're here you know what time the class is um there's

an email list but preferably don't use it and use the Ed site that you can find on the course website so the main place to go and look for information is the course website Which we've got up here and that then links in to Ed which is what we're going to use as the main discussion board please use that rather than sending emails um to um the first assignment for this class it's a sort of an easy one it's the warm-up assignment but we want to get people busy and doing stuff um straight away so

the first assignment is already live on the web page and it's due next Tuesday before class so um you're slightly less than seven days Left to do it um so do get started on that um and to help with that um we're going to be immediately starting office hours um tomorrow and they're also described on the website um we also do a few tutorials on Friday um the first of these tutorials is a tutorial on Python and numpy um many people don't need that because they've done other classes and done this um but for some

people we try and make this class accessible to everybody so if you'd like to um brush Up a bit on python or how to use numpy it's a great thing to go along to and who's right over there is going to be teaching it on Friday today okay what do we hope to teach you know at the end of the quarter when you get the evow you'll be asked to rate whether this class met its learning goals these are my learning goals um what are they so the first one um is to teach you about

the foundations and current methods for using deep learning Applied to natural language processing so this class tries to sort of build up from the bottom up so we start off doing simple things like word vectors and feed forward neural networks recurrent networks and attention we then fairly quickly move into the kind of key methods they Ed for NLP um in 2024 I wrote down here Transformers and coder decoder models I probably should have written large language models Somewhere in this list as well um but then pre-training and post-training of large language models adaptation model interpretability

agents Etc but that's not the only thing that we want to do so there are a couple of other things that we crucially want to achieve um the second is to give you some understanding of human languages and the difficulties in understanding and producing them on computers now there are few of you in this class who Are Linguistics Majors or perhaps the symbolic systems Majors yay to the symbolic systems Majors but for quite a few of the rest of you um you'll never see any uh Linguistics in the sense of understanding how language Works apart

from this class so we do want to try and convey a little bit of a sense of what some of the issues are in language structure and why it's proven to be quite difficult um to get computers to understand human languages even though Humans seem very good at learning to understand each other and then the final thing that we want to make it on to um is actually concretely Building Systems so that this isn't just a theory class that we actually want you to leave this class thinking oh yeah in my first job wherever you

go whether it's a startup or a big Tech or um some nonprofit oh there's something they want to do that they'd like that would be useful if we had a text classification system or we Did information extraction to get some kind of facts out of documents I know to build that I can build that system because I did cs224n okay um here's how you get graded um so we have four assignments mainly one and a half weeks long apart from the first one they make up almost half the grade the other half of the grade

is made up of a final project which there are two variants of a custom or default final project which we'll get on to in a Minute um and then there's a few percent that go for participation um six late days um collaboration policy um like all other CS classes we've had issues with people not doing their own work we really do want you to learn things in this class and the way you do that is by doing your own work um so make sure you understand that um and so for the assignments everyone is expected

to do their own assign assignments um you can Talk to your friends but you're expected to do your own assignment for the final project you can do that as a group um then we have the issue of um AI tools now of course in this class we love large language models but nevertheless we don't want you to do your assignments by saying hey chat GPT could you answer question three for me um that is not the way to learn things um if you want to make use of um AI as a tool to assist you

such as for coding assistance go for It um but we're wanting you to be working out how to answer assignment questions by yourself okay so this is what the assignments look like so assignment one is meant to be an easy onramp and it's done as a Jupiter notebook assignment two um then has people uh you know what can I say here we are at this fine liberal artarts and Engineering institution we're not at a coding boot camp so we hope that people have some Deep understanding of how things work so in assignment two we actually

want you to do some math and understand how things work in neural networks um so for some people assignment to is the scariest assignment in the whole class um but then it's also the place where we introduce py talk which is software package we use for building newal networks and we build a dependency paa which we'll get to later as something more linguistic um then for assignment Three and four we move on to larger projects using pytorch with gpus and we'll be making use of Google cloud and um for those two assignments um we look

at doing machine translation and getting information um out with Transformers and then these are the two final project options so essentially you know we have a default final project where we give you a lot of scaffolding and an outline of what to do but um it's still an open-ended project there are lots of Different things you can try to make this system work better and we encourage you to explore um but nevertheless you're given a leg up from quite a lot of um scaffolding we'll talk about this more but you can either do that option

or you can just come up with totally your own project and do that okay that's the course any questions on the course yes for the final project how are mentors assigned um so if you if you can Find your own Mentor your interest in something and there's someone that's happy to Mentor you that person can be your Mentor otherwise one of the course Tas will be your mentor and how that person is assigned is uh one of the T who is in charge of final projects assigns people and they do the best they can in

terms of finding people with some expertise and having to divide all the students across the mentors roughly equally any other Questions okay I'll power ahead um human language and word meaning um so let me just sort of um say a little bit about the big picture here um so we're in the area of artificial intelligence and we've got this idea that humans are intelligent and then there's the question of you know how does language um fit into that and you know this is something that there is some argument about and if you want to you

can run off in onto social media and Read some of the arguments about these things and contribute to it um if you wish too um but here is my perhaps bias take as a linguist um well you can compare human beings um to some of our nearest neighbors like chimpanzees bonobos and things like that and you know well one big distinguishing thing is we have language and they don't um but you know in most other respects um chimps are very similar human beings right you know in they they can use Tools they can plan how

to solve things um they've got really good memory um chimps have better short-term memory than human beings do right so that in most respects it's hard to show an intelligence difference between chimps and people except for the fact that we have language but us having language has been this enormous differentiator right that if you look around um what happened on the planet you know that there are creatures that are stronger than us Faster than us more venomous than us have every possible Advantage um but human beings took over the whole place and how did that

happen we had language um so we could communicate and that communication allowed us to have human ascendency but I'd like to me so one big role of language is the fact fact that it allows communication but I'd like to suggest it's actually not the only role of language that language has also allowed humans I would argue to achieve A higher level of thought so there are various kinds of thoughts that you can have without any language involved you know you can think about a scene you can move some bits of furniture around in your mind

and there's no language and obviously emotional responses of feeling scared or excited they happen and there's no language involved but you know I think most of the time when we're doing higher level cognition um if you're thinking to Yourself oh gee my friend seemed upset about what I said last night my should probably work out how to fix that or maybe I could BL BL BL BL I think we think in language and plan out things and so that it's given us a scaffolding to do much more detailed thought and planning most recently of all

of course human beings invented ways to write um and that led so writing is really really recent I mean no one really knows how Old human languages are you know most people think a few 100,000 years not very long by evolutionary time scales but writing we do know writing is really really recent so writing is about 5,000 years old um and so but you know writing proved to be this um again this amazing cognitive tool that just gave Humanity an enormous leg up because Suddenly It's Not only that you could share information and learn from

the people that were standing within 50 feet of you You could then um share knowledge across time and space so really um having writing was enough to take us from the Bronze Age very simple um metal working to the kind of um you know mobile phones and all the other technology that we walk around with today in just a very short amount of time so language is pretty cool um but it's you know one shouldn't only fixate on um the sort of knowledge side of language and how that's made um Human beings great I mean

there's this other side of language where language is this very flexible system which is used as a social tool by human beings um so that we can speak with a lot of imprecision and nuance and emotion in language and we can get people to understand we can set up sort of new ways of thinking about things by using words for them and languages aren't static languages change as human beings use them that languages aren't something That would delivered down on Tablets by God languages are things that humans constructed and humans changed them with each success

of generation and indeed most of the innovation in language happens among young people you know people that are either a few years younger than you are most of you are now um in their um earlier teens going into the 20s right that's a big period of linguistic Innovation where people think up cool new phrases and ways of saying Things and some of those get embedded and extended and that then becomes the future of language um so um herb Clark used to be a um psychologist um at Stanford he's now retired but he had this rather

um nice quote the common misconception is that language use has primarily to do with words and what they mean it doesn't it has primarily to do with people and what they mean okay so that's language and two slides for you um so now we'll skip Ahead to deep learning so in the last decade or so we've been able to make um fantastic progress in doing more with computers understanding um human languages um in using deep learning we'll say a bit more about the history later on but you know work on trying to do things with

human language started in the 1950s so it had been sort of going for 60 years or so and you know there was some stuff it's not that nobody could do anything but you know the Ability to understand and produce language had always been kind of questionable where it's really in the last decade with new networks that just enormous strides of progress have been made um and that's led into the world that we have today so one of the first big breakthroughs came in the area of using um neural NLP systems for machine translation and so

this started about 2014 and was already deployed live on services like Google by 2016 it was so Good that it saw really really rapid um commercial deployment and I mean overall this kind of facility um with machine translation just means that you're growing up in such a different world um to people a few Generations back right people a few Generations back um that unless you actually knew different languages of different people you sort of had no chance to communicate with them where um now we're very close to having something like the Babel Fish From Hitchhiker's

Guide to the Galaxy um for understanding all languages it's just it's not a Babel Fish it's a cell phone but you know you can have it out between two people and have it do simultaneous translation and you know it's not perfect people keep on doing research on this but um you know by and large it means you can pick anything up from different areas of the world um as you can see this example is from a couple of years ago since it's still From the um Co pandemic era but you know I can um see

this um Swahili from Kenya and say oh gee I wonder what that means stick it into Google translate and um I can learn that Malawi um lost two ministers due to um Co infections and they died right so you know we're just in this different era of being able to understand stuff and then there are lots of other things that we can do with modern NLP so until a few years ago um we had web search engines and you put in Some text you could write it as a sentence if you wanted to but it

didn't really matter whether you wrote a sentence or not because what you got was some keywords that were then matched against index and you were shown some pages that might have the answers to your questions but these days um you can put an actual question into a modern search engine like when did Kendrick Lamar's first album come out it can go and find documents that have relevant Information it can read those documents and it can give you an answer so that it actually can become an answer engine rather than just something that finds documents that

might be relevant to what you're interested in and the way that that's done is with big neural networks so that you might commonly have um for your query you've got a retrieval neural network which can find passages that are similar to The query they might then be reranked by a second neural network and Then there'll be a third reading neural network that'll read those passages um and synthesize information from them which it then returns as the answer okay that gets to about 2018 um but then things got more advanced again so it was really around

2019 that people started to see the power of large language models and so back in 2019 those of us in NLP were really excited about Gpt2 um it didn't make much of an impact on the Nightly News but it was really exciting an NLP land um because gpt2 already for the first time time meant here was a large language model that could just generate fluent text that really until then um NLP systems had done sort of a decent job at understanding certain facts out of text but we've just never been able to generate fluent text

that was at all good um where here what you could do With gpt2 is you could um write something like the start of a story a train Carriage containing controlled nuclear materials was stolen in cincin today its whereabouts are unknown and then GPT 2 would just write a continuation the incident occurred on the downtown train line which runs from Covington ashin stations in an email to Ohio news outlets the US Department of energy set is working with the Federal Railroad Administration to find the Thief dot dot dot and so the way this is working is

this conditioning on all the past material and as I show at the very bottom line down here it's generating one word at a time as to what word it thinks would be likely to come next after that um and so from that simple method of sort of generating words out of U one after another it's able to produce excellent text and the thing to notice is I mean this text is not only kind of you know formally correct you Know not the spellings correct and the sentences are real sentences not disconnected garbage but you know

it actually understands a lot right so the prompt that was written said there were stolen nuclear materials in Cincinnati but you know gpt2 knows a lot of stuff it knows that Cincinnati is in Ohio it knows that in the United States it's the department of energy that regulates nuclear materials um it knows if something is stolen it's a theft and That that would um make sense that um people are getting involved with that um it talks about you know there's train Carriage So it's talking about the train line where it goes it really knows a

lot and can write you know coherent discourse um like a real story so that's kind of amazing um but you know things moved on from there and so now we're in the world of chat GPT and gp4 and one of the things that we will talk about later is this was a huge huge User success because now you could ask questions or give it commands and it would do what you wanted and that was further amazing so here I'm saying hey please draft a polite email to my boss Jeremy that I would not be able

to come into the office for the next two days because my 9-year-old song that's a misspelling for son but the system works fine Des spite it um Peter is angry with me that I'm not giving him much time and it writes a nice email um it Corrects the spelling mistake because it knows people make spelling mistakes it doesn't talk about songs and everything works out beautifully um you can get it to do other things so you can um ask it what is unusual about this image um so in thinking about meaning one of the things

that's interesting with these recent models um is that they're multimodal and can operate across modes and so um a favorite term that we coined at Stanford is the term Foundation Models which we use as a generalization of large language models to have the same kind of technology used across different modalities images sound um various kinds of bioinformatic things DNA RNA things like that seismic waves any kind of signal building these same kind of large models another place that you can see that um is going from text to images um so if I asked for a

picture of a train going over the Golden Gate Bridge um This is now um darly 2 um it gives me a picture of a train going over the Golden Gate Bridge um this is a perfect time to welcome anyone who's watching this um on Stanford online um if you're on Stanford online and are not in the Bay Area the important thing to know is no trains go over the Golden Gate Bridge um but you might not be completely happy with this picture um because you know it shows the Golden Gate Bridge and a train going

over it but it doesn't show the bay so Maybe I'd like to um get with the bay in the background and if I ask for that well look now I've got um a train going over the Golden Gate Bridge with the bay in the background but you still might not be this this might not be exactly what you want like maybe you'd prefer something that's a pencil drawing so I can say a train going over the Golden Gate Bridge detailed pencil drawing and I can get a pencil drawing um or maybe it's unrealistic that the

Golden Gate Bridge only has trains going over it now um so maybe it be good to have some cars as well um so I could ask for a train and cars and we can get a train and cars going over it um now I actually made these ones all by myself so should be impressed with my generative AI artwork um but these examples are actually a bit old now because they're done with DAR 2 and if you keep up with these things that's a few years ago there now Dary 3 and so on so we

can now get much fancier Things again right an illustration from a graphic novel a bustling City street under the shine of a full moon the sidewalks bustling with pedestrians enjoying the night life at the corner stall a young woman with fiery red hair dressed in a signature velvet cloak is haggling with the grumpy old vendor the grumpy vendor a tall sophisticated man is wearing a sharp suit Sports a noteworthy mustache is animatedly conversing on his steampunk telephone And pretty much um we're getting all of that okay um so let's now get on to starting to

think more about meaning so perhaps um what can we do for meaning right so if you think of words and there meaning um if you look up a dictionary and say what does meaning mean um meaning is defined as the idea that is represented by a word or phrase the idea that a person wants to express by using words the idea that is expressed um and in in linguistics you know if you go and Do a semantics class or something the commonest way of thinking of of meaning is somewhat like what's presented up above there

that meaning is thought of as a pairing between what sometimes called signifier and signified but it's perhaps easy to think of as a symbol a word and then an idea or thing and so this notion is referred to as denotational semantics so the idea or thing is the denotation of the symbol and so this same idea of denotational Semantics has also been used for programming languages because in programming languages you have symbols like while and if variables and they have a meaning and that could be their denotation um so we sort of would say that

the meaning of tree is all the trees you can find out around the world um that's sort of a okay notion of meaning um it's a popular one it's never been very Obvious or at least traditionally it wasn't very obvious as of what we could do with that to get it into computers so if you looked in the pre-neural world when people tried to look at meanings inside computers they sort of had to do something much more primitive of looking at words and their relationship so a very common traditional solution was to make use of

word net and word net was kind of a sort of fancy thesaurus that showed word relations so i' tell you About synonyms and is a kind of things um so a panda is a kind of carnivore which is a placental which is a m and things like that good has various meanings it's a trade good or the sense of goodness and you could explore with that but systems like wordnet um were never very good for computational meaning um they missed a lot of nuance wordnet would tell you that proficient is a synonym for good but

if you think about all the things that you would say Were good you know that was a good shot would you say that was a proficient shot sounds kind of weird to me um you know there's a lot of color and Nuance on how words are used um word net is very incomplete it's missing anything that's kind of cooler more modern slang this maybe isn't very modern slang now but you won't find more modern slang either in it it's sort of very human-made Etc it's got a lot of issues so um this led into the

idea of can we represent Meaning differently and this leads us into word vectors um so when we have words Wicked badass Nifty wizard what do they turn into when we have computers um well effectively um you know words are these discrete symbols um that they're just kind of some kind of atom or symbol and if we then turn those into something that's closer to math um how symbols are normally rep presented is you have a Vocabulary and your word is some item in that vocabulary so Motel is the that word in the vocabulary and hotel

is this word in the vocabulary and commonly this is what computational systems do you take all your strings and you index them to numbers and that's the sort of position in a vector that they belong in um and well we have huge numbers of words so we might have a huge vocabulary so we'll have very big and long vectors and so these get referred to as one hot Vectors um for representing the meaning of words um but representing words by one hot vectors turns out to not be a very good way of computing with them

it was used for decades um but it turns out to be kind of problematic and part of why it's problematic is it doesn't have any natural inherent sense of the meanings of words you just have different words you have hotel and motel and house and chair and so if you think About in terms of these Vector representations that if you have motel and hotel there's no indication that they're kind of similar they're just two different symbols which have ones in different positions in the vector or formally in math terms um if you think about taking

the dotproduct of these two vectors zero um the two vectors are orthogonal they have nothing to do with each other now there are things that you can do with that you can start saying oh Let me start building up some other resource of word similarity and I'll consult that resource of word similarity and it'll tell me that motels and hotels are similar to each other and people did things like that right in web search it was referred to as query expansion techniques um but still the point is that there's no natural notion of similarity in

one hot vectors um and so the the idea was that maybe we could do better than that that We could learn to include similarity in the vectors themselves and so that leads into the idea of word vectors um but it also leads into a different way of thinking about semantics um I just realized I forgot to say one thing back two slides um these kind of representations are referred to as localist representations meaning that there's one point in which um something is represented so that um you've got here is the representation of motel and Here

is the representation of Hotel it's in one place in the vector that each word is represented and they'll be different to what we do next um so there's an alternative idea of semantic um which goes back quite a long way people commonly quote this quote of Jr F who was a British linguist who said in 1957 you shall know a word by the company it keeps but also goes back to philosophical work by binstein and others that what you should do is Represent a word's meaning by the context in which it appears um so the

words that appear around the word give information um about its meaning and so that's the idea of what's called distributional semantics in contrast to denotational semantics so if I want to know about the word banking I say give me some sentences that use the word banking here are some sentences using the word banking government debt problems turning Into banking crises as happened in 2009 etc etc and knowing about that context words that occur around banking those will become the meaning of banking and so we're going to use those statistics um about words and what other

words appear around them in order to learn a new kind of representation of a word so our new representation of words is we're going to represent them now as a dense a sort of a shorter dense Vector that giv the meaning of the words now my Vectors are very short here um these are only eight dimensional if I counted right so I could fit them on my slide they're not that short in practice they might be 200 2,000 but reasonably short they're not going to be like the half a million of the half a million

different words in our vocabulary and the idea is if words have stuff to do with each other um they'll have sort of similar vectors which corresponds to their dot product being large so for Banking and Monetary in my example here both of them are positive in the First Dimension positive in the second dimension negative on the third on the fourth they've got opposite signs so if we want to work out the dot product we're taking the product of the corresponding terms and it'll get bigger to the extent that both of the corresponding ones have the

same sides and bigger if they have large um magnitude Okay so these are what we call Word vectors which are also known as embeddings or newal word representations or phrases like that and so the first thing we want to do is learn good word vectors for different words and our word vectors will be good word vectors if um they give us a good sense of the meanings of words they know which words are similar to other words in meaning um we refer to them as embeddings um because we can think of this as a vector

in a high dimensional Space and so that we're embedding each word as a position in that high-dimensional space and the dimensionality of the space um will be the length of the vector so it might be something like a 300 dimensional space um now that kind of gets problematic because human beings can't look at 300 dimensional spaces and aren't very good at understanding or visualizing what goes on in them so the only thing that I can show you is um two-dimensional Spaces but um a thing that is good to have somewhat in your head is that

really high-dimensional spaces behave extremely differently to two-dimensional spaces in high dimensional spaces things can in a two-dimensional space you're only near to something else if you got similar X and Y coordinates in a high dimensional space things can be very near to all sorts of things on different dimensions in the space and so we can capture different senses of words and Ways that words are similar to each other um but here's the kind of picture we end up with so we're what we're going to do is learn a way to um represent all words as

vectors based on the other words that they within context and we can embed them into this Vector space and of course you can't read anything there but you know we can zoom into this space further and if we zoom into this space and just show a bit of it well here's a part of the space um where it's Showing um country words and some other location words so we've got so of um countries up the top there we've got some nationality terms British Australian American European um further down or we can go to another piece

of the space and here's a bit of the space um where we have verbs and not only if we got verbs but you know there's actually quite quite a lot of fine structure here of what's similar that represents things about verbs so you've Got um sort of verbs of you know communication statements saying thinking expecting grouping together come and go group together down the bottom you've got forms of the verb have then you've got forms of the verb to be above them you've got become and remain which are actually sort of similar to the verb

to be um because they take these sort of complements of state so just that you can same as you can say I am angry you can say he remained angry or he became Angry right so those verbs are more so than most verbs sort of similar to the verb to be so we get these kind of interesting semantic spaces where things that have similar meaning are close by to each other and so the question is how do we get to those things and how we get to those things is then um for you know that

there are various ways of doing it but the one I want to get through um today is showing you about um word DEC okay I'll pause for 30 seconds for bread Breath anyone have a question or anything they want to know yes so but it doesn't to solve the problem where the similar meanings um might depend on context right so let's take your example about profession ver so those two words have their own batteries and we understand similarity some spectors but it's contact right because if you have different contact those two similar and this Al

does not yes correct um so that's a good Thought you can keep it for a few weeks to some extent yeah so for the first thing we're going to do we're just going to learn one word Vector for a string so we're going to have a word let's say it's star and we're going to learn one word Vector for it so that absolutely doesn't capture the meaning of a word in context so it won't be saying whether it's meaning a Hollywood star or an astronomical star or something like that and so later on we're going

to get on to Contextual meaning representation so wait for that but the thing I would like to going along with what I said about high dimensional spaces being weird the the cool thing that we will already find is our representation for Star will be very close to um the representations for astronomical words like nebula um and what other every other astronomical words you know and simultaneously um it'll be um very close to words that mean something like a Hollywood star um help me out no any words that mean something similar um celebrity that's a good

one okay yeah how are youing the Bing is to a lower dimensional space visualized um so that pictures I was showing you used a particular method called tne um which is a nonliner dimensionality reduction that tends to work better for high dimensional new representations um then PCA which you might know um but I'm not going to go Into that now yes um how do you know Di but not too I mean that's something that people have worked on it depends on how much it depends on how much data you've got to make your representations over

you know so normally you know it's worked out either empirically for what works best or practically based on how big vectors you want to work on I mean to give you some idea you know things start to work well when you get to 100 dimensional Space for a long time people used 300 Dimensions because that seemed to work pretty well but as we people have started building Huger and Huger models with way way more data it's now become increasingly common to use numbers like 1,000 or even 2,000 dimensional vectors yeah okay so um you mentioned

that there sort of hidden structur the small areas as well as large areas of the embeding and as in pieces different like different Structures will come up but generally we seem to use distance as the single metric for closeness which doesn't seem to me that we get it like distance between like this and that in space three will be the same right so how would that we don't only use distance we also use directions in the spaces having semantic meanings and I'll show a example that soon yeah the entri they seem to be between1 and

one is there a reason for that or do We have bounds that we them um so good question I mean you know they don't have to be um and the way we're going to learn them they're not bounded but you know you can bound things sometimes people length normalize so that um the vectors are of length one but at any rate normally in this work we use some method called you know regularization that tries to kind of keep coefficients small so they're generally not getting huge yeah a specific word for example Like the bank we

use as before in the previous slides um so is there like a like for the word representation is there like a single embedding for each word or do we have multiple embeddings for each word but what we're doing at the moment each word each string of letters has a single embed in and what you can think of that embedding as as kind of as an average over all its senses um so for example like Bank it can mean like the financial institution Or it can also mean like the river B and then what I said

before about star applies the interesting thing is you'll find that we're able to come up with a representation where our learned representation because it's kind of an average of those will end up similar to words that are semantically evoked by both senses um I think I probably about about go on at this point um okay word to ve okay so word Tove was this method of learning word vectors um that was Thought up by tamas mikolov and colleagues at Google um in 2013 you know it wasn't the first method there are other people that did

methods of learning word vectors that go back to about um the turn of the Millennium um it wasn't the last there ones that come after it as well but it was a particularly simple one um and this particularly you know fast running one and so it really caught people's um attention so the idea of it is um that We start off with a large amount of text so that can just be thought of as a long list of words and in l p we refer to that as a corpus Corpus um is just Latin for

body um so you know it's exactly the same as if you have a dead person on the floor right that's a corpus no um yeah so it's just a body but we mean a body of text um not a live person oh sorry a dead person um yeah um if you want to know more about Latin um since some there isn't very good classical Education these days um Corpus despite the US ending is a fourth declen neuter noun and that means um the plural of Corpus is not corpy the plural of Corpus is corpora um

so I'm sure sometime later in this class I will read a projector assignment that refers to corpy and I will know that that person was not paying attention um in the first lecture um or else they should have said corpora um c r p o r a is the correct um form for that okay I should move on okay so We have our text then we know um that we're going to represent each word so this is each word type so you know star or Bank Etc so for wherever it occurs by a single vector

and so what we're going to do in this algorithm is we're going to go through each position in the text and so at each position in the text which is a list of words we're going to have a center word and words outside it um and then what we're going to do is use the similarity of the word vectors For C and the outside words to calculate the probability that they should have occurred or not um and then we just keep fiddling and we learn ve word vectors now you know at First Sight I'll show

this more concretely maybe I'll just show up more concretely first so here's the idea we're going to have um a vector for each word type so a word type means you know the word problems wherever it occurs which is differentiated from a word token which is this instance of the Word problems so we're going to have a vector for each word type and so I'm going to want to know look in this text the word turning occurred before occurred before the word into How likely should that have been to happen and what I'm going to

do is calculate a probability of the word turning occurring close to the word into and I'm going to do that for each word in a narrow context in example here I'm saying I'm using two words to the left And two words to the right and what I want to do is make those probability estimates as good as possible so in particular I want the probability of cooccurrence to be high for words that actually do occur within the nearby context of each other and so then the question is how am I going to oh and once

I've done it for that word I'm going to go along and do exactly the same thing for the next word and so I'll can continue through The text in that way and so what we want to do is come up with vector representations of words that will let us predict these probabilities quote unquote well now you know there's a huge limit to how well we can do it because you know we've got a simple model obviously when you see the word banking I can't tell you that the word into is going to occur before banking

but you know I want to do it as well as possible so what I want my model to say is after The word banking crisis is pretty likely um but the word skillet is not very likely and if I can do that I'm doing a good job and so we turn that into a piece of math um here's how we do it turn it into a piece of math so we're going to go through our Corpus every position in the Corpus um and we're going to have a fixed window size M which was two in

my example and then what we you're going to want to do is have the probability of words in the Context um being as high as possible so we want to maximize this likelihood where we're going through every position in the text and then we're going through every word in the context and sort of wanting to make this big okay so conceptually that's what we're doing but in practice um we never quite do that um we um use two little tricks here um the first one is you know for completely arbitrary reasons it really makes no

difference um everyone Got into minimizing things rather than maximizing things and so the algorithms that we use get referred to as gradient descent as you'll see in a moment so the first thing we do um is put a minor sign in front so that we can minimize it rather than maximize it that part's pretty trivial but the second part is here we have this enormous product and working with enormous products is more difficult for the math so the second thing that we do Is introduce a logarithm and so once we take the log of the

likelihood um that then um when we take logs of products they turn into sums and so now we can sum over each word position the text sum over for each word in the context window and then sum these log probabilities and then we still got the minus sign in front so we want to minimize the sum of log probabilities so what we're doing is um then wanting to look at the negative log Likelihood um and then the final thing that we do is you know to since this will get um bigger depending on the number

of words in the Corpus um we divide through by the number of words in the Corpus and so our objective function is the average negative log likelihood um so by minimizing this objective function we're maximizing the probability of words in the context okay um we're almost there that's what we want to do um we've got a Couple more tricks that we want to get through the next one is well I've said we want to maximize the this probability how do we maximize this probability what is this probability we haven't defined how we're going to calculate

this probability and this is where the word vectors come in so we're going to Define this probability in terms of the word Vector so we're going to say each word type is represented by a vector of real numbers These are 100 real numbers and we're going to have a formula that works out the probability simply in terms of the vectors of for each um word that there are no other parameters in this model so over here I've shown this Theta which are the parameters of our model and all and only the parameters of our model

are these word vectors for each word in the vocabulary that's a lot of parameters because we have a lot of words and and we've got fairly big word vectors but They are the only parameters okay and how we do that is um by using this little um trick here we're going to say the probability of an outside word given a center word is going to be defined in terms of the dotproduct of the two word vectors so if things have a high dot product they'll be similar and therefore they'll have a high probability of cooccurrence

where I mean similar in a kind of a weird sense right it is the case that we're going to want To say hotel and motel are similar but you know it's also the case that we're going to want to have the word the able to appear easily before the word student so in some weird sense the also has to be similar to student um that has to be similar to basically any noun right um okay so we're going to work witht products and then we do this funky little bit of math here and that will

give us our probability ities okay so let's just go through the Funky bit of math um so here's our formula for the probabilities so what we're doing here is we're starting off with this dot product right so the dot product is you take the two vectors you multiply each component together and you sum them so if they both the same sign that increases your dot product and if they're both big and increases it a lot okay so that gives us a similarity between two vectors and that's unbounded That's that's just a real number it can

be either negative or positive okay but what we'd like to get out is a probability so for our next tricks we first of all exponentiate because if we take um e to X for any X we now have to get something positive out right that's what exponentiation does okay and then well since it's meant to be a probability we'd like it to be between 0 and 1 and so we turn it into numbers between 0 and one in the dumbest way Possible which is we just normalize so that we work out the quantity in the

numerator for every possible context word um and so we get the total of all of those numbers and divide through by it and then we're getting a probability distribution of How likely different words are in this context text okay um yeah so that this little trick that we're doing here is referred to as the softmax function so for the softmax function you can take um Unbounded real numbers put them through this little softmax trick that we just went through the steps of and what you'll get out is a probability distribution so I'm now getting in

this example uh probability distribution over context words my probability estimates over all the context words in my vocabulary will sum up to one by definition by the way that I've con constructed this um so it's called the softmax function because it amplifies The probabilities of the largest things that's because of the um the X function but it's soft because it still assigns some probability to smaller items but you know it's sort of a funny name because you know when you think about um max I mean Max normally picks out just one thing whereas the soft

Max is turning a bunch of real numbers into a probability distribution um so this soft Max is used everywhere um in deep learning any time That we're wanting to turn things that are just vectors in RN into probabilities we shove them through a soft Max function okay um so so in some sense this part I think still seems very abstract and I mean the reason it seems very abstract is um because I've sort of said we have vectors for each word and using these vectors We can then calculate probabilities um but where do the vectors

come from and the answer to where the vectors are going to come from is we're going to turn this into an optimization problem we have a large amount of text and so therefore we can hope to find word vectors that make the contexts of the words in our observed text as big as possible so literally what we're going to do is we're going to start off with random vectors for every Word and then we want to fiddle those vectors so that the calculated probabilities of words in a context go up and we're going to keep

fiddling until they stop going up anymore and we're getting the highest probability estimates um that we can and the way that we do that fiddling um is we use calculus um so you know what we're going to do is kind of conceptually exactly what you do if you're in something like a two-dimensional space like the picture On the right right that if you want to find the minimum in this two-dimensional space and you start off at the top left what you can do is say let me work out the derivatives of the function um at

the top left and they sort of Point sort of down and a bit to the right and so you can walk down and a bit to the right and you can say oh G given where I am now um let me work out the derivatives what direction do they point and they're still pointing down but a bit more to The right so you can walk a bit further that way and you can keep on walking and eventually you'll make it to the minimum of the space um in our case um we've got a lot more

than two Dimensions so our parameters for our model are the concatenation of all the word vectors but it's even slightly worse than I've explained um up until now because actually for each word we assume two vectors we assume one vector when they're the center word and one vector When they're the outside word doing that just makes the math a bit simpler which I can explain later so if we say had a 100 dimensional vectors we'll have 100 parameters for ad v as an outside word 100 parameters for um ah as an outside word all the

way through to 100 parameters for zebra as an outside word then we'd have 100 parameters for arvar as a um a center word continuing down so you know if we had a vocabulary of 400,000 words and 100 um dimensional Word vectors that means we'd have 400,000 * 2 is 800,000 time 100 we'd have 80 million parameters so that's a lot of parameters in our space to try and Fiddle to optimize things but luckily we have big computers um and that's the kind of thing that we do so we simply um say G this is our

optimization problem we're going to compute the gradients of all of these parameters and that will um give us the answer of what we have um and you Know this feels like magic I mean it doesn't really seem like you know we could just start with nothing we could start with random word vectors and a pile of text and say uh do some math and we will get something useful out um but the miracle of what happens in these deep learning spaces is we do get something useful out we can just um minimize um all of

the parameters and um then we'll get something useful out Um so what I wanted to uh I guess I'm not going to quite get to the end of what I hope to today um but what we I wanted to do is sort of um get through um some of um what we do here but you know I wanted to take a few minutes to sort of go through concretely how we do um the math of minimization now lots of different people will um take um cs224n and some of you know way more math than I

do and so if this next 10 Minutes might be extremely boring and if that's the case you can either catch up on Discord or Instagram or something or else you can leave but it turns out there are other people that do cs224n that can't quite remember um when they lasted a math course and we'd like um everybody to be able to learn something about this um so I do actually like in the first two weeks to kind of go through it a bit concretely so let's um try to do this so this was our Likelihood

and then we'd already covered the fact that what we were going to do is have an objective function in terms of our parameters that was the negative the average negative log likelihood across all the words um if I remember the notation for this the sum um in this oops um I'll probably have a hard time writing this um the sum of position M I've got a more neatly written out version of it that appears on the version of the slides it's on the webiz um and then we're going to be taking this log of the

probability of the word at position um t plus sorry position J um t + [Music] J okay trying to write this on my iPad is Not working super well I'll confess we'll see how I get on um WT okay um okay and so then we had the form of what we um wanted to use for the probability and the probability of an outside word given a context word is was then this soft maxed equation where we're taking the x of the outside vector and the center Vector over the normalization term where we sum over the

Vocabulary okay um so to to work out um how to change our parameters so our parameters are all of these word vectors that we summarize um inside Theta what we're then going to want to do is work out the partial derivative of this objective function with respect to all the parameters Theta but you know in particular um I'm going to cons just start doing here the partial derivatives with respect to um The center word and we can work through the outside words separately well this partial derivative is a big a big sum and it's a

big sum of terms like this and so when I have a partial derivative of a big sum of terms I can work out the partial derivatives of each term independently and then sum them so what I want to be doing is working out um the partial derivative Of the log of this probability which equals the the log of that with respect to the center Vector um and so at this point I have a log of two things being divided and so that means I can separate that out of the log of the numerator minus the

log of the denominator and so what I'll be doing is working out the partial derivative with respect to the center vector of the log the numerator um log X of Utvc minus um the partial derivative um with respect to the denominator which is then the log of the sum of w = 1 to V of x who okay I'm having real trouble here writing I look at the slides where I wrote it neatly at home okay um so I want to work with these two terms now at this point um part of it at this

point part of it is easy Because here I just have a log of an exponential and so so those two functions just cancel out and go away and so then I want to get the partial derivative of U outside transpose V Center um with respect to V Center and so um what you get for the answer to that is that that just comes out um as U and um maybe you remember that but if you don't remember that the thing to think about is okay this is a whole Vector right and so we've got a

vector Here and a vector here so what this is going to be looking like is like sort of U1 V1 plus U2 V2 um plus u3 V3 Etc long and so what we're going to want to do is work out the partial derivative with respect to each element um VI right and so if you just think of a sort of a single element derivative with respect to um V1 well it's going to be just U1 because every other term would go to zero um and Then if you worked it out with respect to V2 then

it would be just U2 and every other term goes to zero and so since you keep on doing that along the whole Vector that what you're going to get out is the vector um U1 U2 u3 um down the vocab um for the whole list of vocab items okay so that part is easy um but then we also want to um work out the partial derivatives of that one and at That point I maybe have to um go to another slide so we then want to have um the partial derivative um with respect to VC

of the log of the sum equals W 1 to V of the x u w transvers VC right so at this point things aren't quite so easy and we have to remember a little bit more calcul so in particular what we have to Remember is the chain rule so here we have this inside function so that we've got a function um we've got a function G of VC which you know we might say the output of that is Zed and then we put outside that an extra function f and so when we have something like

that what we get is the derivative of f with respect to VC we can take the derivative of f with respect to Z times the derivative of Z with respect to um VC right that's the chain rule so we're going to then Apply that here so first of all we're going to take the derivative of log um and so the derivative of log is one X you have to remember that or look it up or get mathematic to do it for you or something like that um and so we're going to have one over the

inside Z part the sum of w equal 1 to V of the X uwt VC um and then that's going to be multiplied by the derivative of the Inside part part um so then we're going to have the derivative with respect to VC of the sum of w = 1 to V of the X of okay um so um so that's um made us a little bit of progress um but we've still got something to do here um and so well what we're going to do here is we're going to notice Oh wait we're again

in the space To run the chain rule again so now we've got this function well so first of all we can move the sum to the outside right because we've got a sum of terms W = one to V and so we want to work out the derivatives of the inside piece um with respect to it sorry I'm doing this kind of informally of just doing this piece now um okay so this again gives us an F over a function um G and so we're going to again want to split the pieces up and so

use the chain Rule one more time so we're going to have the sum of w = 1 to V and now we have to know what the derivative of x is and the derivative of x is X so that will be X of uux tv0 and then we're taking the derivative of the inside part with respect to VC of ux um T VC well luckily this was the bit that we already knew how to do because we worked it out before and so this is going to be The sum of w equal 1 to V

of this X times ux okay so then at this point um we want to combine these two forms together so that we want to combine this part that we worked out and this piece here that we've worked out and if we canine combine them together with what we worked out on the first slide um for numerator um since this is we have the U which was the derivative of the numerator and then for the um derivative Of the denominator we're going to have um on top this part and then on the bottom we're going to

have that part and so we can rewrite that as the sum from wal 1 to V of the X of uxt v0 * ux over um the [Music] sum sorry x = 1 to V sum over W = 1 to V of the X this part here of UW Okay so we can rearrange things in that form and then lo and behold we find that we've recreated here this form of the softn equation so we end up with u minus the sum x = 1 to V of the probability of x given c times um

U of x so what this is saying is we're wanting to have this quantity which takes the actual observed U vector and it's comparing it to the weighted prediction so we're taking the weight of some of the our current ux vectors um based on How likely we we were they were to occur and so this is a form that you see quite a bit in these kind of der you get exer observed minus expected the weighted average and so what you'd like to have is your expectation the weighted average be the same as um the

what was observed because then you'll get a derivative of zero which means that you've hit a maximum um and so that gives us you know the form of the derivative of the um That we're having with respect to the center Vector parameters to finish it off you'd have to then work it out also for the um outside Vector parameters but hey it's officially the end of class time so I'd better wrap up quickly now but you know so the deal is we're going to work out all of these derivatives um for each parameter and then

these derivatives will give a direction to change numbers which will let us find good word vectors Automatically um I do want you to understand how this works but fortunately you'll find out very quickly that computers will do this for you and on a regular basis you don't actually have to do it yourself um more about that on Thursday okay see you everyone

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 1 - Intro and Word Vectors