will wolfhttps://willwolf.io/2023-08-19T17:00:00-04:00writings on machine learning, crypto, geopolitics, lifeOur Future with LLMs2023-08-19T17:00:00-04:002023-08-19T17:00:00-04:00Will Wolftag:willwolf.io,2023-08-19:/2023/08/19/future-with-llms/<p>In this post, I explore the evolving world of Language Learning Models (LLMs), considering how they learn, the future of human-LLM conversations, the hallucination problem, compensating data providers, the potential lucrativeness of data annotation, and the advent of a new Marxist struggle.</p><h1>What is a Language Model?</h1>
<p>To introduce the language model let's talk about human speech. If I were to say, "The boy put on his boots and went to the...," what word comes next? Well, there are many that fit. As an English speaker, you could surely list a few: "store," "park," "bar," even "best"—as in "best ice cream spot in town." Conversely, many words could <em>never</em> follow this phrase, like "don't," "photosynthesis," or "trigonometry."</p>
<p>And how do you know this? How can you be sure? It's because you have a "model of language" in your brain, a "language model" that you've acquired over your lifetime. The more language you ingest through interacting with the world the better your model becomes.</p>
<h1>How does ChatGPT know so much?</h1>
<p>In the past 30 years, internet users have unwittingly built the largest, broadest, most diverse, most interesting dataset in human history from which to learn machine learning models. These data come in various forms, or "modalities," like images from Instagram, videos from YouTube, audio from various platforms, and text from Wikipedia, Reddit, and more.</p>
<p>ChatGPT is learned from text. Whereas you've trained your language model only on the language you've encountered in the handful of decades for which you've been alive, ChatGPT has been trained on a large chunk of all text ever written on the internet, period. For comparison, Quora<sup id="fnref:27"><a class="footnote-ref" href="#fn:27">27</a></sup> users estimate that this would take a human roughly "23.8 million years" to "you can't, so it's an impossible question" to "you would be dead long before you even made a start." This makes it very good at predicting the next word in a phrase, such as our earlier example about the boy and his boots. More generally, it can skillfully continue almost <em>any</em> text, such as "Can you recite me the Indonesian national anthem in the style of Shakespeare?" or "What should I make for dinner if I only have salmon and chocolate?" or "What's the best way to get to the moon?"</p>
<h1>How does ChatGPT work?</h1>
<p>Creating ChatGPT involves three steps<sup id="fnref:20"><a class="footnote-ref" href="#fn:20">20</a></sup>:</p>
<ol>
<li><strong>Train a language model:</strong> Given a phrase, teach the model how to output, i.e. "predict," the next word. (Repeating this process ad infinitum, and appending the predicted word to the end of the initial phrase each time, it can generate a complete response.)</li>
<li><strong>Fine-tune on <code>(prompt, response)</code> pairs:</strong> Humans provide both parts of these pairs, giving concrete demonstrations of how to complete the tasks the model will be asked to perform.</li>
<li><strong>Further fine-tune via a model of output quality.</strong> Humans rate the quality of ChatGPT's outputs, then a second model learns these relationships, then ChatGPT learns to output high-quality responses via this second model. This process is known as "Reinforcement Learning from Human Feedback"<sup id="fnref:24"><a class="footnote-ref" href="#fn:24">24</a></sup> (RLHF).</li>
</ol>
<h1>This is a story about data</h1>
<p>Throughout my career, I've learned almost every ML problem is a story about data. Where does it come from? What's wrong with it? How much does it cost? How do we de-bias it? How do we get more of it? Which data should we label next? And on. ChatGPT is no different.</p>
<p>With this in mind, here are a few keys points regarding where we stand today:</p>
<p><strong>Static knowledge</strong>: ChatGPT's language model (GPT-4) has been trained on a large chunk of the written internet, dated through September 2021 (as users will now know by heart, as the system loves to restate this limitation). Encapsulated in these data is the knowledge required to solve a substantial number of <em>static-knowledge</em> tasks. For example, the model can summarize news articles; as the nature of summarization doesn't really evolve year over year, the model does not per se require additional data to accomplish this task. It has enough already.</p>
<p><strong>Dynamic knowledge</strong>: Conversely, imagine that we'd like to translate classical Greek into modern English. Well, what does "modern" mean? Language constantly evolves<sup id="fnref:28"><a class="footnote-ref" href="#fn:28">28</a></sup> to include new vocabularly and modes of expression. So, while the fundamentals of translation don't really change, the contemporary details do. To keep pace with these details, the model needs to be updated with examples of this text. Ten years ago, I surely wasn't saying "that's lit" myself.</p>
<p><strong>Novel knowledge</strong>: Finally, novel knowledge defines the set of tasks or abilities that the model has never encountered. For instance, a novel discovery in physics, e.g. room-temperature superconductivity<sup id="fnref:29"><a class="footnote-ref" href="#fn:29">29</a></sup> is an example of <em>dynamic knowledge</em> if this work is an <em>extension</em> of the scientific knowledge, logical reasoning, historical expectations, etc. that the model already posseses. Conversely, this discovery is an example of <em>novel knowledge</em> if it is predominantly composed of never-before-seen ways of conceptualizing the world, e.g. "a new mathematics," alien anatomy, etc.</p>
<p>The vast majority of knowledge is either static or dynamic. However, for completeness, we leave a small sliver of space for novel knowledge as well.</p>
<p><strong>Human annotators</strong>: Human annotators (paid and trained by OpenAI) have provided the data required for the supervised fine-tuning and RLHF steps. Should we wish to expand the "foundational" set of tasks that we explicitly want the model to solve, or update our "preference" regarding the way in which the model expresses itself, we'll need more annotations.</p>
<h1>A menu of questions</h1>
<p>In this post, I explore our future with LLMs from the perspective of <em>data</em>. I'll do so by asking and answering a series of questions—touching on the methods, the players, the economics, and the power struggles that potentially lie ahead.</p>
<ol>
<li><strong>How will LLMs learn new information?</strong></li>
<li><strong>What will we do with human-LLM conversations?</strong></li>
<li><strong>How do we solve the "hallucination" problem?</strong></li>
<li><strong>How will we compensate data providers?</strong></li>
<li><strong>Will data annotation be lucrative?</strong></li>
<li><strong>Is this the new Marxist struggle?</strong></li>
</ol>
<p>Let's begin.</p>
<h2>How will LLMs learn new information?</h2>
<p>I work as a software engineer. If you don't, you might believe that all engineers have committed to memory the knowledge and syntax required to solve every task we ever encounter. Alas, we haven't. Instead, we commonly use "question-answer" sites like Stack Overflow<sup id="fnref:30"><a class="footnote-ref" href="#fn:30">30</a></sup> to see how other developers have solved the problem at hand. Before ChatGPT, I used Stack Overflow almost daily.</p>
<p>Several sources posit<sup id="fnref:31"><a class="footnote-ref" href="#fn:31">31</a></sup> that ChatGPT is directly cannibalizing Stack Overflow traffic. Personally, I don't find the statistics they provide particularly convincing. So, let's use an anecdote instead: since I started using ChatGPT in ~4 months ago, I have not been on Stack Overflow once. Why wait for human responses when ChatGPT responds instantly?</p>
<p>If other developers do the same, Stack Overflow "freezes." In other words, no new human programming knowledge is published at all. Like language translation, coding implies translating a set of logical expressions in one's brain into machine-readable instructions. Across the spectrum of present and future programming languages, the fundamentals generally don't change. However, the details do. In this vein, being a software engineer (as a human, or an AI model) is a dynamic-knowledge task. When the next programming language hits the market, how will developers know how to use it?</p>
<p>Let's consider Mojo<sup id="fnref:32"><a class="footnote-ref" href="#fn:32">32</a></sup>, a new programming language built specifically for AI developers. As Mojo is a superset of Python, our knowledge about Python will still apply. However, Mojo will bring new features that we haven't seen before. Simply put, how will ChatGPT learn how to read and debug Mojo?</p>
<p>For reading, the answer might be simple: include the Mojo documentation<sup id="fnref:33"><a class="footnote-ref" href="#fn:33">33</a></sup> in the model's training set. This provides a basis for understanding the syntax and semantics of Mojo.</p>
<p>For debugging, it's more subtle. GitHub Copilot X<sup id="fnref:34"><a class="footnote-ref" href="#fn:34">34</a></sup>—the LLM-powered tool that helps you write and debug code—will now capture and send your terminal context back to OpenAI itself. As such, with this beta, the LLM is actively "acquiring new data" on the workflows, questions, patterns, etc. inherent in programming in Mojo.</p>
<p>Taken together, for the model to update its understanding of dynamic and novel knowledge tasks, we must provide it with new data. In what follows, we codify the nature of this provision along three main axes: implicit vs. explicit, quality, and velocity.</p>
<h3>Implicit vs. explicit</h3>
<p>The nature of data provision will range from implicit to explicit. Capturing the Mojo developer's terminal context is an example of implicit data provision. Curating model training examples about how to resolve common Mojo errors is an example of explicit data provision. Answering empirical questions on Stack Overflow has elements of both.</p>
<p>Broadly, implicit data will be easier to collect as there's more of it to go around, and vice versa.</p>
<h3>Quality</h3>
<p>In the three cases just mentioned, we suppose that humans act "rationally," meaning that they earnestly try to produce the "right information" in order to solve their problem. However, the quality of these data varies across each case—as a function of who is providing it, and what their incentives and requirements are.</p>
<p>In the "capturing terminal context" setting—implemented naively—we are capturing information from <em>all</em> developers. Some might be good, others bad. While most developers are likely "trying to solve their problem," or "debugging code until it works," the quality of this information will vary as a function of their skills.</p>
<p>In Stack Overflow, the same "available to all" feature applies. However, there is both an additional social pressure placed on users of the site to provide correct information—people don't want to look silly in front of their peers—as well as an explicit feedback mechanism—answers deemed correct get "upvoted," and vice versa. Nominally, these constraints increase data quality.</p>
<p>Finally, we assume that the "manually curate training set examples" setting gives the highest quality data of the three. Why? A company is paying a human to explicitly teach the model information. Before annotation, they ensure this human has the right qualifications; after annotation, they likely review the results. Taken together, the more training and scrutiny, the higher the quality.</p>
<h3>Velocity</h3>
<p>Finally, across the three settings, the <em>speed</em> with which we can generate a large and diverse dataset varies. In manual curation, it's slowest (single human). On Stack Overflow (many human-human interactions), it's faster. In the terminal (many human-machine interactions), it's fastest (and probably by a lot). A "many machine-machine interactions" setup, e.g. copies of a reinforcement learning system where an AI plays the part of the programmer, gets feedback from the language, and iterates, all running in parallel, would be even faster...</p>
<h3>So where do I get data?</h3>
<p>With the above in mind, companies will seek out data sources that make an optimal trade-off between: the pragmatic implications of collecting implicit vs. explicit data, the quality of the data provided, and the speed with which it is generated. Broadly, implicit data will be easier to collect, of lower quality, and higher velocity. Explicit data will be harder to collect, of higher quality, and lower velocity.</p>
<p><strong>Overall, companies will need to identify the "data provision venue" that makes the right trade-offs for them and their model.</strong> Then, they'll need to be strategic about how to "drop their net" into this stream and catch the passing fish.</p>
<h2>What will we do with human-LLM conversations?</h2>
<p>ChatGPT already boasts a staggering 100 million users<sup id="fnref:35"><a class="footnote-ref" href="#fn:35">35</a></sup>. An enormous quantity of human-LLM conversations are taking place daily. What do we do with these interactions? What valuable data can be gleaned from these conversations, and how can OpenAI use this information?</p>
<h3>Product discovery</h3>
<p>To start, OpenAI could use these conversations to determine how people are using the model, and what new products to therefore build. For instance, we could summarize the conversations, embed the summaries, cluster them, then use the cluster's centroid to generate a cluster label. In this way, we can start to understand what people are using the model for. Furthermore, ChatGPT could be trained to proactively ask users for feedback or suggestions, gauging their interest in potential new features or products.</p>
<h3>Reinforcement learning</h3>
<p>In another approach, we could use conversation outcomes as signals for reinforcement learning. For instance, annotators could label successful and unsuccessful conversations, and these labels could be used as rewards in algorithms like RLHF. This feedback could also be supplemented by binary labels generated by the model itself.</p>
<h3>Learning optimal prompts</h3>
<p>Lastly, by associating tasks with clusters of conversations and their descriptions, the system could learn to generate optimal prompts for those tasks. For instance, we could generate a prompt, have two LLMs engage in a conversation based on this prompt, score the result using our reward model, then update the prompt-generation policy itself.</p>
<p><strong>Taken together, companies will use human-LLM conversations to</strong> among other things, discover and prioritize novel applications and products, improve the model, and improve the experience of using the model itself.</p>
<h2>How do we solve the hallucination problem?</h2>
<p>"Hallucination" is when an LLM says things that have no basis in fact or reality. If we knew <em>when</em> the model did this, we could simply restrict those outputs; if we knew <em>why</em>, we could design better models that hallucinate less. Unfortunately, the answers to these questions remain elusive<sup id="fnref:17"><a class="footnote-ref" href="#fn:17">17</a></sup>.</p>
<p>Retrieval<sup id="fnref:37"><a class="footnote-ref" href="#fn:37">37</a></sup> models select outputs from a fixed "menu" of choices. In this way, we implicitly "solve" the hallucination problem by explicitly restricting a priori what the model can and can't "say." Generative models, on the other hand, make a different trade-off: by allowing the model to generate novel content <em>ex nihilo</em>, we forfeit some of this control.</p>
<p>Paying rational human annotators to "correct" all recorded hallucinations would likely improve this situation. However, as history has shown, policing the actions and behaviors of every constituent is not a scalable strategy. In addition, the question of <em>who</em> decides what "correct" actually means remains open for debate. In the context of ChatGPT, this is OpenAI. Similarly, in the context of the 2020 presidential election, it was Facebook that decided what content was and was not acceptable to promote. Combining the two, an interesting question arises: How do we solve the hallucination problem without a centralized authority? Said differently, how do we build models whose voice represents that of the broader consensus? It is extremely likely that a (heated) discussion surrounding some form of this question will unfold in the coming years.</p>
<p>My technical background is largely in machine learning. However, I've been working in crypto for the past two years. In this section, I'll borrow an idea from the latter and apply it to the former. The following idea may be fanciful and impractical and is not the only way to approach this problem. Nonetheless, it makes for an interesting thought experiment.</p>
<h3>Proof of Stake</h3>
<p>The crypto world has spent the last ~15 years trying to answer a similar question: How do we build a scalable, trustworthy system for the digital transfer of monetary value that does not rely on centralized intermediaries? To date, one of the key mechanisms used to achieve this end is Proof of Stake (PoS). PoS is a consensus algorithm where participants, or "validators," are collectively entrusted to verify the legitimacy of transactions. To incentivize earnest behavior, PoS employs the following mechanism:</p>
<ul>
<li>Participants are paid to validate transactions.</li>
<li>Prior to validating transactions, participants "stake" capital. This "stake" is like a "hold" placed on your credit card when renting a car.</li>
<li>The more capital you "stake," the more likely you are to be selected to validate transactions.</li>
<li>If other participants deem your behavior dishonest, your "stake" is taken (and you do not get paid).</li>
</ul>
<p>Taken together, PoS promotes transaction fidelity with economic incentives and penalties. Dishonest participants may lose their staked tokens, creating a self-regulating system where everyone has a vested interest in success.</p>
<h3>Applying Proof of Stake to LLMs</h3>
<p>How might we apply PoS to LLMs? In effect, the users are the validators who ensure the legitimacy not of transactions, but of model outputs. Validators would have a stake—such as points, reputation, privileges, or digital currency—at risk, creating a vested interest in the accuracy of feedback. Then, model outputs would be periodically selected for collective review. Validators would be rewarded for proposing valid feedback in consensus with other users. Conversely, those providing inaccurate feedback or acting maliciously would lose their stake.</p>
<p>Much like PoS in the blockchain world, this system is not without its challenges. For instance:</p>
<ul>
<li>How do we ensure that a small number of "high-stake" users don't control the system?</li>
<li>Will the collective expertise of users empirically "satisfy" the model provider? If it's Elon, maybe; if it's Sam Altman, unclear.</li>
<li>What balance between rewards and penalties promotes truthful feedback yet does not stifle participation?</li>
<li>Etc.</li>
</ul>
<p>Ultimately, whether or when this type of system might be introduced hinges on the question of who retains power. Will LLMs simply be services offered by private companies? Will governments mandate their use as a public utility informed by, and built for, the body politic itself? I don't have the answers to any of these questions. However, my popcorn is ready.</p>
<p><strong>Overall, Proof of Stake is but one approach to solving the hallucination problem.</strong> As an algorithm for decentralized consensus, its relevance will evolve with ongoing narratives surrounding scalability, fairness, and the distribution of power in the context of LLMs.</p>
<h2>How will we compensate data providers?</h2>
<p>Much like LLMs, business is a story about data as well. For instance, historical customer purchase data allows a business to prioritize which products and services to sell. Customer attributes like age, location, gender, and political preference enable more targeted advertisements. Data collected in feedback forms hint at new products that customers might buy.</p>
<p>Before the internet, companies employed a variety of traditional methods to collect these data: questionnaires or comment cards, tracking purchases and preferences through loyalty cards, and utilizing demographic information and other publicly available data. With the internet, the game changed overnight. Now, companies can track every page visit, click, and even eye movements<sup id="fnref2:37"><a class="footnote-ref" href="#fn:37">37</a></sup>. In addition, they encourage the creation of data expressing <em>implicit</em> user preferences, such as pictures on Instagram, chats between friends on Messenger, videos on YouTube. Much like before, these data offer clear value to businesses.</p>
<p>Other types of data are valuable as well. Expert replies on question-answer sites like Quora offer informational value to users, and garner reputational value for the author herself. Basic data annotation enables machine learning practitioners to train models. In this<sup id="fnref:38"><a class="footnote-ref" href="#fn:38">38</a></sup> episode of The Daily, Sheera Frenkel discusses how fantasy story writing communities give a sense of purpose and satisfaction to their writers (and entertainment value to their readers). Finally, online courses on sites like <a href="https://platzi.com/">Platzi</a> offer clear educational value to students.</p>
<p>Overall, these data remain valuable to diverse parties in myriad ways.</p>
<h3>Current compensation models</h3>
<p>In exchange for the data above, its creators are compensated through the following three means:</p>
<ul>
<li>
<p><strong>Financial</strong>: Direct payments, royalties, or subscription fees for music, art, literature, educational material, data annotation, and more.</p>
</li>
<li>
<p><strong>Reputation</strong>: On platforms like Quora or Stack Overflow, individuals gain recognition based on the quality of their responses, knowledge, and expertise, enhancing their personal brand within a community.</p>
</li>
<li>
<p><strong>Spiritual</strong>: Individuals derive personal satisfaction from contributing something unique to the world.</p>
</li>
</ul>
<h3>What do LLMs providers value?</h3>
<p>LLMs represent an entirely new consumer of data. As such, given the ways these models will be used, LLM providers value different things. Already, "beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance."<sup id="fnref:18"><a class="footnote-ref" href="#fn:18">18</a></sup> Conversely, beyond hallucination, what are these systems <em>not</em> good at? What do they need to be maximally useful and achieve widespread user adoption? Finally, how will model providers compensate those that provide the requisite data?</p>
<p><strong>New information</strong>: First and foremost, LLMs need to learn new information as detailed above. In exchange for this information, model providers could compensate data creators financially. For instance, OpenAI might (continue<sup id="fnref:39"><a class="footnote-ref" href="#fn:39">39</a></sup> to) pay Reddit for its forum data or pay the NYT for the articles its staff writes.</p>
<p><strong>Multi-turn feedback</strong>: Responding to factoid questions is easy; maintaining a coherent and intuitive conversation over multiple "turns" is more difficult. As LLMs are increasingly used for conversational use cases, this type of data becomes more relevant. Ironically, the predominant place where these data will likely be created are in human-LLM conversations themselves. As such, model providers may offer free usage of their services in exchange for these data, neatly mirroring the "free but invasive but no one cares" playbook that Facebook and Google have perfected.</p>
<p><strong>Answering subjective questions</strong>: When answering a subjective question, a model should provide one or more credible viewpoints in response. For instance, "Why did the United States experience inflation in 2021 and onward?" should be addressed with the diverse perspectives of capable economists. Irrespective of who these economists are chosen to be, it's clear that the "airtime" they will receive will be immense. As such, being <em>the</em> person featured by <em>the</em> LLM offers significant reputational benefits, much like being the top search result on Google in years past.</p>
<h3>Future compensation models</h3>
<p>Taken together, future compensation models might look as follows:</p>
<ul>
<li>
<p><strong>Financial</strong>: Direct payment for data dumps, application usage data, etc. In addition, data providers may achieve (micro) royalty<sup id="fnref:19"><a class="footnote-ref" href="#fn:19">19</a></sup> payments<sup id="fnref:25"><a class="footnote-ref" href="#fn:25">25</a></sup> every time their data are referenced in a model output.</p>
</li>
<li>
<p><strong>Reputation</strong>: Becoming <em>the</em> response from <em>the</em> LLM offers similar benefit to being a "top answer" on today's question-answer sites. To wit, GitHub Copilot is already<sup id="fnref:40"><a class="footnote-ref" href="#fn:40">40</a></sup> implementing such "attribution" features.</p>
</li>
<li>
<p><strong>Spiritual</strong>: This form of compensation may really change. As we "share our unique voice" with the LLM, e.g. a short story that we've written, the model can effectively "impersonate" us thereafter—forever. Will this "digital immortality" inspire feelings of personal grandeur? Or despair for the fact that we're "no longer needed?" Similarly, how will people feel interacting with an intelligence superior to their own? These questions are highly uncertain and will find answers in time.</p>
</li>
</ul>
<p><strong>Overall, models for data compensation in a world with LLMs will remain similar to those of today.</strong> However, the spiritual experience of contributing data to and interacting with these models will evolve materially, challenging our perceptions of identity, purpose, and place in the years to come.</p>
<h2>Will data annotation be lucrative?</h2>
<p>Machine learning models are trained to accomplish specific tasks. To do this, humans first "label some data," i.e. perform this task up front. Next, we train the model to learn relationships between its inputs and these labels. For instance, when training a model to distinguish between "cat" and "dog" images, it learns from a dataset where each photo had been explicitly labeled by a human as containing either a "cat" or "dog." Similarly, an autonomous vehicle capable of resolving sophisticated moral dilemmas implies that humans imparted this judgement during training. As we ambition to build models that accomplish increasingly sophisticated tasks, data annotation complexity will increase in turn.</p>
<p>As noted above, ChatGPT relies on two types of data annotation: diverse task demonstrations, e.g. given a prompt, "act out" both sides of the human-machine conversation, as well as model output quality ranking. In effect, these annotations control how the model should respond as well as which types of responses are preferred. In other words, they control a lot! As such, to use this model effectively in increasingly "high stakes" corners of society, e.g. as behavioral therapists, legal counsel, tax attorneys, etc., we may require annotations of increasingly nuanced ethical, moral, "human" substance. Taken to its extreme, we can imagine these annotators as philosophers, legal scholars, and religious authorities, i.e. highly-trained specialists commanding due compensation.</p>
<p>With this in mind, numerous questions arise: Who will appoint these annotators? Will private companies continue to train LLMs exclusively? Will governments mandate a certain participation model, e.g. "democratic" annotation, with each citizen or interest group having a say? Financial compensation aside, will annotators gain status a la Supreme Court justices?</p>
<p><strong>Overall, highly-specialized data annotation will be in demand.</strong> Time will tell if this translates into a financially lucrative career path.</p>
<h2>Is this the new Marxist struggle?</h2>
<p>Viewed through the lens of Marxism, model providers like OpenAI may well ascend to the role of the bourgeoisie of our time—the esteemed custodians of the means of production. Conversely, data providers represent the proletariat; without their digital labor, the "factory lights" would dim. Will their relationship become the next flashpoint of class struggle in our digital age?</p>
<p>To begin, let's examine the relationship between social media platforms and their users over the past ~20 years. In effect: the former amassed vast fortunes through the unpaid efforts of the latter: users supplied data, which the platforms then leveraged for targeted advertising. In return, users received free access to services that allowed them to connect with friends, organize their professional lives, and access information online. While this arrangement sparked outrage among some, dissent eventually faded.</p>
<p>The advent of LLMs, however, introduces a crucial difference: the dramatically expanding range of AI applications. The services promised by LLMs aren't confined to social communication or productivity tools. Instead, they're likely to fill roles as diverse as accountants, therapists, fitness instructors, educators, career coaches, among many others. As AI permeates deeper into our daily lives, the issues of labor and compensation in data provision become that much more salient.</p>
<p>The outcome of this tension will be heavily informed by the answer to the following question: will AI make "all boats rise," or will it swallow them whole? In the former scenario, people lose jobs, retrain, and are re-employed in the next generation of professional careers. In the latter, we devolve into the dystopia depicted in Yuval Noah Harari's "Homo Deus," in which a sizeable portion of the labor force has nothing of "economic relevance" to contribute at all.</p>
<p>What kind of jobs that can be done by LLMs? In fact, the answer to this question is synonymous with the data on which the model was trained. As such, "knowledge worker" jobs based in text (writing, coding, etc.) stand threatened. In addition—and while this post has covered text-only LLMs alone—these models will invariably trend towards understanding <em>multiple</em> digital modalities simultaneously—accepting inputs and producing outputs in the form of text, image, video, audio, and more. These systems will work because of the volume of such data on which we can train, putting the creators of that data—software engineers, digital artists, YouTube content creators, etc.—at risk.</p>
<p>Crucially, though, this type of work is one among many. To date, jobs in fields such as healthcare, hospitality, craftsmanship, maintenance and repair, agriculture, emergency services, beauty, and education are not at real risk of replacement by AI. As such, we may simply see displaced knowledge workers "diffuse" into these other spheres, postponing real conflict. From there, the cycle might repeat: humans create data, model providers collect that data, then train models, then replace humans. And finally, once there are no "untouched" sectors of the economy left, we'll be able to more clearly perceive an answer to our question.</p>
<p><strong>Taken together, I don't envision a material class conflict anytime soon.</strong> AI has a lot further to go towards Harari's dystopia for this to happen. For the time being, the relationship between model providers and data producers will remain largely peaceful, and endlessly interesting.</p>
<h1>Conclusion</h1>
<p>The LLM story is sure to evolve quickly. It's unclear where it will go. In the best case, human knowledge, satisfaction, and general well-being compound exponentially. My fingers are crossed.</p>
<h1>References</h1>
<div class="footnote">
<hr/>
<ol>
<li id="fn:1">
<p>@misc{2210.11610,
Author={Jiaxin Huang and Shixiang Shane Gu and Le Hou and Yuexin Wu and Xuezhi Wang and Hongkun Yu and Jiawei Han},
Title={Large Language Models Can Self-Improve},
Year={2022},
Eprint={arXiv:2210.11610},
} <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>@article{dave2023stackoverflow,
title={Stack Overflow Will Charge AI Giants for Training Data},
author={Dave, Paresh},
journal={Wired},
year={2023},
month={Apr},
day={20},
note={The programmer Q&A site joins Reddit in demanding compensation when its data is used to train algorithms and ChatGPT-style bots},
url={https://www.wired.com/story/stack-overflow-will-charge-ai-giants-for-training-data/},
timestamp={5:19 PM}
} <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>@article{anirudh2023stackoverflow,
title={Is This the Beginning of the End of Stack Overflow?},
author={Anirudh, VK},
journal={Analytics India Magazine},
year={2023},
month={Apr},
day={18},
note={Integrating an LLM into Stack Overflow won't make its problems disappear},
url={https://analyticsindiamag.com/is-this-the-beginning-of-the-end-of-stack-overflow/},
publisher={Endless Origins}
} <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p>@online{chandrasekar2023community,
title={Community is the future of AI},
author={Chandrasekar, Prashanth},
year={2023},
month={Apr},
day={17},
note={To keep knowledge open and accessible to all, we must come together to build the future of AI.},
url={https://stackoverflow.blog/2023/04/17/community-is-the-future-of-ai/},
publisher={Stack Overflow Blog},
organization={Stack Overflow}
} <a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:5">
<p>@misc{2306.15774,
Author={Xiang 'Anthony' Chen and Jeff Burke and Ruofei Du and Matthew K. Hong and Jennifer Jacobs and Philippe Laban and Dingzeyu Li and Nanyun Peng and Karl D. D. Willis and Chien-Sheng Wu and Bolei Zhou},
Title={Next Steps for Human-Centered Generative AI: A Technical Perspective},
Year={2023},
Eprint={arXiv:2306.15774},
} <a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:6">
<p>@misc{2306.08302,
Author={Shirui Pan and Linhao Luo and Yufei Wang and Chen Chen and Jiapu Wang and Xindong Wu},
Title={Unifying Large Language Models and Knowledge Graphs: A Roadmap},
Year={2023},
Eprint={arXiv:2306.08302},
} <a class="footnote-backref" href="#fnref:6" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:7">
<p>@article{2305.18339,
Author={Yuntao Wang and Yanghe Pan and Miao Yan and Zhou Su and Tom H. Luan},
Title={A Survey on ChatGPT: AI-Generated Contents, Challenges, and Solutions},
Year={2023},
Eprint={arXiv:2305.18339},
Doi={10.1109/OJCS.2023.3300321},
} <a class="footnote-backref" href="#fnref:7" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:8">
<p>@misc{2108.13487,
Author={Shuohang Wang and Yang Liu and Yichong Xu and Chenguang Zhu and Michael Zeng},
Title={Want To Reduce Labeling Cost? GPT-3 Can Help},
Year={2021},
Eprint={arXiv:2108.13487},
} <a class="footnote-backref" href="#fnref:8" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:9">
<p>@misc{2307.10169,
Author={Jean Kaddour and Joshua Harris and Maximilian Mozes and Herbie Bradley and Roberta Raileanu and Robert McHardy},
Title={Challenges and Applications of Large Language Models},
Year={2023},
Eprint={arXiv:2307.10169},
} <a class="footnote-backref" href="#fnref:9" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:10">
<p>@misc{2212.10450,
Author={Bosheng Ding and Chengwei Qin and Linlin Liu and Yew Ken Chia and Shafiq Joty and Boyang Li and Lidong Bing},
Title={Is GPT-3 a Good Data Annotator?},
Year={2022},
Eprint={arXiv:2212.10450},
} <a class="footnote-backref" href="#fnref:10" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:11">
<p>@misc{2306.11644,
Author={Suriya Gunasekar and Yi Zhang and Jyoti Aneja and Caio César Teodoro Mendes and Allie Del Giorno and Sivakanth Gopi and Mojan Javaheripi and Piero Kauffmann and Gustavo de Rosa and Olli Saarikivi and Adil Salim and Shital Shah and Harkirat Singh Behl and Xin Wang and Sébastien Bubeck and Ronen Eldan and Adam Tauman Kalai and Yin Tat Lee and Yuanzhi Li},
Title={Textbooks Are All You Need},
Year={2023},
Eprint={arXiv:2306.11644},
} <a class="footnote-backref" href="#fnref:11" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:12">
<p>@inproceedings{yoo2021gpt3mix,
title={GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation},
author={Yoo, Kang Min and Park, Dongju and Kang, Jaewook and Lee, Sang-Woo and Park, Woomyeong},
pages={192},
year={2021},
organization={NAVER AI Lab and NAVER Clova AI},
address={NAVER AI Lab and NAVER Clova AI},
email={{kangmin.yoo, dongju.park, jaewook.kang}@navercorp.com, {sang.woo.lee, max.park}@navercorp.com},
url={https://aclanthology.org/2021.findings-emnlp.192.pdf}
} <a class="footnote-backref" href="#fnref:12" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:13">
<p>@misc{2302.13007,
Author={Haixing Dai and Zhengliang Liu and Wenxiong Liao and Xiaoke Huang and Yihan Cao and Zihao Wu and Lin Zhao and Shaochen Xu and Wei Liu and Ninghao Liu and Sheng Li and Dajiang Zhu and Hongmin Cai and Lichao Sun and Quanzheng Li and Dinggang Shen and Tianming Liu and Xiang Li},
Title={AugGPT: Leveraging ChatGPT for Text Data Augmentation},
Year={2023},
Eprint={arXiv:2302.13007},
} <a class="footnote-backref" href="#fnref:13" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
<li id="fn:14">
<p>@misc{1707.06347,
Author={John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov},
Title={Proximal Policy Optimization Algorithms},
Year={2017},
Eprint={arXiv:1707.06347},
} <a class="footnote-backref" href="#fnref:14" title="Jump back to footnote 14 in the text">↩</a></p>
</li>
<li id="fn:15">
<p>@misc{2304.01852,
Author={Yiheng Liu and Tianle Han and Siyuan Ma and Jiayue Zhang and Yuanyuan Yang and Jiaming Tian and Hao He and Antong Li and Mengshen He and Zhengliang Liu and Zihao Wu and Dajiang Zhu and Xiang Li and Ning Qiang and Dingang Shen and Tianming Liu and Bao Ge},
Title={Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models},
Year={2023},
Eprint={arXiv:2304.01852},
} <a class="footnote-backref" href="#fnref:15" title="Jump back to footnote 15 in the text">↩</a></p>
</li>
<li id="fn:16">
<p>@misc{2209.01538,
Author={Xin Mu and Ming Pang and Feida Zhu},
Title={Data Provenance via Differential Auditing},
Year={2022},
Eprint={arXiv:2209.01538},
} <a class="footnote-backref" href="#fnref:16" title="Jump back to footnote 16 in the text">↩</a></p>
</li>
<li id="fn:17">
<p>@misc{2304.00612,
Author={Samuel R. Bowman},
Title={Eight Things to Know about Large Language Models},
Year={2023},
Eprint={arXiv:2304.00612},
} <a class="footnote-backref" href="#fnref:17" title="Jump back to footnote 17 in the text">↩</a></p>
</li>
<li id="fn:18">
<p>@misc{2303.12712,
Author={Sébastien Bubeck and Varun Chandrasekaran and Ronen Eldan and Johannes Gehrke and Eric Horvitz and Ece Kamar and Peter Lee and Yin Tat Lee and Yuanzhi Li and Scott Lundberg and Harsha Nori and Hamid Palangi and Marco Tulio Ribeiro and Yi Zhang},
Title={Sparks of Artificial General Intelligence: Early experiments with GPT-4},
Year={2023},
Eprint={arXiv:2303.12712},
} <a class="footnote-backref" href="#fnref:18" title="Jump back to footnote 18 in the text">↩</a></p>
</li>
<li id="fn:19">
<p>@book{lanier2013future,
title={Who owns the future?},
author={Lanier, Jaron},
year={2013},
publisher={Simon & Schuster}
} <a class="footnote-backref" href="#fnref:19" title="Jump back to footnote 19 in the text">↩</a></p>
</li>
<li id="fn:20">
<p>@online{openai2022chatgpt,
title={Introducing ChatGPT},
author={OpenAI},
year={2022},
month={11},
day={30},
url={https://openai.com/blog/chatgpt}
} <a class="footnote-backref" href="#fnref:20" title="Jump back to footnote 20 in the text">↩</a></p>
</li>
<li id="fn:21">
<p>@online{openai2022instruction,
title={Aligning language models to follow instructions},
author={OpenAI},
year={2022},
month={01},
day={27},
url={https://openai.com/research/instruction-following}
} <a class="footnote-backref" href="#fnref:21" title="Jump back to footnote 21 in the text">↩</a></p>
</li>
<li id="fn:22">
<p>@online{openai2017humanprefs,
title={Learning from human preferences},
author={OpenAI},
year={2017},
month={06},
day={13},
url={https://openai.com/research/learning-from-human-preferences}
} <a class="footnote-backref" href="#fnref:22" title="Jump back to footnote 22 in the text">↩</a></p>
</li>
<li id="fn:23">
<p>@online{openai2017ppo,
title={Proximal Policy Optimization},
author={OpenAI},
year={2017},
month={07},
day={20},
url={https://openai.com/research/openai-baselines-ppo}
} <a class="footnote-backref" href="#fnref:23" title="Jump back to footnote 23 in the text">↩</a></p>
</li>
<li id="fn:24">
<p>@online{huggingface2022rlhf,
title={Illustrating Reinforcement Learning from Human Feedback (RLHF)},
author={Nathan Lambert and Louis Castricato and Leandro von Werra and Alex Havrilla},
year={2022},
month={12},
day={9},
url={https://huggingface.co/blog/rlhf},
organization={Hugging Face}
} <a class="footnote-backref" href="#fnref:24" title="Jump back to footnote 24 in the text">↩</a></p>
</li>
<li id="fn:25">
<p>@online{durmonski2023owns,
title={Who Owns The Future? by Jaron Lanier [Actionable Summary]},
author={Ivaylo Durmonski},
year={2023},
month={7},
day={7},
url={https://durmonski.com/book-summaries/who-owns-the-future/#5-lesson-2-ordinary-people-are-not-compensated-for-the-information-taken-from-them},
organization={Durmonski.com},
note={Actionable Book Summaries, Science & Tech Book Summaries}
} <a class="footnote-backref" href="#fnref:25" title="Jump back to footnote 25 in the text">↩</a></p>
</li>
<li id="fn:26">
<p>@article{bubeck2023sparks,
title={Sparks of Artificial General Intelligence: Early experiments with GPT-4},
author={Bubeck, Sébastien and Chandrasekaran, Varun and Eldan, Ronen and Gehrke, Johannes and Horvitz, Eric and Kamar, Ece and Lee, Peter and Lee, Yin Tat and Li, Yuanzhi and Lundberg, Scott and Nori, Harsha and Palangi, Hamid and Ribeiro, Marco Tulio and Zhang, Yi},
year={2023},
month={3},
publisher={Microsoft Research},
url={https://www.microsoft.com/en-us/research/publication/sparks-of-artificial-general-intelligence-early-experiments-with-gpt-4/}
} <a class="footnote-backref" href="#fnref:26" title="Jump back to footnote 26 in the text">↩</a></p>
</li>
<li id="fn:27">
<p>@misc{author2023howlong,
title={How long would it take you to read the entire internet?},
author={Stinson, Mark and Chovanek, Chris and Gibson, Jack},
year={2023},
howpublished={\url{https://www.quora.com/How-long-would-it-take-you-to-read-the-entire-internet}},
note={Accessed: [insert date you accessed the link]}
} <a class="footnote-backref" href="#fnref:27" title="Jump back to footnote 27 in the text">↩</a></p>
</li>
<li id="fn:28">
<p>@book{mcculloch2019because,
title={Because Internet: Understanding the New Rules of Language},
author={McCulloch, Gretchen},
year={2019},
publisher={Hardcover},
note={Published on July 23, 2019}
} <a class="footnote-backref" href="#fnref:28" title="Jump back to footnote 28 in the text">↩</a></p>
</li>
<li id="fn:29">
<p>@article{yirka2023korean,
title={Korean team claims to have created the first room-temperature, ambient-pressure superconductor},
author={Yirka, Bob},
year={2023},
month={July},
day={27},
journal={Phys.org},
url={https://phys.org/news/2023-07-korean-team-room-temperature-ambient-pressure-superconductor.html}
} <a class="footnote-backref" href="#fnref:29" title="Jump back to footnote 29 in the text">↩</a></p>
</li>
<li id="fn:30">
<p>@misc{stackoverflow,
title={Stack Overflow},
year={2008},
url={https://stackoverflow.com/}
} <a class="footnote-backref" href="#fnref:30" title="Jump back to footnote 30 in the text">↩</a></p>
</li>
<li id="fn:31">
<p>@online{carr2023stackoverflow,
author={David F. Carr},
title={Stack Overflow is ChatGPT Casualty: Traffic Down 14% in March},
year={2023},
url={https://www.similarweb.com/blog/insights/ai-news/stack-overflow-chatgpt/}
month={April 19},
update={Updated June 21, 2023}
} <a class="footnote-backref" href="#fnref:31" title="Jump back to footnote 31 in the text">↩</a></p>
</li>
<li id="fn:32">
<p>@online{modular2023mojo,
title={Mojo 🔥 — a new programming language for all AI developers},
year={2023},
url={https://www.modular.com/mojo},
publisher={Modular},
} <a class="footnote-backref" href="#fnref:32" title="Jump back to footnote 32 in the text">↩</a></p>
</li>
<li id="fn:33">
<p>@online{modulardocs2023mojo,
title={Mojo Documentation},
year={2023},
url={https://docs.modular.com/mojo/},
publisher={Modular},
} <a class="footnote-backref" href="#fnref:33" title="Jump back to footnote 33 in the text">↩</a></p>
</li>
<li id="fn:34">
<p>@online{github2023copilotx,
title={Your AI pair programmer is leveling up},
year={2023},
url={https://github.com/features/preview/copilot-x},
publisher={GitHub},
} <a class="footnote-backref" href="#fnref:34" title="Jump back to footnote 34 in the text">↩</a></p>
</li>
<li id="fn:35">
<p>@article{hu2023chatgpt,
title={ChatGPT sets record for fastest-growing user base - analyst note},
author={Hu, Krystal},
journal={Reuters},
year={2023},
month={Feb},
day={2},
url={https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/}
} <a class="footnote-backref" href="#fnref:35" title="Jump back to footnote 35 in the text">↩</a></p>
</li>
<li id="fn:36">
<p>@misc{wikipedia2023information,
title={Information retrieval},
author={Wikipedia, The Free Encyclopedia},
year={2023},
note={Available: \url{https://en.wikipedia.org/wiki/Information_retrieval}}
} <a class="footnote-backref" href="#fnref:36" title="Jump back to footnote 36 in the text">↩</a></p>
</li>
<li id="fn:37">
<p>@online{lecomte2022,
author={Patrick Lecomte},
title={Companies are increasingly tracking eye movements — but is it ethical?},
year={2022},
url={https://theconversation.com/companies-are-increasingly-tracking-eye-movements-but-is-it-ethical-191842},
note={Published: October 16, 2022 8.28am EDT},
organization={The Conversation},
institution={Université du Québec à Montréal (UQAM)},
keywords={eye tracking, ethics}
} <a class="footnote-backref" href="#fnref:37" title="Jump back to footnote 37 in the text">↩</a><a class="footnote-backref" href="#fnref2:37" title="Jump back to footnote 37 in the text">↩</a></p>
</li>
<li id="fn:38">
<p>@misc{frenkel2023,
author={Sheera Frenkel},
title={The Writer's Revolt Against AI Companies},
year={2023},
note={Episode of The Daily},
howpublished={Available at: \url{https://open.spotify.com/episode/26xt8MwmfaBlmU6GFjYumu}},
organization={The New York Times},
abstract={To refine their popular technology, new artificial intelligence platforms like Chat-GPT are gobbling up the work of authors, poets, comedians and actors — without their consent. Sheera Frenkel, a technology correspondent for The Times, explains why a rebellion is brewing.},
keywords={AI, Chat-GPT, rebellion, technology, ethics}
} <a class="footnote-backref" href="#fnref:38" title="Jump back to footnote 38 in the text">↩</a></p>
</li>
<li id="fn:39">
<p>@online{ngila2023reddit,
title={AI bots trained on Reddit conversations. Now Reddit wants to be paid for it.},
author={Ngila, Faustine},
year={2023},
month={April},
day={19},
url={https://qz.com/reddit-ai-bots-training-payment-1850352526},
publisher={Quartz}
} <a class="footnote-backref" href="#fnref:39" title="Jump back to footnote 39 in the text">↩</a></p>
</li>
<li id="fn:40">
<p>@online{salva2023introducing,
author={Ryan J. Salva},
title={Introducing code referencing for GitHub Copilot},
year={2023},
url={https://github.blog/2023-08-03-introducing-code-referencing-for-github-copilot/},
month={August}
} <a class="footnote-backref" href="#fnref:40" title="Jump back to footnote 40 in the text">↩</a></p>
</li>
</ol>
</div>README2022-08-01T17:00:00-04:002022-08-01T17:00:00-04:00Will Wolftag:willwolf.io,2022-08-01:/2022/08/01/readme/<p>A document describing how to work with <em>me.</em> What am I like? How do I work best? And more.</p><p>A README is a document describing how to use a piece of software. Done <a href="https://github.com/pandas-dev/pandas/blob/main/README.md">well</a>, it clearly and succinctly communicates how to work with the code in question.</p>
<p>I recently came across the idea of writing a README not for my project, but for <em>myself</em>; in other words, a document describing how to work with <em>me</em>! What am I like? How do I work best? And more.</p>
<p>In the following text, I endeavour to do exactly this, both as an exercise in self-exploration, and a step towards working more effectively with others in my remote-dominated professional world.</p>
<p>Without further ado, let's begin.</p>
<h2>How I type on Slack</h2>
<p>I generally type in all lowercase letters and with punctuation. For example, I might send the following message:</p>
<div class="highlight"><pre><span></span><span class="n">heya</span><span class="p">.</span> <span class="n">how</span> <span class="n">goes</span><span class="o">?</span> <span class="n">how</span> <span class="n">was</span> <span class="n">your</span> <span class="n">weekend</span><span class="o">?</span> <span class="n">what</span> <span class="k">are</span> <span class="n">you</span> <span class="n">working</span> <span class="k">on</span> <span class="n">today</span><span class="o">?</span>
<span class="n">i</span><span class="s1">'m tweaking our model. i should be done this evening. i'</span><span class="n">ll</span> <span class="n">let</span> <span class="n">you</span> <span class="n">know</span> <span class="k">when</span> <span class="n">i</span><span class="err">'</span><span class="n">m</span> <span class="n">finished</span><span class="p">.</span>
</pre></div>
<p>My (textual) tone may seem curt and cold. In reality, it's not. Punctuation helps me keep my thoughts organized and speak succinctly. Capitalization is far too much effort for far too little reward.</p>
<h2>My communication style</h2>
<p>I speak directly in (non-sensitive) professional conversations. I ask questions plainly, e.g. "Why did you make that choice? How many hours will that take? What other solutions did you consider?</p>
<p>In science, explicit communication of hypotheses, methods, and outcomes is crucial. Clarifying questions are sexy. Sanity checks are king. My professional brain lives in this space; speaking directly keeps my feet planted below, and (hopefully!) moves me and my team forward.</p>
<h2>When I have energy</h2>
<p>I'm a morning person. My most productive technical hours are between 8 am and 12 pm. Ideally, I spend these hours writing code, undistracted. In the afternoons, I prefer talking to others. Ideally, I spend these hours in 1 on 1's, planning, retros, etc.</p>
<h2>How to message me</h2>
<p>Don't overthink it. I don't bite!</p>
<h2>What I look like when I'm thinking</h2>
<p>When I'm trying to understand a complex problem, I divert my gaze and stare at the wall for 5-15 seconds. (Conversely, staring directly at the camera clouds my brain.)</p>
<p>When I'm engaged in a technical conversation, my brow is furrowed. I look intense, exacting, prickly.</p>
<p>In both cases, please don't intuit rudeness! I'm focused on clarifying details for myself and my teammates.</p>
<h2>How to teach me</h2>
<p>At work, I prize learning from others—math, engineering, leadership, you name it—above most else. If you have something to teach me, please do. You aren't bothering me. You're not "preaching." Your wisdom is welcome. Keep it coming.</p>
<h2>How to learn from me</h2>
<p>If you value what I have to say, let me know. Ask for more. I love to share what I've been taught by others. To pass the torch, if you will! If you're open to learn, I'm genuinely excited to teach.</p>
<h2>Anything I missed?</h2>
<p>Let me know in the comments and I'll do my best to address.</p>I Work in Crypto2022-06-11T08:30:00-04:002022-06-11T08:30:00-04:00Will Wolftag:willwolf.io,2022-06-11:/2022/06/11/work-in-crypto/<p>I now work in crypto—building credit scoring models at <a href="https://www.credprotocol.com/">Cred Protocol</a> and simulations of complex systems at <a href="https://block.science/">Block Science</a>.</p><p>I <a href="https://willwolf.io/2021/12/04/leaving-asapp/">left</a> ASAPP in December 2021 to "explore": Rust, complex systems, crypto. After ~4 months of study, podcasts, conversations, toy <a href="https://willwolf.io/2021/12/21/crypto-projects/">projects</a>, Twitter, meetups, conferences, and general experimentation, I've now re-entered the workforce; I now work in crypto!</p>
<p>Per our brave new remote world and the breadth of my technical interests, I recently started <em>two</em> new jobs in this zany new space: building credit scoring models at <a href="https://www.credprotocol.com/">Cred Protocol</a> and simulations of complex systems at <a href="https://block.science/">Block Science</a>. The former is my primary gig.</p>
<h1>Cred Protocol</h1>
<p>Cred aims to enable under-collateralized, capital efficient borrowing of cryptocurrencies by quantifying wallet risk. In other words, much like FICO's "300-850," we're building a credit score. From a technical stance, credit scoring implies systems and algorithms to quantify risk—of tens of millions of crypto wallets—in near real-time. Per my longstanding interests in building systems and statistical products, my current energy to build from the ground up, and the chance to explore a world of new data, new players, new algorithms, new rules—I jumped at the opportunity to join Cred as one of its first few hires. We're currently about ~10 people.</p>
<h1>Block Science</h1>
<p>Ten years ago, I wrote an undergraduate <a href="https://honors.libraries.psu.edu/catalog/1947">thesis</a> about the simulation of iterated Prisoner's Dilemma games on the surface of a geodesic dome. This work has remained of abiding fascination ever since. Last year, I realized that crypto offers a perfect "arena" for the simulation of similar "complex systems": a blockchain gives a graph data structure indexed by time—with copious amounts of structured, <em>public</em> data to boot! In other words, simulating the behavior of systems built on blockchains, e.g. a lending-and-borrowing protocol like <a href="https://aave.com/">Aave</a>, then aligning simulated results with empirical results, is a highly tractable, useful, and interesting line of work.</p>
<p>Block Science is a leader in doing exactly this. As a visiting contributor, I build, analyze, and communicate simulations of "cryptoeconomic" systems for a variety of partners. Block Science is comprised of ~30 talented and diverse members—with backgrounds spanning economics, social science, natural science, behavioral science, multi-agent systems, and more.</p>
<h1>Join us</h1>
<p>Both companies are hiring. If you're interested in what we do, do reach out!</p>
<p>Will</p>
<div class="footnote">
<hr/>
<ol>
<li id="fn:1">
<p>Thumbnail image credit goes <a href="https://jingculturecommerce.com/museum-cryptocurrency-donations/">here</a>. <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Neural Methods in Simulation-Based Inference2022-01-04T10:00:00-05:002022-01-04T10:00:00-05:00Will Wolftag:willwolf.io,2022-01-04:/2022/01/04/neural-methods-in-sbi/<p>A survey of how neural networks are currently being used in simulation-based inference routines.</p><p>Bayesian inference is the task of quantifying a posterior belief over parameters <span class="math">\(\boldsymbol{\theta}\)</span> given observed data <span class="math">\(\mathbf{x}\)</span>—where <span class="math">\(\mathbf{x}\)</span> was generated from a model <span class="math">\(p(\mathbf{x}\mid{\boldsymbol{\theta}})\)</span>—via Bayes' Theorem:</p>
<div class="math">$$
p(\boldsymbol{\theta}\mid\mathbf{x}) = \frac{p(\mathbf{x}\mid\boldsymbol{\theta})p(\boldsymbol{\theta})}{p(\mathbf{x})}
$$</div>
<p>In numerous applications of scientific interest, e.g. cosmological, climatic or urban-mobility phenomena, the likelihood of the data <span class="math">\(\mathbf{x}\)</span> under the data-generating function <span class="math">\(p(\mathbf{x}\mid\boldsymbol{\theta})\)</span> is intractable to compute, precluding classical inference approaches. Notwithstanding, <em>simulating</em> new data <span class="math">\(\mathbf{x}\)</span> from this function is often trivial—for example, by coding the generative process in a few lines of Python—</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">generative_process</span><span class="p">(</span><span class="n">params</span><span class="p">):</span>
<span class="n">data</span> <span class="o">=</span> <span class="o">...</span> <span class="c1"># some deterministic logic</span>
<span class="n">data</span> <span class="o">=</span> <span class="o">...</span> <span class="c1"># some stochastic logic</span>
<span class="n">data</span> <span class="o">=</span> <span class="o">...</span> <span class="c1"># whatever you want!</span>
<span class="k">return</span> <span class="n">data</span>
<span class="n">simulated_data</span> <span class="o">=</span> <span class="p">[</span><span class="n">generative_process</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="p">[</span><span class="o">.</span><span class="mi">2</span><span class="p">,</span> <span class="o">.</span><span class="mi">4</span><span class="p">,</span> <span class="o">.</span><span class="mi">6</span><span class="p">,</span> <span class="o">.</span><span class="mi">8</span><span class="p">,</span> <span class="mi">1</span><span class="p">]]</span>
</pre></div>
<p>—motivating the study of <em>simulation-based</em> Bayesian <em>inference</em> methods, termed SBI.</p>
<p>Furthermore, the evidence <span class="math">\(p(\mathbf{x}) = \int{p(\mathbf{x}\mid\boldsymbol{\theta})p(\boldsymbol{\theta})}d\boldsymbol{\theta}\)</span> is typically intractable to compute as well. This is because the integral has no closed-form solution; or, were the functional form of the likelihood (which we don't have) and the prior (which we do have) available, expanding these terms yields a summation over an "impractically large" number of terms, e.g. the number of possible cluster assignment configurations in a mixture of Gaussians <a href="#10.1080/01621459.2017.1285773" id="ref-10.1080/01621459.2017.1285773-1">(Blei et al., 2017)</a>. For this reason, in SBI, we typically estimate the <em>unnormalized</em> posterior <span class="math">\(\tilde{p}(\boldsymbol{\theta}\mid\mathbf{x}) = p(\mathbf{x}\mid\boldsymbol{\theta})p(\boldsymbol{\theta}) \propto \frac{p(\mathbf{x}\mid\boldsymbol{\theta})p(\boldsymbol{\theta})}{p(\mathbf{x})}\)</span>.</p>
<p>Recent work has explored the use of neural networks to perform key density estimation tasks, i.e. subroutines, of the SBI routine itself. We refer to this work as Neural SBI. In the following sections, we detail the various classes of these estimation tasks. For a more thorough analysis of their respective motivations, behaviors, and tradeoffs, we refer the reader to the original work.</p>
<h1>Neural Posterior Estimation</h1>
<p>In this class of models, we estimate <span class="math">\(\tilde{p}(\boldsymbol{\theta}\mid\mathbf{x})\)</span> with a conditional neural density estimator <span class="math">\(q_{\phi}(\boldsymbol{\theta}\mid\mathbf{x})\)</span>. Simply, this estimator is a neural network with parameters <span class="math">\(\phi\)</span> that accepts <span class="math">\(\mathbf{x}\)</span> as input and produces <span class="math">\(\boldsymbol{\theta}\)</span> as output. For example, It is trained on data tuples <span class="math">\(\{\boldsymbol{\theta}_n, \mathbf{x}_n\}_{1:N}\)</span> sampled from <span class="math">\(p(\mathbf{x}, \boldsymbol{\theta}) = p(\mathbf{x}\mid\boldsymbol{\theta})p(\boldsymbol{\theta})\)</span>, where <span class="math">\(p(\boldsymbol{\theta})\)</span> is a prior we choose, and <span class="math">\(p(\mathbf{x}\mid\boldsymbol{\theta})\)</span> is our <em>simulator</em>. For example, we can construct this training set as follows:</p>
<div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N_SAMPLES</span><span class="p">):</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">prior</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">generative_process</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span>
<span class="n">data</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">x</span><span class="p">,</span> <span class="n">theta</span><span class="p">))</span>
</pre></div>
<p>Then, we train our network.</p>
<div class="highlight"><pre><span></span><span class="n">q_phi</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</pre></div>
<p>Finally, once trained, we can estimate <span class="math">\(\tilde{p}(\boldsymbol{\theta}\mid\mathbf{x} = \mathbf{x}_o)\)</span>—our posterior belief over parameters <span class="math">\(\boldsymbol{\theta}\)</span> given our <em>observed</em> (not simulated!) data <span class="math">\(\mathbf{x}_o\)</span> as <span class="math">\(\tilde{p}(\boldsymbol{\theta}\mid\mathbf{x}_o) = q_{\phi}(\boldsymbol{\theta}\mid\mathbf{x} = \mathbf{x}_o)\)</span>.</p>
<h2>Learning the wrong estimator</h2>
<p>Ultimately, our goal is to perform the following computation:</p>
<div class="math">$$
q_{\phi}(\boldsymbol{\theta}\mid\mathbf{x} = \mathbf{x}_o)
$$</div>
<p>Such that <span class="math">\(q_{\phi}\)</span> produces an <em>accurate</em> estimation of the parameters <span class="math">\(\boldsymbol{\theta}\)</span> given observed data <span class="math">\(\mathbf{x}_o\)</span>, we require that <span class="math">\(q_{\phi}\)</span> be <em>trained</em> on tuples <span class="math">\(\{\boldsymbol{\theta}_n, \mathbf{x}_n\}\)</span> where:</p>
<ol>
<li><span class="math">\(\mathbf{x}_n \sim p(\mathbf{x}\mid\boldsymbol{\theta}_n)\)</span> via our simulation step.</li>
<li><span class="math">\(\mid\mathbf{x}_n - \mathbf{x}_o\mid\)</span> is small, i.e. our simulated are nearby our observed data.</li>
</ol>
<p>Otherwise, <span class="math">\(q_{\phi}\)</span> will learn to estimate a posterior over parameters given data <em>unlike</em> our own.</p>
<h2>Learning a better estimator</h2>
<p>So, how do we obtain parameters <span class="math">\(\boldsymbol{\theta}_n\)</span> that produce <span class="math">\(\mathbf{x}_n \sim p(\mathbf{x}\mid\boldsymbol{\theta}_n)\)</span> near <span class="math">\(\mathbf{x}_o\)</span>? We take those that have high (estimated) posterior density given <span class="math">\(\mathbf{x}_o\)</span>!</p>
<p>In this vein, we build our training set as follows:</p>
<div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N_SAMPLES</span><span class="p">):</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">q_phi</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x_o</span><span class="p">)</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">generative_process</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span>
<span class="n">data</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">x</span><span class="p">,</span> <span class="n">theta</span><span class="p">))</span>
</pre></div>
<p>Stitching this all together, our SBI routine becomes:</p>
<div class="highlight"><pre><span></span><span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N_ROUNDS</span><span class="p">):</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N_SAMPLES</span><span class="p">):</span>
<span class="k">if</span> <span class="n">r</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">prior</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">q_phi</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x_o</span><span class="p">)</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">generative_process</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span>
<span class="n">data</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">x</span><span class="p">,</span> <span class="n">theta</span><span class="p">))</span>
<span class="n">q_phi</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">posterior_samples</span> <span class="o">=</span> <span class="p">[</span><span class="n">q_phi</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x_o</span><span class="p">)</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">ANY_NUMBER</span><span class="p">)]</span>
</pre></div>
<h2>Learning the right estimator</h2>
<p>Unfortunately, we're still left with a problem:</p>
<ol>
<li>In the first round, we learn <span class="math">\(q_{\phi, r=0}(\boldsymbol{\theta}\mid\mathbf{x}) \approx p(\mathbf{x}\mid\boldsymbol{\theta})p(\boldsymbol{\theta})\)</span>, i.e. the <strong>right</strong> estimator.</li>
<li>Thereafter, we learn <span class="math">\(q_{\phi, r}(\boldsymbol{\theta}\mid\mathbf{x}) \approx p(\mathbf{x}\mid\boldsymbol{\theta})q_{\phi, r-1}(\boldsymbol{\theta}\mid\mathbf{x})\)</span>, i.e. the <strong>wrong</strong> estimator.</li>
</ol>
<p>So, how do we correct this mistake?</p>
<p>In <a href="#papamakarios2016" id="ref-papamakarios2016-1">Papamakarios and Murray (2016)</a>, the authors adjust the learned posterior <span class="math">\(q_{\phi, r}(\boldsymbol{\theta}\mid\mathbf{x})\)</span> by simply dividing it by <span class="math">\(q_{\phi, r-1}(\boldsymbol{\theta}\mid\mathbf{x})\)</span> then multiplying it by <span class="math">\(p(\boldsymbol{\theta})\)</span>. Furthermore, as they choose <span class="math">\(q_{\phi}\)</span> to be a <em>Mixture Density Network</em>—a neural network which outputs the parameters of a mixture of Gaussians—and the prior to be "simple distribution (uniform or Gaussian, as is typically the case in practice)," this adjustment can be done analytically.</p>
<p>Conversely, <a href="#lueckmann2017" id="ref-lueckmann2017-1">Lueckmann et al. (2017)</a> <em>train</em> <span class="math">\(q_{\phi}\)</span> on a target <em>reweighted</em> to similar effect: instead of maximizing the total (log) likelihood <span class="math">\(\Sigma_{n} \log q_{\phi}(\boldsymbol{\theta}_n\mid\mathbf{x}_n)\)</span>, they maximize <span class="math">\(\Sigma_{n} \log w_n q_{\phi}(\boldsymbol{\theta}_n\mid\mathbf{x}_n)\)</span>, where <span class="math">\(w_n = \frac{p(\boldsymbol{\theta}_n)}{q_{\phi, r-1}(\boldsymbol{\theta}_n\mid\mathbf{x}_n)}\)</span>.</p>
<p>While both approaches carry further nuance and potential pitfalls, they bring us effective methods for using a neural network to directly estimate a faithful posterior in SBI routines.</p>
<h1>Neural Likelihood Estimation</h1>
<p>In neural likelihood estimation (NLE), we use a neural network to directly estimate the (intractable) likelihood function of the simulator <span class="math">\(p(\mathbf{x}\mid\boldsymbol{\theta})\)</span> itself. We denote this estimator <span class="math">\(q_{\phi}(\mathbf{x}\mid\boldsymbol{\theta})\)</span>. Finally, we compute our desired posterior as <span class="math">\(\tilde{p}(\boldsymbol{\theta}\mid\mathbf{x}_o) \approx q_{\phi}(\mathbf{x}_o\mid\boldsymbol{\theta})p(\boldsymbol{\theta})\)</span>.</p>
<p>Similar to Neural Posterior Estimation (NPE) approaches, we'd like to learn our estimator on inputs <span class="math">\(\boldsymbol{\theta}\)</span> that produce <span class="math">\(\mathbf{x}_n \sim p(\mathbf{x}\mid\boldsymbol{\theta}_n)\)</span> near <span class="math">\(\mathbf{x}_o\)</span>. To do this, we again sample them from regions of high approximate posterior density. In each round <span class="math">\(r\)</span>, in NPE, this posterior was <span class="math">\(q_{\phi, r-1}(\boldsymbol{\theta}\mid\mathbf{x} = \mathbf{x}_o)\)</span>; in NLE, it is <span class="math">\(q_{\phi, r-1}(\mathbf{x}_o\mid\boldsymbol{\theta})p(\boldsymbol{\theta})\)</span>. In both cases, we draw samples from our approximate posterior density, then feed them to the simulator to generate novel data for training our estimator <span class="math">\(q_{\phi, r}\)</span>.</p>
<p>For a more detailed treatment, please refer to original works <a href="#pmlr-v89-papamakarios19a" id="ref-pmlr-v89-papamakarios19a-1">Papamakarios et al. (2019)</a> and <a href="#pmlr-v96-lueckmann19a" id="ref-pmlr-v96-lueckmann19a-1">Lueckmann et al. (2019)</a> (among others).</p>
<h1>Neural Likelihood Ratio Estimation</h1>
<p>In this final class of models, we instead try to directly draw <em>samples</em> from the true posterior itself. However, since we can't compute <span class="math">\(p(\mathbf{x}\mid\boldsymbol{\theta})\)</span> nor <span class="math">\(p(\mathbf{x})\)</span>, we first need a sampling algorithm that satisifes these constraints. One such class of algorithms is <em>Markov chain Monte Carlo</em>, termed MCMC.</p>
<p>In MCMC, we first <em>propose</em> parameter samples <span class="math">\(\boldsymbol{\theta}_i\)</span> from a proposal distribution. Then, we evaluate their <em>fitness</em> by asking the question: "does this sample <span class="math">\(\boldsymbol{\theta}_i\)</span> have higher posterior density than the previous sample <span class="math">\(\boldsymbol{\theta}_j\)</span> we drew?" Generally, this question is answered through comparison, e.g.</p>
<div class="math">$$
\frac{
p(\boldsymbol{\theta}_i\mid\mathbf{x})
} {
p(\boldsymbol{\theta}_{j}\mid\mathbf{x})
} = \frac{
p(\mathbf{x}\mid\boldsymbol{\theta}_i)p(\boldsymbol{\theta}_i) / p(\mathbf{x})
} {
p(\mathbf{x}\mid\boldsymbol{\theta}_j)p(\boldsymbol{\theta}_j) / p(\mathbf{x})
}
$$</div>
<p>Fortunately, the evidence terms <span class="math">\(p(\mathbf{x})\)</span> cancel, and the prior densities <span class="math">\(p(\boldsymbol{\theta})\)</span> are evaluable. Though we cannot compute the likelihood terms outright, we can estimate their <em>ratio</em> and proceed with MCMC as per normal. If <span class="math">\(\frac{p(\boldsymbol{\theta}_i\mid\mathbf{x})}{p(\boldsymbol{\theta}_j\mid\mathbf{x})} \gt 1\)</span>, we (are likely to) <em>accept</em> <span class="math">\(\boldsymbol{\theta}_i\)</span> as a valid sample from our target posterior.</p>
<h2>Estimating the likelihood ratio</h2>
<p>Let us term the likelihood ratio as</p>
<div class="math">$$
r(\mathbf{x}\mid\boldsymbol{\theta}_i, \boldsymbol{\theta}_j) = \frac{
p(\mathbf{x}\mid\boldsymbol{\theta}_i)
} {
p(\mathbf{x}\mid\boldsymbol{\theta}_j)
}
$$</div>
<p>Ingeniously, <a href="#cranmer2015" id="ref-cranmer2015-1">Cranmer et al. (2015)</a> propose to learn a classifier to discriminate samples <span class="math">\(\mathbf{x} \sim p(\mathbf{x}\mid\boldsymbol{\theta}_i)\)</span> from <span class="math">\(\mathbf{x} \sim p(\mathbf{x}\mid\boldsymbol{\theta}_j)\)</span>, then use its predictions to estimate <span class="math">\(r(\mathbf{x}\mid\boldsymbol{\theta}_i, \boldsymbol{\theta}_j)\)</span>.</p>
<p>To do this, they draw training samples <span class="math">\((\mathbf{x}, y=1) \sim p(\mathbf{x}\mid\boldsymbol{\theta}_i)\)</span> and <span class="math">\((\mathbf{x}, y=0) \sim p(\mathbf{x}\mid\boldsymbol{\theta}_j)\)</span> then train a binary classifer <span class="math">\(d(y\mid\mathbf{x})\)</span> on this data. In this vein, a perfect classifier gives:</p>
<div class="math">$$
\begin{align*}
d^*(y=1\mid\mathbf{x})
&= \frac{
p(\mathbf{x}\mid\boldsymbol{\theta}_i)
} {
p(\mathbf{x}\mid\boldsymbol{\theta}_i) + p(\mathbf{x}\mid\boldsymbol{\theta}_j)
} \\
d^*(y=0\mid\mathbf{x})
&= \frac{
p(\mathbf{x}\mid\boldsymbol{\theta}_j)
} {
p(\mathbf{x}\mid\boldsymbol{\theta}_i) + p(\mathbf{x}\mid\boldsymbol{\theta}_j)
} \\
\end{align*}
$$</div>
<p>Consequently,</p>
<div class="math">$$
\begin{align*}
r(\mathbf{x}\mid\boldsymbol{\theta}_i, \boldsymbol{\theta}_j)
&= \frac{
p(\mathbf{x}\mid\boldsymbol{\theta}_i)
} {
p(\mathbf{x}\mid\boldsymbol{\theta}_j)
} \\
&= \frac{
d^*(y=1\mid\mathbf{x})
} {
d^*(y=0\mid\mathbf{x})
} \\
&= \frac{
d^*(y=1\mid\mathbf{x})
} {
1 - d^*(y=1\mid\mathbf{x})
}
\end{align*}
$$</div>
<p>Since our classifier won't be perfect, we simply term it <span class="math">\(d(y\mid\mathbf{x})\)</span>, where</p>
<div class="math">$$
\begin{align*}
\hat{r}(\mathbf{x}\mid\boldsymbol{\theta}_i, \boldsymbol{\theta}_j)
&= \frac{d(y=1\mid\mathbf{x})}{1 - d(y=1\mid\mathbf{x})}\\
&\approx r(\mathbf{x}\mid\boldsymbol{\theta}_i, \boldsymbol{\theta}_j)
\end{align*}
$$</div>
<p>With <span class="math">\(\hat{r}(\mathbf{x}\mid\boldsymbol{\theta}_i, \boldsymbol{\theta}_j)\)</span> in hand, we can compare the posterior density of proposed samples <span class="math">\(\boldsymbol{\theta}_i\)</span> and <span class="math">\(\boldsymbol{\theta}_j\)</span> in our MCMC routine.</p>
<h2>Generalizing our classifier</h2>
<p>To use the above classifier in our inference routine, we must <em>retrain</em> a <em>new</em> classifier for every <em>unique</em> set of parameters <span class="math">\(\{\boldsymbol{\theta}_i, \boldsymbol{\theta}_j\}\)</span>. Clearly, this is extremely impractical. How can we generalize our classifier such that we only have to train it once?</p>
<p>In <a href="#cranmer2015" id="ref-cranmer2015-2">Cranmer et al. (2015)</a>, the authors learn a <em>single</em> classifier <span class="math">\(d(y\mid\mathbf{x}, \boldsymbol{\theta})\)</span> to discriminate samples <span class="math">\(\mathbf{x} \sim p(\mathbf{x}\mid\boldsymbol{\theta})\)</span> from <span class="math">\(\mathbf{x} \sim p(\mathbf{x}\mid\boldsymbol{\theta}_{ref})\)</span>, where <span class="math">\(\boldsymbol{\theta}\)</span> is an <em>arbitrary</em> parameter value, and <span class="math">\(\boldsymbol{\theta}_{ref}\)</span> is a fixed, <em>reference</em> parameter value. It is trained on data <span class="math">\((\mathbf{x}, \boldsymbol{\theta}, y=1) \sim p(\mathbf{x}\mid\boldsymbol{\theta})\)</span> and <span class="math">\((\mathbf{x}, \boldsymbol{\theta}_{ref}, y=0) \sim p(\mathbf{x}\mid\boldsymbol{\theta}_{ref})\)</span>. Once trained, it gives:</p>
<div class="math">$$
r(\mathbf{x}\mid\boldsymbol{\theta}, \boldsymbol{\theta}_{ref})
= \frac{
d^*(y\mid\mathbf{x}, \boldsymbol{\theta})
} {
1 - d^*(y\mid\mathbf{x}, \boldsymbol{\theta})
}
$$</div>
<p>Consequently,</p>
<div class="math">$$
\begin{align*}
r(\mathbf{x}\mid\boldsymbol{\theta}_i, \boldsymbol{\theta}_j)
&= \frac{
r(\mathbf{x}\mid\boldsymbol{\theta}_i, \boldsymbol{\theta}_{ref})
} {
r(\mathbf{x}\mid\boldsymbol{\theta}_j, \boldsymbol{\theta}_{ref})
} \\
&= \frac{
d^*(y\mid\mathbf{x}, \boldsymbol{\theta}_i)
} {
1 - d^*(y\mid\mathbf{x}, \boldsymbol{\theta}_i)
} * \frac{
1 - d^*(y\mid\mathbf{x}, \boldsymbol{\theta}_j)
} {
d^*(y\mid\mathbf{x}, \boldsymbol{\theta}_j)
}
\end{align*}
$$</div>
<p>With a <em>single</em> model, we can now compare the density of two proposed posterior samples.</p>
<h2>Improving our generalized classifier</h2>
<p>Once more, our classifier <span class="math">\(d(y\mid\mathbf{x}, \boldsymbol{\theta})\)</span> discriminates samples <span class="math">\(\mathbf{x} \sim p(\mathbf{x}\mid\boldsymbol{\theta})\)</span> from <span class="math">\(\mathbf{x} \sim p(\mathbf{x}\mid\boldsymbol{\theta}_{ref})\)</span>. In this vein, in the case that a given <span class="math">\(\mathbf{x}\)</span> was drawn from neither <span class="math">\(p(\mathbf{x}\mid\boldsymbol{\theta})\)</span> <em>nor</em> <span class="math">\(p(\mathbf{x}\mid\boldsymbol{\theta}_{ref})\)</span>, what should our classifier do? In <a href="#hermans2019" id="ref-hermans2019-1">Hermans et al. (2019)</a>, the authors illustrate this problem—</p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/neural-sbi/undefined-classifier.png"/></p>
<p>—stressing that "poor inference results occur in the absence of support between <span class="math">\(p(\mathbf{x}\mid\boldsymbol{\theta})\)</span> and <span class="math">\(p(\mathbf{x}\mid\boldsymbol{\theta}_{ref})\)</span>."</p>
<p>In solution, they propose to learn a (neural) classifier that instead discriminates between <em>dependent</em> sample-parameter pairs <span class="math">\((\mathbf{x}, \boldsymbol{\theta}) \sim p(\mathbf{x}\mid\boldsymbol{\theta})p(\boldsymbol{\theta})\)</span> from <em>independent</em> sample-parameter pairs <span class="math">\((\mathbf{x}, \boldsymbol{\theta}) \sim p(\mathbf{x})p(\boldsymbol{\theta})\)</span>. Since <span class="math">\(p(\mathbf{x}\mid\boldsymbol{\theta})p(\boldsymbol{\theta})\)</span> and <span class="math">\(p(\mathbf{x})p(\boldsymbol{\theta})\)</span> occupy the same space, they share a common support. In other words, the likelihood of a given <span class="math">\(\mathbf{x}\)</span> will <em>always</em> be positive for <em>some</em> <span class="math">\(\boldsymbol{\theta}\)</span> in the figure above.</p>
<h1>Conclusion</h1>
<p>Simulation-based inference is a class of techniques that allows us to perform Bayesian inference where our data-generating model <span class="math">\(p(\mathbf{x}\mid\boldsymbol{\theta})\)</span> lacks a tractable likelihood function, yet permits simulation of novel data. In the above sections, we detailed several SBI approaches, and ways in which neural networks are currently being used in each.</p>
<h2>Credit</h2>
<p>Credit to <a href="https://www.processmaker.com/wp-content/uploads/2021/07/simulation-modeling-process-mining.jpg">ProcessMaker</a> for social card image.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script><hr/>
<h2>Bibliography</h2>
<p id="10.1080/01621459.2017.1285773">David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe.
<span class="bibtex-protected">Variational Inference: A Review for Statisticians</span>.
<em>Journal of the American Statistical Association</em>, 112(518):859–877, 2017.
<a href="https://arxiv.org/abs/1601.00670">arXiv:1601.00670</a>, <a href="https://doi.org/10.1080/01621459.2017.1285773">doi:10.1080/01621459.2017.1285773</a>. <a class="cite-backref" href="#ref-10.1080/01621459.2017.1285773-1" title="Jump back to reference 1">↩</a></p>
<p id="cranmer2015">Kyle Cranmer, Juan Pavez, and Gilles Louppe.
<span class="bibtex-protected">Approximating Likelihood Ratios with Calibrated Discriminative Classifiers</span>.
<em>arXiv</em>, 2015.
<a href="https://arxiv.org/abs/1506.02169">arXiv:1506.02169</a>. <a class="cite-backref" href="#ref-cranmer2015-1" title="Jump back to reference 1">↩</a><a class="cite-backref" href="#ref-cranmer2015-1" title="Jump back to reference 1"> <sup>1</sup> </a><a class="cite-backref" href="#ref-cranmer2015-2" title="Jump back to reference 2"><sup>2</sup> </a></p>
<p id="hermans2019">Joeri Hermans, Volodimir Begy, and Gilles Louppe.
<span class="bibtex-protected">Likelihood-free MCMC with Amortized Approximate Ratio Estimators</span>.
<em>arXiv</em>, 2019.
<a href="https://arxiv.org/abs/1903.04057">arXiv:1903.04057</a>. <a class="cite-backref" href="#ref-hermans2019-1" title="Jump back to reference 1">↩</a></p>
<p id="pmlr-v96-lueckmann19a">Jan-Matthis Lueckmann, Giacomo Bassetto, Theofanis Karaletsos, and Jakob H. Macke.
Likelihood-free inference with emulator networks.
In Francisco Ruiz, Cheng Zhang, Dawen Liang, and Thang Bui, editors, <em>Proceedings of The 1st Symposium on Advances in Approximate Bayesian Inference</em>, volume 96 of Proceedings of Machine Learning Research, 32–53. 02 Dec 2019. PMLR.
URL: <a href="http://proceedings.mlr.press/v96/lueckmann19a.html">http://proceedings.mlr.press/v96/lueckmann19a.html</a>. <a class="cite-backref" href="#ref-pmlr-v96-lueckmann19a-1" title="Jump back to reference 1">↩</a></p>
<p id="lueckmann2017">Jan-Matthis Lueckmann, Pedro J Goncalves, Giacomo Bassetto, Kaan Öcal, Marcel Nonnenmacher, and Jakob H Macke.
<span class="bibtex-protected">Flexible statistical inference for mechanistic models of neural dynamics</span>.
<em>arXiv</em>, 2017.
<a href="https://arxiv.org/abs/1711.01861">arXiv:1711.01861</a>. <a class="cite-backref" href="#ref-lueckmann2017-1" title="Jump back to reference 1">↩</a></p>
<p id="papamakarios2016">George Papamakarios and Iain Murray.
Fast ε-free inference of simulation models with bayesian conditional density estimation.
In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, <em>Advances in Neural Information Processing Systems</em>, volume 29. Curran Associates, Inc., 2016.
URL: <a href="https://proceedings.neurips.cc/paper/2016/file/6aca97005c68f1206823815f66102863-Paper.pdf">https://proceedings.neurips.cc/paper/2016/file/6aca97005c68f1206823815f66102863-Paper.pdf</a>. <a class="cite-backref" href="#ref-papamakarios2016-1" title="Jump back to reference 1">↩</a></p>
<p id="pmlr-v89-papamakarios19a">George Papamakarios, David Sterratt, and Iain Murray.
Sequential neural likelihood: fast likelihood-free inference with autoregressive flows.
In Kamalika Chaudhuri and Masashi Sugiyama, editors, <em>Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics</em>, volume 89 of Proceedings of Machine Learning Research, 837–848. PMLR, 16–18 Apr 2019.
URL: <a href="http://proceedings.mlr.press/v89/papamakarios19a.html">http://proceedings.mlr.press/v89/papamakarios19a.html</a>. <a class="cite-backref" href="#ref-pmlr-v89-papamakarios19a-1" title="Jump back to reference 1">↩</a></p>
Things I've Built in Crypto2021-12-21T18:00:00-05:002021-12-21T18:00:00-05:00Will Wolftag:willwolf.io,2021-12-21:/2021/12/21/crypto-projects/<p>A running list of things I've built in and around the crypto space.</p><p>A running list of things I've built in and around this space.</p>
<h1>simsim</h1>
<p>A (sim)ple (sim)ulation <a href="https://github.com/cavaunpeu/simsim">framework</a> for generalized dynamical systems, in Rust.</p>
<p><img alt="simsim" class="img-responsive" src="https://willwolf.io/images/simsim.png"/></p>
<h1>solsim</h1>
<p>A <a href="https://github.com/cavaunpeu/solsim">simulator</a> for complex systems on the Solana blockchain.</p>
<p><img alt="solsim results app screenshot" class="img-responsive" src="https://raw.githubusercontent.com/cavaunpeu/solsim/main/img/results_explorer_app.png"/></p>
<h1>Solana Economic Simulator</h1>
<p>A Streamlit <a href="https://share.streamlit.io/cavaunpeu/solana-economics/main/app/main.py">app</a> for simulating Solana validator economics.</p>
<p><img alt="solana economics simulation screenshot" class="img-responsive" src="https://raw.githubusercontent.com/cavaunpeu/solana-economics/main/screenshot.png"/></p>
<h1>Anchor Escrow Program</h1>
<p>A toy implementation of a Solana escrow <a href="https://anchor-escrow-program.netlify.app/">program</a> in Anchor and React.</p>
<p><img alt="escrow program screenshot" class="img-responsive" src="https://anchor-escrow-program.netlify.app/card_image.png"/></p>Exploring, Crypto2021-12-10T10:00:00-05:002021-12-10T10:00:00-05:00Will Wolftag:willwolf.io,2021-12-10:/2021/12/10/exploring-crypto/<p>I'm taking some time to explore: Rust, complex systems, crypto.</p><p>To "explore vs. exploit" is a classic dilemma in decision-making. To illustrate, imagine it’s date-night in downtown Manhattan and you’re choosing where to eat.</p>
<p>To <em>exploit</em> is to dine at your favorite restaurant. The food is well-cooked. The servers are attentive. You'll have a great meal and you know it.</p>
<p>To <em>explore</em> is to try something new. It's to implicitly concede that while your favorite restaurant is great, it may not be the best overall. Your new choice may very well end in vomit and tears. However, you'll never find something better unless you try.</p>
<h2>ASAPP</h2>
<p>Four and a half years at <a href="https://willwolf.io/2021/12/04/leaving-asapp/">ASAPP</a> was my career on "exploit." ASAPP is a company solving important problems replete with wonderful colleagues and ample opportunity for personal growth. I benefited from this environment each and every day. I kept choosing ASAPP because it unfailingly was, and remains, an excellent place to call my professional home.</p>
<p>I joined ASAPP at age 28. I'm now 32. And I'd like to see what else there is. Where else I can learn, enjoy, create, and grow. In simple terms, I'd like to explore.</p>
<h2>What's next</h2>
<p>On the technical front:</p>
<ul>
<li>Learn <strong>Rust</strong> and smart-contract development in Solana.</li>
<li>Study the theory and applications of <a href="https://en.wikipedia.org/wiki/Complex_system"><strong>complex systems</strong></a> (per my long-standing <a href="https://honors.libraries.psu.edu/files/final_submissions/970">interest</a> in this space).</li>
<li>Apply these tools to the <strong>crypto</strong> space at large.</li>
</ul>
<p>On the personal front:</p>
<ul>
<li>Study <strong>Russian.</strong></li>
<li>Spend time with friends.</li>
<li>Travel.</li>
</ul>
<h2>PhD?</h2>
<p>In addition, I've applied for Ph.D. programs in statistical ML. Hopefully, I'll have a difficult choice to make in March.</p>
<h2>In sum</h2>
<p>In the spirit of exploration, I won't rigidly structure this period of my life. Instead, I'll see where my interests lead and follow them vigorously.</p>
<p>Please don't hesitate to say 👋 if any of the above resonates with you!</p>
<p>Will</p>Leaving ASAPP2021-12-04T12:00:00-05:002021-12-04T12:00:00-05:00Will Wolftag:willwolf.io,2021-12-04:/2021/12/04/leaving-asapp/<p>I left <a href="https://www.asapp.com/">ASAPP</a> after 4+ wonderful years.</p><p>I joined ASAPP in August 2017 as its 78th hire. I'd just finished my Open-Source ML <a href="https://willwolf.io/2016/07/29/my-open-source-machine-learning-masters-in-casablanca-morocco/">"Master's"</a> in Morocco; I wanted to <a href="https://willwolf.io/2017/06/20/my-next-role/">join</a> "a small company with ML at its core." I was referred by an old colleague. We barely had a <a href="https://web.archive.org/web/20170825060205/https://www.asapp.com/">website</a>.</p>
<p>51 months went by quickly. ASAPP was a magical place. To my colleagues, know that you made a profound personal, professional, intellectual, and creative impact on my life. That our time by the whiteboard with fickle black markers, the monsoon swell of positive energy and "is there an extra chair for me?" that were our daily lunches, our unfettered excitement for the field and optimism for the technologies we build are golden memories that I'll hold close to my heart for years to come.</p>
<p>All great stories must one day end. For me, that was yesterday. It was peculiar and bittersweet. I'm sincerely lucky to have worked at ASAPP, and I look forward to seeing all they accomplish in the future.</p>
<p><strong>From the bottom of my heart, thank you.</strong></p>
<p>From here, I'm starting something new! I'll share an update soon.</p>
<p>@will</p>Halford Mackinder on Artificial Intelligence2020-08-16T16:00:00-04:002020-08-16T16:00:00-04:00Will Wolftag:willwolf.io,2020-08-16:/2020/08/16/mackinder-on-ai/<p>What would preeminent 20th century geographer Halford J. Mackinder say about the coming revolution in artificial intelligence and its impact on our current ideological war?</p><p><a href="en.wikipedia.org/wiki/Halford_Mackinder">Halford J. Mackinder</a> was an English geographer, academic and politician. Born in the 1860s and deceased shortly after the end of World War II, he is largely known for a reading at the Royal Geographical Society of his 1904 paper, <em><a href="https://www.iwp.edu/wp-content/uploads/2019/05/20131016_MackinderTheGeographicalJournal.pdf">The Geographical Pivot of History</a></em>. In this work, Mackinder describes in sweeping detail the preceding two thousands years of human activity—how geography, along with climate, religion, commerce and war, shaped the world of the time of his writing and facilitated the meteoric rise of the then-dominant Continental European powers. On this foundation, he then delivers, in an inspiring if admonitory crescendo, his famous "Heartland Theory"—a geopolitical analysis of the importance of these powers controlling what is modern Central Asia in the coming (20th) century—for which he is most singularly remembered.</p>
<p>The goal of this post is to reimagine The Pivot in content and form for the 21st century. First, I'll reintroduce the paper's main ideas. Then, I'll play with the following question: <strong>what might Mackinder write if he were alive today?</strong></p>
<h2>The Geographical Pivot of History</h2>
<p>Mackinder begins his famous work by asserting that in 1904, the world stands at the <em>exit</em> of the Columbian epoch: a 400-year period of European expansionism, in which European states—principally French, Spanish, Dutch, English and Portuguese—sent their generals, scientists and engineers to explore, detail, and ultimately subjugate people and land in the furthest reaches of the globe—transforming themselves from provincial, land-based powers into global empires, now dominant at sea, and transforming the "known world map" from an ill-lit, half-built cabin deep in an unnamed wood to an immaculate Grand Hotel with a different European powerhouse in every single room. As such, in 1904, his continent's powers sat atop a now-familiar world, shoulder-to-shoulder, with nowhere to go but <em>down.</em></p>
<p>With the stage set thus, Mackinder invites us to step back, inhale deeply, and reflect—with a bias towards Continental Europe, no doubt, owing implicitly to its then-dominant stature, and perhaps to Mackinder being English himself—on some of history's broadest currents through the lens of geography: what was, what is, and what will be utterly crucial in the coming 20th century—his "Geographical Pivot."</p>
<p>The Continental European landmass comprises an intricate series of mountains, valleys and peninsulas. These features, in turn, provide Europeans with ample physical barriers between which a multitude of politically and culturally distinct nation-states could and did take root. In stark contrast, Mackinder then describes the Russian landmass, one of an altogether different geographic profile: dense mixed and deciduous forest to the North; dry, arid steppe to the South, relatively inhospitable to both life and agriculture; unendingly flat throughout. Unlike Europe, this geography bred a different type of person and society: regional capitals like Kiev, Moscow and Kazan took root just north of this dividing forest-steppe frontier, whereas the sandy steppe itself spawned, well, a thousand years of nomadic invaders—Huns, Avers, Bulgarians, Magyars, Khazars, Patzinaks, Cumans, Kalmuks, Genghis Khan's Mongol hordes, and others—hardened by its brutal conditions, in sync with the land's silent whisper of "keep on moving, life won't survive here."</p>
<p>In its first 1,500 years following the death of Jesus Christ, Continental Europe evolved largely on land. Corralled from the West by the Atlantic, as yet uncrossed, the North by the Scandinavian tundra, the South by, in effect, the immense sweep of the then-impenetrable Sahara, and the East by roughly 7,500 kilometers of remaining Eurasia comprised primarily of hostile Central Asian Steppe, separating Berlin from Beijing and the Pacific Ocean. As such, the bulk of Europe's military and political growth remained on its home turf. And implicitly, as the only game in town, <em>land power</em> reigned supreme.</p>
<p>On the heels of the Renaissance around 1,500 A.D., things changed. Emboldened and energized by cultural and economic development at home, European explorers took to the seas in zealous pursuit of additional sources of wealth and knowledge abroad. In 1498, <a href="https://en.wikipedia.org/wiki/Vasco_da_Gama">Vasco de Gama</a> rounded the Cape of Good Hope, traveling directly from Europe to India without traversing the Middle East. In 1492, <a href="https://en.wikipedia.org/wiki/Christopher_Columbus">Columbus</a>, the namesake of the Columbian age itself, crossed the Atlantic and reached the Americas. Over the next 400 years, European nation-states swiftly amassed empires of global proportions, subjugating previous inhabitants under their economic, military and political control, and becoming, new in history, the preeminent titans of <em>sea power</em> throughout.</p>
<p>The following map shows European exploration routes during this era. (Though it only covers years 1492-1682, the general trend it shows continued until the time of Mackinder's writing.)</p>
<p><img alt="european exploration map" class="img-responsive" src="http://callisto.ggsrv.com/imgsrv/FastFetch/UBER2/00018293"/></p>
<p>Simply, Europeans went everywhere! Except the Russian Empire. Except Mongolia. Except the Central Asian Steppe.</p>
<p>In 1904, following 400 years of global domination, European historians were wont to consider their own stories, self-development and influence as superior to that of others. In response, in his paper, Mackinder implores us to reconsider the map of pre-Columbian Continental Europe, and the ways in which, owing to this map, the development of European society itself was shaped considerably from the <em>outside</em> in.</p>
<p>Indeed, until Columbus, the Atlantic Ocean remained uncrossed, thereby serving as an impregnable natural border to the West. Likewise, the menacing infinity of the Sahara to the South remained uncrossed until the mid-19th century. Conversely, the cold and cheerless tundra to the North played host to the <a href="https://en.wikipedia.org/wiki/Viking_Age">Viking Age</a>, a 200-year period of disruption and settlement in and around the North Sea—the British Isles, Ireland, Iceland, Greenland, the Faroe Islands, and others—as well as descents down the rivers Dnieper and Volga wherein the Vikings lived and traded with peoples in Kievan Rus and the Eastern Roman Empire.</p>
<p>Finally, east of the Carpathians (in present-day Romania), sweeping north-east to the Ural Mountains (approximated by Chelyabinsk, Russia on the map below) then south-east through both Kazakhstan and Mongolia, is, well, what? What protected and will henceforth protect Europe from an intruding Eastern neighbor? On what natural borders can it count? Here, Mackinder (figuratively) smiles, stands up, cooly eyes the crowd, and concludes: "Nothing." Then, he reminds European historians what actually happened before the recent Columbian epoch that has so warped their perception and egos.</p>
<p><img alt="central asian steppe" class="img-responsive" src="https://willwolf.io/images/mackinder-on-ai/central-asian-steppe.png"/></p>
<p>In reality, the 5th to 16th centuries saw unending progressions of inveterate Asiatic invaders crossing the ceaseless Central Asian Steppe and <em>hammering</em> the European land powers like <em>pestle to mortar.</em> Furthermore, there is evidence that these invasions sparked the formation of many of the modern Continental European states themselves! Mackinder writes:</p>
<blockquote>
<p>A large part of modern history might be written as a commentary upon the changes directly or indirectly ensuing from these raids. The Angles and Saxons, it is quite possible, were then driven to cross the seas to found England in Britain. The Franks, the Goths, and the Roman provincials were compelled, for the first time, to stand shoulder to shoulder on the battlefield of Chalons, making common cause against the Asiatics, who were unconsciously welding together modern France. Venice was founded from the destruction of Aquileia and Padua; and even the Papacy owed a decisive prestige to the successful mediation of Pope Leo with <a href="https://en.wikipedia.org/wiki/Attila">Attila</a> at Milan. Such was the harvest of results produced by a cloud of ruthless and idealess horsemen sweeping over the unimpeded plain—a blow, as it were, from the great Asiatic hammer striking freely through the vacant space.</p>
</blockquote>
<p>In other words, though European society had matured significantly in the preceding 2,000 years, evolving from land-based principalities to sea-faring, nation-state empires of global reach, its <em>geography</em> had, and will always, stay the same. In 1904, transcontinental dominion notwithstanding, the Steppe to Europe's East lay flat, barren, uninterrupted—<em>and un-mastered by Europe</em> as ever.</p>
<p><img alt="european colonial possessions 1900" class="img-responsive" src="https://apworldunit5.weebly.com/uploads/7/6/7/6/76761401/701214.gif?880"/></p>
<p><em>N.B. By 1900, the majority of Latin American states, their own colonial history predominantly that of Spanish, Dutch and Portuguese imperialism, had already gained independence.</em></p>
<p>And finally, in 1904, why was this so significant? Among his esteemed and learned colleagues at the Royal Geographical Society, representing in turn the titans of contemporary global order, why was Mackinder so acutely preoccupied with the final "piece of the puzzle"—the "Heartland"—so influential in Europe's pre-Columbian history?</p>
<p>As a barren steppe, the Heartland lacks the stone and timber necessary for road-making. As such, throughout the majority of history, it was best traversed in the same fashion as that of the Asiatic invaders: on lunky horse and camel. However, in 1904, a recent technology was poised to <em>radically</em> alter this fact, to substantially accelerate the pace with which the Steppe could be crossed, to transform the perennial Steppe Invasion from a dainty pestle into an earth-fracturing pneumatic drill. This technology, <a href="https://en.wikipedia.org/wiki/History_of_rail_transport">introduced</a> in Britain around 1830, was the <em>railroad</em>, and the <em>steam-powered trains</em> to which it played host.</p>
<p>The following graph shows the growing number of rail passengers per year at the time of Mackinder's writing, which we can understand as a proxy for the growing number of miles of railroad itself:</p>
<p><img alt="rail passengers by year" class="img-responsive" src="https://upload.wikimedia.org/wikipedia/commons/1/15/GBR_rail_passengers_by_year_1830-2015.png"/></p>
<p>Here, Mackinder concludes, delivering his key insight as well as the Heartland Theory itself in one grand sweep:</p>
<blockquote>
<p>A generation ago steam and the Suez canal appeared to have increased the mobility of sea-power relatively to land-power. Railways acted chiefly as feeders to ocean-going commerce. But trans-continental railways are now transmuting the conditions of land-power, and nowhere can they have such effect as in the closed heart-land of Euro-Asia, in vast areas of which neither timber nor accessible stone was available for road-making. Railways work the greater wonders in the steppe, because they directly replace horse and camel mobility, the road stage of development having here been omitted. [...] As we consider this rapid review of the broader currents of history, does not a certain persistence of geographical relationship become evident? Is not the pivot region of the world's politics that vast area of Euro-Asia which is inaccessible to ships, but in antiquity lay open to the horse-riding nomads, as is to-day about to be covered with a network of railways?</p>
</blockquote>
<p><img alt="geographical pivot" class="img-responsive" src="https://willwolf.io/images/mackinder-on-ai/geographical-pivot.png"/></p>
<h2>Zooming out</h2>
<p>Though Mackinder's thesis is specific to 20th century Continental Europe, I think there is a broader framework at play. Personally, I read Mackinder to be saying the following:</p>
<ol>
<li>
<p>In the present day, there is a global hegemon. This hegemon is content, if haughty, with its place in the world. Modern history presented this hegemon with a task, a challenge, an "unclaimed space on which to plant its flags," and the hegemon did just that.</p>
</li>
<li>
<p>Well, not quite: there is still <em>some</em> space it failed to claim. And this particular space, as it goes, facilitated singular and outsized destruction in the past. So, hegemon beware!</p>
</li>
<li>
<p>There is today a novel technology neatly poised to reify this destruction on a much grander scale than ever before.</p>
</li>
</ol>
<p>With this framework in mind, what might Mackinder write if he were alive today?</p>
<h2>The United States</h2>
<p>In the past 100 years, the United States has transformed itself from a non-interventionist island nation of lofty ideals into a heavyweight champion of military might, economic preponderance, and cultural dominance at global scale. To wit, 1918 saw President Woodrow Wilson delivering his <a href="https://www.theworldwar.org/learn/peace/fourteen-points">Fourteen Points</a> touting "freedom of the seas and free trade and the concept of national self-determination," while 2020 America maintains ~1,000 <a href="https://www.thenation.com/article/archive/the-us-has-military-bases-in-172-countries-all-of-them-must-close/">military bases</a> in ~100 countries, controls ~30% of the <a href="https://www.visualcapitalist.com/all-of-the-worlds-wealth-in-one-visualization/#:~:text=Leading%20the%20pack%20is%20the,share%20of%20the%20global%20total.">world's wealth</a>, and cannot count on <a href="https://www.washingtonpost.com/news/monkey-cage/wp/2016/12/23/the-cia-says-russia-hacked-the-u-s-election-here-are-6-things-to-learn-from-cold-war-attempts-to-change-regimes/">seven sets of hands</a> the number of times its CIA has <a href="https://www.amazon.com/Killing-Hope-C-I-Interventions-II-Updated/dp/1567512526">intervened in foreign countries</a> since its <a href="https://www.cia.gov/about-cia/history-of-the-cia">establishment in 1947</a>. Though its detractors are harsh, and though President Donald Trump's foppishness and distinctly isolationist bent may indeed alter this course, the fact remains that the United States <em>is</em> Mackinder's unambiguous contemporary hegemon—his Continental Europe of 1904—eminently comfortable, if <a href="https://www.nytimes.com/2020/02/07/opinion/sunday/western-society-decadence.html">decaying</a>, in its gold-rimmed seat on top of the world.</p>
<p>Continental Europe achieved its own preponderance by winning the war of global <em>land</em>: occupying—with lust, zeal and greed—as much of the map as it could throughout the Columbian era. In turn, the United States reached its current seat by winning the war of global <em>ideas</em>, which Europe lost: from the brutal, hellish ruins of a 77-year Battle Royal from 1914 to 1991 between fascism, communism and free-market democracy—in other words, not over who gets what of an empty map, but how we live where we currently are in a map already full, encompassing the two World Wars and the Cold War—the United States and the ideas it represented emerged victorious, while Continental Europe fell to ruin. From then on, the United States became the global imperial much like the Columbian Europeans, patrolling the space of ideas with weapons and money—both as a beachhead-plus-lighthouse for the ideals of democracy, along with frequent <a href="https://en.wikipedia.org/wiki/Foreign_interventions_by_the_United_States#Post-Cold_War">foreign interventions</a> to this effect—instead of the seas themselves, cementing its role as global hegemon in the process.</p>
<h2>The modern war of ideas</h2>
<p>Whereas the 20th century fought over what I will call "macro-ideology"—how nation-states should self-organize in the broadest of political, economic and social terms—2020 America is home to vicious wars over, in comparison, "micro-ideology": conduct in the workplace, gender identity and bias, police brutality, racial justice, who "belongs" where, and an infinity of other like topics that animate a sizable chunk of our current social discourse.</p>
<p>In this fight, people have taken to the street. To wit, the May 2020 murder of <a href="https://en.wikipedia.org/wiki/George_Floyd">George Floyd</a> at the hands of Minneapolis police officer Derek Chauvin sparked humungous, long-lasting protests around the country and world. In 2017, the day after Donald Trump's inauguration, the <a href="https://en.wikipedia.org/wiki/2017_Women%27s_March">Women's March</a> protesting his anti-women rhetoric became the largest single-day protest in U.S. history. Throughout the Trump presidency, white nationalist groups have variously assembled—the infamous 2017 Unite the Right rally in <a href="https://en.wikipedia.org/wiki/Unite_the_Right_rally">Charlottesville</a> as the most prominent example—declaiming in fire and fury, well, who really knows—to much public attention and even <a href="https://www.bbc.com/news/world-us-canada-40980175">counter-protest</a>.</p>
<p>Nevertheless, for all of the issues over which we currently fight, do we not, as Ross Douthat writes in <a href="https://www.amazon.com/Decadent-Society-Became-Victims-Success/dp/1476785244"><em>The Decadent Society</em></a>, conduct the <em>majority</em> of these battles on our <em>screens</em> rather than in the streets themselves? Are the <a href="https://en.wikipedia.org/wiki/Sturmabteilung">Brownshirts</a> of our time not smashing the heads of their adversaries in the public square, but instead patrolling the forums of Reddit and the "I don't know who needs to hear this; Thread 🧵:"'s of Twitter, throwing flame in the form of an ironic tweet at whoever disagrees? Has physical violence not largely been replaced by <em>reputational</em> violence, i.e. our modern <a href="https://www.dictionary.com/e/pop-culture/cancel-culture/#:~:text=Cancel%20culture%20refers%20to%20the,the%20form%20of%20group%20shaming."><em>cancel culture</em></a>? For the <em>majority</em> of Americans, is the intensity of our political beliefs <em>really</em> commensurate with our corresponding actions in <a href="https://www.urbandictionary.com/define.php?term=meatspace">"meatspace"</a>? Or do we say these things to our like-minded friends, feel these things when we come home from work, and then, out for blood and looking to fight, we take to the internet for twenty minutes to throw a few punches and be punched right back, then ice cream, Netflix, and early to bed?</p>
<p>Protest and demonstration in the streets is not dead. However, in terms of dynamism, large-scale impact, and sheer number of combatants, the Internet truly is the preeminent battlefield of our time.</p>
<h2>The remaining space on the map</h2>
<p>As the dust settled over the world in <a href="https://history.state.gov/milestones/1989-1992/collapse-soviet-union#:~:text=On%20December%2025%2C%201991%2C%20the,the%20newly%20independent%20Russian%20state.">1991</a> following 77 brutal years of physical and psychological war over macro-ideology, the United States emerged on top, much like Mackinder's post-Columbian Europe following its own implicit victory—territorial and economic expansion unprecedented in history. Nevertheless, in our modern day, there exists a <em>new</em> space on the map of human interaction that the United States has <em>not</em> conquered, the space where our current war for micro-ideology is largely waged, and, as it was ultimately a fight for ideology that felled the previous global powers, a space of critical importance. This space, of course, is the Internet: the Mackinderian Digital Heartland.</p>
<h2>The practical nature of battles online</h2>
<p>Mackinder's dictum about the importance of the Central Asian heart-land stemmed from the impending technological revolution in its harmful potential: whereas the invaders used to crosse the Steppe on camel and horse—slowly, in other words—they would soon be racing under starry desert skies on steam-powered train.</p>
<p>Presently, ideological battles online take place with text, video and audio authored manually by humans; in other words, slowly! It is humans that write tweets and think-pieces to which other humans respond. It is humans that start digital media campaigns, create YouTube content, stoke the fires that lead to real-life demonstration and violence, and even coordinate large-scale protests themselves.</p>
<p>So, what happens when a technology comes along that stands to turbo-charge the speed and chaos with which these battles take place? What happens when the text, video and audio that comprise these battles are created not by humans, but by <em>algorithms</em>—unaccountable, indefatigable, at times sloppy, and brutally fast? Finally, what happens when human combatants, who we currently "eliminate" from battle via <em>reputational</em> assassination, are replaced by algorithms that are immune to this deadliest weapon?</p>
<h2>Mackinder on artifical intelligence</h2>
<p>A 21st century Halford Mackinder would look at the deafening roll of artificial intelligence innovation in models that generate video, audio and language and be rightfully crapping his pants. Because simply, these models have the ability to do to our already-chaotic online debate not what trains, but 3D-printable intercontinental ballistic missiles, would have done for the Steppe Invaders of centuries past.</p>
<p>To make this claim, we don't have to go much past the recent advent of <a href="https://openai.com/blog/openai-api/">OpenAI's GPT-3</a>, neatly explained in the following video:</p>
<p><a href="http://www.youtube.com/watch?v=SboKeK6FFHQ" target="_blank" title="GPT-3 - explained in layman terms."><img alt="Click me!" class="img-responsive" src="http://img.youtube.com/vi/SboKeK6FFHQ/0.jpg"/></a></p>
<p>In essence, and borrowing from the video, "GPT-3 is an extremely sophisticated text predictor. A human gives it a chunk of text as input, and the model generates its best guess as to what the next chunk of text should be." In other words, and in terms of online battle, it reads what's written, and responds. However, crucially, whereas a human might reply to 1 or 2 posts over a 5-minute period, GPT-3 could create or reply to thousands.</p>
<p>To demonstrate the ability of these models, <a href="https://www.theverge.com/2020/8/16/21371049/gpt3-hacker-news-ai-blog">college student Liam Porr recently generated</a> the following <a href="https://adolos.substack.com/p/feeling-unproductive-maybe-you-should">blog post</a> using GPT-3. In effect, he fed a title and prompt (about human productivity) to the model, asked it for a few different drafts, selected the best one, edited it minimally, then hit publish. The post itself quickly earned the top-spot on vaunted <a href="https://news.ycombinator.com/">Hacker News</a>, seemingly on the merits of its content alone. Take a look and see for yourself.</p>
<p>Finally, what happens when our online battles are fueled by such <em>high-quality content generated by models like GPT-3 at warp speed?</em> In a world whose news cycle already changes by the hour? In which social media is already such a dominant force?</p>
<p>Honestly, your guess, and Mackinder's guess, are as good as mine.</p>
<h2>How to control the Digital Heartland, and what's at stake</h2>
<p>Implicit in Mackinder's admonition is that control of the Central Asian Heartland means the same type of control with which European then-empires came to dominate the rest of the world: economic, political or outright military dominion (and to wit, this is in effect what Hitler then tried to do). Unfortunately, he neglects to offer more specific guidance—especially with respect to the impending explosion in railroad construction in which he places so much concern. With this in mind, how might a 21st century Mackinder encourage the United States to "control" the Digital Heartland? By crafting then backing the internet's dominant ideological current, a la victory in the Cold War? By creating a monopoly on the internet's infrastructure and data, i.e. controlling the servers on which online war is waged, and owning the data that it generates? By training, employing and overseeing the bulk of the researchers that build the artificial intelligence itself? Or perhaps by gaining ownership over the specific platforms, like Twitter and Reddit, on which combatants clash?</p>
<p>To me, the answer is not necessarily clear, and a deeper analysis of this question is the subject of a future post. Nonetheless, it is clear that the eventual conclusion of this war—the "1991" of our current age—has the real potential to spawn some sort of parallel, supranational organization or hegemony rooted more in a sense of identity and ideological affiliation, in which your figurative "digital passport" does not have the name of a sovereign physical territory embroidered on its front, but instead, a way of thought or belief picked from amongst the micro-ideological multiplex that is the modern internet.</p>
<h2>A closing warning</h2>
<p>As 1904 Continental Europe stood at, in his view, the exit of the Columbian era, Mackinder leaves us with one final warning about the practical reality of the coming world order. Whereas pre-Columbian Europe was "pent into a narrow region and threatened by external barbarism," and Columbian Europe was characterized by "the expansion of Europe against almost negligible resistances," a now post-Columbian Europe will take an altogether different hue:</p>
<blockquote>
<p>From the present time forth, in the post-Columbian age, we shall again have to deal with a closed political system, and none the less that it will be one of worldwide scope. Every explosion of social forces, instead of being dissipated in a surrounding circuit of unknown space and barbaric chaos, will be sharply re-echoed from the far side of the globe, and weak elements in the political and economic organism of the world will be shattered in consequence.</p>
</blockquote>
<p>To me, his second sentence is utterly jarring: if Mackinder worries thus about the consequences of an "explosion of social forces" in a "closed political system," what would he say about combat on our modern internet—in which social forces, quite literally, can reach and ignite the remotest corners of the world <em>minutes</em> after their birth? In which further explosions of these forces are, as the military-industrial complex is to traditional warfare, bolstered by the advertising industry, wherein wealthy and powerful executives have a direct incentive to throw kerosene on the flames? And how would he feel about the fact that, in a broad sense, this fiery battlefield is increasingly hard to leave—as we learn more, bank more, socialize more, and work more online?</p>
<h2>On fatalism</h2>
<p>To paraphrase Robert Kaplan in Chapter 9 of <a href="https://www.amazon.com/Revenge-Geography-Coming-Conflicts-Against/dp/0812982223"><em>The Revenge of Geography</em></a>, "though Mackinder was often attacked as an arch-deterministic, a charge not wholly unfair, no doubt, owing to the fact that geography was his subject, and geography itself can by its very nature be deterministic, Mackinder was no mere fatalist: ultimately, he believed that geography and the environment could be overcome, but only if we treat those subjects with the greatest knowledge and respect."</p>
<p>In other words, a 21st century Mackinder might say that the technology that we currently build does not have to invite an invasion of the Huns. It does not have to poison social order. And finally, as artificial intelligence combatants are increasingly whisked off to wage internet war, he'd hope that we don't inadvertently burn it all down before we get much further.</p>
<h2>Credit</h2>
<p>Many thanks to Abishur Prakash for reviewing an earlier draft of this piece.</p>
<h2>Additional references</h2>
<div class="footnote">
<hr/>
<ol>
<li id="fn:1">
<p>“Are We on the Road to Civilisation Collapse?” BBC Future, BBC, www.bbc.com/future/article/20190218-are-we-on-the-road-to-civilisation-collapse. <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>Chan, Amy. “Behind the Lines: When Empires Collapse.” HistoryNet, HistoryNet, 15 May 2017, www.historynet.com/behind-lines-empires-collapse.htm. <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>Holland, Tom. Why Empires Fall: from Ancient Rome to Putin's Russia, 23 May 2014, www.newstatesman.com/politics/2014/05/why-empires-fall-ancient-rome-putins-russia. <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p>Mackinder, Halford J. “The Geographical Pivot of History.” The Geographical Journal, vol. 23, no. 4, 1904, p. 421., doi:10.2307/1775498. <a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:5">
<p>Mihir Shukla, Automation Anywhere. “The Role of Automation in the Aging Workforce.” VentureBeat, VentureBeat, 19 Apr. 2018, venturebeat.com/2018/04/19/the-role-of-automation-in-the-aging-workforce/. <a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:6">
<p>“Why Do Empires Decline?” Quora, www.quora.com/Why-do-empires-decline. <a class="footnote-backref" href="#fnref:6" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
</ol>
</div>Soft Power in the Age of Generative Models2020-03-12T08:00:00-04:002020-03-12T08:00:00-04:00Will Wolftag:willwolf.io,2020-03-12:/2020/03/12/soft-power-in-the-age-of-generative-models/<p>What happens when a post-Trump, reputationally-bruised United States, and improved generative models (the technology behind "deepfakes") collide head-on?</p><p>Nations exert influence in two key ways: hard power, and soft power. Hard power is <a href="https://www.history.com/topics/world-war-ii/bombing-of-hiroshima-and-nagasaki">dropping the atomic bomb</a> on Hiroshima and Nagasaki during World War II, compelling the Japanese to surrender. Conversely, soft power is building American libraries in foreign universities, placing American films in foreign theaters, and sending <a href="https://en.wikipedia.org/wiki/Peace_Corps">United States Peace Corps</a> volunteers to rural communities worldwide, compelling a favorable image of our country abroad.</p>
<p>In other words, hard power is to influence via economic and military violence; soft power is to influence via <em>attraction.</em></p>
<h2>Soft power through media</h2>
<p>A key subset of soft power is that delivered through media.</p>
<p>For example, during the Franco-Prussian War, the French dropped leaflets over enemy territory by <a href="https://en.wikipedia.org/wiki/Airborne_leaflet_propaganda">airborne balloon</a> touting the mutual benefits of ceasefire. During World War I, all parties dropped such leaflets by <a href="https://en.wikipedia.org/wiki/Airborne_leaflet_propaganda#First_World_War">airplane</a>. In World War II, the Nazis pioneered the use of <a href="https://en.wikipedia.org/wiki/Radio_propaganda#Nazi_Germany">radio</a> as a propaganda machine. Finally, in 1969, the Americans live-broadcast Neil Armstrong’s moon landing to <a href="https://www.youtube.com/watch?v=cwZb2mqId0A">television</a> screens worldwide, showcasing the "derring-do and genius of American ingenuity, and putting one over the Soviets at the same time."<sup id="fnref:7"><a class="footnote-ref" href="#fn:7">7</a></sup></p>
<p>Today, such media is more numerous and diverse: American shows on Netflix, American music on Spotify, American “influencers” on YouTube, Hollywood movies and more—all distributed globally.</p>
<h2>The cost of media</h2>
<p>The cost of media can be split into two broad camps: creation and distribution.</p>
<p>Since the days of leaflets by balloons, the marginal cost of distribution has become extremely cheap. Today, Trump tweets, and his words instantaneously appear on the timelines of his 75 million followers. Were he to have 150 instead, the additional cost of reaching these additional 75 would be, practically speaking, nothing.</p>
<p>In contrast, the creation of this media itself remains a wholly manual exercise. To create a news broadcast, humans write scripts, set the stage, select the costumes, apply makeup, and read the news itself. Furthermore, creating twice as many news broadcasts costs, crucially, roughly twice as much.</p>
<p>The dwindling marginal cost of its distribution makes soft power media efficient and appealing. However, it is the “creation bottleneck” that stands in the way of true scale.</p>
<h2>Generative models</h2>
<p>Generative models are those that empower <a href="https://en.wikipedia.org/wiki/Deepfake">“deepfakes.”</a> They are algorithms that generate realistic <a href="https://thispersondoesnotexist.com/">images</a> (refresh this link for more), <a href="https://medium.com/syncedreview/deepmind-dvd-gan-impressive-step-toward-realistic-video-synthesis-12027d942e53">video</a>, <a href="https://openai.com/blog/musenet/">audio</a>, <a href="https://talktotransformer.com/">text</a>, or other types of rich media. Presently, generative models excel in creating art, music, literature, news reports, and textiles. <a href="https://news.microsoft.com/apac/features/much-more-than-a-chatbot-chinas-xiaoice-mixes-ai-with-emotions-and-wins-over-millions-of-fans/">Microsoft’s Xiaoice</a>, a multi-purpose <a href="https://en.wikipedia.org/wiki/Chatbot">chatbot</a> (which currently dialogs with over 660 million registered users), pulls all this off and more:</p>
<blockquote>
<p>“Xiaoice’s framework is learning to write literature as well as compose and perform songs. Last year she published a book of poems and helps her followers write their own. She can sing her own songs in styles based on existing popular performers. There are plans to release an album of pop tunes soon. And she is able to author tailor-made stories for children and reads them out in voices suited to each of the characters she has created.<br/><br/>She’s painting images based on keywords and other inputs. She’s also gone into mainstream media as a host of dozens of TV and radio programs that are broadcast across China. She reads news stories and provides commentary. And, she is generating multiple reports based on information from China’s financial markets and used by investors and traders who subscribe to Wind, a major financial information service.”</p>
</blockquote>
<p><em>In short, generative models <strong>automate</strong> the creation of media itself—smashing the “creation bottleneck” outright.</em></p>
<h2>Coming soon</h2>
<p>At present, generative models succeed at tasks like: <a href="https://www.youtube.com/watch?v=PCBTZh41Ris">transferring dance moves</a> from professionals to amateurs, <a href="https://github.com/elliottwu/sText2Image">translating pencil sketches</a> to high-resolution images with the guidance of text, and making Obama lip sync the words of American actor Jordan Peele.</p>
<p><a href="http://www.youtube.com/watch?v=cQ54GDm1eL0" title="Click me!"><img alt="Click me!" class="img-responsive" src="http://img.youtube.com/vi/cQ54GDm1eL0/0.jpg"/></a></p>
<p>These feats are impressive. However, to achieve widespread impact, these algorithms must allow for more “control” over the media they create.</p>
<p>For example, consider a system that, given English text, instantaneously generates a video of this text being recited in an arbitrary language. To be truly effective as a soft power tool, this system must ultimately operate as follows:</p>
<p><strong>System</strong>: “What kind of video would you like?”</p>
<p><strong>User</strong>: “I’d like a video of the President reciting the uploaded text. I'd like him to speak slowly, be wearing green US Army fatigues, and be seated in front of a blue Boeing CH-47 Chinook. Make me two, actually: one in Farsi, the other in Greek.”</p>
<p><strong>System</strong>: “Coming right up.”</p>
<p><em>Such generative models are not quite here. However, given the speed at which this <a href="http://www.arxiv-sanity.com/search?q=video+generation">research</a> moves, they will reliably arrive within 5 years.</em></p>
<h2>The United States used to invest in soft power institutions</h2>
<p>Historically, the United States invested substantially in soft power institutions.</p>
<p>During World War I, Woodrow Wilson established the <a href="https://en.wikipedia.org/wiki/Committee_on_Public_Information">Committee on Public Information</a>, which ultimately trafficked in outright propaganda.</p>
<p>During World War II, Roosevelt established the <a href="https://en.wikipedia.org/wiki/Public_diplomacy_of_the_United_States">Division of Cultural Relations</a> and Office of Inter-American Affairs in response to fascist propaganda in Latin America. Following the bombing of Pearl Harbor, the <a href="https://en.wikipedia.org/wiki/Voice_of_America">Voice of America</a>—a globally-distributed radio broadcast of non-military promotional content—the <a href="https://en.wikipedia.org/wiki/United_States_Office_of_War_Information">Office of War Information</a>—distributing media like “newspapers, posters, photographs, films”<sup id="fnref:8"><a class="footnote-ref" href="#fn:8">8</a></sup> to civilian communities abroad—and the <a href="https://en.wikipedia.org/wiki/Office_of_Strategic_Services">Office of Strategic Services</a>—the predecessor to the CIA—were all established as well.</p>
<p>During the Cold War, the United States flexed its “peacetime” soft power muscle, establishing myriad organizations to promote international cultural and educational exchanges, including the <a href="https://en.wikipedia.org/wiki/Fulbright_Program">Fulbright Program</a> in 1947 and the <a href="https://en.wikipedia.org/wiki/Public_diplomacy_of_the_United_States#U.S._Information_Agency_(USIA)">United States Information Agency</a> (USIA) in 1953.</p>
<h2>Dwindling interest and opportunity missed</h2>
<p>Following the Cold War, however, American soft power institutions began to slowly decline.</p>
<p>In 1999, the USIA was absorbed into the US State Department; its staff, and budget for key projects, were cut roughly in half.<sup id="fnref:6"><a class="footnote-ref" href="#fn:6">6</a></sup> “From 1995 to 2001, academic and cultural exchanges dropped from 45,000 to 29,000 annually, and many accessible cultural centers and libraries were closed.”<sup id="fnref2:6"><a class="footnote-ref" href="#fn:6">6</a></sup> Finally, “while government-funded radio broadcasts reached between 70 and 80 percent of the populace of Eastern Europe during the Cold War, around year 2000, a mere 2 percent of Arabs heard the VOA.”<sup id="fnref3:6"><a class="footnote-ref" href="#fn:6">6</a></sup></p>
<p>At present, the United States has a largely <a href="https://www.nationalreview.com/2017/01/trump-foreign-policy-isolationsim-america-first-allies-nato-trans-pacific-partnership/">isolationist president</a> in Donald Trump. Since the beginning of his term, American soft power has fallen globally according to the Portland <a href="https://softpower30.com/">Soft Power 30</a>. Trump himself is unconcerned with others actively <a href="https://www.rollcall.com/2018/09/25/world-leaders-laugh-as-trump-boasts-of-accomplishments/">laughing</a> at him and his country.</p>
<p>By 2018, Statista estimates that there were 1.67 billion "pay TV households" worldwide.<sup id="fnref:9"><a class="footnote-ref" href="#fn:9">9</a></sup> By 2019, they estimate that half of private households had a computer.<sup id="fnref:10"><a class="footnote-ref" href="#fn:10">10</a></sup> Finally, Pew Research estimates that "more than 5 billion people have mobile devices, and over half of these connections are smartphones."<sup id="fnref:11"><a class="footnote-ref" href="#fn:11">11</a></sup></p>
<p><em>In <a href="https://en.wikipedia.org/wiki/Information_Age">Information Age</a>, a dwindling interest in soft power institutions is a <strong>major</strong> opportunity missed.</em></p>
<p>In a recent article, Joseph Nye, the Harvard professor who coined the term “soft power” itself, neatly summarizes this phenomenon:</p>
<blockquote>
<p>“U.S. President Donald Trump’s administration has shown little interest in public diplomacy. And yet public diplomacy—a government’s efforts to communicate directly with other countries’ publics—is one of the key instruments policymakers use to generate soft power, and the current information revolution makes such instruments more important than ever.”<sup id="fnref:4"><a class="footnote-ref" href="#fn:4">4</a></sup></p>
</blockquote>
<h2>A renewed interest in soft power</h2>
<p>Truly impactful generative models, and Trump’s exit from office, will soon collide head-on.</p>
<p>At this point, the United States—with its blemished reputation, exorbitant military budget<sup id="fnref:12"><a class="footnote-ref" href="#fn:12">12</a></sup>, longstanding technical leadership<sup id="fnref:13"><a class="footnote-ref" href="#fn:13">13</a></sup> and forever proactive drive—will be sitting on transformative media tech and a clear-cut mission: rebuild its image at home and abroad.</p>
<p><em>To this effect, I predict that the United States will soon make a concerted investment in generative models for soft power at true scale.</em></p>
<p>Here’s what this might look like.</p>
<h2>Soft power through media, in the age of generative models</h2>
<p>The two keys areas which generative models will impact are state-to-public, and state-to-individual media.</p>
<h3>State-to-public media</h3>
<p>Current examples of state-to-public soft power media, like Presidential addresses to the American people, English-language CNN news broadcasts to foreign viewers, and front-line conflict reporting will see their use both quickened and expanded by generative models.</p>
<p>Presently, Presidential addresses happen once every few weeks or months. Conversely, Donald Trump writes on <a href="https://twitter.com/realDonaldTrump">Twitter</a> hourly. With generative models, the next President will <em>instantaneously</em> generate videos of herself reciting the tweet instead. Americans at large devour the President's tweets; if a generated video legitimately looks and sounds just like her, they’re likely to devour it as well.</p>
<p>Sponsorship deals, like <a href="https://www.espn.com/nba/story/_/id/14314807/lebron-james-signs-life-deal-nike">Lebrons’ lifetime appointment</a> with Nike, will crucially expand to include the right to generate content with the athlete’s likeness. Politicians will travel less to campaign; instead, they’ll send generated holograms—of them, their spouses, their key supporters—to campaign instead—each one personalized to the venue in question.</p>
<p>Finally, a generative model will be trained to ingest footage of world events—for example, the <a href="https://www.cnn.com/2019/09/14/middleeast/yemen-houthi-rebels-drone-attacks-saudi-aramco-intl/index.html">attack</a> on the Saudi oil refinery—and produce text describing what happened. This text will then be used to generate a news report. Humans won’t be on the front line to capture this footage either; why not send a drone instead?</p>
<h3>State-to-individual media</h3>
<p>Microsoft’s Xiaoice demonstrates the enormous traction of personalized chatbots. From here, the jump to state-built, propaganda-(subtly)-infused chatbots is small.</p>
<p>In general, the public distrusts governments; to this effect, humanitarian missions are typically run by NGOs. As such, states will deploy these chatbots on behalf of related entities: tourism boards, online universities, cultural centers, etc.</p>
<p>Finally, current state-to-individual media is largely <em>unidirectional</em>: media is sent from the former to the latter, and rarely vice versa. Chatbots powered by generative models, however, will solicit <em>feedback</em> from their viewer in a personalized way; for example, “Are you enjoying this media? What do you like about American schools as compared to your own? What do you think about democracy?”</p>
<p>In the same way that the Chinese government <em>passively</em> learns about its citizens by <a href="http://harvardpolitics.com/world/wechat-the-people-technology-and-social-control-in-china/">monitoring WeChat data</a>, generative models will <em>actively solicit information</em> re the efficacy of their soft power plays. Then, like a savvy marketer, they’ll double down on the stuff that works best.</p>
<h2>Conclusion</h2>
<p>For over a century, states have exercised soft power through media. Throughout, the marginal cost of distributing this media has trended towards zero. Conversely, to date, the marginal cost of its creation remains high.</p>
<p>Generative models are a powerful technology which trend the latter towards zero as well. They're here and they work. However, to be truly impactful, there's still a short ways to go.</p>
<p>In 5 years' time, a dazed, reputationally-bruised United States, and robust, flexible, practical generative models, will collide head-on.</p>
<p><em>At this point, soft power media—and human politics in general—are likely to change forever.</em></p>
<h2>Credit</h2>
<p>Many thanks to Abishur Prakash for reviewing earlier drafts of this piece.</p>
<h2>References</h2>
<div class="footnote">
<hr/>
<ol>
<li id="fn:1">
<p>Brands, Hal. “Not Even Trump Can Obliterate America's Soft Power.” Bloomberg.com, Bloomberg, www.bloomberg.com/opinion/articles/2018-01-18/not-even-trump-can-obliterate-america-s-soft-power. <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>Chakravarti, Sudeshna. Soft Power: The Culture Weapon in The Cold War and South Asia. www.culturaldiplomacy.org/academy/content/pdf/participant-papers/academy/Sudeshna-Khasnobis-Soft-Power-The-Culture-Weapon-in-The-Cold-War-and-South-Asia.pdf. <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>Khandelwal, Aakash. “Economics of Digital Goods.” LinkedIn SlideShare, 23 Nov. 2016, www.slideshare.net/aakashkhandelwal921/economics-of-digital-goods. <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p>Nye, Joseph S. “No, President Trump: You've Weakened America's Soft Power.” The New York Times, The New York Times, 25 Feb. 2020, www.nytimes.com/2020/02/25/opinion/trump-soft-power.html. <a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:5">
<p>Shah, Ritula. “Is US Monopoly on the Use of Soft Power at an End?” BBC News, BBC, 19 Nov. 2014, www.bbc.com/news/world-29536648. <a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:6">
<p>Nye, Joseph S. Soft Power: the Means to Success in World Politics. Knowledge World, 2012. <a class="footnote-backref" href="#fnref:6" title="Jump back to footnote 6 in the text">↩</a><a class="footnote-backref" href="#fnref2:6" title="Jump back to footnote 6 in the text">↩</a><a class="footnote-backref" href="#fnref3:6" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:7">
<p>Harris, Paul. “Man on the Moon: Moment of Greatness That Defined the American Century.” The Guardian, Guardian News and Media, 25 Aug. 2012, www.theguardian.com/science/2012/aug/25/man-moon-american-century. <a class="footnote-backref" href="#fnref:7" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:8">
<p>“United States Office of War Information.” Wikipedia, Wikimedia Foundation, 7 Jan. 2020, en.wikipedia.org/wiki/United_States_Office_of_War_Information. <a class="footnote-backref" href="#fnref:8" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:9">
<p>Watson, Amy. “TV Households Worldwide.” Statista, 4 Dec. 2019, www.statista.com/statistics/268695/number-of-tv-households-worldwide/. <a class="footnote-backref" href="#fnref:9" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:10">
<p>Statista Research Department. “How Many People Have Access to a Computer 2018.” Statista, 2 Mar. 2020, www.statista.com/statistics/748551/worldwide-households-with-computer/. <a class="footnote-backref" href="#fnref:10" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:11">
<p>Silver, Laura. “Smartphone Ownership Is Growing Rapidly Around the World, but Not Always Equally.” Pew Research Center's Global Attitudes Project, Pew Research Center, 30 Dec. 2019, www.pewresearch.org/global/2019/02/05/smartphone-ownership-is-growing-rapidly-around-the-world-but-not-always-equally/. <a class="footnote-backref" href="#fnref:11" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:12">
<p>Cancian, Mark F. “U.S. Military Forces in FY 2020: The Strategic and Budget Context.” U.S. Military Forces in FY 2020: The Strategic and Budget Context | Center for Strategic and International Studies, 12 Mar. 2020, www.csis.org/analysis/us-military-forces-fy-2020-strategic-and-budget-context?gclid=CjwKCAjwgbLzBRBsEiwAXVIygLkZlNLc3zV8EVVhPPwjOoqrUoQ4kjTeaTHN5vFktRWET2wsnHCwRhoCm2QQAvD_BwE. <a class="footnote-backref" href="#fnref:12" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:13">
<p>“Report Shows United States Leads in Science and Technology as China Rapidly Advances.” ScienceDaily, ScienceDaily, 24 Jan. 2018, www.sciencedaily.com/releases/2018/01/180124113951.htm. <a class="footnote-backref" href="#fnref:13" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
</ol>
</div>On Saudi Drone Strikes and Adversarial AI2019-09-22T08:30:00-04:002019-09-22T08:30:00-04:00Will Wolftag:willwolf.io,2019-09-22:/2019/09/22/saudi-drone-strikes-and-adversarial-ai/<p>In world of weaponized drones piloted by algorithms, what new strategic opportunities arise?</p><p>Last week, Houthi rebels—a militant political group in Yemen—claimed credit for exploding an oil refinery in Saudi Arabia. The explosion was the work of a missile, fired from a drone.</p>
<p>Though the group responsible is closely allied with Iran—and to wit, the American government does declaim Iran as responsible for facilitating the strike itself—it is by all accounts a non-state actor, notable for the following reasons:</p>
<ol>
<li>As the cost of full-scale war increases due to, at a trivial minimum, the strength of weapons available, small, “surgical” operations become more common. This attack falls in directly with this trend.</li>
<li>You do not have to be the state to desire to affect world politics. As weapons become more potent, and the “density” of potential targets—population density, in the case of traditional military offensives, or density of nodes in an information-sharing network, in the case of cyber offensives, as two examples—increases, the ability to affect outsized change increases as well. As such, incentives for, and frequency of, strategic action from non-state actors will likely continue to rise—this being yet another example.</li>
<li>The technologies required to effectuate the attack in question are increasingly easy to obtain.</li>
</ol>
<p>In sum, last week’s strike is yet another example of an isolated punch by a non-state actor with technologies (drones, at a minimum) ultimately available in the public domain. Moving forward, it seems safe to expect more of the same.</p>
<h2>Enter machine learning</h2>
<p>Throughout history, much of war has been about destroying the enemy’s stuff. There are many ways to do this—each involving a distinct set of tradeoffs.</p>
<p>To this effect, drones are particularly interesting: they allow for precise, close-proximity strikes, without risking the lives of the aggressors. Presently, the majority of such drones are piloted remotely by humans; moving forward, they—in conceptual step with, say, self-driving cars—will pilot themselves, allowing for larger and larger deployments.</p>
<p>To do this, engineers will equip drones with cameras, and machine learning algorithms that make the drone’s operational and tactical decisions conditional on what these cameras show. As such, the defensive objective is still, as per usual, thwart the aggressor; <strong>however, the aggressor is now the machine learning algorithms controlling the aggressor’s tech.</strong></p>
<p>For drones, what do these algorithms look like? How can they be thwarted? What risks and opportunities do they imply for the defense of critical infrastructure?</p>
<p><strong>While these questions might seem futuristic, that future is indeed approaching fast.</strong></p>
<h2>The supervised classification model</h2>
<p>To begin the discussion, let’s look at the basic paradigm by which self-driving cars operate: the supervised classification model, which works as follows:</p>
<ol>
<li>Engineers collect a large amount of “labeled data” in the form of <code>(cameras_images, what_the_driver_did)</code> pairs. This data is considered “labeled,” as each input, <code>cameras_images</code>, is coupled with an output, <code>what_the_driver_did</code>. An example of the former might be 360-degree images of a curving, barren road in Arizona; the latter, i.e. what the driver did in this moment in time, might be: depress the gas pedal by five degrees, and rotate the steering wheel counter-clockwise by seven.”</li>
<li>“Teach” a computer to, given photographic input, predict the correct driver action to be taken.</li>
</ol>
<p>Roughly speaking, a self-piloting drone would not be different: it’d require equivalent labeled data from human-controlled flights, on which a supervised classification model would be trained.</p>
<h2>A likely progression of offensive drone tech</h2>
<p>Weaponized drone deployment will likely progress as follows:</p>
<ol>
<li>Deploy drones that pilot themselves; however, “finish” decisions are still executed by humans.</li>
<li>Deploy drones that pilot themselves; “finish” decisions are executed by the drone—when it is extremely confident in its choice.</li>
<li>Deploy swarms of drones that work together to more swiftly and efficiently achieve the aforementioned—attacking, exchanging data, self-sacrificing optimally, etc. throughout their attack.</li>
</ol>
<p><strong>In a brave new world of machine-learned aggressors, what new opportunities for defense arise?</strong></p>
<h2>Adversarial examples in machine learning</h2>
<p>Adversarial examples for supervised image classification models are inputs, i.e. images, whose pixels have been perturbed in a way imperceptible to a human eye, yet cause the classifier to change its prediction entirely.</p>
<p>An example<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup> looks as follows:</p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/images/saudi-adversarial/panda-adversarial-example.png"/></p>
<p>Initially, the classifier is 57.7% confident that the image contains a panda; after adding noise, resulting in the image on the right—still, inarguably, a panda, to the human eye—the classifier has now changed its prediction to "gibbon" with 99.3% confidence.</p>
<p><strong>As such, knowledge of the data on which an aggressive drone was trained gives unique opportunity to algorithmically design adversarial examples, like the above, that confuse its classifier aboard.</strong></p>
<p>Concrete opportunities would include:</p>
<ul>
<li><strong>At “find”</strong>: defensive lasers pointed at critical infrastructure dynamically perturb, by just a few pixels, its appearance, tricking the drone into thinking it has not found what it actually has.</li>
<li><strong>At “fix”</strong>: these lasers dynamically concoct sequences of perturbations, tricking the drone into thinking its target is moving erratically when it’s actually still.</li>
<li><strong>At “finish”</strong>: adversarial examples make the drone believe its strike would, put simply, cause more damage than it intends.</li>
</ul>
<p>As adversarial examples are, once more, built from real examples, perturbed minimally, in a way imperceptible to the human eye, they are in fact difficult to, even a priori, teach our classifier to ignore. Said differently, well-designed adversarial examples are extremely effective.</p>
<h2>Out-of-distribution data</h2>
<p>Machine learning algorithms are notoriously bad at “knowing what they don’t know.” Said differently, these algorithms only learn to make decisions about the type of data on which they’re trained. Were a classification model piloting a drone to be presented with an image of an ice-cream cone, it would, barring careful design, attempt to make a decision about this data all the same.</p>
<p>In a world of autonomous fly-and-finish drones, one would hope that its finish decisions are taken with extreme care. Fundamentally, this dovetails quickly into the notion of “out-of-distribution” data, i.e. data that the classifier knows it has not seen before, and about which it therefore neglects to make a prediction.</p>
<p><strong>As such, insight into the data on which an enemy’s system was trained naturally implies “defense by what’s different”</strong>: show the drone images you know that it hasn’t seen before, and thereby increase its uncertainty around the decision at hand—buying time, and keeping your stuff in tact.</p>
<h2>Learning optimal defense via reinforcement learning</h2>
<p><a href="https://en.wikipedia.org/wiki/Reinforcement_learning">Reinforcement learning</a>, though exceedingly powerful, and often overhyped, is a relatively simple idea: given an environment and its state, try an action, observe a reward, and repeat the actions that give high rewards; additionally, periodically explore new actions you’ve never tried before just to see how it goes.</p>
<p>Reinforcement learning, or RL, requires vast amounts of data. As such, as a point of “meta defensive strategy”, trying out different types of adversarial-machine-learning attacks against a drone, seeing which work best, then repeating the best performers, while plausible in theory, would not work in practice if one drone approaches your one oil refinery but once a year.</p>
<p>This said, swarms of drones might change the game; <em>what if, in 20 years, Houthi militants deployed one autonomous, armed drone for every square mile of Saudi territory?</em> And what if we could simulate this scenario a priori, and learn how to optimally defend against the swarm?</p>
<p>To do this, you'd require:</p>
<ul>
<li><strong>A heavy investment into sensors</strong>, on all of: the infrastructure you’re trying to protect, atmospheric conditions in which the drones are flying, tools to monitor the drones’ speeds and movements, etc. In other words, any and all technologies that capture the state of the environment, and the rewards, to as granular level of detail as possible.</li>
<li><strong>Simulation environments.</strong> In certain RL problems, this one potentially included, one has the delicious ability to generate data from which the algorithm can learn—by letting it play games with itself. <a href="https://deepmind.com/blog/article/alphago-zero-starting-scratch">AlphaGo Zero</a> is one famous such example. Learning to optimally defend against a swarm of drones might fit the bit as well: deploy thousands of your own into a broad swath of desert, instruct them to “capture your flag”, then let your defensive systems get to work: taking actions, observing the subsequent reward—“Did I divert the drone away from my flag?”; “Did the drone hesitate more than usual before acting?”; etc.—and repeat those actions that work best.</li>
</ul>
<h2>Summary</h2>
<p>As the years roll forward, machine learning algorithms will continue to direct more and more of our most critical systems—financial markets, healthcare, transportation, and alas—the offensive technologies of political groups.</p>
<p>To wit, an understanding of the workings of these algorithms sheds light on the new risks these systems will introduce, and the new strategic opportunities that will thereafter arise.</p>
<h2>References</h2>
<div class="footnote">
<hr/>
<ol>
<li id="fn:1">
<p>Goodfellow, I., Shlens, J., Szegedy, C. (2014). Explaining and Harnessing Adversarial Examples https://arxiv.org/abs/1412.6572 <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>(Thumbnail) “Adversarial AI: As New Attack Vector Opens, Researchers Aim to Defend Against It.” DataProtectionCenter.com - Tech and Security, 17 Apr. 2018, dataprotectioncenter.com/malware/adversarial-ai-as-new-attack-vector-opens-researchers-aim-to-defend-against-it/. <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>Artificial Intelligence and Geopolitics2019-09-21T21:00:00-04:002019-09-21T21:00:00-04:00Will Wolftag:willwolf.io,2019-09-21:/2019/09/21/ai-and-geopolitics/<p>I'm beginning to write about the intersection of artificial intelligence and geopolitics.</p><p>Machine learning and artificial intelligence have long been passions; to date, I've spent many years—my entire professional career included—developing relevant expertise. This has been the historical focus of this blog.</p>
<p>Recently, I've been reading heavily in other areas as well: history, political theory, strategy, economics and geography; geopolitics, in sum.</p>
<p>The geopolitics of yesterday—the subject of my reading—has its experts and central themes: collective security, nuclear armament, and oil, to name a few. Conversely, the geopolitics of tomorrow will take form from entirely new ideas; chief among them: artificial intelligence.</p>
<p>As such, I plan to merge the two: to suffuse, through writing, my exploration of the former with my expertise in the latter. <strong>Specifically, I'll connect deep, technical knowledge of AI algorithms, as well as the likely implications of their widespread use, to the risks—and opportunities—that AI will bring to the geopolitics of tomorrow.</strong></p>
<p>My writing will be objective, i.e. devoid of overt political bias. My intended audience is non-technical, i.e. geopolitics-first. Your comments—especially with respect to my evolving understanding of geopolitics itself—are encouraged as always.</p>
<p>Will</p>Deriving Mean-Field Variational Bayes2018-11-23T10:00:00-05:002018-11-23T10:00:00-05:00Will Wolftag:willwolf.io,2018-11-23:/2018/11/23/mean-field-variational-bayes/<p>A detailed derivation of Mean-Field Variational Bayes, its connection to Expectation-Maximization, and its implicit motivation for the "black-box variational inference" methods born in recent years.</p><p>"Mean-Field Variational Bayes" (MFVB), is similar to <a href="https://willwolf.io/2018/11/11/em-for-lda/">expectation-maximization</a> (EM) yet distinct in two key ways:</p>
<ol>
<li>We do not minimize <span class="math">\(\text{KL}\big(q(\mathbf{Z})\Vert p(\mathbf{Z}\vert\mathbf{X}, \theta)\big)\)</span>, i.e. perform the E-step, as [in the problems in which we employ mean-field] the posterior distribution <span class="math">\(p(\mathbf{Z}\vert\mathbf{X}, \theta)\)</span> "is too complex to work with,"™ i.e. it has no analytical form.</li>
<li>Our variational distribution <span class="math">\(q(\mathbf{Z})\)</span> is a <em>factorized distribution</em>, i.e.</li>
</ol>
<div class="math">$$
q(\mathbf{Z}) = \prod\limits_i^{M} q_i(\mathbf{Z}_i)
$$</div>
<p>for all latent variables <span class="math">\(\mathbf{Z}_i\)</span>.</p>
<p>Briefly, factorized distributions are cheap to compute: if each <span class="math">\(q_i(\mathbf{Z}_i)\)</span> is Gaussian, <span class="math">\(q(\mathbf{Z})\)</span> requires optimization of <span class="math">\(2M\)</span> parameters (a mean and a variance for each factor); conversely, a non-factorized <span class="math">\(q(\mathbf{Z}) = \text{Normal}(\mu, \Sigma)\)</span> would require optimization of <span class="math">\(M\)</span> parameters for the mean and <span class="math">\(\frac{M^2 + M}{2}\)</span> parameters for the covariance. Following intuition, this gain in computational efficiency comes at the cost of decreased accuracy in approximating the true posterior over latent variables.</p>
<h2>So, what is it?</h2>
<p>Mean-field Variational Bayes is an iterative maximization of the ELBO. More precisely, it is an iterative M-step with respect to the variational factors <span class="math">\(q_i(\mathbf{Z}_i)\)</span>.</p>
<p>In the simplest case, we posit a variational factor over every latent variable, <em>as well as every parameter</em>. In other words, as compared to the log-marginal decomposition in EM, <span class="math">\(\theta\)</span> is absorbed into <span class="math">\(\mathbf{Z}\)</span>.</p>
<div class="math">$$
\log{p(\mathbf{X}\vert\theta)} = \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\frac{p(\mathbf{X, Z}\vert\theta)}{q(\mathbf{Z})}}\bigg] + \text{KL}\big(q(\mathbf{Z})\Vert p(\mathbf{Z}\vert\mathbf{X}, \theta)\big)\quad \text{(EM)}
$$</div>
<p>becomes</p>
<div class="math">$$
\log{p(\mathbf{X})} = \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\frac{p(\mathbf{X, Z})}{q(\mathbf{Z})}}\bigg] + \text{KL}\big(q(\mathbf{Z})\Vert p(\mathbf{Z}\vert\mathbf{X})\big)\quad \text{(MFVB)}
$$</div>
<p>From there, we simply maximize the ELBO, i.e. <span class="math">\(\mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\frac{p(\mathbf{X, Z})}{q(\mathbf{Z})}}\bigg]\)</span>, by <em>iteratively maximizing with respect to each variational factor <span class="math">\(q_i(\mathbf{Z}_i)\)</span></em> in turn.</p>
<h2>What's this do?</h2>
<p>Curiously, we note that <span class="math">\(\log{p(\mathbf{X})}\)</span> is a <em>fixed quantity</em> with respect to <span class="math">\(q(\mathbf{Z})\)</span>: updating our variational factors <em>will not change</em> the marginal log-likelihood of our data.</p>
<p>This said, we note that the ELBO and <span class="math">\(\text{KL}\big(q(\mathbf{Z})\Vert p(\mathbf{Z}\vert\mathbf{X})\big)\)</span> trade off linearly: when one goes up by <span class="math">\(\Delta\)</span>, the other goes down by <span class="math">\(\Delta\)</span>.</p>
<p>As such, (iteratively) maximizing the ELBO in MFVB is akin to minimizing the divergence between the true posterior over the latent variables given data and our factorized variational approximation thereof.</p>
<h2>Derivation</h2>
<p>So, what do these updates look like?</p>
<p>First, let's break the ELBO into its two main components:</p>
<div class="math">$$
\begin{align*}
\mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\frac{p(\mathbf{X, Z})}{q(\mathbf{Z})}}\bigg]
&= \int{q(\mathbf{Z})\log{\frac{p(\mathbf{X, Z})}{q(\mathbf{Z})}}}d\mathbf{Z}\\
&= \int{q(\mathbf{Z})\log{p(\mathbf{X, Z})}}d\mathbf{Z} - \int{q(\mathbf{Z})\log{q(\mathbf{Z})}}d\mathbf{Z}\\
&= A + B
\end{align*}
$$</div>
<p>Next, rewrite this expression in a way that isolates a single variational factor <span class="math">\(q_j(\mathbf{Z}_j)\)</span>, i.e. the factor with respect to which we'd like to maximize the ELBO in a given iteration.</p>
<h2>Expanding the first term</h2>
<div class="math">$$
\begin{align*}
A
&= \int{q(\mathbf{Z})\log{p(\mathbf{X, Z})}d\mathbf{Z}}\\
&= \int{\prod\limits_{i}q_i(\mathbf{Z}_i)\log{p(\mathbf{X, Z})}d\mathbf{Z}_i}\\
&= \int{q_j(\mathbf{Z}_j)\bigg[\int{\prod\limits_{i \neq j}q_i(\mathbf{Z}_{i})\log{p(\mathbf{X, Z})}}d\mathbf{Z}_i\bigg]}d\mathbf{Z}_j\\
&= \int{q_j(\mathbf{Z}_j){ \mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}] }d\mathbf{Z}_j}\\
\end{align*}
$$</div>
<p>Following Bishop<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>'s derivation, we've introduced the notation:</p>
<div class="math">$$
\int{\prod\limits_{i \neq j}q_i(\mathbf{Z}_{i})\log{p(\mathbf{X, Z})}}d\mathbf{Z}_i = \mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]
$$</div>
<p>A few things to note, and in case this looks strange:</p>
<ul>
<li>Were the left-hand side to read <span class="math">\(\int{q(\mathbf{Z})\log{p(\mathbf{X, Z})}}d\mathbf{Z}\)</span>, this would look like the perfectly vanilla expectation <span class="math">\(\mathop{\mathbb{E}}_{q(\mathbf{Z})}[\log{p(\mathbf{X, Z})}]\)</span>.</li>
<li>An expectation maps a function <span class="math">\(f\)</span>, e.g. <span class="math">\(\log{p(\mathbf{X, Z})}\)</span>, to a single real number. As our expression reads <span class="math">\(\mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]\)</span> as opposed to <span class="math">\(\mathop{\mathbb{E}}_{q(\mathbf{Z})}[\log{p(\mathbf{X, Z})}]\)</span>, we're conspicuously unable to integrate over the remaining factor <span class="math">\(q_j(\mathbf{Z}_j)\)</span></li>
<li><strong>As such, <span class="math">\(\mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]\)</span> gives a function of the value of <span class="math">\(\mathbf{Z}_j\)</span></strong> which itself maps to the aforementioned real number.</li>
</ul>
<p>To further illustrate, let's employ some toy Python code:</p>
<div class="highlight"><pre><span></span><span class="c1"># Suppose `Z = [Z_0, Z_1, Z_2]`, with corresponding (discrete) variational distributions `q_0`, `q_1`, `q_2`</span>
<span class="n">q_0</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">.</span><span class="mi">2</span><span class="p">),</span> <span class="c1"># q_0(1) = .2</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="o">.</span><span class="mi">3</span><span class="p">),</span>
<span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="o">.</span><span class="mi">5</span><span class="p">)</span>
<span class="p">]</span>
<span class="n">q_1</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="o">.</span><span class="mi">3</span><span class="p">),</span>
<span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="o">.</span><span class="mi">3</span><span class="p">),</span>
<span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="o">.</span><span class="mi">4</span><span class="p">)</span>
<span class="p">]</span>
<span class="n">q_2</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="o">.</span><span class="mi">7</span><span class="p">),</span>
<span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="o">.</span><span class="mi">2</span><span class="p">),</span>
<span class="p">(</span><span class="mi">9</span><span class="p">,</span> <span class="o">.</span><span class="mi">1</span><span class="p">)</span>
<span class="p">]</span>
<span class="n">dists</span> <span class="o">=</span> <span class="p">(</span><span class="n">q_0</span><span class="p">,</span> <span class="n">q_1</span><span class="p">,</span> <span class="n">q_2</span><span class="p">)</span>
<span class="c1"># Next, suppose we'd like to isolate Z_2</span>
<span class="n">j</span> <span class="o">=</span> <span class="mi">2</span>
</pre></div>
<p><span class="math">\(\mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]\)</span>, written <code>E_i_neq_j_log_p_X_Z</code> below, can be computed as:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">E_i_neq_j_log_p_X_Z</span><span class="p">(</span><span class="n">Z_j</span><span class="p">):</span>
<span class="n">E</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">Z_i_neq_j_dists</span> <span class="o">=</span> <span class="p">[</span><span class="n">dist</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">dist</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dists</span><span class="p">)</span> <span class="k">if</span> <span class="n">i</span> <span class="o">!=</span> <span class="n">j</span><span class="p">]</span>
<span class="k">for</span> <span class="n">comb</span> <span class="ow">in</span> <span class="n">product</span><span class="p">(</span><span class="o">*</span><span class="n">Z_i_neq_j_dists</span><span class="p">):</span>
<span class="n">Z</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">prob</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">comb</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">comb</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">dists</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="n">j</span><span class="p">:</span>
<span class="n">Z</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">Z_j</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">Z_i</span><span class="p">,</span> <span class="n">p</span> <span class="o">=</span> <span class="n">comb</span><span class="o">.</span><span class="n">pop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">Z</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">Z_i</span><span class="p">)</span>
<span class="k">if</span> <span class="n">prob</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">prob</span> <span class="o">=</span> <span class="n">p</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">prob</span> <span class="o">*=</span> <span class="n">p</span>
<span class="n">E</span> <span class="o">+=</span> <span class="n">prob</span> <span class="o">*</span> <span class="n">ln_p_X_Z</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Z</span><span class="p">)</span>
<span class="k">return</span> <span class="n">E</span>
</pre></div>
<ul>
<li>Continuing with our notes, it was not immediately obvious to me how and why we're able to introduce a second integral sign on line 3 of the derivation above. Notwithstanding, the reason is quite simple; a simple exercise of nested for-loops is illustrative.</li>
</ul>
<p>Before beginning, we remind the definition of an integral in code. In its simplest example, <span class="math">\(\int{ydx}\)</span> can be written as:</p>
<div class="highlight"><pre><span></span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="n">lower_lim</span><span class="p">,</span> <span class="n">upper_lim</span><span class="p">,</span> <span class="n">n_ticks</span><span class="p">)</span>
<span class="n">integral</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">y</span> <span class="o">*</span> <span class="n">dx</span> <span class="k">for</span> <span class="n">dx</span> <span class="ow">in</span> <span class="n">x</span><span class="p">])</span>
<span class="c1"># ...where `n_ticks` approaches infinity.</span>
</pre></div>
<p>With this in mind, the following confirms the self-evidence of the second integral sign:</p>
<div class="highlight"><pre><span></span><span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">30</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">ln_p_X_Z</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Z</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">X</span> <span class="o">+</span> <span class="n">Z</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="c1"># some dummy expression</span>
<span class="c1"># Line 2 of `Expanding the first term`</span>
<span class="n">total</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">Z_0</span> <span class="ow">in</span> <span class="n">q_0</span><span class="p">:</span>
<span class="k">for</span> <span class="n">Z_1</span> <span class="ow">in</span> <span class="n">q_1</span><span class="p">:</span>
<span class="k">for</span> <span class="n">Z_2</span> <span class="ow">in</span> <span class="n">q_2</span><span class="p">:</span>
<span class="n">val_z_0</span><span class="p">,</span> <span class="n">prob_z_0</span> <span class="o">=</span> <span class="n">Z_0</span>
<span class="n">val_z_1</span><span class="p">,</span> <span class="n">prob_z_1</span> <span class="o">=</span> <span class="n">Z_1</span>
<span class="n">val_z_2</span><span class="p">,</span> <span class="n">prob_z_2</span> <span class="o">=</span> <span class="n">Z_2</span>
<span class="n">Z</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">val_z_0</span><span class="p">,</span> <span class="n">val_z_1</span><span class="p">,</span> <span class="n">val_z_2</span><span class="p">])</span>
<span class="n">total</span> <span class="o">+=</span> <span class="n">prob_z_0</span> <span class="o">*</span> <span class="n">prob_z_1</span> <span class="o">*</span> <span class="n">prob_z_2</span> <span class="o">*</span> <span class="n">ln_p_X_Z</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Z</span><span class="p">)</span>
<span class="n">TOTAL</span> <span class="o">=</span> <span class="n">total</span>
<span class="c1"># Line 3 of `Expanding the first term`</span>
<span class="n">total</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">Z_0</span> <span class="ow">in</span> <span class="n">q_0</span><span class="p">:</span>
<span class="n">_total</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">val_z_0</span><span class="p">,</span> <span class="n">prob_z_0</span> <span class="o">=</span> <span class="n">Z_0</span>
<span class="k">for</span> <span class="n">Z_1</span> <span class="ow">in</span> <span class="n">q_1</span><span class="p">:</span>
<span class="k">for</span> <span class="n">Z_2</span> <span class="ow">in</span> <span class="n">q_2</span><span class="p">:</span>
<span class="n">val_z_1</span><span class="p">,</span> <span class="n">prob_z_1</span> <span class="o">=</span> <span class="n">Z_1</span>
<span class="n">val_z_2</span><span class="p">,</span> <span class="n">prob_z_2</span> <span class="o">=</span> <span class="n">Z_2</span>
<span class="n">Z</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">val_z_0</span><span class="p">,</span> <span class="n">val_z_1</span><span class="p">,</span> <span class="n">val_z_2</span><span class="p">])</span>
<span class="n">_total</span> <span class="o">+=</span> <span class="n">prob_z_1</span> <span class="o">*</span> <span class="n">prob_z_2</span> <span class="o">*</span> <span class="n">ln_p_X_Z</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Z</span><span class="p">)</span>
<span class="n">total</span> <span class="o">+=</span> <span class="n">prob_z_0</span> <span class="o">*</span> <span class="n">_total</span>
<span class="k">assert</span> <span class="n">total</span> <span class="o">==</span> <span class="n">TOTAL</span>
</pre></div>
<p>In effect, isolating <span class="math">\(q_j(\mathbf{Z}_j)\)</span> is akin to the penultimate line <code>total += prob_z_0 * _total</code>, i.e. multiplying <span class="math">\(q_j(\mathbf{Z}_j)\)</span> by an intermediate summation <code>_total</code>. Therefore, the second integral sign is akin to <code>_total += prob_z_1 * prob_z_2 * ln_p_X_Z(X, Z)</code>, i.e. the computation of this intermediate summation itself.</p>
<p>More succinctly, a multi-dimensional integral can be thought of as a nested-for-loop which commutes a global sum. Herein, we are free to compute intermediate sums at will.</p>
<h2>Expanding the second term</h2>
<p>Next, let's expand <span class="math">\(B\)</span>. We note that this is the entropy of the full variational distribution <span class="math">\(q(\mathbf{Z})\)</span>.</p>
<div class="math">$$
\begin{align*}
B
&= - \int{q(\mathbf{Z})\log{q(\mathbf{Z})}}d\mathbf{Z}\\
&= - \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{q(\mathbf{Z})}\bigg]\\
&= - \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\prod\limits_{i}q_i(\mathbf{Z}_i)}\bigg]\\
&= - \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\sum\limits_{i}\log{q_i(\mathbf{Z}_i)}\bigg]\\
&= - \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{q_j(\mathbf{Z}_j)} + \sum\limits_{i \neq j}\log{q_i(\mathbf{Z}_i)}\bigg]\\
&= - \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{q_j(\mathbf{Z}_j)}\bigg] - \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\sum\limits_{i \neq j}\log{q_i(\mathbf{Z}_i)}\bigg]\\
&= - \mathop{\mathbb{E}}_{q_j(\mathbf{Z}_j)}\bigg[\log{q_j(\mathbf{Z}_j)}\bigg] - \mathop{\mathbb{E}}_{q_{i \neq j}(\mathbf{Z}_i)}\bigg[\sum\limits_{i \neq j}\log{q_i(\mathbf{Z}_i)}\bigg]\\
&= - \mathop{\mathbb{E}}_{q_j(\mathbf{Z}_j)}\bigg[\log{q_j(\mathbf{Z}_j)}\bigg] + \text{const}\\
&= - \int{q_j(\mathbf{Z}_j)\log{q_j(\mathbf{Z}_j)}}d\mathbf{Z}_j + \text{const}\\
\end{align*}
$$</div>
<p>As we'll be maximizing w.r.t. just <span class="math">\(q_j(\mathbf{Z}_j)\)</span>, we can set all terms that don't include this factor to constants.</p>
<h2>Putting it back together</h2>
<div class="math">$$
\begin{align*}
\mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\frac{p(\mathbf{X, Z})}{q(\mathbf{Z})}}\bigg]
&= A + B\\
&= \int{q_j(\mathbf{Z}_j){ \mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}] }d\mathbf{Z}_j} - \int{q_j(\mathbf{Z}_j)\log{q_j(\mathbf{Z}_j)}}d\mathbf{Z}_j + \text{const}\\
\end{align*}
$$</div>
<h2>One final pseudonym</h2>
<p>Were we able to replace the expectation in <span class="math">\(A\)</span> with the <span class="math">\(\log\)</span> of some density <span class="math">\(D\)</span>, i.e.</p>
<div class="math">$$
= \int{q_j(\mathbf{Z}_j){ \log{D} }\ d\mathbf{Z}_j} - \int{q_j(\mathbf{Z}_j)\log{q_j(\mathbf{Z}_j)}}d\mathbf{Z}_j + \text{const}
$$</div>
<p><span class="math">\(A + B\)</span> could be rewritten as <span class="math">\(-\text{KL}(q_j(\mathbf{Z}_j)\Vert D)\)</span>.</p>
<p>Acknowledging that <span class="math">\(\mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]\)</span> is an unnormalized log-likelihood written as a function of <span class="math">\(\mathbf{Z}_j\)</span>, we temporarily rewrite it as:</p>
<div class="math">$$
\mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}] = \log{\tilde{p}(\mathbf{X}, \mathbf{Z}_j})
$$</div>
<p>As such:</p>
<div class="math">$$
\begin{align*}
\mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\frac{p(\mathbf{X, Z})}{q(\mathbf{Z})}}\bigg]
&= \int{q_j(\mathbf{Z}_j){ \log{\tilde{p}(\mathbf{X}, \mathbf{Z}_j}) }d\mathbf{Z}_j} - \int{q_j(\mathbf{Z}_j)\log{q_j(\mathbf{Z}_j)}}d\mathbf{Z}_j + \text{const}\\
&= \int{q_j(\mathbf{Z}_j){ \log{\frac{\tilde{p}(\mathbf{X}, \mathbf{Z}_j)}{q_j(\mathbf{Z}_j)}} }d\mathbf{Z}_j} + \text{const}\\
&= - \text{KL}\big(q_j(\mathbf{Z}_j)\Vert \tilde{p}(\mathbf{X}, \mathbf{Z}_j)\big) + \text{const}\\
\end{align*}
$$</div>
<p>Finally, per this expression, the ELBO reaches its minimum when:</p>
<div class="math">$$
\begin{align*}
q_j(\mathbf{Z}_j)
&= \tilde{p}(\mathbf{X}, \mathbf{Z}_j)\\
&= \exp{\bigg(\mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]\bigg)}
\end{align*}
$$</div>
<p>Or equivalently:</p>
<div class="math">$$
\log{q_j(\mathbf{Z}_j)} = \mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]
$$</div>
<p><strong>Summing up:</strong></p>
<ul>
<li>Iteratively minimizing the divergence between <span class="math">\(q_j(\mathbf{Z}_j)\)</span> and <span class="math">\(\tilde{p}(\mathbf{X}, \mathbf{Z}_j)\)</span> for all factors <span class="math">\(j\)</span> is our mechanism for maximizing the ELBO</li>
<li>In turn, maximizing the ELBO is our mechanism for minimizing the KL divergence between the full factorized posterior <span class="math">\(q(\mathbf{Z})\)</span> and the true posterior <span class="math">\(p(\mathbf{Z}\vert\mathbf{X})\)</span>.</li>
</ul>
<p>Finally, as the optimal density <span class="math">\(q_j(\mathbf{Z}_j)\)</span> relies on those of <span class="math">\(q_{i \neq j}(\mathbf{Z}_{i})\)</span>, this optimization algorithm is necessarily <em>iterative</em>.</p>
<h2>Normalization constant</h2>
<p>Nearing the end, we note that <span class="math">\(q_j(\mathbf{Z}_j) = \exp{\bigg(\mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]\bigg)}\)</span> is not necessarily a normalized density (over <span class="math">\(\mathbf{Z}_j\)</span>). "By inspection," we compute:</p>
<div class="math">$$
\begin{align*}
q_j(\mathbf{Z}_j)
&= \frac{\exp{\bigg(\mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]\bigg)}}{\int{\exp{\bigg(\mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]\bigg)}d\mathbf{Z}_j}}\\
&= \exp{\bigg(\mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]\bigg)} + \text{const}\\
\end{align*}
$$</div>
<h2>How to actually employ this thing</h2>
<p>First, plug in values for the right-hand side of:</p>
<div class="math">$$
\log{q_j(\mathbf{Z}_j)} = \mathop{\mathbb{E}}_{i \neq j}[\log{p(\mathbf{X, Z})}]
$$</div>
<p>Then, attempt to rearrange this expression such that:</p>
<p>Once exponentiated, giving <span class="math">\(\exp{\big(\log{q_j(\mathbf{Z}_j)}\big)} = q_j(\mathbf{Z}_j)\)</span>, we are left with something that, once normalized (by inspection), resembles a known density function (e.g. a Gaussian, a Gamma, etc.).</p>
<p>NB: This may require significant computation.</p>
<h1>Approximating a Gaussian</h1>
<p>Here, we'll approximate a 2D multivariate Gaussian with a factorized mean-field approximation.</p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/mean-field-variational-bayes/mv-gaussian-approx-1.png"/></p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/mean-field-variational-bayes/mv-gaussian-approx-2.png"/></p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/mean-field-variational-bayes/mv-gaussian-approx-3.png"/></p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/mean-field-variational-bayes/mv-gaussian-approx-4.png"/></p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/mean-field-variational-bayes/mv-gaussian-approx-5.png"/></p>
<h1>Summing up</h1>
<p>Mean-Field Variational Bayes is an iterative optimization algorithm for maximizing a lower-bound of the marginal likelihood of some data <span class="math">\(\mathbf{X}\)</span> under a given model with latent variables <span class="math">\(\mathbf{Z}\)</span>. It accomplishes this task by positing a factorized variational distribution over all latent variables <span class="math">\(\mathbf{Z}\)</span> and parameters <span class="math">\(\theta\)</span>, then computes, <em>analytically</em>, the algebraic forms and parameters of each factor which maximize this bound.</p>
<p>In practice, this process can be cumbersome and labor-intensive. As such, in recent years, "black-box variational inference" techniques were born, which <em>fix</em> the forms of each factor <span class="math">\(q_j(\mathbf{Z}_j)\)</span>, then optimize its parameters via gradient descent.</p>
<h2>References</h2>
<div class="footnote">
<hr/>
<ol>
<li id="fn:1">
<p>C. M. Bishop. Pattern recognition and machine learning,
page 229. Springer-Verlag New York, 2006. <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Deriving Expectation-Maximization2018-11-11T16:00:00-05:002018-11-11T16:00:00-05:00Will Wolftag:willwolf.io,2018-11-11:/2018/11/11/em-for-lda/<p>Deriving the expectation-maximization algorithm, and the beginnings of its application to LDA. Once finished, its intimate connection to variational inference is apparent.</p><p>Consider a model with parameters <span class="math">\(\theta\)</span> and latent variables <span class="math">\(\mathbf{Z}\)</span>; the expectation-maximization algorithm (EM) is a mechanism for computing the values of <span class="math">\(\theta\)</span> that, under this model, maximize the likelihood of some observed data <span class="math">\(\mathbf{X}\)</span>.</p>
<p>The joint probability of our model can be written as follows:</p>
<div class="math">$$
p(\mathbf{X}, \mathbf{Z}\vert \theta) = p(\mathbf{X}\vert \mathbf{Z}, \theta)p(\mathbf{Z}\vert \theta)
$$</div>
<p>where, once more, our stated goal is to maximize the marginal likelihood of our data:</p>
<div class="math">$$
\log{p(\mathbf{X}\vert\theta)} = \log{\sum_{\mathbf{Z}}p(\mathbf{X, Z}\vert\theta)}
$$</div>
<p>An example of a latent variable model is the Latent Dirichlet Allocation<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup> (LDA) model for uncovering latent topics in documents of text. Once finished deriving the general EM equations, we'll (begin to) apply them to this model.</p>
<h2>Why not maximum likelihood estimation?</h2>
<p>As the adage goes, computing the MLE with respect to this marginal is "hard." For one, it requires summing over an (implicitly) humongous number of configurations of latent variables <span class="math">\(z\)</span>. Further, as Bishop<sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup> states:</p>
<blockquote>
<p>A key observation is that the summation over the latent variables appears inside the logarithm. Even if the joint distribution <span class="math">\(p(\mathbf{X, Z}\vert\theta)\)</span> belongs to the exponential family, the marginal distribution <span class="math">\(p(\mathbf{X}\vert\theta)\)</span> typically does not as a result of this summation. The presence of the sum prevents the logarithm from acting directly on the joint distribution, resulting in complicated expressions for the maximum likelihood solution.</p>
</blockquote>
<p><strong>We'll want something else to maximize instead.</strong></p>
<h2>A lower bound</h2>
<p>Instead of maximizing the log-marginal <span class="math">\(\log{p(\mathbf{X}\vert\theta)}\)</span> (with respect to model parameters <span class="math">\(\theta\)</span>), let's maximize a lower-bound with a less-problematic form.</p>
<p>Perhaps, we'd work with <span class="math">\(\log{p(\mathbf{X}, \mathbf{Z}\vert \theta)}\)</span> which, almost tautologically, removes the summation over latent variables <span class="math">\(\mathbf{Z}\)</span>.</p>
<p>As such, let's derive a lower-bound which features this term. As <span class="math">\(\log{p(\mathbf{X}\vert\theta)}\)</span> is often called the log-"evidence," we'll call our expression the "evidence lower-bound," or ELBO.</p>
<h2>Jensen's inequality</h2>
<p><a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality">Jensen's inequality</a><sup id="fnref:3"><a class="footnote-ref" href="#fn:3">3</a></sup> generalizes the statement that the line secant to a <strong>concave function</strong> lies below this function. An example is illustrative:</p>
<p><img alt="png" class="img-responsive" src="https://alliance.seas.upenn.edu/~cis520/dynamic/2017/wiki/uploads/Lectures/jensen.png"/></p>
<p>First, we note that the red line is below the blue for all points for which it is defined.</p>
<p>Second, working through the example, and assuming:</p>
<ul>
<li><span class="math">\(f(x) = \exp(-(x - 2)^2)\)</span></li>
<li><span class="math">\(v_1 = 1; v_2 = 2.5; \alpha = .3\)</span></li>
</ul>
<div class="math">$$
\begin{align*}
f(v_1) &\approx .3679\\
f(v_2) &\approx .7788\\
\alpha f(v_1) + (1 - \alpha)f(v_2) &\approx \bf{.6555}\\
\end{align*}
$$</div>
<div class="math">$$
\begin{align*}
\alpha v_1 + (1 - \alpha)v_2 &= 2.05\\
f(\alpha v_1 + (1 - \alpha)v_2) &\approx \bf{.9975}\\
\end{align*}
$$</div>
<p>we see that <strong><span class="math">\(\alpha f(v_1) + (1 - \alpha)f(v_2) \leq f(\alpha v_1 + (1 - \alpha)v_2)\)</span></strong>.</p>
<p>Finally, we arrive at a general form:</p>
<div class="math">$$
\mathop{\mathbb{E}_{v}}[f(v)] \leq f(\mathop{\mathbb{E}_{v}}[v])
$$</div>
<p>where <span class="math">\(p(v) = \alpha\)</span>.</p>
<h2>Deriving the ELBO</h2>
<p>In trying to align <span class="math">\(\log{p(\mathbf{X}\vert\theta)}
= \log{\sum\limits_{\mathbf{Z}}p(\mathbf{X, Z}\vert\theta)}\)</span> with <span class="math">\(f(\mathop{\mathbb{E}_{v}}[v])\)</span>, we see a function <span class="math">\(f = \log\)</span> yet no expectation inside. However, given the summation over <span class="math">\(\mathbf{Z}\)</span>, introducing some distribution <span class="math">\(q(\mathbf{Z})\)</span> would give us the expectation we desire.</p>
<div class="math">$$
\begin{align*}
\log{p(\mathbf{X}\vert\theta)}
&= \log{\sum_{\mathbf{Z}}p(\mathbf{X, Z}\vert\theta)}\\
&= \log{\sum_{\mathbf{Z}}q(\mathbf{Z})\frac{p(\mathbf{X, Z}\vert\theta)}{q(\mathbf{Z})}}\\
&= \log{\mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\frac{p(\mathbf{X, Z}\vert\theta)}{q(\mathbf{Z})}}\bigg]\
\end{align*}
$$</div>
<p>where <span class="math">\(q(\mathbf{Z})\)</span> is some distribution over <span class="math">\(\mathbf{Z}\)</span> with parameters <span class="math">\(\lambda\)</span> (omitted for cleanliness) and known form (e.g. a Gaussian). It is often referred to as a <strong>variational distribution</strong>.</p>
<p>From here, via Jensen's inequality, we can derive the lower-bound:</p>
<div class="math">$$
\begin{align*}
\log{p(\mathbf{X}\vert\theta)} = \log{\mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\frac{p(\mathbf{X, Z}\vert\theta)}{q(\mathbf{Z})}\bigg]}
&\geq \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\frac{p(\mathbf{X, Z}\vert\theta)}{q(\mathbf{Z})}}\bigg]\\
&= \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\frac{p(\mathbf{X, Z}\vert\theta)}{q(\mathbf{Z})}}\bigg] + R
\end{align*}
$$</div>
<p><em>Et voilà</em>, we see that this term contains <span class="math">\(\log{p(\mathbf{X, Z}\vert\theta)}\)</span>; the ELBO should now be easier to optimize with respect to our parameters <span class="math">\(\theta\)</span>.</p>
<h1>So, what's <span class="math">\(R\)</span>?</h1>
<div class="math">$$
\begin{align*}
R
&= \log{p(\mathbf{X}\vert\theta)} - \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\frac{p(\mathbf{X, Z}\vert\theta)}{q(\mathbf{Z})}}\bigg]\\
&= \log{p(\mathbf{X}\vert\theta)} - \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{p(\mathbf{X, Z}\vert\theta) - \log{q(\mathbf{Z})}}\bigg]\\
&= \log{p(\mathbf{X}\vert\theta)} - \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{p(\mathbf{Z}\vert\mathbf{X}, \theta)} + \log{p(\mathbf{X}\vert\theta)} - \log{q(\mathbf{Z})}\bigg]\\
&= \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{p(\mathbf{X}\vert\theta)}\bigg] - \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{p(\mathbf{Z}\vert\mathbf{X}, \theta)} + \log{p(\mathbf{X}\vert\theta)} - \log{q(\mathbf{Z})}\bigg]\\
&= \sum_{\mathbf{Z}}q(\mathbf{Z})\bigg(\log{p(\mathbf{X}\vert\theta)} - \log{p(\mathbf{Z}\vert\mathbf{X}, \theta)} - \log{p(\mathbf{X}\vert\theta)} + \log{q(\mathbf{Z})}\bigg)\\
&= \sum_{\mathbf{Z}}q(\mathbf{Z})\bigg( - \log{p(\mathbf{Z}\vert\mathbf{X}, \theta)} + \log{q(\mathbf{Z})}\bigg)\\
&=
\sum_{\mathbf{Z}}q(\mathbf{Z})\log{\frac{q(\mathbf{Z})}{p(\mathbf{Z}\vert\mathbf{X}, \theta)}}\\
&= \text{KL}\big(q(\mathbf{Z})\Vert p(\mathbf{Z}\vert\mathbf{X}, \theta)\big)\\
\end{align*}
$$</div>
<p><strong>Putting it back together:</strong></p>
<div class="math">$$
\log{p(\mathbf{X}\vert\theta)} = \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\frac{p(\mathbf{X, Z}\vert\theta)}{q(\mathbf{Z})}}\bigg] + \text{KL}\big(q(\mathbf{Z})\Vert p(\mathbf{Z}\vert\mathbf{X}, \theta)\big)
$$</div>
<h2>The EM algorithm</h2>
<p>The algorithm can be described by a few simple observations.</p>
<ol>
<li>
<p><span class="math">\(\text{KL}\big(q(\mathbf{Z})\Vert p(\mathbf{Z}\vert\mathbf{X}, \theta)\big)\)</span> is a divergence metric which is strictly non-negative.</p>
</li>
<li>
<p>As <span class="math">\(\log{p(\mathbf{X}\vert\theta)}\)</span> does not depend on <span class="math">\(q(\mathbf{Z})\)</span>—if we decrease <span class="math">\(\text{KL}\big(q(\mathbf{Z})\Vert p(\mathbf{Z}\vert\mathbf{X}, \theta)\big)\)</span> by changing <span class="math">\(q(\mathbf{Z})\)</span>, the ELBO must increase to compensate.</p>
</li>
<li>
<p>If we increase the ELBO by changing <span class="math">\(\theta\)</span>, <span class="math">\(\log{p(\mathbf{X}\vert\theta)}\)</span> will increase as well. In addition, as <span class="math">\(p(\mathbf{Z}\vert\mathbf{X}, \theta)\)</span> now (likely) diverges from <span class="math">\(q(\mathbf{Z})\)</span> in non-zero amount, <span class="math">\(\log{p(\mathbf{X}\vert\theta)}\)</span> will increase even more.</p>
</li>
</ol>
<p><strong>The EM algorithm is a repeated alternation between Step 2 (E-step) and Step 3 (M-step).</strong> After each M-Step, <span class="math">\(\log{p(\mathbf{X}\vert\theta)}\)</span> is guaranteed to increase (unless it is already at a maximum)<sup id="fnref2:2"><a class="footnote-ref" href="#fn:2">2</a></sup>.</p>
<p>A graphic<sup id="fnref3:2"><a class="footnote-ref" href="#fn:2">2</a></sup> is further illustrative.</p>
<h3>Initial decomposition</h3>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/em-for-lda/initial_decomp.png"/></p>
<p>Here, the ELBO is written as <span class="math">\(\mathcal{L}(q, \theta)\)</span>.</p>
<h3>E-step</h3>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/em-for-lda/e_step.png"/></p>
<p>In other words, holding the parameters <span class="math">\(\theta\)</span> constant, minimize <span class="math">\(\text{KL}\big(q(\mathbf{Z})\Vert p(\mathbf{Z}\vert\mathbf{X}, \theta)\big)\)</span> with respect to <span class="math">\(q(\mathbf{Z})\)</span>. Remember, as <span class="math">\(q\)</span> is a distribution with a fixed functional form, this amounts to updating its parameters <span class="math">\(\lambda\)</span>.</p>
<p>The caption implies that we can always compute <span class="math">\(q(\mathbf{Z}) = p(\mathbf{Z}\vert\mathbf{X}, \theta)\)</span>. We will show below that this is not the case for LDA, nor for many interesting models.</p>
<h3>M-step</h3>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/em-for-lda/m_step.png"/></p>
<p>In other words, in the M-step, maximize the ELBO with respect to the model parameters <span class="math">\(\theta\)</span>.</p>
<p>Expanding the ELBO:</p>
<div class="math">$$
\begin{align*}
\mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\frac{p(\mathbf{X, Z}\vert\theta)}{q(\mathbf{Z})}}\bigg]
&= \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{p(\mathbf{X, Z}\vert\theta)}\bigg] - \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{q(\mathbf{Z})}\bigg]\\
&= \mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{p(\mathbf{X, Z}\vert\theta)}\bigg] + \mathbf{H}[q(\mathbf{Z})]
\end{align*}
$$</div>
<p>we see that it decomposes into an expectation of the joint distribution over data and latent variables given parameters <span class="math">\(\theta\)</span> with respect to the variational distribution <span class="math">\(q(\mathbf{Z})\)</span>, plus the entropy of <span class="math">\(q(\mathbf{Z})\)</span>.</p>
<p>As our task is to maximize this expression with respect to <span class="math">\(\theta\)</span>, we can treat the latter term as a constant.</p>
<h2>EM for LDA</h2>
<p>To give an example of the above, we'll examine the classic Latent Dirichlet Allocation<sup id="fnref2:1"><a class="footnote-ref" href="#fn:1">1</a></sup> paper.</p>
<h3>Model</h3>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/em-for-lda/lda_formulation.png"/></p>
<p>"Given the parameters <span class="math">\(\alpha\)</span> and <span class="math">\(\beta\)</span>, the joint distribution of a topic mixture <span class="math">\(\theta\)</span>, a set of <span class="math">\(N\)</span> topics <span class="math">\(\mathbf{z}\)</span>, and a set of <span class="math">\(N\)</span> words <span class="math">\(\mathbf{w}\)</span> is given by:"<sup id="fnref3:1"><a class="footnote-ref" href="#fn:1">1</a></sup></p>
<div class="math">$$
p(\theta, \mathbf{z}, \mathbf{w}\vert \alpha, \beta) = p(\theta\vert \alpha)\prod\limits_{n=1}^{N}p(z_n\vert \theta)p(w_n\vert z_n, \beta)
$$</div>
<h3>Log-evidence</h3>
<p>The (problematic) log-evidence of a single document:</p>
<div class="math">$$
\log{p(\mathbf{w}\vert \alpha, \beta)} = \log{\int p(\theta\vert \alpha)\prod\limits_{n=1}^{N}\sum\limits_{z_n} p(z_n\vert \theta)p(w_n\vert z_n, \beta)d\theta}
$$</div>
<p>NB: The parameters of our model are <span class="math">\(\{\alpha, \beta\}\)</span> and <span class="math">\(\{\theta, \mathbf{z}\}\)</span> are our latent variables.</p>
<h3>ELBO</h3>
<div class="math">$$
\mathop{\mathbb{E}}_{q(\mathbf{Z})}\bigg[\log{\bigg(\frac{p(\theta\vert \alpha)}{q(\mathbf{Z})}}\prod\limits_{n=1}^{N}p(z_n\vert \theta)p(w_n\vert z_n, \beta)\bigg)\bigg]
$$</div>
<p>where <span class="math">\(\mathbf{Z} = \{\theta, \mathbf{z}\}\)</span>.</p>
<h3>KL term</h3>
<div class="math">$$
\text{KL}\big(q(\mathbf{Z})\Vert \frac{p(\theta, \mathbf{z}, \mathbf{w}\vert \alpha, \beta)}{p(\mathbf{w}\vert \alpha, \beta)}\big)
$$</div>
<p>Peering at the denominator, we see that it includes an integration over all values <span class="math">\(\theta\)</span>, which we assume is intractable to compute. As such, the "ideal" E-step solution <span class="math">\(q(\mathbf{Z}) = p(\theta, \mathbf{z}\vert \mathbf{w}, \alpha, \beta)\)</span> will elude us as well.</p>
<p>In the next post, we'll cover how to minimize this KL term with respect to <span class="math">\(q(\mathbf{Z})\)</span> in detail. This effort will begin with the derivation of the mean-field algorithm.</p>
<h2>Summary</h2>
<p>In this post, we motivated the expectation-maximization algorithm then derived its general form. We then applied this framework to the LDA model.</p>
<p>In the next post, we'll expand this logic into mean-field variational Bayes, and eventually, variational inference more broadly.</p>
<p>Thanks for reading.</p>
<h2>References</h2>
<div class="footnote">
<hr/>
<ol>
<li id="fn:1">
<p>D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003. <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a><a class="footnote-backref" href="#fnref2:1" title="Jump back to footnote 1 in the text">↩</a><a class="footnote-backref" href="#fnref3:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>C. M. Bishop. Pattern recognition and machine learning,
page 229. Springer-Verlag New York, 2006. <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a><a class="footnote-backref" href="#fnref2:2" title="Jump back to footnote 2 in the text">↩</a><a class="footnote-backref" href="#fnref3:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>Wikipedia contributors. "Jensen's inequality." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 29 Oct. 2018. Web. 11 Nov. 2018. <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Additional Strategies for Confronting the Partition Function2018-10-29T22:00:00-04:002018-10-29T22:00:00-04:00Will Wolftag:willwolf.io,2018-10-29:/2018/10/29/additional-strategies-partition-function/<p>Stochastic maximum likelihood, contrastive divergence, negative contrastive estimation and negative sampling for improving or avoiding the computation of the gradient of the log-partition function. (Oof, that's a mouthful.)</p><p>In the <a href="https://willwolf.io/2018/10/20/thorough-introduction-to-boltzmann-machines/">previous post</a> we introduced Boltzmann machines and the infeasibility of computing the gradient of its log-partition function <span class="math">\(\nabla\log{Z}\)</span>. To this end, we explored one strategy for its approximation: Gibbs sampling. Gibbs sampling is a viable alternative because the expression for our gradient simplifies to an expectation over the model distribution, which can be approximated with Monte Carlo samples.</p>
<p>In this post, we'll highlight the imperfections of this approximate approach itself, then present more preferable alternatives.</p>
<h1>Pitfalls of Gibbs sampling</h1>
<p>To refresh, the two gradients we seek to compute in a reasonable amount of time are:</p>
<div class="math">$$
\nabla_{w_{i, j}}\log{Z} = \mathop{\mathbb{E}}_{x \sim p_{\text{model}}} [x_i x_j]\\
\nabla_{b_{i}}\log{Z} = \mathop{\mathbb{E}}_{x \sim p_{\text{model}}} [x_i]
$$</div>
<p>Via Gibbs sampling, we approximate each by:</p>
<ol>
<li>Burning in a Markov chain with respect to our model, then selecting <span class="math">\(n\)</span> samples from this chain</li>
<li>Evaluating both functions (<span class="math">\(x_i x_j\)</span>, and <span class="math">\(x_i\)</span>) at these samples</li>
<li>Taking the average of each</li>
</ol>
<p>Concretely:</p>
<div class="math">$$
\nabla_{w_{i, j}}\log{Z} \approx \frac{1}{N}\sum\limits_{k=1}^N x^{(k)}_i x^{(k)}_j\quad\text{where}\quad x^{(k)} \sim p_{\text{model}}\\
\nabla_{b_{i}}\log{Z} \approx \frac{1}{N}\sum\limits_{k=1}^N x^{(k)}_i\quad\text{where}\quad x^{(k)} \sim p_{\text{model}}
$$</div>
<p><strong>We perform this sampling process at each gradient step.</strong></p>
<h2>The cost of burning in each chain</h2>
<p>Initializing a Markov chain at a random sample incurs a non-trivial "burn-in" cost. If paying this cost at each gradient step, it begins to add up. How can we do better?</p>
<p><strong>In the remainder of the post, we'll explore two new directives for approximating the negative phase more cheaply, and the algorithms they birth.</strong></p>
<h1>Directive #1: Cheapen the burn-in process</h1>
<h2>Stochastic maximum likelihood</h2>
<p>SML assumes the premise: let's initialize our chain at a point already close to the model's true distribution—reducing or perhaps eliminating the cost of burn-in altogether. <strong>In this vein, at what sample do we initialize the chain?</strong></p>
<p>In SML, we simply initialize at the terminal value of the previous chain (i.e. the one we manufactured to compute the gradients of the previous mini-batch). <strong>As long as the model has not changed significantly since, i.e. as long as the previous parameter update (gradient step) was not too large, this sample should exist in a region of high probability under the current model.</strong></p>
<h3>Implications</h3>
<p>Per the expression for the full log-likelihood gradient, e.g. <span class="math">\(\nabla_{w_{i, j}}\log{\mathcal{L}} = \mathop{\mathbb{E}}_{x \sim p_{\text{data}}} [x_i x_j] - \mathop{\mathbb{E}}_{x \sim p_{\text{model}}} [x_i x_j]\)</span>, the negative phase works to "reduce the probability of the points in which the model strongly, yet wrongly, believes".<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup> Since we approximate this term at each parameter update with samples <em>roughly from</em> the current model's true distribution, <strong>we do not encroach on this foundational task.</strong></p>
<h2>Contrastive divergence</h2>
<p>Alternatively, in the contrastive divergence algorithm, we initialize the chain at each gradient step with a <em>random sample</em> from the data distribution.</p>
<h3>Implications</h3>
<p>With no guarantee that the data distribution resembles the model distribution, we may systematically fail to sample, and thereafter "suppress," points that are incorrectly likely under the latter (as they do not appear in the former!). <strong>This incurs the growth of "spurious modes"</strong> in our model, aptly named.<sup id="fnref2:1"><a class="footnote-ref" href="#fn:1">1</a></sup></p>
<h2>In summary</h2>
<p>Cheapening the burn-in phase indeed gives us a more efficient training routine. Moving forward, what are some even more aggressive strategies we might explore?</p>
<h1>Directive #2: Skip the computation of <span class="math">\(Z\)</span> altogether</h1>
<p>Canonically, we write the log-likelihood of our Boltzmann machine as follows:</p>
<div class="math">$$
\begin{align*}
\log{\mathcal{L}(x)}
&= \log{\frac{\exp{(H(x))}}{Z}}\\
&= \log{\big(\exp{(H(x))}\big)} - \log{Z}\\
&= H(x) - \log{Z}
\end{align*}
$$</div>
<p>Instead, what if we simply wrote this as:</p>
<div class="math">$$
\log{\mathcal{L}(x)} = H(x) - c
$$</div>
<p>or, more generally:</p>
<div class="math">$$
\log{p_{\text{model}}(x)} = \log{\tilde{p}_{\text{model}}(x; \theta)} - c
$$</div>
<p>and estimated <span class="math">\(c\)</span> as a parameter?</p>
<p><strong>Immediately, we remark that if we optimize this model with maximum likelihood, our algorithm will, trivially, make <span class="math">\(c\)</span> arbitrarily negative.</strong> In other words, the easiest way to increase <span class="math">\(\log{p_{\text{model}}(x)}\)</span> is to decrease <span class="math">\(c\)</span>.</p>
<p>How might we better phrase this problem?</p>
<h2>Noise contrastive estimation</h2>
<p>Ingeniously, NCE proposes an alternative:</p>
<ol>
<li>Posit two distributions: the model, and a noise distribution</li>
<li>Given a data point, predict the distribution (i.e. binary classification) from which this point was generated</li>
</ol>
<p>Let's unpack this a bit.</p>
<p>Under an (erroneous) MLE formulation, we would optimize the following objective:</p>
<div class="math">$$
\theta, c = \underset{\theta, c}{\arg\max}\ \mathbb{E}_{x \sim p_{\text{data}}} [\log{p_{\text{model}}}(x)]
$$</div>
<p>Under NCE, we're going to replace two pieces so as to perform the binary classification task described above (with 1 = "model", and 0 = "noise").</p>
<p>First, let's swap <span class="math">\(\log{p_{\text{model}}}(x)\)</span> with <span class="math">\(\log{p_{\text{joint}}}(y = 0\vert x)\)</span>, where:</p>
<div class="math">$$
p_{\text{joint}}(x\vert y) =
\begin{cases}
p_{\text{noise}}(x)\quad y = 0\\
p_{\text{model}}(x)\quad y = 1\\
\end{cases}
$$</div>
<div class="math">$$
p_{\text{joint}}(x, y)
= p_{\text{joint}}(y = 0)p_{\text{noise}}(x) + p_{\text{joint}}(y = 1)p_{\text{model}}(x)
$$</div>
<div class="math">$$
p_{\text{joint}}(y = 0\vert x)
= \frac{p_{\text{joint}}(y = 0)p_{\text{noise}}(x)}{p_{\text{joint}}(y = 0)p_{\text{noise}}(x) + p_{\text{joint}}(y = 1)p_{\text{model}}(x)}
$$</div>
<p>Finally:</p>
<div class="math">$$
\theta, c = \underset{\theta, c}{\arg\max}\ \mathbb{E}_{x \sim p_{\text{data}}} [\log{p_{\text{joint}}(y = 0\vert x)}]
$$</div>
<p>From here, we need to update <span class="math">\(x \sim p_{\text{data}}\)</span> to include <span class="math">\(y\)</span>. We'll do this in two pedantic steps.</p>
<p>First, let's write:</p>
<div class="math">$$
\theta, c = \underset{\theta, c}{\arg\max}\ \mathbb{E}_{x, y=0\ \sim\ p_{\text{noise}}} [\log{p_{\text{joint}}(y\vert x)}]
$$</div>
<p>This equation:</p>
<ol>
<li>Builds a classifier that discriminates between samples generated from the model distribution and noise distribution <strong>trained only on samples from the latter.</strong> (Clearly, this will not make for an effective classifier.)</li>
<li>To train this classifier, we note that the equation asks us to maximize the likelihood of the noise samples under the noise distribution—where the noise distribution itself has no actual parameters we intend to train!</li>
</ol>
<p>In solution, we trivially expand our expectation to one over both noise samples, and data samples. In doing so, in predicting <span class="math">\(\log{p_{\text{joint}}(y = 1\vert x)} = 1 - \log{p_{\text{joint}}(y = 0\vert x)}\)</span>, <strong>we'll be maximizing the likelihood of the data under the model.</strong></p>
<div class="math">$$
\theta, c = \underset{\theta, c}{\arg\max}\ \mathbb{E}_{x, y\ \sim\ p_{\text{train}}} [\log{p_{\text{joint}}(y \vert x)}]
$$</div>
<p>where:</p>
<div class="math">$$
p_{\text{train}}(x\vert y) =
\begin{cases}
p_{\text{noise}}(x)\quad y = 0\\
p_{\text{data}}(x)\quad y = 1\\
\end{cases}
$$</div>
<p>As a final step, we'll expand our object into something more elegant:</p>
<div class="math">$$
\begin{align*}
p_{\text{joint}}(y = 0\vert x)
&= \frac{p_{\text{joint}}(y = 0)p_{\text{noise}}(x)}{p_{\text{joint}}(y = 0)p_{\text{noise}}(x) + p_{\text{joint}}(y = 1)p_{\text{model}}(x)}\\
&= \frac{1}{1 + \frac{p_{\text{joint}}(y = 1)p_{\text{model}}(x)}{p_{\text{joint}}(y = 0)p_{\text{noise}}(x)}}\\
\end{align*}
$$</div>
<p>Assuming <em>a priori</em> that <span class="math">\(p_{\text{joint}}(x, y)\)</span> is <span class="math">\(k\)</span> times more likely to generate a noise sample, i.e. <span class="math">\(\frac{p_{\text{joint}}(y = 1)}{p_{\text{joint}}(y = 0)} = \frac{1}{k}\)</span>:</p>
<div class="math">$$
\begin{align*}
p_{\text{joint}}(y = 0\vert x)
&= \frac{1}{1 + \frac{p_{\text{model}}(x)}{p_{\text{noise}}(x)\cdot k}}\\
&= \frac{1}{1 + \exp\big(\log{\frac{p_{\text{model}}(x)}{{p_{\text{noise}}(x)\cdot k}}}\big)}\\
&= \sigma\bigg(-\log{\frac{p_{\text{model}}(x)}{{p_{\text{noise}}(x)\cdot k}}}\bigg)\\
&= \sigma\bigg(\log{k} + \log{p_{\text{noise}}(x)} - \log{p_{\text{model}}(x)}\bigg)\\
p_{\text{joint}}(y = 1\vert x)
&= 1 - \sigma\bigg(\log{k} + \log{p_{\text{noise}}(x)} - \log{p_{\text{model}}(x)}\bigg)
\end{align*}
$$</div>
<p>Given a joint training distribution over <span class="math">\((X_{\text{data}}, y=1)\)</span> and <span class="math">\((X_{\text{noise}}, y=0)\)</span>, this is the target we'd like to maximize.</p>
<h2>Implications</h2>
<p>For our training data, <strong>we require the ability to sample from our noise distribution.</strong></p>
<p>For our target, <strong>we require the ability to compute the likelihood of some data under our noise distribution.</strong></p>
<p>Therefore, these criteria do place practical restrictions on the types of noise distributions that we're able to consider.</p>
<h2>Extensions</h2>
<p>We briefly alluded to the fact that our noise distribution is non-parametric. However, there is nothing stopping us from evolving this distribution and giving it trainable parameters, then updating these parameters such that it generates increasingly "optimal" samples.</p>
<p>Of course, we would have to design what "optimal" means. One interesting approach is called <a href="https://arxiv.org/abs/1805.03642">Adversarial Contrastive Estimation
</a>, wherein the authors adapt the noise distribution to generate increasingly "harder negative examples, which forces the main model to learn a better representation of the data."<sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup></p>
<h2>Negative sampling</h2>
<p>Negative sampling is the same as NCE except:</p>
<ol>
<li>We consider noise distributions whose likelihood we cannot evaluate</li>
<li>To accommodate, we simply set <span class="math">\(p_{\text{noise}}(x) = 1\)</span></li>
</ol>
<p>Therefore:</p>
<div class="math">$$
\begin{align*}
p_{\text{joint}}(y = 0\vert x)
&= \frac{1}{1 + \frac{p_{\text{model}}(x)}{p_{\text{noise}}(x)\cdot k}}\\
&= \frac{1}{1 + \frac{p_{\text{model}}(x)}{ k}}\\
&= \frac{1}{1 + \exp\big(\log\frac{p_{\text{model}}(x)}{ k}\big)}\\
&=\sigma(-\log\frac{p_{\text{model}}(x)}{ k})\\
&=\sigma(\log{k} - \log{p_{\text{model}}(x)})\\
p_{\text{joint}}(y = 1\vert x)
&= 1 - \sigma(\log{k} - \log{p_{\text{model}}(x)})
\end{align*}
$$</div>
<h2>In code</h2>
<p>Since I learn best by implementing things, let's play around. Below, we train Boltzmann machines via noise contrastive estimation and negative sampling.</p>
<h2>Load data</h2>
<p>For this exercise, we'll fit a Boltzmann machine to the <a href="https://www.kaggle.com/zalando-research/fashionmnist">Fashion MNIST</a> dataset.</p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/additional-strategies-partition-function/output_3_0.png"/></p>
<h2>Define model</h2>
<p>Below, as opposed to in the previous post, I offer a vectorized implementation of the Boltzmann energy function.</p>
<p>This said, the code is still imperfect: especially re: the line in which I iterate through data points individually to compute the joint likelihood.</p>
<p>Finally, in <code>Model._H</code>, I divide by 1000 to get this thing to train. The following is only a toy exercise (like many of my posts); I did not spend much time tuning parameters.</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
<span class="k">return</span> <span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">torch</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">z</span><span class="p">))</span>
<span class="k">class</span> <span class="nc">Model</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_units</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">42</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
<span class="n">torch</span><span class="o">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">params</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">n_units</span><span class="p">,</span> <span class="n">n_units</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">torch</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">init</span><span class="o">.</span><span class="n">xavier_uniform_</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">weight</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">c</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">FloatTensor</span><span class="p">([</span><span class="mf">1.</span><span class="p">]))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">diagonal_mask</span> <span class="o">=</span> <span class="p">(</span><span class="o">~</span><span class="n">torch</span><span class="o">.</span><span class="n">eye</span><span class="p">(</span><span class="n">n_units</span><span class="p">)</span><span class="o">.</span><span class="n">byte</span><span class="p">())</span><span class="o">.</span><span class="n">float</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">n_units</span> <span class="o">=</span> <span class="n">n_units</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">log</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_likelihood</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">log</span><span class="o">=</span><span class="n">log</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_unnormalized_likelihood</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">torch</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_H</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">_H</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">H</span> <span class="o">=</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">diagonal_mask</span> <span class="o">*</span> <span class="n">torch</span><span class="o">.</span><span class="n">triu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">weight</span> <span class="o">*</span> <span class="n">torch</span><span class="o">.</span><span class="n">ger</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x</span><span class="p">)))</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">bias</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="k">return</span> <span class="n">H</span> <span class="o">/</span> <span class="mi">1000</span>
<span class="k">def</span> <span class="nf">_likelihood</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">log</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> :param x: a vector of shape (n_units,) or (n, n_units),</span>
<span class="sd"> where the latter is a matrix of multiple data points</span>
<span class="sd"> for which to compute the joint likelihood</span>
<span class="sd"> :return: the likelihood, or log-likelihood if `log=True`</span>
<span class="sd"> """</span>
<span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_units</span> <span class="ow">in</span> <span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">()</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">())</span> <span class="ow">in</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">):</span>
<span class="k">raise</span><span class="p">(</span><span class="s1">'Please pass 1 or more points of `n_units` dimensions'</span><span class="p">)</span>
<span class="c1"># compute unnormalized likelihoods</span>
<span class="n">multiple_samples</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span>
<span class="k">if</span> <span class="n">multiple_samples</span><span class="p">:</span>
<span class="n">likelihood</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">_unnormalized_likelihood</span><span class="p">(</span><span class="n">point</span><span class="p">)</span> <span class="k">for</span> <span class="n">point</span> <span class="ow">in</span> <span class="n">x</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">likelihood</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">_unnormalized_likelihood</span><span class="p">(</span><span class="n">x</span><span class="p">)]</span>
<span class="k">if</span> <span class="n">log</span><span class="p">:</span>
<span class="k">return</span> <span class="n">torch</span><span class="o">.</span><span class="n">stack</span><span class="p">([</span><span class="n">torch</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">lik</span><span class="p">)</span> <span class="o">-</span> <span class="n">torch</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">c</span><span class="p">)</span> <span class="k">for</span> <span class="n">lik</span> <span class="ow">in</span> <span class="n">likelihood</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">torch</span><span class="o">.</span><span class="n">stack</span><span class="p">([</span><span class="n">lik</span> <span class="o">/</span> <span class="bp">self</span><span class="o">.</span><span class="n">c</span> <span class="k">for</span> <span class="n">lik</span> <span class="ow">in</span> <span class="n">likelihood</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">sample</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_samples</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">init_sample</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">burn_in</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span> <span class="n">every_n</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> <span class="o">-></span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">:</span>
<span class="k">if</span> <span class="n">burn_in</span> <span class="o">></span> <span class="n">n_samples</span><span class="p">:</span>
<span class="n">n_samples</span> <span class="o">+=</span> <span class="n">burn_in</span>
<span class="n">init_sample</span> <span class="o">=</span> <span class="n">init_sample</span> <span class="ow">or</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">bias</span><span class="p">)</span>
<span class="n">samples</span> <span class="o">=</span> <span class="p">[</span><span class="n">init_sample</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">_gibbs_step</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span> <span class="n">i</span><span class="p">):</span>
<span class="n">z</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">weight</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="n">sample</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">sample</span><span class="p">))</span> <span class="k">if</span> <span class="n">j</span> <span class="o">!=</span> <span class="n">i</span><span class="p">])</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">bias</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="k">return</span> <span class="n">p</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_samples</span><span class="p">):</span>
<span class="n">sample</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">samples</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="c1"># make copy</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">sample</span><span class="p">):</span>
<span class="n">sample</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">_gibbs_step</span><span class="p">(</span><span class="n">sample</span><span class="o">=</span><span class="n">sample</span><span class="p">,</span> <span class="n">i</span><span class="o">=</span><span class="n">i</span><span class="p">)</span>
<span class="n">samples</span><span class="o">.</span><span class="n">append</span><span class="p">(</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span> <span class="p">)</span>
<span class="k">return</span> <span class="p">[</span><span class="n">sample</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">sample</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">samples</span><span class="p">[</span><span class="n">burn_in</span><span class="p">:])</span> <span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="n">every_n</span> <span class="o">==</span> <span class="mi">0</span><span class="p">]</span>
</pre></div>
<h2>Noise contrastive estimation</h2>
<p>Train a model using noise contrastive estimation. For our noise distribution, we'll start with a diagonal multivariate Gaussian, from which we can sample, and whose likelihood we can evaluate (as of PyTorch 0.4!).</p>
<div class="highlight"><pre><span></span><span class="c1"># define model, noise distribution</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">n_units</span><span class="p">)</span>
<span class="n">noise</span> <span class="o">=</span> <span class="n">MultivariateNormal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">n_units</span><span class="p">),</span> <span class="n">covariance_matrix</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">eye</span><span class="p">(</span><span class="n">n_units</span><span class="p">))</span>
<span class="c1"># define classifier. we add a multiplicative constant to make training more stable.</span>
<span class="n">k</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">classifier</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">X</span><span class="p">:</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">([</span><span class="n">k</span><span class="p">])</span><span class="o">.</span><span class="n">log</span><span class="p">()</span> <span class="o">+</span> <span class="p">(</span><span class="n">noise</span><span class="o">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="o">/</span> <span class="mi">10000</span><span class="p">)</span> <span class="o">-</span> <span class="n">model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">log</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">squeeze</span><span class="p">())</span>
<span class="c1"># define noise generator</span>
<span class="n">noise_sample</span> <span class="o">=</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">noise</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">sample_shape</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">Size</span><span class="p">([</span><span class="n">BATCH_SIZE</span> <span class="o">*</span> <span class="n">k</span><span class="p">]))</span>
<span class="n">noiseloader</span> <span class="o">=</span> <span class="p">(</span> <span class="p">(</span><span class="n">noise_sample</span><span class="p">(),</span> <span class="n">_</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">trainloader</span><span class="p">))</span> <span class="p">)</span>
<span class="c1"># define optimizer, loss</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">optim</span><span class="o">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=.</span><span class="mi">01</span><span class="p">)</span>
<span class="n">criterion</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">BCELoss</span><span class="p">()</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">train_model</span><span class="p">(</span><span class="n">classifier</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">trainloader</span><span class="p">,</span> <span class="n">noiseloader</span><span class="p">,</span> <span class="n">n_batches</span><span class="o">=</span><span class="nb">float</span><span class="p">(</span><span class="s1">'inf'</span><span class="p">),</span> <span class="n">verbose</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">noise</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">trainloader</span><span class="p">,</span> <span class="n">noiseloader</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">i</span> <span class="o"><</span> <span class="n">n_batches</span><span class="p">:</span>
<span class="c1"># points from data distribution</span>
<span class="n">X_data</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">data</span>
<span class="n">X_data</span> <span class="o">=</span> <span class="n">X_data</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_units</span><span class="p">)</span>
<span class="n">y_data</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">FloatTensor</span><span class="p">([</span><span class="mf">1.</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">X_data</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">))])</span>
<span class="c1"># points from noise distribution</span>
<span class="n">X_noise</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">noise</span>
<span class="n">X_noise</span> <span class="o">=</span> <span class="n">X_noise</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_units</span><span class="p">)</span>
<span class="n">y_noise</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">FloatTensor</span><span class="p">([</span><span class="mf">0.</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">X_noise</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">))])</span>
<span class="c1"># stack into single input</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">([</span><span class="n">X_data</span><span class="p">,</span> <span class="n">X_noise</span><span class="p">])</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_units</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">([</span><span class="n">y_data</span><span class="p">,</span> <span class="n">y_noise</span><span class="p">])</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">logits</span> <span class="o">=</span> <span class="n">classifier</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">criterion</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">10</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Batch: </span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s1"> | Loss: </span><span class="si">{</span><span class="n">loss</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">item</span><span class="p">()</span><span class="si">:</span><span class="s1">.3</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">verbose</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'weights.mean(): </span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">weight</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="si">:</span><span class="s1">.3</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'bias.mean(): </span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">bias</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="si">:</span><span class="s1">.3</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'c: </span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">item</span><span class="p">()</span><span class="si">:</span><span class="s1">.3</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'weights.grad.mean(): </span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">weight</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="si">:</span><span class="s1">.3</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'bias.grad.mean(): </span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">bias</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="si">:</span><span class="s1">.3</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'c.grad: </span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">item</span><span class="p">()</span><span class="si">:</span><span class="s1">.3</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
</pre></div>
<h2>Train model</h2>
<div class="highlight"><pre><span></span><span class="n">train_model</span><span class="p">(</span><span class="n">classifier</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">trainloader</span><span class="p">,</span> <span class="n">noiseloader</span><span class="p">,</span> <span class="n">n_batches</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">Batch</span><span class="o">:</span> <span class="mi">0</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.305</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">10</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.0887</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">20</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.0794</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">30</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.0603</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">40</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.0525</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">50</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.0503</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">60</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.0414</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">70</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.038</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">80</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.034</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">90</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.0312</span>
</pre></div>
<h2>Negative sampling</h2>
<p>Next, we'll try negative sampling using some actual images as negative samples</p>
<div class="highlight"><pre><span></span><span class="n">noiseset</span> <span class="o">=</span> <span class="n">dset</span><span class="o">.</span><span class="n">mnist</span><span class="o">.</span><span class="n">MNIST</span><span class="p">(</span><span class="n">root</span> <span class="o">=</span> <span class="s1">'data/mnist'</span><span class="p">,</span> <span class="n">download</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">transform</span><span class="o">=</span><span class="n">transform</span><span class="p">)</span>
<span class="n">noiseloader</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">DataLoader</span><span class="p">(</span><span class="n">noiseset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">BATCH_SIZE</span> <span class="o">*</span> <span class="mi">10</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># get some random training images</span>
<span class="n">dataiter</span> <span class="o">=</span> <span class="nb">iter</span><span class="p">(</span><span class="n">noiseloader</span><span class="p">)</span>
<span class="n">images</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">dataiter</span><span class="o">.</span><span class="n">next</span><span class="p">()</span>
<span class="c1"># show images</span>
<span class="n">imshow</span><span class="p">(</span><span class="n">torchvision</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">make_grid</span><span class="p">(</span><span class="n">images</span><span class="p">))</span>
</pre></div>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/additional-strategies-partition-function/output_12_0.png"/></p>
<h2>Train model</h2>
<div class="highlight"><pre><span></span><span class="c1"># define model</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">n_units</span><span class="p">)</span>
<span class="c1"># define classifier</span>
<span class="n">k</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">classifier</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">X</span><span class="p">:</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">([</span><span class="n">k</span><span class="p">])</span><span class="o">.</span><span class="n">log</span><span class="p">()</span> <span class="o">-</span> <span class="n">model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">log</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">squeeze</span><span class="p">())</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">optim</span><span class="o">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=.</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># i had to change this learning rate to get this to train</span>
<span class="c1"># train</span>
<span class="n">train_model</span><span class="p">(</span><span class="n">classifier</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">trainloader</span><span class="p">,</span> <span class="n">noiseloader</span><span class="p">,</span> <span class="n">n_batches</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">Batch</span><span class="o">:</span> <span class="mi">0</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.304</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">10</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.027</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">20</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.0111</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">30</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.00611</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">40</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.00505</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">50</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.00318</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">60</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.00284</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">70</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.0029</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">80</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.0023</span>
<span class="n">Batch</span><span class="o">:</span> <span class="mi">90</span> <span class="o">|</span> <span class="n">Loss</span><span class="o">:</span> <span class="mf">0.00217</span>
</pre></div>
<h1>Sampling</h1>
<p>Once more, the (ideal) goal of this model is to fit a function <span class="math">\(p(x)\)</span> to some data, such that we can:</p>
<ol>
<li>Evaluate its likelihood (wherein it actually tells us that data on which the model was fit is more likely than data on which it was not)</li>
<li>Draw realistic samples</li>
</ol>
<p>From a Boltzmann machine, our primary strategy for drawing samples is via Gibbs sampling. It's slow, and I do not believe it's meant to work particularly well. Let's draw 5 samples and see how we do.</p>
<div class="highlight"><pre><span></span><span class="o">%%</span><span class="n">time</span>
<span class="n">samples</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span> <span class="n">burn_in</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span> <span class="n">every_n</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="err">CPU times: user 4min 10s, sys: 4.09 s, total: 4min 14s</span>
<span class="err">Wall time: 4min 17s</span>
</pre></div>
<p>Takes forever!</p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/additional-strategies-partition-function/output_18_0.png"/></p>
<p>Nothing great. These samples are highly correlated, if perfectly identical, as expected.</p>
<p>To generate better images, we'll have to let this run for a lot longer and "thin" the chain (taking every <code>every_n</code> samples, where <code>every_n</code> is on the order of 1, 10, or 100, roughly).</p>
<h1>Summary</h1>
<p>In this post, we discussed four additional strategies for both speeding up, as well as outright avoiding, the computation of the gradient of the log-partition function <span class="math">\(\nabla\log{Z}\)</span>.</p>
<p>While we only presented toy models here, these strategies see successful application in larger undirected graphical models, as well as directed conditional models for <span class="math">\(p(y\vert x)\)</span>. One key example of the latter is a language model; though the partition function is a sum over distinct values of <span class="math">\(y\)</span> (labels) instead of configurations of <span class="math">\(x\)</span> (inputs), it can still be intractable to compute! This is because there are as many distinct values of <span class="math">\(y\)</span> as there are tokens in the given language's vocabulary, which is typically on the order of millions.</p>
<p>Thanks for reading.</p>
<h2>Code</h2>
<p>The <a href="https://github.com/cavaunpeu/boltzmann-machines">repository</a> and <a href="https://nbviewer.jupyter.org/github/cavaunpeu/boltzmann-machines/blob/master/boltzmann-machines-part-2.ipynb">rendered notebook</a> for this project can be found at their respective links.</p>
<h2>References</h2>
<div class="footnote">
<hr/>
<ol>
<li id="fn:1">
<p>@book{Goodfellow-et-al-2016,
title={Deep Learning},
author={Ian Goodfellow and Yoshua Bengio and Aaron Courville},
publisher={MIT Press},
note={\url{http://www.deeplearningbook.org}},
year={2016}
} <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a><a class="footnote-backref" href="#fnref2:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://arxiv.org/abs/1805.03642">Adversarial Contrastive Estimation</a> <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>A Thorough Introduction to Boltzmann Machines2018-10-20T14:00:00-04:002018-10-20T14:00:00-04:00Will Wolftag:willwolf.io,2018-10-20:/2018/10/20/thorough-introduction-to-boltzmann-machines/<p>A pedantic walk through Boltzmann machines, with focus on the computational thorn-in-side of the partition function.</p><p>The principal task of machine learning is to fit a model to some data. In programming terms, this model is an object with two methods:</p>
<div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Model</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">likelihood</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">sample</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_samples</span><span class="p">):</span>
<span class="k">pass</span>
</pre></div>
<h2>Likelihood</h2>
<p>How likely is the query point <span class="math">\(x\)</span> under our model? In other words, how likely was it that our model produced <span class="math">\(x\)</span>?</p>
<p>Note: The likelihood gives a value proportional to a valid probability, but is not necessarily a valid probability itself.</p>
<h2>Sample</h2>
<p>Draw a sample datum <span class="math">\(x\)</span> from the model.</p>
<h2>Notation</h2>
<p>Canonically, we denote an instance of our <code>Model</code> in mathematical syntax as follows:</p>
<div class="math">$$
x \sim p(x)
$$</div>
<p>Again, this simple notation implies two powerful methods: that we can evaluate the <code>likelihood</code> of having observed <span class="math">\(x\)</span> under our model <span class="math">\(p\)</span>, and that we can <code>sample</code> a new value <span class="math">\(x\)</span> from our model <span class="math">\(p\)</span>.</p>
<p>Often, we work instead with <em>conditional</em> models, e.g. <span class="math">\(y \sim p(y\vert x)\)</span>, in classification and regression tasks. The <code>likelihood</code> and <code>sample</code> methods apply all the same.</p>
<h2>Boltzmann machines</h2>
<p>A Boltzmann machine is one of the simplest mechanisms for modeling <span class="math">\(p(x)\)</span>. It is an undirected graphical model where every dimension <span class="math">\(x_i\)</span> of a given observation <span class="math">\(x\)</span> influences every other dimension. As such, we might use it to model data which we believe to exhibit this property, e.g. an image (where intuitively, pixel values influence neighboring pixel values). For <span class="math">\(x \in R^3\)</span>, our model would look as follows:</p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/thorough-introduction-to-boltzmann-machines/boltzmann-machine.svg"/></p>
<p>For <span class="math">\(x \in R^n\)</span>, a given node <span class="math">\(x_i\)</span> would have <span class="math">\(n - 1\)</span> outgoing connections in total—one to every other node <span class="math">\(x_j\ \forall\ j \neq i\)</span>.</p>
<p>Finally, a Boltzmann machine strictly operates on <em>binary</em> data. This keeps things simple.</p>
<h2>Computing the likelihood</h2>
<p>A Boltzmann machines admits the following formula for computing the <code>likelihood</code> of data points <span class="math">\(x^{(1)}, ..., x^{(n)}\)</span>:</p>
<div class="math">$$
H(x) = \sum\limits_{i \neq j} w_{i, j} x_i x_j + \sum\limits_i b_i x_i
$$</div>
<div class="math">$$
p(x) = \frac{\exp{(H(x))}}{Z}
$$</div>
<div class="math">$$
\mathcal{L}(x^{(1)}, ..., x^{(n)}) = \prod\limits_{i=1}^n p(x^{(i)})
$$</div>
<p>Note:</p>
<ul>
<li>Since our weights can be negative, <span class="math">\(H(x)\)</span> can be negative. Since our likelihood is proportional to a valid probability, we'd prefer it to be non-negative.</li>
<li>To enforce this constraint, we exponentiate <span class="math">\(H(x)\)</span> in the second equation.</li>
<li>To normalize, we divide by the normalization constant <span class="math">\(Z\)</span>, i.e. the sum of the likelihoods of all possible values of <span class="math">\(x\)</span>.</li>
</ul>
<h2>Computing the partition function by hand</h2>
<p>In the case of 2-dimensional binary datum <span class="math">\(x\)</span>, there are 4 possible "configurations": <span class="math">\([0, 0], [0, 1], [1, 0], [1, 1]\)</span>. As such, to compute the likelihood of one of these configurations, e.g.</p>
<div class="math">$$
p([1, 0]) = \frac{\exp{(H([1, 0]))}}{\exp{(H([0, 0]))} + \exp{(H([0, 1]))} + \exp{(H([1, 0]))} + \exp{(H([1, 1]))}}
$$</div>
<p>we see that the normalization constant <span class="math">\(Z\)</span> is a sum of 4 terms.</p>
<p>More generally, given <span class="math">\(d\)</span>-dimensional <span class="math">\(x\)</span>, where each <span class="math">\(x_i\)</span> can assume one of <span class="math">\(v\)</span> distinct values, computing <span class="math">\(p(x)\)</span> implies a summation over <span class="math">\(v^d\)</span> terms. <strong>With a non-trivially large <span class="math">\(v\)</span> or <span class="math">\(d\)</span> this becomes intractable to compute.</strong></p>
<p>Below, we'll demonstrate how "tractability," i.e. "can we actually compute <span class="math">\(Z\)</span> before the end of the universe?" changes with varying <span class="math">\(d\)</span> for our Boltzmann machine (of <span class="math">\(v = 2\)</span>).</p>
<h2>The likelihood function in code</h2>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">_unnormalized_likelihood</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_H</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">_H</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">var_combinations</span><span class="p">:</span>
<span class="n">h</span> <span class="o">+=</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
<span class="n">h</span> <span class="o">+=</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">@</span> <span class="n">x</span>
<span class="k">return</span> <span class="n">h</span>
<span class="k">def</span> <span class="nf">likelihood</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">log</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> :param x: a vector of shape (n_units,) or (n, n_units),</span>
<span class="sd"> where the latter is a matrix of multiple data points</span>
<span class="sd"> for which to compute the joint likelihood.</span>
<span class="sd"> """</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">if</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_units</span><span class="p">:</span>
<span class="k">raise</span><span class="p">(</span><span class="s1">'Please pass 1 or more points of `n_units` dimensions'</span><span class="p">)</span>
<span class="c1"># compute unnormalized likelihoods</span>
<span class="n">multiple_samples</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span>
<span class="k">if</span> <span class="n">multiple_samples</span><span class="p">:</span>
<span class="n">likelihood</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">_unnormalized_likelihood</span><span class="p">(</span><span class="n">point</span><span class="p">)</span> <span class="k">for</span> <span class="n">point</span> <span class="ow">in</span> <span class="n">x</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">likelihood</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">_unnormalized_likelihood</span><span class="p">(</span><span class="n">x</span><span class="p">)]</span>
<span class="c1"># compute partition function</span>
<span class="n">Z</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="bp">self</span><span class="o">.</span><span class="n">_unnormalized_likelihood</span><span class="p">(</span><span class="n">config</span><span class="p">)</span> <span class="k">for</span> <span class="n">config</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">all_configs</span><span class="p">])</span>
<span class="k">if</span> <span class="n">log</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">sum</span><span class="p">([</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">lik</span><span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">Z</span><span class="p">)</span> <span class="k">for</span> <span class="n">lik</span> <span class="ow">in</span> <span class="n">likelihood</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">reduce</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">multiply</span><span class="p">,</span> <span class="p">[</span><span class="n">lik</span> <span class="o">/</span> <span class="n">Z</span> <span class="k">for</span> <span class="n">lik</span> <span class="ow">in</span> <span class="n">likelihood</span><span class="p">])</span>
</pre></div>
<p>NB: In mathematical Python code, for-loops are bad; we should prefer <code>numpy</code> instead. Nevertheless, I've used for-loops here because they are easier to read.</p>
<p>This code block is longer than you might expect because it includes a few supplementary behaviors, namely:</p>
<ul>
<li>Computing the likelihood of one or more points</li>
<li>Avoiding redundant computation of <code>Z</code></li>
<li>Optionally computing the log-likelihood</li>
</ul>
<h2>Training the model</h2>
<p>At the outset, the parameters <code>self.weights</code> and <code>self.biases</code> of our model are initialized at random. Trivially, such that the values returned by <code>likelihood</code> and <code>sample</code> are useful, we must first update these parameters by fitting this model to observed data.</p>
<p>To do so, we will employ the principal of maximum likelihood: compute the parameters that make the observed data maximally likely under the model, via gradient ascent.</p>
<h2>Gradients</h2>
<p>Since our model is simple, we can derive exact gradients by hand. We will work with the log-likelihood instead of the true likelihood to avoid issues of computational underflow. Below, we simplify this expression, then compute its various gradients.</p>
<h3><span class="math">\(\log{\mathcal{L}}\)</span></h3>
<div class="math">$$
\mathcal{L}(x^{(1)}, ..., x^{(n)}) = \prod\limits_{k=1}^n \frac{\exp{(H(x^{(k)})}}{Z}
$$</div>
<div class="math">$$
\begin{align*}
\log{\mathcal{L}(x^{(1)}, ..., x^{(n)})}
&= \sum\limits_{k=1}^n \log{\frac{\exp{(H(x^{(k)})}}{Z}}\\
&= \sum\limits_{k=1}^n \log{\big(\exp{(H(x^{(k)})}\big)} - \log{Z}\\
&= \sum\limits_{k=1}^n H(x^{(k)}) - \log{Z}
\end{align*}
$$</div>
<p>This gives the total likelihood. Our aim is to maximize the expected likelihood with respect to the data generating distribution.</p>
<h3>Expected likelihood</h3>
<div class="math">$$
\begin{align*}
\mathop{\mathbb{E}}_{x \sim p_{\text{data}}}\big[ \mathcal{L}(x) \big]
&= \sum\limits_{k=1}^N p_{\text{data}}(x = x^{(k)}) \mathcal{L(x^{(k)})}\\
&\approx \sum\limits_{k=1}^N \frac{1}{N} \mathcal{L(x^{(k)})}\\
&= \frac{1}{N} \sum\limits_{k=1}^N \mathcal{L(x^{(k)})}\\
\end{align*}
$$</div>
<p>In other words, we wish to maximize the average likelihood of our data under the model. Henceforth, we will refer to this quantity as <span class="math">\(\mathcal{L}\)</span>, i.e. <span class="math">\(\mathcal{L} = \frac{1}{N} \sum\limits_{k=1}^n H(x^{(k)}) - \log{Z}\)</span>.</p>
<p>Now, deriving the gradient with respect to our weights:</p>
<h3><span class="math">\(\nabla_{w_{i, j}}\log{\mathcal{L}}\)</span>:</h3>
<div class="math">$$
\begin{align*}
\nabla_{w_{i, j}} \frac{1}{N} \sum\limits_{k=1}^n H(x^{(k)}) - \log{Z}
&= \frac{1}{N} \sum\limits_{k=1}^n \nabla_{w_{i, j}} H(x^{(k)}) - \frac{1}{N} \sum\limits_{k=1}^n \nabla_{w_{i, j}} \log{Z}
\end{align*}
$$</div>
<h3>First term:</h3>
<div class="math">$$
\begin{align*}
\frac{1}{N} \sum\limits_{k=1}^n \nabla_{w_{i, j}} H(x^{(k)})
&= \frac{1}{N} \sum\limits_{k=1}^n \nabla_{w_{i, j}} \bigg[ \sum\limits_{i \neq j} w_{i, j} x_i^{(k)} x_j^{(k)} + \sum\limits_i b_i x_i^{(k)} \bigg]\\
&= \frac{1}{N} \sum\limits_{k=1}^n x_i^{(k)} x_j^{(k)}\\
&\approx \mathop{\mathbb{E}}_{x \sim p_{\text{data}}} [x_i x_j]
\end{align*}
$$</div>
<h3>Second term:</h3>
<p>NB: <span class="math">\(\sum\limits_{\mathcal{x}}\)</span> implies a summation over all <span class="math">\(v^d\)</span> possible configurations of <span class="math">\(x\)</span>.</p>
<div class="math">$$
\begin{align*}
\nabla_{w_{i, j}} \log{Z}
&= \nabla_{w_{i, j}} \log{\sum\limits_{\mathcal{x}}} \exp{(H(x))}\\
&= \frac{1}{\sum\limits_{\mathcal{x}} \exp{(H(x))}} \nabla_{w_{i, j}} \sum\limits_{\mathcal{x}} \exp{(H(x))}\\
&= \frac{1}{Z} \sum\limits_{\mathcal{x}} \nabla_{w_{i, j}} \exp{(H(x))}\\
&= \frac{1}{Z} \sum\limits_{\mathcal{x}} \exp{(H(x))} \nabla_{w_{i, j}} H(x)\\
&= \sum\limits_{\mathcal{x}} \frac{\exp{(H(x))}}{Z} \nabla_{w_{i, j}} H(x)\\
&= \sum\limits_{\mathcal{x}} p(x) \nabla_{w_{i, j}} H(x)\\
&= \sum\limits_{\mathcal{x}} p(x) [x_i x_j]\\
&= \mathop{\mathbb{E}}_{x \sim p_{\text{model}}} [x_i x_j]
\end{align*}
$$</div>
<h3>Putting it back together</h3>
<p>Combining these constituent parts, we arrive at the following formula:</p>
<div class="math">$$
\nabla_{w_{i, j}}\log{\mathcal{L}} = \mathop{\mathbb{E}}_{x \sim p_{\text{data}}} [x_i x_j] - \mathop{\mathbb{E}}_{x \sim p_{\text{model}}} [x_i x_j]
$$</div>
<p>Finally, following the same logic, we derive the exact gradient with respect to our biases:</p>
<div class="math">$$
\nabla_{b_i}\log{\mathcal{L}} = \mathop{\mathbb{E}}_{x \sim p_{\text{data}}} [x_i] - \mathop{\mathbb{E}}_{x \sim p_{\text{model}}} [x_i]
$$</div>
<p>The first and second terms of each gradient are called, respectively, <strong>the positive and negative phases.</strong></p>
<h2>Computing the positive phase</h2>
<p>In the following toy example, our data are small: we can compute the positive phase using all of the training data, i.e. <span class="math">\(\frac{1}{N} \sum\limits_{k=1}^n x_i^{(k)} x_j^{(k)}\)</span>. Were our data bigger, we could approximate this expectation with a mini-batch of training data (much like SGD).</p>
<h2>Computing the negative phase</h2>
<p>Again, this term asks us to compute then sum the log-likelihood over every possible data configuration in the support of our model, which is <span class="math">\(O(v^d)\)</span>. <strong>With non-trivially large <span class="math">\(v\)</span> or <span class="math">\(d\)</span>, this becomes intractable to compute.</strong></p>
<p>Below, we'll begin our toy example by computing the true negative-phase, <span class="math">\(\mathop{\mathbb{E}}_{x \sim p_{\text{model}}} [x_i x_j]\)</span>, with varying data dimensionalities <span class="math">\(d\)</span>. Then, once this computation becomes slow, we'll look to approximate this expectation later on.</p>
<h2>Parameter updates in code</h2>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">update_parameters_with_true_negative_phase</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">):</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">)</span>
<span class="n">model_distribution</span> <span class="o">=</span> <span class="p">[(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">config</span><span class="p">),</span> <span class="n">model</span><span class="o">.</span><span class="n">likelihood</span><span class="p">(</span><span class="n">config</span><span class="p">))</span> <span class="k">for</span> <span class="n">config</span> <span class="ow">in</span> <span class="n">all_configs</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">var_combinations</span><span class="p">:</span>
<span class="c1"># positive phase</span>
<span class="n">positive_phase</span> <span class="o">=</span> <span class="p">(</span><span class="n">data</span><span class="p">[:,</span> <span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">data</span><span class="p">[:,</span> <span class="n">j</span><span class="p">])</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="c1"># negative phase</span>
<span class="n">negative_phase</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">config</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">config</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="n">likelihood</span> <span class="k">for</span> <span class="n">config</span><span class="p">,</span> <span class="n">likelihood</span> <span class="ow">in</span> <span class="n">model_distribution</span><span class="p">])</span>
<span class="c1"># update weights</span>
<span class="n">weights</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="n">positive_phase</span> <span class="o">-</span> <span class="n">negative_phase</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">biases</span><span class="p">):</span>
<span class="c1"># positive phase</span>
<span class="n">positive_phase</span> <span class="o">=</span> <span class="n">data</span><span class="p">[:,</span> <span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="c1"># negative phase</span>
<span class="n">negative_phase</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">config</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">likelihood</span> <span class="k">for</span> <span class="n">config</span><span class="p">,</span> <span class="n">likelihood</span> <span class="ow">in</span> <span class="n">model_distribution</span><span class="p">])</span>
<span class="c1"># update biases</span>
<span class="n">biases</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="n">positive_phase</span> <span class="o">-</span> <span class="n">negative_phase</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">weights</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">biases</span><span class="p">)</span>
</pre></div>
<h2>Train model, visualize model distribution</h2>
<p>Finally, we're ready to train. Using the true negative phase, let's train our model for 100 epochs with <span class="math">\(d=3\)</span> then visualize results.</p>
<div class="highlight"><pre><span></span><span class="n">seed</span> <span class="o">=</span> <span class="mi">42</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="o">.</span><span class="mi">01</span>
<span class="k">def</span> <span class="nf">reset_data_and_parameters</span><span class="p">(</span><span class="n">n_units</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">n_obs</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="p">[</span><span class="o">.</span><span class="mi">8</span><span class="p">,</span> <span class="o">.</span><span class="mi">1</span><span class="p">,</span> <span class="o">.</span><span class="mi">5</span><span class="p">]):</span>
<span class="sd">"""</span>
<span class="sd"> Generate training data, weights, biases, and a list of all data configurations</span>
<span class="sd"> in our model's support.</span>
<span class="sd"> In addition, generate a list of tuples of the indices of adjacent nodes, which</span>
<span class="sd"> we'll use to update parameters without duplication.</span>
<span class="sd"> For example, with `n_units=3`, we generate a matrix of weights with shape (3, 3);</span>
<span class="sd"> however, there are only 3 distinct weights in this matrix that we'll actually</span>
<span class="sd"> want to update: those connecting Node 0 --> Node 1, Node 1 --> Node 2, and</span>
<span class="sd"> Node 0 --> Node 2. This function returns a list containing these tuples</span>
<span class="sd"> named `var_combinations`.</span>
<span class="sd"> :param n_units: the dimensionality of our data `d`</span>
<span class="sd"> :param n_obs: the number of observations in our training set</span>
<span class="sd"> :param p: a vector of the probabilities of observing a 1 in each index</span>
<span class="sd"> of the training data. The length of this vector must equal `n_units`</span>
<span class="sd"> :return: weights, biases, var_combinations, all_configs, data</span>
<span class="sd"> """</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="c1"># initialize data</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">binomial</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">p</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_units</span><span class="p">))</span>
<span class="c1"># initialize parameters</span>
<span class="n">biases</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n_units</span><span class="p">)</span>
<span class="n">weights</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n_units</span><span class="p">,</span> <span class="n">n_units</span><span class="p">)</span>
<span class="c1"># a few other pieces we'll need</span>
<span class="n">var_combinations</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">combinations</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">n_units</span><span class="p">),</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">all_configs</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">product</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">repeat</span><span class="o">=</span><span class="n">n_units</span><span class="p">))</span>
<span class="k">return</span> <span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">,</span> <span class="n">data</span>
<span class="k">class</span> <span class="nc">Model</span><span class="p">:</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="n">weights</span>
<span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="n">biases</span>
<span class="bp">self</span><span class="o">.</span><span class="n">var_combinations</span> <span class="o">=</span> <span class="n">var_combinations</span>
<span class="bp">self</span><span class="o">.</span><span class="n">all_configs</span> <span class="o">=</span> <span class="n">all_configs</span>
<span class="bp">self</span><span class="o">.</span><span class="n">n_units</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">)</span>
<span class="nd">@staticmethod</span>
<span class="k">def</span> <span class="nf">_inv_logit</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
<span class="k">return</span> <span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">z</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">_unnormalized_likelihood</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_H</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">_H</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">var_combinations</span><span class="p">:</span>
<span class="n">h</span> <span class="o">+=</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
<span class="n">h</span> <span class="o">+=</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">@</span> <span class="n">x</span>
<span class="k">return</span> <span class="n">h</span>
<span class="k">def</span> <span class="nf">likelihood</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">log</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> :param x: a vector of shape (n_units,) or (n, n_units),</span>
<span class="sd"> where the latter is a matrix of multiple data points</span>
<span class="sd"> for which to compute the joint likelihood.</span>
<span class="sd"> """</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">if</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_units</span><span class="p">:</span>
<span class="k">raise</span><span class="p">(</span><span class="s1">'Please pass 1 or more points of `n_units` dimensions'</span><span class="p">)</span>
<span class="c1"># compute unnormalized likelihoods</span>
<span class="n">multiple_samples</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span>
<span class="k">if</span> <span class="n">multiple_samples</span><span class="p">:</span>
<span class="n">likelihood</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">_unnormalized_likelihood</span><span class="p">(</span><span class="n">point</span><span class="p">)</span> <span class="k">for</span> <span class="n">point</span> <span class="ow">in</span> <span class="n">x</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">likelihood</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">_unnormalized_likelihood</span><span class="p">(</span><span class="n">x</span><span class="p">)]</span>
<span class="c1"># compute partition function</span>
<span class="n">Z</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="bp">self</span><span class="o">.</span><span class="n">_unnormalized_likelihood</span><span class="p">(</span><span class="n">config</span><span class="p">)</span> <span class="k">for</span> <span class="n">config</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">all_configs</span><span class="p">])</span>
<span class="k">if</span> <span class="n">log</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">sum</span><span class="p">([</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">lik</span><span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">Z</span><span class="p">)</span> <span class="k">for</span> <span class="n">lik</span> <span class="ow">in</span> <span class="n">likelihood</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">reduce</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">multiply</span><span class="p">,</span> <span class="p">[</span><span class="n">lik</span> <span class="o">/</span> <span class="n">Z</span> <span class="k">for</span> <span class="n">lik</span> <span class="ow">in</span> <span class="n">likelihood</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">sample</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_samples</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">init_sample</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">burn_in</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span> <span class="n">every_n</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="n">seed</span><span class="p">)</span> <span class="o">-></span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">:</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="k">if</span> <span class="n">burn_in</span> <span class="o">></span> <span class="n">n_samples</span><span class="p">:</span>
<span class="k">raise</span><span class="p">(</span><span class="s2">"Can't burn in for more samples than there are in the chain"</span><span class="p">)</span>
<span class="n">init_sample</span> <span class="o">=</span> <span class="n">init_sample</span> <span class="ow">or</span> <span class="p">[</span><span class="mi">0</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
<span class="n">samples</span> <span class="o">=</span> <span class="p">[</span><span class="n">init_sample</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">_gibbs_step</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span> <span class="n">i</span><span class="p">):</span>
<span class="n">z</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="n">sample</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">sample</span><span class="p">))</span> <span class="k">if</span> <span class="n">j</span> <span class="o">!=</span> <span class="n">i</span><span class="p">])</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">p</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_inv_logit</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">binomial</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">p</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_samples</span><span class="p">):</span>
<span class="n">sample</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">samples</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="c1"># make copy</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">sample</span><span class="p">):</span>
<span class="n">sample</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">_gibbs_step</span><span class="p">(</span><span class="n">sample</span><span class="o">=</span><span class="n">sample</span><span class="p">,</span> <span class="n">i</span><span class="o">=</span><span class="n">i</span><span class="p">)</span>
<span class="n">samples</span><span class="o">.</span><span class="n">append</span><span class="p">(</span> <span class="n">sample</span> <span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">sample</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">sample</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">samples</span><span class="p">[</span><span class="n">burn_in</span><span class="p">:])</span> <span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="n">every_n</span> <span class="o">==</span> <span class="mi">0</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">update_parameters_with_true_negative_phase</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">)</span>
<span class="n">model_distribution</span> <span class="o">=</span> <span class="p">[(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">config</span><span class="p">),</span> <span class="n">model</span><span class="o">.</span><span class="n">likelihood</span><span class="p">(</span><span class="n">config</span><span class="p">))</span> <span class="k">for</span> <span class="n">config</span> <span class="ow">in</span> <span class="n">all_configs</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">var_combinations</span><span class="p">:</span>
<span class="c1"># positive phase</span>
<span class="n">positive_phase</span> <span class="o">=</span> <span class="p">(</span><span class="n">data</span><span class="p">[:,</span> <span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">data</span><span class="p">[:,</span> <span class="n">j</span><span class="p">])</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="c1"># negative phase</span>
<span class="n">negative_phase</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">config</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">config</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="n">likelihood</span> <span class="k">for</span> <span class="n">config</span><span class="p">,</span> <span class="n">likelihood</span> <span class="ow">in</span> <span class="n">model_distribution</span><span class="p">])</span>
<span class="c1"># update weights</span>
<span class="n">weights</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="n">positive_phase</span> <span class="o">-</span> <span class="n">negative_phase</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">biases</span><span class="p">):</span>
<span class="c1"># positive phase</span>
<span class="n">positive_phase</span> <span class="o">=</span> <span class="n">data</span><span class="p">[:,</span> <span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="c1"># negative phase</span>
<span class="n">negative_phase</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">config</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">likelihood</span> <span class="k">for</span> <span class="n">config</span><span class="p">,</span> <span class="n">likelihood</span> <span class="ow">in</span> <span class="n">model_distribution</span><span class="p">])</span>
<span class="c1"># update biases</span>
<span class="n">biases</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="n">positive_phase</span> <span class="o">-</span> <span class="n">negative_phase</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">weights</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">biases</span><span class="p">)</span>
</pre></div>
<h2>Train</h2>
<div class="highlight"><pre><span></span><span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">reset_data_and_parameters</span><span class="p">(</span><span class="n">n_units</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="p">[</span><span class="o">.</span><span class="mi">8</span><span class="p">,</span> <span class="o">.</span><span class="mi">1</span><span class="p">,</span> <span class="o">.</span><span class="mi">5</span><span class="p">])</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="n">weights</span><span class="p">,</span> <span class="n">biases</span> <span class="o">=</span> <span class="n">update_parameters_with_true_negative_phase</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">lik</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">)</span><span class="o">.</span><span class="n">likelihood</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">log</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">10</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Epoch: </span><span class="si">{</span><span class="n">i</span><span class="si">:</span><span class="s1">2</span><span class="si">}</span><span class="s1"> | Likelihood: </span><span class="si">{</span><span class="n">lik</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">Epoch</span><span class="o">:</span> <span class="mi">0</span> <span class="o">|</span> <span class="n">Likelihood</span><span class="o">:</span> <span class="o">-</span><span class="mf">209.63758306786653</span>
<span class="n">Epoch</span><span class="o">:</span> <span class="mi">10</span> <span class="o">|</span> <span class="n">Likelihood</span><span class="o">:</span> <span class="o">-</span><span class="mf">162.04280784271083</span>
<span class="n">Epoch</span><span class="o">:</span> <span class="mi">20</span> <span class="o">|</span> <span class="n">Likelihood</span><span class="o">:</span> <span class="o">-</span><span class="mf">160.49961381649555</span>
<span class="n">Epoch</span><span class="o">:</span> <span class="mi">30</span> <span class="o">|</span> <span class="n">Likelihood</span><span class="o">:</span> <span class="o">-</span><span class="mf">159.79539070373576</span>
<span class="n">Epoch</span><span class="o">:</span> <span class="mi">40</span> <span class="o">|</span> <span class="n">Likelihood</span><span class="o">:</span> <span class="o">-</span><span class="mf">159.2853717231018</span>
<span class="n">Epoch</span><span class="o">:</span> <span class="mi">50</span> <span class="o">|</span> <span class="n">Likelihood</span><span class="o">:</span> <span class="o">-</span><span class="mf">158.90186293631422</span>
<span class="n">Epoch</span><span class="o">:</span> <span class="mi">60</span> <span class="o">|</span> <span class="n">Likelihood</span><span class="o">:</span> <span class="o">-</span><span class="mf">158.6084020645482</span>
<span class="n">Epoch</span><span class="o">:</span> <span class="mi">70</span> <span class="o">|</span> <span class="n">Likelihood</span><span class="o">:</span> <span class="o">-</span><span class="mf">158.38094343579155</span>
<span class="n">Epoch</span><span class="o">:</span> <span class="mi">80</span> <span class="o">|</span> <span class="n">Likelihood</span><span class="o">:</span> <span class="o">-</span><span class="mf">158.20287017780586</span>
<span class="n">Epoch</span><span class="o">:</span> <span class="mi">90</span> <span class="o">|</span> <span class="n">Likelihood</span><span class="o">:</span> <span class="o">-</span><span class="mf">158.06232196551673</span>
</pre></div>
<h2>Visualize samples</h2>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">plot_n_samples</span><span class="p">(</span><span class="n">n_samples</span><span class="p">,</span> <span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> NB: We add some jitter to the points so as to better visualize density in a given corner of the model.</span>
<span class="sd"> """</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">9</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="o">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">111</span><span class="p">,</span> <span class="n">projection</span><span class="o">=</span><span class="s1">'3d'</span><span class="p">)</span>
<span class="n">samples</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">)</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">n_samples</span><span class="p">)</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">samples</span><span class="p">))</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">))</span> <span class="o">*</span> <span class="o">.</span><span class="mi">05</span>
<span class="n">y</span> <span class="o">+=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">y</span><span class="p">))</span> <span class="o">*</span> <span class="o">.</span><span class="mi">05</span>
<span class="n">z</span> <span class="o">+=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">z</span><span class="p">))</span> <span class="o">*</span> <span class="o">.</span><span class="mi">05</span>
<span class="n">ax</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s1">'Node 0'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s1">'Node 1'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_zlabel</span><span class="p">(</span><span class="s1">'Node 2'</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">n_samples</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">plot_n_samples</span><span class="p">(</span><span class="n">n_samples</span><span class="p">,</span> <span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">n_samples</span><span class="si">}</span><span class="s1"> Samples from Model'</span><span class="p">)</span>
</pre></div>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/thorough-introduction-to-boltzmann-machines/output_7_0.png"/></p>
<p>The plot roughly matches the data-generating distribution: most points assume values of either <span class="math">\([1, 0, 1]\)</span>, or <span class="math">\([1, 0, 0]\)</span> (given <span class="math">\(p=[.8, .1, .5]\)</span>).</p>
<h2>Sampling, via Gibbs</h2>
<p>The second, final method we need to implement is <code>sample</code>. In a Boltzmann machine, we typically do this via <a href="http://www.mit.edu/~ilkery/papers/GibbsSampling.pdf">Gibbs sampling</a>.</p>
<p>To effectuate this sampling scheme, we'll need a model of each data dimension conditional on the other data dimensions. For example, for <span class="math">\(d=3\)</span>, we'll need to define:</p>
<ul>
<li><span class="math">\(p(x_0\vert x_1, x_2)\)</span></li>
<li><span class="math">\(p(x_1\vert x_0, x_2)\)</span></li>
<li><span class="math">\(p(x_2\vert x_0, x_1)\)</span></li>
</ul>
<p>Given that each dimension must assume a 0 or a 1, the above 3 models must necessarily return the probability of observing a 1 (where 1 minus this value gives the probability of observing a 0).</p>
<p>Let's derive these models using the workhorse axiom of conditional probability, starting with the first:</p>
<div class="math">$$
\begin{align*}
p(x_0 = 1\vert x_1, x_2)
&= \frac{p(x_0 = 1, x_1, x_2)}{p(x_1, x_2)}\\
&= \frac{p(x_0 = 1, x_1, x_2)}{\sum\limits_{x_0 \in [0, 1]} p(x_0, x_1, x_2)}\\
&= \frac{p(x_0 = 1, x_1, x_2)}{p(x_0 = 0, x_1, x_2) + p(x_0 = 1, x_1, x_2)}\\
&= \frac{1}{1 + \frac{p(x_0 = 0, x_1, x_2)}{p(x_0 = 1, x_1, x_2)}}\\
&= \frac{1}{1 + \frac{\exp{(H(x_0 = 0, x_1, x_2)))}}{\exp{(H(x_0 = 1, x_1, x_2)))}}}\\
&= \frac{1}{1 + \exp{(H(x_0 = 0, x_1, x_2) - H(x_0 = 1, x_1, x_2))}}\\
&= \frac{1}{1 + \exp{(\sum\limits_{i \neq j} w_{i, j} x_i x_j + \sum\limits_i b_i x_i - (\sum\limits_{i \neq j} w_{i, j} x_i x_j + \sum\limits_i b_i x_i))}}\\
&= \frac{1}{1 + \exp{(-\sum\limits_{j \neq i = 0} w_{i, j} x_j - b_i)}}\\
&= \sigma\bigg(\sum\limits_{j \neq i = 0} w_{i, j} x_j + b_i\bigg)\\
\end{align*}
$$</div>
<p>Pleasantly enough, this model resolves to a simple Binomial GLM, i.e. logistic regression, involving only its neighboring units and the weights that connect them.</p>
<p>With the requisite conditionals in hand, let's run this chain and compare it with our (trained) model's true probability distribution.</p>
<h2>True probability distribution</h2>
<div class="highlight"><pre><span></span><span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">)</span>
<span class="n">distribution</span> <span class="o">=</span> <span class="p">[(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">config</span><span class="p">),</span> <span class="n">model</span><span class="o">.</span><span class="n">likelihood</span><span class="p">(</span><span class="n">config</span><span class="p">))</span> <span class="k">for</span> <span class="n">config</span> <span class="ow">in</span> <span class="n">all_configs</span><span class="p">]</span>
<span class="k">assert</span> <span class="nb">sum</span><span class="p">([</span><span class="n">likelihood</span> <span class="k">for</span> <span class="n">config</span><span class="p">,</span> <span class="n">likelihood</span> <span class="ow">in</span> <span class="n">distribution</span><span class="p">])</span> <span class="o">==</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">config</span><span class="p">,</span> <span class="n">likelihood</span> <span class="ow">in</span> <span class="n">distribution</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="nb">list</span><span class="p">(</span><span class="n">config</span><span class="p">)</span><span class="si">}</span><span class="s1">: </span><span class="si">{</span><span class="n">likelihood</span><span class="si">:</span><span class="s1">.4</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="err">[0, 0, 0]: 0.07327</span>
<span class="err">[0, 0, 1]: 0.09227</span>
<span class="err">[0, 1, 0]: 0.01366</span>
<span class="err">[0, 1, 1]: 0.01938</span>
<span class="err">[1, 0, 0]: 0.3351</span>
<span class="err">[1, 0, 1]: 0.3622</span>
<span class="err">[1, 1, 0]: 0.04693</span>
<span class="err">[1, 1, 1]: 0.05715</span>
</pre></div>
<h2>Empirical probability distribution, via Gibbs</h2>
<div class="highlight"><pre><span></span><span class="n">empirical_dist</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">samples</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span>
<span class="n">n_samples</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">samples</span><span class="p">)</span>
<span class="k">for</span> <span class="n">sample</span> <span class="ow">in</span> <span class="n">samples</span><span class="p">:</span>
<span class="n">empirical_dist</span><span class="p">[</span><span class="nb">tuple</span><span class="p">(</span><span class="n">sample</span><span class="p">)]</span> <span class="o">+=</span> <span class="mi">1</span> <span class="o">/</span> <span class="n">n_samples</span>
<span class="k">assert</span> <span class="n">np</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="n">empirical_dist</span><span class="o">.</span><span class="n">values</span><span class="p">()),</span> <span class="mi">8</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">config</span><span class="p">,</span> <span class="n">likelihood</span> <span class="ow">in</span> <span class="n">distribution</span><span class="p">:</span>
<span class="n">empirical_probability</span> <span class="o">=</span> <span class="n">empirical_dist</span><span class="p">[</span><span class="nb">tuple</span><span class="p">(</span><span class="n">config</span><span class="p">)]</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="nb">list</span><span class="p">(</span><span class="n">config</span><span class="p">)</span><span class="si">}</span><span class="s1">: </span><span class="si">{</span><span class="n">likelihood</span><span class="si">:</span><span class="s1">.4</span><span class="si">}</span><span class="s1"> (true), </span><span class="si">{</span><span class="n">empirical_probability</span><span class="si">:</span><span class="s1">.4</span><span class="si">}</span><span class="s1"> (empirical)'</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="err">[0, 0, 0]: 0.07327 (true), 0.05102 (empirical)</span>
<span class="err">[0, 0, 1]: 0.09227 (true), 0.09184 (empirical)</span>
<span class="err">[0, 1, 0]: 0.01366 (true), 0.0102 (empirical)</span>
<span class="err">[0, 1, 1]: 0.01938 (true), 0.02041 (empirical)</span>
<span class="err">[1, 0, 0]: 0.3351 (true), 0.3673 (empirical)</span>
<span class="err">[1, 0, 1]: 0.3622 (true), 0.398 (empirical)</span>
<span class="err">[1, 1, 0]: 0.04693 (true), 0.03061 (empirical)</span>
<span class="err">[1, 1, 1]: 0.05715 (true), 0.03061 (empirical)</span>
</pre></div>
<p>Close, ish enough.</p>
<h2>Scaling up, and hitting the bottleneck</h2>
<p>With data of vary dimensionality <code>n_units</code>, the following plot gives the time in seconds that it takes to train this model for 10 epochs.</p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/thorough-introduction-to-boltzmann-machines/output_15_1.png"/></p>
<p>To reduce computational burden, and/or to fit a Boltzmann machine to data of non-trivial dimensionality (e.g. a 28x28 grey-scale image, which implies a random variable with 28x28=784 dimensions), we need to compute the positive and/or negative phase of our gradient faster than we currently are.</p>
<p>To compute the former more quickly, we could employ mini-batches as in canonical stochastic gradient descent.</p>
<p>In this post, we'll instead focus on ways to speed up the latter. Revisiting its expression, <span class="math">\(\mathop{\mathbb{E}}_{x \sim p_{\text{model}}} [x_i x_j]\)</span>, we readily see that we can create an unbiased estimator for this value by drawing Monte Carlo samples from our model, i.e.</p>
<div class="math">$$
\mathop{\mathbb{E}}_{x \sim p_{\text{model}}} [x_i x_j] \approx \frac{1}{N}\sum\limits_{k=1}^N x^{(k)}_i x^{(k)}_j\quad\text{where}\quad x^{(k)} \sim p_{\text{model}}
$$</div>
<p>So, now we just need a way to draw these samples. Luckily, we have a Gibbs sampler to tap!</p>
<p><strong>Instead of computing the true negative phase, i.e. summing <span class="math">\(x_i x_j\)</span> over all permissible configurations <span class="math">\(X\)</span> under our model, we can approximate it by evaluating this expression for a few model samples, then taking the mean.</strong></p>
<p>We define this update mechanism here:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">update_parameters_with_gibbs_sampling</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span>
<span class="n">seed</span><span class="o">=</span><span class="mi">42</span><span class="p">,</span> <span class="n">n_samples</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="n">model_samples</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">biases</span><span class="p">,</span> <span class="n">var_combinations</span><span class="p">,</span> <span class="n">all_configs</span><span class="p">)</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="n">n_samples</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="n">seed</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">var_combinations</span><span class="p">:</span>
<span class="c1"># positive phase</span>
<span class="n">positive_phase</span> <span class="o">=</span> <span class="p">(</span><span class="n">data</span><span class="p">[:,</span> <span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">data</span><span class="p">[:,</span> <span class="n">j</span><span class="p">])</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="c1"># negative phase</span>
<span class="n">negative_phase</span> <span class="o">=</span> <span class="p">(</span><span class="n">model_samples</span><span class="p">[:,</span> <span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">model_samples</span><span class="p">[:,</span> <span class="n">j</span><span class="p">])</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="c1"># update weights</span>
<span class="n">weights</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="n">positive_phase</span> <span class="o">-</span> <span class="n">negative_phase</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">biases</span><span class="p">):</span>
<span class="c1"># positive phase</span>
<span class="n">positive_phase</span> <span class="o">=</span> <span class="n">data</span><span class="p">[:,</span> <span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="c1"># negative phase</span>
<span class="n">negative_phase</span> <span class="o">=</span> <span class="n">model_samples</span><span class="p">[:,</span> <span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="c1"># update biases</span>
<span class="n">biases</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="n">positive_phase</span> <span class="o">-</span> <span class="n">negative_phase</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">weights</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">biases</span><span class="p">)</span>
</pre></div>
<p>Next, we'll define a function that we can parameterize by an optimization algorithm (computing the true negative phase, or approximating it via Gibbs sampling, in the above case) which will train a model for <span class="math">\(n\)</span> epochs and return data requisite for plotting.</p>
<h2>How does training progress for varying data dimensionalities?</h2>
<p>Finally, for data of <code>n_units</code> 3, 4, 5, etc., let’s train models for 100 epochs and plot likelihood curves.</p>
<p>When training with the approximate negative phase, we’ll:</p>
<ul>
<li>Derive model samples from a <strong>1000-sample Gibbs chain. Of course, this is a parameter we can tune, which will affect both model accuracy and training runtime. However, we don’t explore that in this post;</strong> instead, we just pick something reasonable and hold this value constant throughout our experiments.</li>
<li>Train several models for a given <code>n_units</code>; Seaborn will average results for us then plot a single line.</li>
</ul>
<div class="highlight"><pre><span></span><span class="n">all_updates</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="o">.</span><span class="mi">1</span>
<span class="n">min_units</span><span class="p">,</span> <span class="n">max_units</span> <span class="o">=</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">7</span>
<span class="k">for</span> <span class="n">n_units</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">min_units</span><span class="p">,</span> <span class="n">max_units</span><span class="p">):</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">n_units</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">n_units</span><span class="p">)</span>
<span class="n">n_epochs</span> <span class="o">=</span> <span class="mi">100</span>
<span class="k">if</span> <span class="n">n_units</span> <span class="o">%</span> <span class="mi">5</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">alpha</span> <span class="o">/=</span> <span class="mi">10</span>
<span class="k">for</span> <span class="n">run</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="n">updates</span> <span class="o">=</span> <span class="n">train_model_for_n_epochs</span><span class="p">(</span>
<span class="n">update_parameters_with_gibbs_sampling</span><span class="p">,</span>
<span class="n">n_units</span><span class="o">=</span><span class="n">n_units</span><span class="p">,</span>
<span class="n">epochs</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
<span class="n">verbose</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
<span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">,</span>
<span class="n">run_num</span><span class="o">=</span><span class="n">run</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span>
<span class="n">p</span><span class="o">=</span><span class="n">p</span>
<span class="p">)</span>
<span class="n">all_updates</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">updates</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">n_units</span><span class="o">=</span><span class="n">n_units</span><span class="p">))</span>
<span class="k">for</span> <span class="n">run</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">):</span>
<span class="n">updates</span> <span class="o">=</span> <span class="n">train_model_for_n_epochs</span><span class="p">(</span>
<span class="n">update_parameters_with_true_negative_phase</span><span class="p">,</span>
<span class="n">n_units</span><span class="o">=</span><span class="n">n_units</span><span class="p">,</span>
<span class="n">epochs</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
<span class="n">verbose</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
<span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">,</span>
<span class="n">run_num</span><span class="o">=</span><span class="n">run</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span>
<span class="n">p</span><span class="o">=</span><span class="n">p</span>
<span class="p">)</span>
<span class="n">all_updates</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">updates</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">n_units</span><span class="o">=</span><span class="n">n_units</span><span class="p">))</span>
</pre></div>
<h2>Plot</h2>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/thorough-introduction-to-boltzmann-machines/output_23_1.png"/></p>
<p><strong>When we let each algorithm run for 100 epochs, the true negative phase gives a model which assigns higher likelihood to the observed data in all of the above training runs.</strong></p>
<p>Notwithstanding, the central point is that 100 epochs of the true negative phase takes a long time to run.</p>
<p>As such, let’s run each for an equal amount of time, and plot results. Below, we define a function to train models for <span class="math">\(n\)</span> seconds (or 1 epoch—whichever comes first).</p>
<div class="highlight"><pre><span></span><span class="n">all_updates</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">n_seconds</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="o">.</span><span class="mi">1</span>
<span class="n">min_units</span><span class="p">,</span> <span class="n">max_units</span> <span class="o">=</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">10</span>
<span class="k">for</span> <span class="n">n_units</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">min_units</span><span class="p">,</span> <span class="n">max_units</span><span class="p">):</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">n_units</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">n_units</span><span class="p">)</span>
<span class="n">n_epochs</span> <span class="o">=</span> <span class="mi">100</span>
<span class="k">if</span> <span class="n">n_units</span> <span class="o">%</span> <span class="mi">5</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">alpha</span> <span class="o">/=</span> <span class="mi">10</span>
<span class="k">for</span> <span class="n">run</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="n">updates</span> <span class="o">=</span> <span class="n">train_model_for_n_seconds</span><span class="p">(</span>
<span class="n">update_parameters_with_gibbs_sampling</span><span class="p">,</span>
<span class="n">n_units</span><span class="o">=</span><span class="n">n_units</span><span class="p">,</span>
<span class="n">n_seconds</span><span class="o">=</span><span class="n">n_seconds</span><span class="p">,</span>
<span class="n">verbose</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
<span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">,</span>
<span class="n">run_num</span><span class="o">=</span><span class="n">run</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span>
<span class="n">p</span><span class="o">=</span><span class="n">p</span>
<span class="p">)</span>
<span class="n">all_updates</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">updates</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">n_units</span><span class="o">=</span><span class="n">n_units</span><span class="p">))</span>
<span class="k">for</span> <span class="n">run</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">):</span>
<span class="n">updates</span> <span class="o">=</span> <span class="n">train_model_for_n_seconds</span><span class="p">(</span>
<span class="n">update_parameters_with_true_negative_phase</span><span class="p">,</span>
<span class="n">n_units</span><span class="o">=</span><span class="n">n_units</span><span class="p">,</span>
<span class="n">n_seconds</span><span class="o">=</span><span class="n">n_seconds</span><span class="p">,</span>
<span class="n">verbose</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
<span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">,</span>
<span class="n">run_num</span><span class="o">=</span><span class="n">run</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span>
<span class="n">p</span><span class="o">=</span><span class="n">p</span>
<span class="p">)</span>
<span class="n">all_updates</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">updates</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">n_units</span><span class="o">=</span><span class="n">n_units</span><span class="p">))</span>
</pre></div>
<h2>How many epochs do we actually get through?</h2>
<p>Before plotting results, let’s examine how many epochs each algorithm completes in its allotted time. In fact, for some values of <code>n_units</code>, we couldn’t even complete a single epoch (when computing the true negative phase) in <span class="math">\(\leq 1\)</span> second.</p>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/thorough-introduction-to-boltzmann-machines/output_28_1.png"/></p>
<p>Finally, we look at performance. With <code>n_units <= 7</code>, we see that 1 second of training with the true negative phase yields a better model. Conversely, <strong>using 7 or more units, the added performance given by using the true negative phase is overshadowed by the amount of time it takes the model to train.</strong></p>
<h2>Plot</h2>
<p><img alt="png" class="img-responsive" src="https://willwolf.io/figures/thorough-introduction-to-boltzmann-machines/output_31_1.png"/></p>
<p>Of course, we re-stress that the exact ablation results are conditional (amongst other things) on <strong>the number of Gibbs samples we chose to draw. Changing this will change the results, but not that about which we care the most: the overall trend.</strong></p>
<h2>Summary</h2>
<p>Throughout this post, we've given a thorough introduction to a Boltzmann machine: what it does, how it trains, and some of the computational burdens and considerations inherent.</p>
<p>In the next post, we'll look at cheaper, more inventive algorithms for avoiding the computation of the negative phase, and describe how they're used in common machine learning models and training routines.</p>
<h2>Code</h2>
<p>The <a href="https://github.com/cavaunpeu/boltzmann-machines">repository</a> and <a href="https://nbviewer.jupyter.org/github/cavaunpeu/boltzmann-machines/blob/master/boltzmann-machines-part-1.ipynb">rendered notebook</a> for this project can be found at their respective links.</p>
<h2>References</h2>
<div class="footnote">
<hr/>
<ol>
<li id="fn:1">
<p><a href="http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec19.pdf">CSC321 Lecture 19: Boltzmann Machines</a> <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://theclevermachine.wordpress.com/2014/09/23/derivation-maximum-likelihood-for-boltzmann-machines/">Derivation: Maximum Likelihood for Boltzmann Machines</a> <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p><a href="https://www.cs.toronto.edu/~hinton/csc321/readings/boltz321.pdf">Boltzmann Machines</a> <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>