golden state society: CAT

Showing posts with label CAT_misc. Show all posts

Tuesday, November 22, 2011

New tools for digital humanities

One of the innovative papers I heard at the SSHA last week was a presentation by Harvard graduate student Ian Miller, with a paper called "Reading 500 Years of Chinese History at Once". (In the end Ian apologized for only getting to the last 188 years of the Qing Dynasty.) I won't mention the details, since Ian hasn't yet published any of this work. But it was a genuinely fascinating exploration of emerging tools in the "digital humanities," to apply topic analysis to a 188-year series of Imperial memoranda. Ian's goal was to identify spikes of interest in topics such as rebels and bandits, and the work was really fascinating to hear about. (Here are a couple of interesting pages on digital humanities; link, link.)

The basic insight that is leading to new research in digital humanities is the fact that vast quantities of texts are now available for quantitative analysis. Humanists typically work with texts, and up till now their approaches have largely taken the form of close readings and semantic interpretations. Now that much of the published corpus of humanity is available in digital form thanks to the Google Books project, and now that many archives are steadily moving their ephemera to digital versions as well, it is possible for humanities researchers to broaden their toolkit and look for patterns among these published and unpublished texts. Google's NGrams tool allows all of us to do some of this kind of work (link, link), but more specialized tools for statistical analysis and presentation are needed if we are to go beyond compiling of changing frequencies of specific terms.

Statistical techniques for discovering "topics" in documents represent a crucial step forward in this endeavor. As Nelson Goodman noted in a pre-digital time, knowing what a text is "about" requires more than simply knowing what words are included in the document in what frequencies (Problems and Projects). We might have said at that point in the 1960s, that what we need beyond the syntax and the list of terms, is "understanding", an irreplaceably human capability. But a central task for web-based search arises from exactly this issue, and a great deal of research has been done to attempt to do a better job of discovering the "topics" that are central in a given document without invoking a human reader. And surprisingly enough, real progress has been made. This progress is at the heart of the digital humanities. The fundamental problem is this: are there statistical methods that can be used to analyze the frequency of the words included in a given document to provide a compressed analysis of the "topics" included in the document? We might then say that this compressed representation is a good approximation to what the document is "about".

A theoretical advancement, and corresponding set of tools, that is frequently invoked in research projects in this field is a "latent Direchet allocation" (LDA), a statistical technique for using word frequencies in a document to sort out a smaller set of topics. David Blei, Andrew Ng, and Michael Jordan introduced the idea in 2003 (link). (There is a detailed and technical description of the model in Wikipedia; link.) They indicate that this method is similar to algorithms based on "latent semantic indexing". Here is how Blei, Ng, and Jordan describe the approach in the abstract to this paper:

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

And here is their statement of the goal of LDA analysis:

The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments. (993)

And here is a summary assessment of the effectiveness of the LDA representation of a set of documents relative to a less compressive representation:

We see that there is little reduction in classification performance in using the LDA-based features; indeed, in almost all cases the performance is improved with the LDA features. Although these results need further substantiation, they suggest that the topic-based representation provided by LDA may be useful as a fast filtering algorithm for feature selection in text classification. (1013)

Here is a table they provide illustrating the kind of topic analysis that this statistical methodology creates:

In some ways the type of application that Ian Miller is making of these tools seems ideal. This kind of statistical methodology can be applied to very large databases of historical texts in order to discover patterns that the authors of those texts would have been entirely unaware of. So methods like LPA seem well designed to uncover historically shifting patterns of topic emphasis by observers and policy makers over time and space.

This is just a first cut for me on the kind of reasoning and statistical analysis that information theorists are employing to do semantic analysis of documents, and I certainly don't have a good understanding of how this works in detail. The power of these frameworks seems very great, though, and well worth studying in greater detail by historians and humanists.

New tools for digital humanities

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

And here is their statement of the goal of LDA analysis:

The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments. (993)

And here is a summary assessment of the effectiveness of the LDA representation of a set of documents relative to a less compressive representation:

We see that there is little reduction in classification performance in using the LDA-based features; indeed, in almost all cases the performance is improved with the LDA features. Although these results need further substantiation, they suggest that the topic-based representation provided by LDA may be useful as a fast filtering algorithm for feature selection in text classification. (1013)

Here is a table they provide illustrating the kind of topic analysis that this statistical methodology creates:

New tools for digital humanities

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

And here is their statement of the goal of LDA analysis:

The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments. (993)

And here is a summary assessment of the effectiveness of the LDA representation of a set of documents relative to a less compressive representation:

We see that there is little reduction in classification performance in using the LDA-based features; indeed, in almost all cases the performance is improved with the LDA features. Although these results need further substantiation, they suggest that the topic-based representation provided by LDA may be useful as a fast filtering algorithm for feature selection in text classification. (1013)

Here is a table they provide illustrating the kind of topic analysis that this statistical methodology creates:

Wednesday, August 24, 2011

Small cities

A recent post on the suburbs closed with the observation that there is an important "other" social space in the United States beyond the categories of urban, rural, and suburban. These are the small cities throughout the United States where a significant number of people come to maturity and develop their families and careers. I speculated that perhaps there is a distinctive sociology associated with these lesser urban places. Here I will look into this question a bit more fully.

There are about 275 cities in the US with populations 100,000 or larger (Wikipedia link). 201 of these cities are small, with populations between 100,000 and 250,000. There are 30.3 million people living in these cities -- about 10% of the US population. A certain number of these cities fall within the metropolitan areas of larger cities, but a significant number are at least 50 miles from a major city.

Here is a map of 200 cities with populations between 100,000 and 250,000:

View Small Cities in a larger map

And here is a map of 25 cities with population greater than 500,000 (red) and 48 cities with population between 250,000 and 500,000 (green):

View Small Cities in a larger map

Google Maps limits the number of objects that can be placed on a map to 200 items, so it isn't possible to overlay these maps using Google Maps. Google Earth does not have this limitation, and all these points are included on the Google Earth version of the map. Here is what the overlay looks like:

And here is a map of the Metropolitan Statistical Areas in the US in 1999. Wikipedia provides an up-to-date list of the MSAs in the US (link). (Many of the small cities actually constitute an MSA of their own; so determining whether a small city is "metropolitan" really involves the question of whether the place falls within one of the top 25-50 MSAs by population.)

The group of cities I'm interested in here is a subset of the cities on the first map: those that are more than 50 miles from one of the top 25 cities on the second map. This still leaves well over 100 cities in the United States with a couple of interesting characteristics: they are relatively small, so they can be expected to lack a number of higher-level functions and industries; and they are relatively isolated from other larger cities, so their populations are extensively dependent on the resources of the city itself for employment, social services, entertainment, consumption, education, etc.

So the takeaway question here is this: what is life like in Billings MT, Topeka KS, Norman OK, Pueblo CO, Springfield IL, Knoxville TN, Cary NC, Green Bay WI, Grand Rapids MI, Allentown PA, Shreveport LA, and Killeen TX? What is it like to grow up in these places? Where do young people go for post-secondary education? What percentage of young people leave these places permanently in the course of their careers? Where do the elected officials in these places come from? How are these cities doing, from the perspective of unemployment, neighborhood and business district decline, and social problems?

Further, we can ask whether there are any structural features in common that imply that these places are more similar to each other than they are to larger cities or smaller towns. Are issues of immigration, race relations, drug use, teen pregnancy, or high school dropout rates different in these places?

Finally, we can ask whether growing up in these places gives rise to a specific mentality. Do those of us who grew up in small cities like these -- Peoria, Rock Island, Springfield -- have a different set of values, a different way of looking at the world, or perhaps different ways of relating to people in ordinary social life? Or are regional differences (south, midwest, Pacific Coast) more of a determinant of one's mentality?

(I've placed the lists of cities and MSAs I've used here as spreadsheets at Google Docs; link, link. Both lists come from Wikipedia entries on US Cities and Metropolitan Statistical Areas.)

Small cities

Friday, July 15, 2011

Food and water

It seems likely enough that one of the largest global security issues in the next fifty years will be food and water. There is a brewing food crisis underway already, with prices for staple grains rising world wide, and poor countries are beginning to experience the consequences. But a crisis in fresh water seems not too far in the future as well. Both these necessities depend on inherently scarce resources: arable land and large sources of fresh water. Along with energy, these goods are crucial to every person and every country in the world; and this in turn suggest the possibility of serious conflict over these resources in the future.

So what makes food and water a global security crisis? How does the possibility of dearth at the family level get transformed into the possibility of international conflict? Rising food prices create social unrest at the national level long before they lead to famine or malnutrition. International grain markets have been unstable over the past decade, with periodic upward spikes in prices. And grain riots have occurred as a consequence in a number of developing countries. This piece from DemocracyNow from 2008 documents demonstrations and riots across a range of African countries (link), with an interview with Raj Patel, author of Stuffed and Starved: The Hidden Battle for the World Food System. Here are some updates from the Christian Science Monitor (link) and Energy Bulletin (link), including riots in Algeria in 2011.

The bottom line of several of these reports is fairly simple: the international trade system for grains, including especially rice and wheat, periodically undergoes abrupt and prolonged price rises, and these price increases have dire consequences for urban poor and middle class people in the developing world. When a large population mobilizes in protest against rising food prices, national governments are at risk. And this is where the security risk comes in: when countries like Algeria or Morocco suffer serious instability, this has the potential of leading to international instability in the region as well.

Here is another, more distant cause of international tension that comes from the food crisis. Governments are interested in taking steps to provide greater food security for their own populations. And this sometimes involves taking actions that are harmful for other countries or for other populations. One symptom of the pressures mounting on the world food system is a widespread land grab of agricultural land around the developing world. Here is a report by the United Nations Food and Agricultural Organization (FAO) on this issue (link) and a piece from the Guardian that describes the situation a few years ago (link), and here is a FastCompany story about China's land purchases (link). The FAO report documents a significant transfer of land ownership from developing countries to middle and upper income countries; this implies serious future constraints on the development possibilities available to those countries. And it suggests a new form of colonialism -- not direct governance, but substantial absentee ownership. This too has the potential for stimulating international conflict.

So what about water? Here is a recent report by the Council for Foreign Relations on the interconnected consequences of fresh water shortages in different parts of the world (link). In this piece the effects of China's water crisis are traced internationally. Here is an inventory of resources by GlobalPolicy on international conflicts over water (link); it is a long list of potential conflicts. Here is the introduction the editors offer:

As demand for water hits the limits of finite supply, potential conflicts are brewing between nations that share transboundary freshwater reserves. More than 50 countries on five continents might soon be caught up in water disputes unless they move quickly to establish agreements on how to share reservoirs, rivers, and underground water aquifers. The articles and analysis below examine international water disputes, civil disturbances caused by water shortages, and potential regulatory solutions to diffuse water conflict.

Chinese-financed dam projects in Burma and other parts of Southeast Asia indicate how high the stakes are. The great rivers of Southeast Asia now face a major of challenges as a result of plans for hydroelectric dams regulating flow to downriver users. Here is a piece from the Irrawaddy on the controversies surrounding the Myitsone Dam in northern Burma; link. Similar issues arise on the Mekong River and other great rivers in the region; link. The problems that have arisen with regard to dams in Southeast Asia include displacement of villages and towns, serious environmental damage, and significant lowering of water levels in many of the great rivers of the region. Here is a background piece in Global Policy Forum on China's massive expansion of hydropower; link.

These sources of conflict over the most basic necessities of life suggest the need for serious international planning today to arrive at equitable and sustainable regimes for resolving conflicts over resources in the future.

Food and water

As demand for water hits the limits of finite supply, potential conflicts are brewing between nations that share transboundary freshwater reserves. More than 50 countries on five continents might soon be caught up in water disputes unless they move quickly to establish agreements on how to share reservoirs, rivers, and underground water aquifers. The articles and analysis below examine international water disputes, civil disturbances caused by water shortages, and potential regulatory solutions to diffuse water conflict.

Food and water

As demand for water hits the limits of finite supply, potential conflicts are brewing between nations that share transboundary freshwater reserves. More than 50 countries on five continents might soon be caught up in water disputes unless they move quickly to establish agreements on how to share reservoirs, rivers, and underground water aquifers. The articles and analysis below examine international water disputes, civil disturbances caused by water shortages, and potential regulatory solutions to diffuse water conflict.

Sunday, May 8, 2011

Flood courses of the Mississippi River

This fantastic map of the historical twists and turns of the Mississippi River near Cairo, Illinois, was drawn in 1944. It is reproduced in the New York Times today (link). In an age of digitally produced information displays, it is fascinating to see the density of historical information represented in this hand-drafted map. It is reminiscent of the maps Edward Tufte highlights in The Visual Display of Quantitative Information. Here is Charles Joseph Minard's 1869 map of Napoleon's invasion of Russia that Tufte made famous:

While on the subject of great maps, here is one by George Abel Schreiner in 1924, representing the structure of the world's telegraph cable system (link).

Here is a contemporary graphic representing global Internet flow:

And here is a graph of global cities connections, produced by R. Wall and B. v.d. Knaap in "Sustainability within a World City Network" (link).

What these images have in common is a very simple point: the power of graphical representation to capture complex sets of inter-related data.

Flood courses of the Mississippi River

While on the subject of great maps, here is one by George Abel Schreiner in 1924, representing the structure of the world's telegraph cable system (link).

Here is a contemporary graphic representing global Internet flow:

And here is a graph of global cities connections, produced by R. Wall and B. v.d. Knaap in "Sustainability within a World City Network" (link).

What these images have in common is a very simple point: the power of graphical representation to capture complex sets of inter-related data.

Flood courses of the Mississippi River

While on the subject of great maps, here is one by George Abel Schreiner in 1924, representing the structure of the world's telegraph cable system (link).

Here is a contemporary graphic representing global Internet flow:

And here is a graph of global cities connections, produced by R. Wall and B. v.d. Knaap in "Sustainability within a World City Network" (link).

What these images have in common is a very simple point: the power of graphical representation to capture complex sets of inter-related data.

Thursday, May 5, 2011

The drop-out crisis (II)

We've talked about "wicked problems" before -- problems that involve complex social processes, multiple actors, and murky causal pathways (link, link). A particularly important example of such a problem currently confronting the United States is the high school dropout crisis. The crisis is particularly intense in high-poverty areas, but it is found in all states and all parts of urban, suburban, and rural America. (Here is an earlier discussion of these issues; link.)

The consequences of this crisis are severe. More than a million high school students a year drop out of high school. Over 50% of these dropouts come from fewer than 20% of high schools. These young people have virtually no feasible pathways to a middle class life or a job in the 21st-century economy. And this in turn means a permanent underclass of unemployed or underemployed young people. This in turn has consequences for crime rates, social service budgets, incarceration rates, and a serious productivity gap for our economy as a whole. So the problem is an enormously important one. (The Alliance for Excellent Education is a national organization devoted to tracking this issue; link. Another important resource is Building a Grad Nation ((link) from the America's Promise Alliance.)

Changing this current situation requires change of behavior on the parts of many independent parties -- teenagers, parents, teachers, principals, elected officials, and foundation officers, to name only some of the most obvious participants.

There are many social actors who have an interest in this problem and a commitment to trying to resolve it. Teachers, principals, school boards; mayors and governors; non-profit organizations; foundations; universities and schools of education; citizens' groups -- there are committed and concerned actors throughout the country that are highly motivated to attempt to solve the problem.

But it is very, very hard to marshal these actors into effective attacks on the causes of this crisis. One part of the problem is strategic -- what are the interventions that can work on a large scale? How can a school system introduce changes in behavior and organization that really change the outcomes in a measurable way?

Another part of the problem is a coordination problem. How can we succeed in gaining commitment and cooperation across this range of actors, even if we have some credible strategies at hand? It often seems that every actor has a different theory of the problem, and often it is difficult to gain concerted action across diverse actors. A foundation has one strategy; a school board has a different theory; and the teachers themselves work on the basis of a different understanding of the problem as well. All are well motivated; but there is a clash of efforts.

In this context the Diplomas Now initiative is particularly encouraging. It is focused on a national initiative to target the "drop out factories" through a clear theory of how to create turn-around schools. It is referred to as a civic Marshall Plan. It is based on careful empirical research. It has developed a clear theory about how interventions with children through the schools can impact persistence through graduation. It has mobilized a strategic group of partners -- CityYear, Communities in Schools, and Talent Development at Johns Hopkins. And it has an ambitious and effective national strategy that is already being implemented.

And the most impressive fact is that Diplomas Now is beginning to work. There are DN schools in some of the toughest urban contexts in America; these schools are showing real measurable progress; and the example is spreading to other cities and systems. Concrete evidence of these successes is highlighted by a wide range of committed leaders, academics, and corps members at the CityYear National Leadership Conference in Washington (link).

So maybe we can have some cautious optimism that our wicked problems can be solved, with sufficient commitment and persistence from a range of actors.