Quantcast
Channel: Not Just Books » Not Just Books »
Viewing all articles
Browse latest Browse all 12

Topic Modeling Eleanor Roosevelt’s “My Day” Columns

$
0
0

The Project

Methods, Challenges & Lessons Learned

Appendix

The Project

For my final project, I did a topic modeling analysis of Eleanor Roosevelt’s “My Day” column.  Mrs. Roosevelt wrote this column six days a week from January of 1936 to August of 1960.  She then produced the column three times a week until September 1962.  I felt that Mrs. Roosevelt’s columns were a good candidate for topic modeling due to the volume of columns and my experience with her later columns indicated that she wrote on a wealth of topics.

I initially chose three years for analysis to determine if there were any patterns immediately visible and if a question could be formulated for further investigation.  I chose columns from 1936, 1943 and 1950.  1936 was chosen as it was the first year that she wrote the column.  1943 was chosen as it was a war time year, and she was also still in the White House as First Lady.  1950 was chosen for the fact that she was no longer at the White House and was well into her work with the United Nations Commission on Human Rights.  After converting the columns to text files (see file preparation for additional information) I analyzed the texts using Mallet version 2.0.7.  The output was analyzed and turned into a visualization comparing the distribution of topics over the three years.  The analysis revealed groups of words that compromised topics such as “Home & Family” (#1 for 1936), “War” (#1 in 1943) and Human Rights Commission (#8 in 1950).

As you can see from the chart below, the #1 topic from 1936 dominated the top 10 topics, while in 1943 the topics are far more evenly distributed.  This lead to my question that would guide the rest of the project — can we use topic modeling and determine when Mrs. Roosevelt’s writing shifted from a single dominant topic as seen in 1936 to the more even distributions of topics in 1943?  And if we can determine when that shift took place, can we determine what events may have spurred it on?

1936, 1943, 1950 Topic Distribution by Year

1936, 1943, 1950 Topic Distribution by Year – Click for Larger Version In New Window

A list of all the topics and associated words for each year is in the Appendix.

1936

1943

1950

Topic #

Topic Name

Topic #

Topic Name

Topic #

Topic Name

T1

Home & Family

T1

War

T1

Government & Country

T2

Youth Education

T2

Day to Day

T2

Friends

T3

WPA

T3

Family

T3

War, Peace & Democracy

T4

Train Travel

T4

Community Service

T4

Global Issues

T5

Entertaining

T5

Armed Forces

T5

Meetings

T6

Women’s Issues

T6

Global War Issues

T6

Korean War

T7

Day to Day

T7

Future

T7

Scandinavia

T8

White House Socials

T8

Community Issues

T8

Human Rights Commission

T9

Coastal Travel

T9

War Relief

T9

Community Support

T10

Employment

T10

Soldiers

T10

Family

After converting the files for 1937 through 1942, I once again ran Mallet’s analysis on each individual year.  The reason I looked at each year on its own rather than all eight years was that I wanted to see the distribution of topics by individual year.  Running all eight years as one analysis would have diluted the topics and not allowed for as sharp a year-to-year contrast.

Once the analysis had been run, I created another visualization of the top 10 topics for each year, which is below.  As you can see, while the #1 topic is less dominant in 1939 in her writing, it is not until 1941 that she expands her writing on other topics to push the #1 topic to less than 50% of the top 10 for the year.

1936-1943 Topic Distribution by Year - Click For Larger Version In New Window

1936-1943 Topic Distribution by Year – Click For Larger Version In New Window

1936

1937

1938

1939

Topic #

Topic Name

Topic #

Topic Name

Topic #

Topic Name

Topic #

Topic Name

T1

Home & Family

T1

Home & Family

T1

Day to Day

T1

Youth

T2

Youth Education

T2

Women

T2

Friends

T2

Day to Day

T3

WPA

T3

Entertaining

T3

Children

T3

Entertaining

T4

Train Travel

T4

War

T4

WPA

T4

Train Travel

T5

Entertaining

T5

Train Travel

T5

Train Travel

T5

Family

T6

Women’s Issues

T6

College Meetings

T6

Family

T6

War

T7

Day to Day

T7

Women’s Education

T7

Hyde Park

T7

Holidays

T8

White House Socials

T8

WPA

T8

Air Travel

T8

WPA

T9

Coastal Travel

T9

Women in Service

T9

Theatre

T9

Monarchy

T10

Employment

T10

Outdoors

T10

Rural Conditions

T10

Riding

1940

1941

1942

1943

Topic #

Topic Name

Topic #

Topic Name

Topic #

Topic Name

Topic #

Topic Name

T1

Day to Day

T1

Day to Day

T1

War

T1

War

T2

Entertaining

T2

Entertaining

T2

Day to Day

T2

Day to Day

T3

War

T3

War

T3

Home

T3

Family

T4

Family

T4

Defense

T4

War

T4

Community Service

T5

WPA

T5

Civilian Defense

T5

War Relief

T5

Armed Forces

T6

Outdoors

T6

WPA

T6

War Funding

T6

Global War Issues

T7

Travel

T7

Pleasantries

T7

Women’s Issues

T7

Future

T8

Music

T8

People

T8

Youth Education

T8

Community Issues

T9

White House

T9

Youth Issues

T9

Community Service

T9

War Relief

T10

Youth Issues

T10

Meetings

T10

Armed Forces

T10

Soldiers

At this point, I hoped it would be possible to see a trend in 1940 of the #1 topic trending down by month leading into 1941, and created the following visualization:

1940 Topic Distribution by Month - Click for Larger Version in New Window

1940 Topic Distribution by Month – Click for Larger Version in New Window

1940

Topic #

Topic Name

Topic #

Topic Name

T1

Day to Day

T6

Outdoors

T2

Entertaining

T7

Travel

T3

War

T8

Music

T4

Family

T9

White House

T5

WPA

T10

Youth Issues

Unfortunately, there was no clear trend line indicating that her writing topics would become wider in 1941.  While the #1 topic is less dominant in December than it was in January, it is still over 60% of the top ten topics for the month.  I ran another breakdown by month for 1941 to again see if there was any clear trend over the course of the year, which is below.  While it again showed no obvious month-to-month trend, there is a very clear increase in the topic distribution of December of 1941 – with Japan’s attack on Pearl Harbor on December 7th, 1941, the increase in the topics of “War,” “Defense” and “Civilian Defense” make sense.

1941 Topic Distribution By Month - Click For Larger Version in New Window

1941 Topic Distribution By Month – Click For Larger Version in New Window

1941

Topic #

Topic Name

Topic #

Topic Name

T1

Day to Day

T6

WPA

T2

Entertaining

T7

Pleasantries

T3

War

T8

People

T4

Defense

T9

Youth Issues

T5

Civilian Defense

T10

Meetings

Though the entrance of the United States into World War II understandably prompted an expansion of Mrs. Roosevelt’s writing, December was not the only month in 1941 that we can see multiple topics making up over half of the top ten topics – eight months over the course of the year saw the # 1 topic comprising less than 50% of her columns.

I will admit I had hoped to see a drastic change where a specific event could be pointed to as a watershed moment.  However, it would appear that in 1941 the continued expansion of war around the world also expanded Mrs. Roosevelt’s daily writing topics and there is no one single act that can account for the expansion of her writing on various aspects of the war and less on her day to day activities – but can see that 1941 as a whole is where her writing became more expansive, and continued to do so in 1942 and 1943.

Methods, Challenges and Lessons Learned

File Preparation

The columns were taken from the website of the Eleanor Roosevelt Papers Project, where they have produced an electronic version of her columns.

Each column was copied from the website and pasted in a text file using Apple’s TextEdit program.  Datelines were not included in the file, but embedded in the file name.  The file naming structure is as follows:  PublicationDate_City_State/Province_Country.txt

Example:

December 12, 1941 San Francisco:  19411212_SanFrancisco_CA_USA.txt

Exceptions:

Datelines of Washington DC were saved as:

YYYYMMDD_DC_USA.txt

Columns with no printed dateline information were saved as:

YYYYMMDD_ND.txt

Additionally, any notes from United Features Syndicate to newspaper editors such as embargo information or notes on time of mailing were also excluded from the files.

On more than one occasion in class, we have discussed issues with spelling and text mining, and I kept that in mind as I was working.  The files were created in one-month batches, and once a month was finished, I opened all the files in TextEdit and set it to highlight (but not auto-correct) any spelling errors.  The files were quickly scanned and any obvious typos or words that had been run together were corrected.  There are also known idiosyncrasies with Eleanor Roosevelt’s spelling.  While most of these spelling “errors” were left intact in the digitized editions, some were corrected.  In this case I had a choice: I could attempt to verify that all instances of “ER spelling” were intact in all the documents, or I could simply correct them.  To search 2,811 documents to ensure that “traveling” was left as “travelling” along with other words she routinely spelled differently would have taken far too much time.  Correcting the words did not change the meaning of the documents (even though Mallet does not care about definitions), and ensured consistency of spelling throughout the documents.

After each monthly batch was completed, the file folder was then scanned to ensure that the file names were correct.  The time to process each month was approximately 15 minutes.

Choosing Frequency of Topics

One of the bigger challenges in the analysis was choosing the number of topics to model.  In looking at references in an attempt to identify an “optimum” number of topics, the consensus seemed to be that it was a trial and error process until one finds the “right” number of topics to use.  During the initial 3 year analysis of 1936, 1943 and 1950, I ran Mallet with topics ranging from 5 to 100.  5 gave topics that were far too broad and 100 was not surprisingly, too narrow.  After looking at output from all three years with varying numbers of topics defined, I settled on 30 topics.  There is no “hard scientific” reason I used 30 topics – it produced data that made sense and was workable.

Choosing Frequency of Analysis

Choosing the frequency of the analysis was done in concert with choosing the number of topics to model.  I quickly discovered that trying to do the analysis by month or quarter was not feasible, as there was simply not enough data to produce meaningful results.  I found that individual years worked very well – enough data to produce distinct topics, but not so much that the topics would become diluted.

Choosing Duration of Analysis

Choosing the duration of the analysis was actually one of the easier choices once the three year initial analysis was completed.  Knowing that I was looking for a point in time between 1936 and 1943, ending the analysis at 1943 made perfect sense.  Performing the analysis through to 1950 simply because I had included it in my initial analysis would not have contributed to the overall project.

Creating the Visualizations

The charts in this report were all created with the chart building tools in Microsoft Excel 2011 for Mac.  The *_composition and *_keys text files created by Mallet were converted to .csv files for analysis and then .xlsx files, and that data was used to build the charts.

Actual Processing of Files

As stated before, Mallet 2.0.7 was used for the topic modeling.  In order to ensure consistency in processing each year, and minimize processing errors due to typing errors, I wrote out all the commands in a saved text file.  When I would perform the analysis on a new year, I would run a “find and replace” in the text file to update all the references to the year being processed.  I would then copy and paste each command into Terminal.

There were problems at the beginning of processing with hidden files in the directories that were holding the data files to be processed, specifically, the .ds_store file that Apple’s OS X creates when finder windows are opened.  To ensure that these files would not interfere with the processing, before each year was processed, I ensured that the finder window was closed.  (If you delete the .ds_store file from an open finder window, it immediately recreates itself.)  Then I ran the following commands in Terminal, replacing the year as appropriate.

cd ~

cd desktop/DigitalFinal/1936-FULL/

rm .ds_store

cd ~

cd desktop/mallet-2.0.7/

After those commands were run, the topic model commands were then run.  Once the files were processed, I also double-checked the number of files in the *_composition files to ensure no extra files had been processed.

Creating this Presentation

I made the conscious decision to present this as a posting at NotJustBooks.org rather than trying to create a new, separate webpage for it.  WordPress’ built in stylesheets and formatting allowed me to concentrate on the report itself rather than creating an additional digital project.  (Though creating bookmarks within the post was far more of a pain than it should have been.)

Other Lessons Learned

This project has been both challenging and enjoyable.  The most time consuming part was preparing the files for processing.  While I am sure I could have written a script to scrape the files from the website, I think it would have taken longer to get a script to do exactly what I wanted to do than it did to simply copy & paste the texts.  (It also quickly became a muscle memory exercise.)

Once the data was processed, making sense of it was the real challenge.  I will admit I was shocked that the first year of texts was so heavily weighted to one topic – my initial work with Mrs. Roosevelt’s column had been her 1953 writings and she seemed to write about everything.  It seems that it just took a little while and a World War for her to get there.

The visualizations were very easy – once I figured out the best way to graphically present the data.  There were some graphical non-starters in the beginning, such as this attempt to look at the frequency of topics by quarter:

Not All Visualizations Are Helpful - Click For Larger Version in New Window

Not All Visualizations Are Helpful – Click For Larger Version in New Window

Overall the project has been a good experience in distant reading, data analysis and wrangling Mallet.

Appendix

Topic Breakdowns

Please note that the Topic “names” that I have given them are simply my interpretation of the word groupings.  Mallet does not identify word meanings, only patterns.  Also, Mallet removes all capitalization and punctuation as part of the analysis process.

1936

Topic #

Topic Name

# Of Files

Topic Words

T1

Home & Family

194

time morning day night back good house made home husband children york long washington left afternoon told young years

T2

Youth Education

13

people work interesting young group country interested youth girls education public city point government conditions view future individual national

T3

WPA

9

work wpa project west school projects virginia arthurdale visited breakfast teachers sewing piece hundred skill workers set mine making

T4

Train Travel

9

clock thirty train boys arrived drove hour breakfast minutes back station hotel eleven ready drive ten reached twelve tonight

T5

Entertaining

9

mrs mr miss scheider early brought daughter asked started sat dr half roosevelt interesting lady beautiful street cook hand

T6

Women’s Issues

9

women state country york committee miss meeting conference county college home parts club league rural working democratic train audience

T7

Day to Day

5

people life world great things feel make real makes today fact kind living live thinking read deal person difficult

T8

White House Socials

5

white dinner tea guests house luncheon morning colonel ride pleasure show staying secretary spring social deep played high sir

T9

Coastal Travel

5

island beach village st weather captain ferry dickerman river boat miss houses bay fog prince ahead minutes side looked

T10

Employment

4

human problem beings power unemployed ability employees waste solution happiness find people drury awake lesson fall emergency method towns

Note: There is no file for March 11, 1936

1937

Topic #

Topic Name

# Of Files

Topic Words

T1

Home & Family

148

time morning day back people good husband yesterday told home long feel hope thought today evening days made night

T2

Women

21

people work young women interested fact good real make great state years present kind woman important interesting youth today

T3

Entertaining

18

mrs mr house miss afternoon dinner white group visit washington yesterday friends held lady states ladies meeting talk small

T4

War

15

find country world give things people number great men thing read man friends made question bring mind experience war

T5

Train Travel

9

train found morning hour great thirty clock minutes breakfast mrs finally station ten scheider looked left half bed office

T6

College Meetings

8

college president governor state car fort lecture speech high home flowers elliott met mayor gas texas negro ruth chandler

T7

Women’s Education

8

school girls houses work project interesting city small room community living virginia spent west dollars job private arthurdale hundred

T8

WPA

7

building work boys wpa exhibition made industry district rural workers variety low scouts buildings arts months space public dam

T9

Women in Service

6

woman service county blind serve court fair judge exhibit jury atlantic clubs sale subject showing hospital law bobby sit

T10

Outdoors

6

beautiful water river hudson mountains beauty glorious lived air drove mountain coffee charleston bridge streets lovely place flowers hills

1938

Topic #

Topic Name

# Of Files

Topic Words

T1

Day to Day

160

people day work time good country things york great find today interesting yesterday city feel make morning part give

T2

Friends

18

mrs mr house back morning washington lunch pleasant miss scheider beautiful left gray decided hotel delightful trip john started

T3

Children

16

young children told night back woman life man feeling long talk world asked met boy real sat live spent

T4

WPA

12

wpa project work school boys projects state building girls nya visited made built farm room train schools buildings high

T5

Train Travel

11

president train arrived dinner left morning time found late afternoon days finally enjoyed car clock short hour leave told

T6

Family

10

home place law mother yesterday rest house returned bed hospital food supper remember big hours afternoon spent daughter knew

T7

Hyde Park

7

park time made yesterday looked afternoon hyde rain early morning days good side set sitting waiting rooms west held

T8

Air Travel

7

plane trip airport flight seattle chicago miles james crowd reached atlanta flying fog lake mountains leave stayed spot north

T9

Theatre

6

play theatre washington evening pleasant amusing plays stage judge perfectly guest amused habit story mind cast action seats unpleasant

T10

Rural Conditions

6

education interest housing rural public living national conditions south county making point books teachers plan federal question training important

1939

Topic #

Topic Name

# Of Files

Topic Words

T1

Youth

137

people work young great country interesting time made find group today years yesterday interest night part number give men

T2

Day to Day

64

time morning day good yesterday city york things told long found home world lunch back thought afternoon place make

T3

Entertaining

13

mrs mr dinner evening friends guests delightful party gave enjoyed tea ladies clock wife luncheon back conference press glad

T4

Train Travel

7

train reached hotel drive drove car station texas back kind breakfast late visit beautiful stopped real lecture town girl

T5

Family

6

house president washington mr white mother mrs family hyde law big arrived saturday afternoon tomorrow friday clock park small

T6

War

6

war world peace nation nations read leaders countries difficulties suffering abroad thinking problems rest man influence responsibility desire european

T7

Holidays

6

christmas year family remember eve head tree cards church day fourth season party pass hard custom friends grieve july

T8

WPA

6

project boys wpa nya projects state girls college training lecture buildings workers homes school needed communities business colored resident

T9

Monarchy

5

queen king majesties royal crowd diana crowds immediately heat arrival feeling potomac picnic rain swimming opportunity royalty canada embassy

T10

Riding

4

wife words horses speech horse ride fort show opinion lower senator bridge column general today question madame riding corner

1940

Topic #

Topic Name

# Of Files

Topic Words

T1

Day to Day

194

people country time city york young great hope work yesterday day night life make things home years made group

T2

Entertaining

33

mrs mr miss back lunch dinner afternoon time yesterday morning work pleasant clock thought enjoyed good gave washington enjoy

T3

War

12

war today world nation democracy peace people force human future desire feel live individual nations courage economic face government

T4

Family

7

president train park left morning drove house hyde day back arrived nice air reached return hours children front decided

T5

WPA

6

work state community nya training wpa week boys valuable education program young projects drive government school schools communities center

T6

Outdoors

6

green drive trees houses lovely blue color miss mountains sun drove road looked flowers gardens haven lake space beautiful

T7

Travel

5

car hour long plane late started wanted arrived reached airport san plans trip stop hands man gentleman stopped texas

T8

Music

4

music orchestra played sang evening symphony dance singing concert lovely songs negro union delightful hear song dancing costume art

T9

White House

4

house white reception guests members soft cold wise room birthday lawn party band numbers cabinet future museum end words

T10

Youth Issues

3

youth national groups congress meeting defense situation problems administration american states program interested general problem meet social agencies conference

1941

Topic #

Topic Name

# Of Files

Topic Words

T1

Day to Day

108

mrs house morning time miss washington back day lunch yesterday work dinner president white york clock today afternoon left

T2

Entertaining

49

people mr young evening night group made interesting yesterday day time service afternoon dr opportunity home small days play

T3

War

22

people country great life future world make year time nation today interest told hope situation feel bring democracy sense

T4

Defense

15

work defense community part people program groups number meeting week effort job important needed agencies labor morning made meet

T5

Civilian Defense

13

defense civilian office women meeting state volunteer bureau organization participation staff local services met regional federal working information mayor

T6

WPA

11

training girls school nya schools boys project year wpa program center projects art nurses hospital pictures colored houses opportunities

T7

Pleasantries

11

good things yesterday find morning read man return make deal called thought put night long difficult water days summer

T8

People

6

women live men make land human times long years read understand friend build grow end generation reason peace short

T9

Youth Issues

5

national education conference youth communities service great press working areas educational housing development physical rural administration problem recreation cities

T10

Meetings

5

meeting government thought democratic present interest hands attended questions citizens gave country organizations form countries members activities women good

Note: There are no files for July 24th and 25th, 1941

1942

Topic #

Topic Name

# Of Files

Topic Words

T1

War

106

people war country great things young time work find present life made make give hope men good part house

T2

Day to Day

51

mrs yesterday mr city york evening miss afternoon morning day train washington president lunch interested meeting night dinner state

T3

Home

27

time back day good days home long morning found left hours thought hour feel man find put family visit

T4

War

12

world future war today kind united peace countries nations courage hope fight fighting responsibility understanding freedom nation end forward

T5

War Relief

11

american british red cross states britain united lady lord london club minister headquarters center tea great sir thirty prime

T6

War Funding

8

stamps bonds relief buy washington money defense save campaign sale savings helping charitable university national gifts fund send feel

T7

Women’s Issues

7

children home child woman day schools family services families care homes mother leave provide mothers older money medical good

T8

Youth Education

6

school training clothes letter wear girls nya high adequate teacher jenks mama clothing month rationing variety basic learning mountain

T9

Community Service

6

work women community girls workers food great run school trained services activities home factory full boys weeks meal service

T10

Armed Forces

6

boys men army boy navy training service number officers soldiers cases forces visited group camp officer young armed camps

1943

Topic #

Topic Name

# Of Files

Topic Words

T1

War

49

people great country day make time home things life war find group part work feel long hope days interesting

T2

Day to Day

31

mrs york city mr washington good miss night train young evening yesterday left morning play afternoon back give time

T3

Family

29

time house morning president found afternoon lunch thought husband spent great white hope number short small kind yesterday talk

T4

Community Service

23

women work meeting services training opportunity war girls community workers group service evening britain college great groups labor activities

T5

Armed Forces

21

men made good army told man home day boys work back war fighting service today night places general asked

T6

Global War Issues

18

war united states country nations people nation interest world future make plans question responsibility military postwar relief national part

T7

Future

12

world people future human year live give real peace country day ago words living important thought reading bring fight

T8

Community Issues

10

young people effort housing youth government industry mr conditions organizations district white situation older making ways numbers increase shortage

T9

War Relief

9

hospital red cross hospitals nurses visited patients area boy doctors island club boys wards cases wounded months light daughter

T10

Soldiers

8

boys soldiers written boy long place show letters home feel soldier warm write houses families back grateful enjoy american

1950

Topic #

Topic Name

# Of Files

Topic Words

T1

Government & Country

92

people country great time good find day work long make today government made part young things give important countries

T2

Friends

33

mrs morning mr time evening house home night lunch young number york afternoon day dinner meeting visit city gave

T3

War, Peace & Democracy

21

world people war men life peace man live make thought nation force free hope effort living things rest democracy

T4

Global Issues

16

united nations states state time present hope nation greater organization security understanding congress peace bring making past day fact

T5

Meetings

14

general work committee made yesterday feel session hope success assembly afternoon questions meeting delegates asked morning subject change lake

T6

Korean War

13

korea chinese soviet war communist ussr koreans north government china aggression south union forces free korean communists military fighting

T7

Scandanavia

10

oslo swedish farm building great large small visited norway houses rooms drove prince sweden beautiful modern ships scandinavian charge

T8

Human Rights Commission

10

rights human commission covenant declaration social freedoms articles article hope group exceptions weeks economic quickly members finished award document

T9

Community Support

9

public work community social dr association group activities year organization training national groups council red education department similar cross

T10

Family

7

park hyde library back sunday year car franklin weekend stand husband cottage memorial day guests garden labor joy rain

Mallet Output Files

1936 – http://notjustbooks.org/FinalFiles/Analysis_1936-FULL-30_composition.xlsx

1937 – http://notjustbooks.org/FinalFiles/Analysis_1937-FULL-30_composition.xlsx

1938 – http://notjustbooks.org/FinalFiles/Analysis_1938-FULL-30_composition.xlsx

1939 – http://notjustbooks.org/FinalFiles/Analysis_1939-FULL-30_composition.xlsx

1940 – http://notjustbooks.org/FinalFiles/Analysis_1940-FULL-30_composition.xlsx

1941 – http://notjustbooks.org/FinalFiles/Analysis_1941-FULL-30_composition.xlsx

1942 – http://notjustbooks.org/FinalFiles/Analysis_1942-FULL-30_composition.xlsx

1943 – http://notjustbooks.org/FinalFiles/Analysis_1943-FULL-30_composition.xlsx

1950 – http://notjustbooks.org/FinalFiles/Analysis_1950-FULL-30_composition.xlsx


Viewing all articles
Browse latest Browse all 12

Latest Images

Trending Articles



Latest Images