Behind the Scenes of Really Big Data: Computing on the Whole World

For our April Meetup, we're thrilled to have Kalev Leetaru, Yahoo! Fellow in Residence at Georgetown University, talk about data mining at a global-scale. What does it take to build a system that monitors the entire world, analyzing global newsmedia in realtime, compiling catalogs of everything happening in the world and makes that data accessible for analysis, visualization, forecasting, and operational use? What does it take to support querying of a quarter-billion-record-by-58-column database in near-realtime? How do you visualize networks with hundreds of millions of nodes, tease structure from chaotic real-world observational graphs, or explore networks in the multi-petabyte range? How do you process and geographically visualize the emotion of the live Twitter Decahose in realtime? How do you rethink tone mining from scratch to power a flagship new reality television show? How do you adapt systems to work with machine translation, OCR and closed captioning error, and the messiness of real-world data? How do you process half a million hours of television news, five billion pages of historic books, or 60 million images dating back 500 years?

• 6:30pm -- Networking, Empenadas, and Refreshments

• 7:00pm -- Introduction

• 7:15pm -- Presentation and Discussion

• 8:30pm -- Data Drinks (Tonic, 2036 G St NW, Patio)


This talk will pull back the curtain and present a behind-the-scenes view of what its really like to work with really big data. How does one blend the world’s most powerful supercomputers, virtual machines, cloud storage, infrastructure as a service, plus a ton of software, into a single end-to-end environment that supports all of this research? I’ll be deep-diving on the GDELT Project, a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day. What does it take to build and run a system that monitors the entire world each day and delivers a quantitative model that increasingly powers operational conflict watchboards across the world?


Kalev H. Leetaru is the[masked] Yahoo! Fellow in Residence for International Values, Communications Technology and the Global Internet at the Institute for the Study of Diplomacy in the Edmund A. Walsh School of Foreign Service  at Georgetown University. He holds three US patents (cited by a combined 44 other issued US patents) and his work has been profiled in Nature, the New  York Times, The Economist, BBC, Discovery Channel and the media of more than 100 countries. His most recent work includes the first in-depth study of the geography of social media and the changing role of distance and location in online communicative behavior around the world (named by Harvard’s Nieman Lab as the top social media study of 2013), the creation of the GDELT Project, a database of more than a quarter-billion georeferenced global events 1979-present and the people, organizations, locations, and themes connecting the world, and the creation of the SyFy Channel’s Twitter Popularity Index, the first realtime character “leaderboard” created for television. Most recently he was named as one of Foreign Policy Magazine’s Top 100 Global Thinkers of 2013. More on his latest projects can be found on his website at


This event is sponsored by the GWU Dept. of Decision SciencesClouderaStatistics.comIBM Analytics Solution Center, Elder Research, and InformIT. Would you like to sponsor too? Please get in touch!

Join or login to comment.

  • Harlan H.

    There's been some interesting pushback against GDELT in recent days, related to news articles that used it as a source. See here: Bottom line seems to be that it's important to understand the _actual_ source and processes that generate data, and not just to trust the name of a data source.

    May 16

  • Harlan H.

    Kalev asked me to share a couple of recent interesting things about GDELT. Data now available by BigQuery: And, data now on globes in many museums:

    1 · June 1

    • Greg T.

      I saw that a couple days ago. Great move to increase usage. I'm a frequent user and fan of BigQuery. It is very powerful once you understand how it works. Ask me if you want to know more.

      June 1

    • Brand N.

      Greg, Please give us a presentation on this at the Federal Big Data Working Group Meetup. Brand

      June 5

    • Eric

      Interesting fact: playing audio at half speed and adjusting for pitch makes it sound normal. :-p

      3 · April 25

  • Miriam H.

    What I want now is 4D analytics, with two feet pedal time manipulation and augmented reality db probing using hand tracking and Kinects à la Minority Report (now available via Smart Vision) to drill down to source articles, TV broadcasts, etc. These tools need to be extended to to collaborative analysis teams. The Big Data summary stats of evolving news screams for this.

    2 · April 23

  • Nevin H.

    Amazing that the world is so small; imagine analyzing it from your desktop.

    1 · April 23

  • Steve

    Certainly a huge vat of information to absorb in such little time. The speaker definitely knows his stuff, but spoke way too quickly and in hindsight I would prefer to have this event broken into smaller chunks. When I stopped to fully appreciate and contemplate on one bullet point, I was already 1 slide behind the rest. All that being said, great information!

    1 · April 23

    • A former member
      A former member

      Absolutely, any chance we can get the slides from the talk?

      April 23

  • Brand N.

    My Announcement at the Meetup was: The Federal Big Data Working Group Meetup meets on the first and third Tuesdays of the month in Tysons Corner and we are mentoring students and professionals with data science tutorials, preparation of presentations, and writing proposals.

    April 23

  • Brand N.

    Kudos to the organizers for hosting such a large group. The speaker should slow down and be more interactive with the audience and present some real data science results on "the whole world" like Facebook's analysis showing the average degree of separation is down from 6 to about 4.2, Recorded Future's analysis of protests and web intelligence, and Marc Smith's uses of NodeXL network graphs in treemaps to discover patterns and what might be done to change them. I also suggest the author look at the presentations we have had in the Federal Big Data Working Group Meetup on the state of the art in big graph computing.

    2 · April 23

  • Jim B.

    A complex subject made cogent. Unique insights and observations pointing out the challenges and tool for big Analytics.

    2 · April 22

  • Rachael

    Interesting topic. Nice food. Well organized. I didn't like the 30 minutes of commercials at the beginning. Speaker went over time.

    April 22

  • Matthew R M.

    Speed talker but amazing amount of information and thought provoking research.

    1 · April 22

  • Greg T.

    Government Data is the most widely used type of data by folks attending tonight. Scientific/Medical Data is second followed closely by Social/Web Data.

    April 22

  • Valerie

    I would have liked to see less time spent on motivation (we didn't need to start with the formation of the earth) and more time spent in depth on the actual projects.

    5 · April 22

  • Greg T.

    About 1/3 of folks here tonight have worked with data sets larger than 1 Terabyte.

    April 22

  • MeL

    I can no longer make it and I'm wondering is there any chance this could be recorded and shared later?

    April 22

  • Harlan H.

    I've created a flier for this event. Please print copies and post on cork boards, office doors, or anywhere else!

    April 7, 2014

  • lee de c.

    not the whole world, but (just) the people; there's an environment out there, too.

    1 · April 7, 2014

  • Teferra A.

    I am interested in clinical informatics and data analysis. I feel that these two fields have of great influence in changing the trend of medicine by improving patient care. Careful analysis of clinical data will lead to more of personalized patient care which is best in addressing patients individually based on their specific needs.

    April 7, 2014

165 went

Our Sponsors

People in this
Meetup are also in:

Create a Meetup Group and meet new people

Get started Learn more

I decided to start Reno Motorcycle Riders Group because I wanted to be part of a group of people who enjoyed my passion... I was excited and nervous. Our group has grown by leaps and bounds. I never thought it would be this big.

Henry, started Reno Motorcycle Riders

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy