123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044104510461047104810491050105110521053105410551056105710581059106010611062106310641065106610671068106910701071107210731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119112011211122112311241125112611271128112911301131113211331134113511361137113811391140114111421143114411451146114711481149115011511152115311541155115611571158115911601161116211631164116511661167116811691170117111721173117411751176117711781179118011811182118311841185118611871188118911901191119211931194119511961197119811991200120112021203120412051206120712081209121012111212121312141215121612171218121912201221122212231224 |
- <?xml version="1.0" encoding="utf-8"?>
- <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
- "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
- <book lang="en_US" xml:base="../">
- <title>Six Degrees of Kevin Bacon</title>
- <bookinfo>
- <title>Six Degrees of Kevin Bacon: ECL Programming Example</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/redswooshWithLogo3.jpg" />
- </imageobject>
- </mediaobject>
- <author>
- <surname>Boca Raton Documentation Team</surname>
- </author>
- <legalnotice>
- <para>We welcome your comments and feedback about this document via
- email to <email>docfeedback@hpccsystems.com</email></para>
- <para>Please include <emphasis role="bold">Documentation
- Feedback</emphasis> in the subject line and reference the document name,
- page numbers, and current Version Number in the text of the
- message.</para>
- <para>LexisNexis and the Knowledge Burst logo are registered trademarks
- of Reed Elsevier Properties Inc., used under license.</para>
- <para>HPCC Systems is a registered trademark of LexisNexis Risk Data
- Management Inc.</para>
- <para>Other products, logos, and services may be trademarks or
- registered trademarks of their respective companies.</para>
- <para>All names and example data used in this manual are fictitious. Any
- similarity to actual persons, living or dead, is purely
- coincidental.</para>
- <para></para>
- </legalnotice>
- <xi:include href="common/Version.xml" xpointer="FooterInfo"
- xmlns:xi="http://www.w3.org/2001/XInclude" />
- <xi:include href="common/Version.xml" xpointer="DateVer"
- xmlns:xi="http://www.w3.org/2001/XInclude" />
- <corpname>HPCC Systems</corpname>
- <xi:include href="common/Version.xml" xpointer="Copyright"
- xmlns:xi="http://www.w3.org/2001/XInclude" />
- <mediaobject role="logo">
- <imageobject>
- <imagedata fileref="images/LN_Rightjustified.jpg" />
- </imageobject>
- </mediaobject>
- </bookinfo>
- <chapter id="Working_with_Data">
- <title><emphasis role="bold">Working with Data</emphasis></title>
- <sect1 id="Working_with_data_Intro" role="nobrk">
- <title>Introduction</title>
- <para>This exercise shows the methodology to extract useful information
- from data. Finding interesting links and relationships from large or
- massive datasets is a typical use of the HPCCSystems High Performance
- Computing Cluster (HPCC) platform.</para>
- <para>In this example, we will download the data files from the Internet
- Movie Database (IMDB) and see one technique to extract links and find
- relationships.</para>
- <para>Since the concept of actors and movies is conceptually simple;
- everyone should understand the data and relationships intuitively.
- However, the data is comprehensive enough to provide a solid example and
- inspiration for new users to gain skills to attack their own real-world
- problems with an HPCC.</para>
- <para>In this example, we will:</para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Download raw data files and supporting documentation about the
- data</para>
- </listitem>
- <listitem>
- <para>Analyze the data file to understand its format and
- contents</para>
- </listitem>
- <listitem>
- <para>Spray the file to a Data Refinery (Thor) cluster</para>
- </listitem>
- <listitem>
- <para>Examine the data and determine the pre-processing
- needed</para>
- </listitem>
- <listitem>
- <para>Pre-process the data to produce a new data file</para>
- </listitem>
- </itemizedlist>
- <para><informaltable colsep="1" frame="all">
- <?dbfo keep-together="always"?>
- <tgroup cols="2">
- <colspec colwidth="52.60pt" />
- <colspec />
- <tbody>
- <row>
- <entry><inlinegraphic fileref="images/OSSgr3.png" /></entry>
- <entry>While this example will run on a single-node HPCC, you
- will see a dramatic difference in performance on a multi-node
- system. The true power of an HPCC is its ability to work on
- different portions of the data file in parallel. This is known
- as Massively Parallel Processing (MPP).</entry>
- </row>
- </tbody>
- </tgroup>
- </informaltable></para>
- </sect1>
- <sect1 id="Processing_the_data">
- <title>Processing the Data</title>
- <sect2 id="Get_a_data_file">
- <title><emphasis>We get a data file</emphasis></title>
- <para>The Internet Movie Database (IMDB) database is a freely
- downloadable set of data files about Movies.</para>
- <para>It can be downloaded in many formats, including text file
- format. The set includes approximately 48 files about Actors,
- Actresses, Directors, Producers, and other aspects of motions
- pictures.</para>
- <para>It is manageable in size (~400MB) and is sufficient in size to
- exercise an HPCC platform but not too big to download<emphasis
- role="bold">.</emphasis></para>
- <para>The plain text data files are available from the following ftp
- sites:</para>
- <itemizedlist mark="bullet">
- <listitem>
- <para><ulink
- url="ftp://ftp.fu-berlin.de/pub/misc/movies/database/">ftp://ftp.fu-berlin.de/pub/misc/movies/database/</ulink>
- (Germany)</para>
- </listitem>
- <listitem>
- <para><ulink
- url="ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/">ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/</ulink>
- (Finland)</para>
- </listitem>
- <listitem>
- <para><ulink
- url="ftp://ftp.sunet.se/pub/tv+movies/imdb/">ftp://ftp.sunet.se/pub/tv+movies/imdb/</ulink>
- (Sweden)</para>
- </listitem>
- </itemizedlist>
- <para>The files are compressed using GNUzip to save space and
- bandwidth.</para>
- <para>We will focus initially on two of the larger data sets in the
- IMDB database</para>
- <blockquote>
- <itemizedlist mark="bullet">
- <listitem>
- <para>The Actors Dataset (Approximately 4 million
- Records)</para>
- </listitem>
- <listitem>
- <para>The Actresses Dataset (Approximately 2 million
- Records)</para>
- </listitem>
- </itemizedlist>
- </blockquote>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Download the plain text data files
- (<emphasis>actors.list.gz</emphasis> and
- <emphasis>actresses.list.gz</emphasis> )to your local drive using
- any ftp interface you choose.</para>
- </listitem>
- <listitem>
- <para>Extract the two data files (<emphasis>actors.list</emphasis>
- and <emphasis>actresses.list</emphasis> ) using any GNUzip
- interface.</para>
- </listitem>
- </itemizedlist>
- </sect2>
- <sect2 id="Analyze_the_data">
- <title><emphasis>Analyze the data file to understand its format and
- its contents</emphasis></title>
- <para>Here is the sample of the data in the Actors.list file from
- IMDB</para>
- <para><programlisting>Koolout' Starks, Johnny Nothing Like the Holidays (2008) [Alexis' Thug] <35>
- Subtle Seduction (2008) [Officer Ward]
- The Godfather of Green Bay (2005) (as Johnny Starks) [Marcus] <18>
- La Chispa', Tony Caceria de judiciales (1997) <11>
- Violencia en la sierra (1995) [Victoriano] <4></programlisting></para>
- <para>Notice the actors text file is structured as follows</para>
- <para><programlisting>Blankline
- Actorname_i Moviename (year) [role] <listing position>
- Moviename (year) [role] <listing position>
- Moviename (year) [role] <listing position>
- :
- Blankline
- Actorname_j \t Moviename (year) [role] <listing position>
- :
- Blankline</programlisting></para>
- </sect2>
- <sect2 id="Load_the_Incoming_Data" role="brk">
- <title><emphasis>Load the Incoming Data File to your Landing
- Zone</emphasis></title>
- <para>In this step, you will copy the data files to a location from
- which it can be sprayed to your HPCC cluster. A Landing Zone is a
- storage location attached to your HPCC. It has a utility running to
- facilitate file spraying to a cluster.</para>
- <para>For smaller data files, maximum of 2GB, you can use the
- upload/download file utility in ECL Watch. The sample data files are
- ~400 mb.</para>
- <para>Next you will distribute (or Spray) the dataset to all the nodes
- in the HPCC cluster. The power of the HPCC comes from its ability to
- assign multiple processors to work on different portions of the data
- file in parallel.</para>
- <orderedlist>
- <listitem>
- <para>Download the sample data files from the ftp sites as
- described in the previous section, if you have not done so
- already.</para>
- </listitem>
- <listitem>
- <para>Extract them to a folder on your local machine.</para>
- </listitem>
- <listitem>
- <para>In your browser, go to the <emphasis role="bold">ECL
- Watch</emphasis> URL. For example, http://nnn.nnn.nnn.nnn:8010,
- where nnn.nnn.nnn.nnn is your ESP Server's IP address.</para>
- <para><informaltable colsep="1" frame="all" rowsep="1">
- <?dbfo keep-together="always"?>
- <tgroup cols="2">
- <colspec colwidth="49.50pt" />
- <colspec />
- <tbody>
- <row>
- <entry><inlinegraphic
- fileref="images/caution.png" /></entry>
- <entry>Your IP address could be different from the ones
- provided in the example images. Please use the IP
- address provided by <emphasis
- role="bold">your</emphasis> installation.</entry>
- </row>
- </tbody>
- </tgroup>
- </informaltable></para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>From ECL Watch page, click on the <emphasis
- role="bold">Files </emphasis>icon, then on the <emphasis
- role="bold">Landing Zones</emphasis> link.</para>
- <para><figure>
- <title>Upload/download</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/LZimg03-1.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure></para>
- <para>Once you click on the <emphasis
- role="bold">Upload</emphasis> file link, a file dialog
- displays.</para>
- <para></para>
- </listitem>
- <listitem>
- <para>Browse the files on your local machine, then use
- multi-select to choose the files to upload and then press the
- <emphasis role="bold">Open</emphasis> button.</para>
- <para>The files you selected should appear . The data files are
- named: <emphasis>actors.list</emphasis> and
- <emphasis>actresses.list</emphasis> <emphasis role="bold">
- </emphasis></para>
- <figure>
- <title>Dropzones and Files</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_upload.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure>
- </listitem>
- <listitem>
- <para>Press the <emphasis role="bold">Start</emphasis> button to
- upload the files.</para>
- <para>You can monitor priogress as it uploads.</para>
- <figure>
- <title>Upload Progress</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_uploadProgress.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure>
- </listitem>
- </orderedlist>
- </sect2>
- <sect2 id="Spray_the_Data_to_THOR">
- <title>Spray the Data File to your <emphasis>Data Refinery (Thor)
- Cluster</emphasis></title>
- <para>To use the data file in our HPCC system, we must "spray" it to
- all the nodes. A <emphasis>spray</emphasis> or
- <emphasis>import</emphasis> is the relocation of a data file from one
- location (such as a Landing Zone) to multiple file parts on nodes in a
- cluster.</para>
- <para>The distributed or sprayed file is given a
- <emphasis>logical-file-name</emphasis> as follows<emphasis
- role="bold">: ~thor::in::IMDB::actors.list </emphasis> The system
- maintains a list of logical files and the corresponding physical file
- locations of the file parts.</para>
- <para></para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Open ECL Watch using the following URL:</para>
- <para><emphasis role="bold">http://nnn.nnn.nnn.nnn:pppp(where
- nnn.nnn.nnn.nnn is your ESP Server's IP Address and pppp is the
- port. The default port is 8010)</emphasis></para>
- </listitem>
- <listitem>
- <para>Click on the <emphasis role="bold">Files</emphasis> icon,
- then click the <emphasis role="bold">Landing Zones</emphasis> link
- from the navigation.</para>
- </listitem>
- <listitem>
- <para>Select the two files (actors.list and actresses.list ) then
- press the Delimited button.</para>
- <para>The <emphasis role="bold">Spray Delimited</emphasis> dialog
- displays.</para>
- <para><figure>
- <title>Spray Delimited</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_01.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure></para>
- <para></para>
- </listitem>
- <listitem>
- <para>Select mythor in the <emphasis role="bold">Group</emphasis>
- drop-list.</para>
- <para>The IP Address is automatically filled and the Local Path is
- partially filled with the default folder on your landing zone.
- Note: The VM and Community Edition typically only has one landing
- zone defined.</para>
- </listitem>
- <listitem>
- <para>Complete the Name prefix <emphasis
- role="bold">~thor::in::IMDB</emphasis></para>
- </listitem>
- <listitem>
- <para>Fill in the rest of the parameters (if they are not filled
- in already).</para>
- <para><itemizedlist>
- <listitem>
- <para>Max Record Length 8192</para>
- </listitem>
- <listitem>
- <para>Separator \,</para>
- </listitem>
- <listitem>
- <para>Line Terminator \n,\r\n</para>
- </listitem>
- <listitem>
- <para>Quote: '</para>
- </listitem>
- </itemizedlist></para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Make sure the <emphasis role="bold">Overwrite</emphasis> and
- <emphasis role="bold">Replicate</emphasis><emphasis role="bold">
- </emphasis>boxes are checked.</para>
- <para><emphasis role="bold">Note:</emphasis> The Replicate option
- is only available on systems where replication has been
- enabled.</para>
- <para></para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Press the <emphasis role="bold">Spray</emphasis><emphasis
- role="bold"> </emphasis>button.</para>
- <para>A tab opens for each file. On these tabs, you can monitor
- the progress of each DFU Spray.</para>
- <para><figure>
- <title>View Progress</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_03a.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure></para>
- </listitem>
- <listitem>
- <para>After both sprays are complete, we can query the logical
- files on the HPCC to see the files we sprayed.</para>
- </listitem>
- <listitem>
- <para>Click on the <emphasis role="bold">Logical Files</emphasis>
- link</para>
- <para>The files display in the Logical Files list:</para>
- <para><figure>
- <title>Display Logical Files</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_05.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure></para>
- </listitem>
- </itemizedlist>
- </sect2>
- <sect2 id="Working_with_the_Data">
- <title>Working With the Data</title>
- <para>In this portion of the example, we will write ECL code to make
- sure we can read the sprayed data file .We will define and execute
- simple queries on it so we can evaluate it and determine any necessary
- pre-processing.</para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Start the ECL IDE (Start >> All Programs >> HPCC
- Systems >> ECL IDE )</para>
- </listitem>
- <listitem>
- <para>Log in to your environment.</para>
- </listitem>
- <listitem>
- <para>Expand the <emphasis role="bold">examples</emphasis> ECL
- folder in the Repository toolbox.</para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Expand the <emphasis role="bold">IMDB </emphasis>folder
- inside.</para>
- <para>All the ECL files needed to complete this tutorial are
- located in the IMDB folder.</para>
- <figure>
- <title>IMDB ECL Files</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_06_new.jpg" />
- </imageobject>
- </mediaobject>
- </figure>
- </listitem>
- <listitem>
- <para>Open the CleanActor ECL file and examine the code.</para>
- <para>This code reads and processes the raw text file. The
- comments below describe the processing:</para>
- <para><programlisting>IMPORT Std;
- EXPORT STRING CleanActor(STRING infld) := FUNCTION
- //this can be refined later
- s1 := Std.Str.FindReplace(infld, '\'',''); // replace apostrophe
- s2 := Std.Str.FindReplace(s1, '\t',''); //replace tabs
- s3 := Std.Str.FindReplace(s2, '----',''); // replace multiple -----
- return TRIM(s3, LEFT, RIGHT);
- END;
- </programlisting></para>
- </listitem>
- </itemizedlist>
- <sect3 role="brk">
- <title>Examine the Data</title>
- <para>In this section, we will look at the data and determine if
- there is any pre-processing we want to perform. This is the step in
- the development process where we convert the raw data into a form we
- can actually use.</para>
- <variablelist>
- <varlistentry>
- <term>Note:</term>
- <listitem>
- <para>The IMDB.FileActors.ecl file specifies the size of the
- header in the files (actors.list and actresses.list.) The
- HEADING() value in the example code was accurate at the time
- we downloaded the IMDB data, but could change at any time. We
- suggest opening in a text editor and checking the line number
- where the header ends and actual data begins (as shown
- below).</para>
- </listitem>
- </varlistentry>
- </variablelist>
- <figure>
- <title>actors.list in text editor</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_fileheading.jpg" />
- </imageobject>
- </mediaobject>
- </figure>
- <para></para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Open a new Builder window (CTRL+N) and write the following
- code:</para>
- <para><programlisting>IMPORT IMDB;
- OUTPUT(IMDB.FileActors);
- </programlisting></para>
- </listitem>
- <listitem>
- <para>Press the syntax check button on the main toolbar (or
- press F7).</para>
- <para>It is always a good idea to check syntax before
- submitting.</para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Make sure the selected cluster is your
- <emphasis>thor</emphasis> cluster, then press the <emphasis
- role="bold">Submit</emphasis> button.</para>
- <para><figure>
- <title>Submit to Thor</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_10.jpg" />
- </imageobject>
- </mediaobject>
- </figure></para>
- </listitem>
- <listitem>
- <para>When the Workunit completes it displays a green checkmark.
- <inlinegraphic fileref="images/DT173-15.jpg" /></para>
- <para><emphasis role="bold">Note:</emphasis> Depending on the
- size of your cluster and the speed of your server(s), this
- process could take several minutes. If you are running this on a
- virtual machine, it could take as long as 45 minutes to
- complete.</para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Select the Workunit tab (the one with the number and the
- checkmark next to it) and select the <emphasis
- role="bold">Result 1</emphasis> tab.</para>
- <para><figure>
- <title>Select Workunit</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_07.jpg" />
- </imageobject>
- </mediaobject>
- </figure></para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Scroll down to see more records.</para>
- <para><figure>
- <title>See more records</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_08.jpg" />
- </imageobject>
- </mediaobject>
- </figure></para>
- </listitem>
- </itemizedlist>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Close the Builder Window.</para>
- </listitem>
- </itemizedlist>
- </sect3>
- </sect2>
- <sect2 id="Processing_the_Data_E-T-L">
- <title><emphasis role="bold">Processing the Data : Extract,
- </emphasis><emphasis>Transform, and Load</emphasis></title>
- <para><emphasis>In this section, we will write code to transform the
- original actor data as follows:</emphasis></para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>From the raw actors data, we will do an ETL operation
- (Extract, Transform, Load) to build an <emphasis
- role="bold">actor_movie </emphasis>relation set.</para>
- </listitem>
- <listitem>
- <para>We will also construct a Kevin Bacon degrees of separation
- lookup set. This is the structure we will query to answer the
- question:</para>
- </listitem>
- </itemizedlist>
- <para><emphasis>H</emphasis><emphasis>ow many degrees of separation
- exist between Actor X and Kevin Bacon?</emphasis></para>
- <para></para>
- <para></para>
- <para></para>
- <para><emphasis role="bold">For example: </emphasis>Using Jon Lovitz
- as the actor, we want information as follows:</para>
- <para>Jon Lovitz ( (was in) Movie X ( (with) Actor2 ((who was in)
- Movie Y ( (with) Kevin Bacon</para>
- <para>We will then write this new file to our Thor cluster so it can
- be used in parameterized queries.</para>
- <para></para>
- <para></para>
- <para><itemizedlist mark="bullet">
- <listitem>
- <para>In the ECL IDE , go to the Repository panel and expand the
- IMDB folder.</para>
- </listitem>
- <listitem>
- <para>Open the ECL File ActorsInMovies.</para>
- <para>The code in this ECL file looks like this:</para>
- </listitem>
- </itemizedlist></para>
- <para><programlisting>/* ******************************************************************************
- ## Copyright 2011 HPCC Systems. All rights reserved.
- ******************************************************************************* */
- /**
- * Produce a slimmed down version of the IMDB actor AND actress files to
- * permit more efficient join operations.
- * Filter out the movie records we do not want in building our KBacon Number sets.
- *
- */
- IMPORT $ AS IMDB;
- IMPORT Std;
- // Filter out TV movies, Videos AND some documentary type collections
- ds_IMDB := IMDB.FileActors(actorname!='' AND moviename != '' AND
- Std.Str.Find(moviename,'Boffo',1) = 0 AND
- Std.Str.Find(moviename,'Slasher Film',1) = 0 AND
- movie_type != 'Video' AND isTVseries = 'N' AND
- movie_type != 'For TV');
- //Slim the records down to bare essentials for searching AND joining
- slim_IMDB_rec := RECORD
- STRING50 actor;
- STRING150 movie;
- END;
- slim_IMDB_rec slim_it(ds_IMDB L):= TRANSFORM
- SELF.actor := Std.Str.FindReplace(L.actorname,'(I)','');
- SELF.movie := L.moviename;;
- END;
- IMDB_names := PROJECT(ds_IMDB, slim_it(LEFT));
- export ActorsInMovies := IMDB_Names : persist('~temp::IMDB::ActorsInMovies');;
- </programlisting></para>
- <para>This defines a relational data set:-- actor:movie. We will use
- this definition later.</para>
- </sect2>
- </sect1>
- <sect1 id="Getting_Useful_Info_from_Data">
- <title>Getting Useful Information from Data</title>
- <sect2 id="Links_and_Degrees_of_Separation">
- <title><emphasis>Links and Degrees of Separation</emphasis></title>
- <para>Now that we have our data in a useful format, have a relation
- defined, and the file is in place, we can write code to use the new
- data file.</para>
- <para>We want to know how many actors are a distance
- <emphasis>N</emphasis> from Kevin Bacon. To accomplish this, we will
- construct sets of Kevin Bacon's costars that are KBacon number
- apart.</para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Open the KevinBaconNumberSets ECL file.</para>
- </listitem>
- </itemizedlist>
- <para>This ECL code counts the number of actors with <emphasis>"bacon
- numbers"</emphasis> starting from 1 thru 7, that is up to 7 Levels of
- separation. We will use this later to do searches by building an
- index.</para>
- <para><programlisting>/* ******************************************************************************
- ATTRIBUTE PURPOSE:
- Produce a series of sets for Actors and Movies that are : distance-0
- away (KBacons Direct movies ), distance-2 Away KBacon's Costars Movies ,
- distance-3 away - Movies of Costars of Costars etc all the way upto level 7
-
- The nested attributes below are shown here together for the benefit of the reader.
-
- Notes on variable naming convention used for costars and movies
- KBMovies : Movies Kevin Bacon Worked in (distance 0)
- KBCoStars : Stars who worked in KBMovies (distance 1)
- KBCoStarMovies : Movies worked in by KBCoStars
- except KBMovies (distance 1)
- KBCo2Stars : Stars(Actors) who worked in KBCoStarMovies (distance 2)
- KBCo2StarMovies : Movies worked in by KBCo2Stars
- except KBCoStarMovies (distance 2)
- KBCo3Stars : Stars(Actors) who worked in KBCo2StarMovies (distance 3)
- KBCo3StarMovies : Movies worked in by KBCo3Stars
- except KBCo2StarMovies (distance 3)
- etc..
- ******************************************************************************* */
- IMPORT Std;
- IMPORT IMDB;
- EXPORT KevinBaconNumberSets := MODULE
- // Constructing a proper name match function is an art within itself
- // For simplicity we will define a name as matching if both first and last name
- //are found within the string
- NameMatch(string full_name, string fname,string lname) :=
- Std.Str.Find(full_name,fname,1) > 0 AND
- Std.Str.Find(full_name,lname,1) > 0;
- //------ Get KBacon Movies
- AllKBEntries := IMDB.ActorsInMovies(NameMatch(actor,'Kevin','Bacon'));
- EXPORT KBMovies := DEDUP(AllKBEntries, movie, ALL); // Each movie should ONLY occur once
- //------ Get KBacon CoStars
- CoStars := IMDB.ActorsInMovies(Movie IN SET(KBMovies,Movie));
- EXPORT KBCoStars := DEDUP( CoStars(actor<>'Kevin Bacon'), actor, ALL);
- //------ Get KBacon Costars' Movies
- // CSM = First find all of the movies that a KBCoStar has been in
- CSM := DEDUP(JOIN(IMDB.ActorsInMovies,KBCoStars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LOOKUP),
- movie,ALL);
- // Now we need to remove all of those that KB was in himself
- // We can use a set; KB has not been in (quite!) that many movies
- EXPORT KBCoStarMovies := CSM(movie NOT IN SET(KBMovies,movie));
- //------ Bacon # 2 Actors
- // To be a Co2Star of Kevin Bacon you must have appeared in a movie with a
- //CoStar of Kevin Bacon
- // This corresponds to having a Bacon number of 2
- // We are now getting towards the expensive part of the process
- KBCo2S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCoStarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT), LOOKUP),
- actor, ALL);
- // KCCo2S = ALL Actors appearing in Movies of KBacon's CoActors
- // The above is all the people in the movies; but some will have been co-stars of KB
- //directly - these must be removed
- // The LEFT ONLY join removes items in one list from another
- EXPORT KBCo2Stars := JOIN(KBCo2S, KBCoStars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LEFT ONLY);
- //------- bacon # 2 Movies
- // Co2SM = what movies have all the Co2Stars been in?
- Co2SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo2Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LOOKUP),
- movie, ALL);
- // Co2SM = ALL Movies KBCo2Stars have been in
- // Of course some of these movies will have CoStars in too and thus will already have
- //been listed. Note this list will not contain any Kevin Bacon movies OR the movie would
- //have been reached earlier!
- Export KBCo2StarMovies := JOIN(Co2SM, KBCoStarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT),LEFT ONLY);
- //------ bacon #3 Actors
- // Find people with a Bacon number of 3
- // This code is very similar to KBCo2Stars; one might be tempted to common up into a
- // function or macro. However it is worth looking at the attribute counts first; we may be
- // down to a small enough set that we can start using in-memory functions (e.g.,SET) again.
- KBCo3S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo2StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT), LOOKUP),
- actor, ALL);
- // KBCo3S = ALL CoStars in KBCo2Star Movies
- // The above is all the people in the movies; but some will have been co2stars of KB
- // directly - these must be removed. The LEFT ONLY join removes items in one list from
- // another. There should not be any direct CoStars in this list (or the movie would have
- // been a CoStarMovie not a CoCoStarMovie)
- EXPORT KBCo3Stars := JOIN(KBCo3S, KBCo2Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #3 Movies
- // So what movies have all the KBCo3Stars been in?
- Co3SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo3Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LOOKUP),
- movie, ALL);
- // Co3SM = ALL Movies KBCo3Stars have been in
- // Of course some of these movies will have KBCo2Stars in too and thus will already have
- // been listed. Note We ONLY have to remove one level back from the list; previous levels
- // cannot be reached by definition
- EXPORT KBCo3StarMovies := JOIN(Co3SM, KBCo2StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT),LEFT ONLY);
- //------bacon #4 Actors
- KBCo4S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo3StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT), LOOKUP),
- actor, ALL);
- EXPORT KBCo4Stars := JOIN(KBCo4S, KBCo3Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #4 Movies
- // So what movies have all the Co4Stars been in?
- Co4SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo4Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LOOKUP),
- movie, ALL);
- // Co4SM = ALL Movies KBCo4Stars have been in
- // Of course some of these movies will have Co3Stars in too and thus will already have
- // been listed. Note We ONLY have to remove one level back from the list; previous levels
- // cannot be reached by definition
- EXPORT KBCo4StarMovies := JOIN(Co4SM, KBCo3StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #5 Stars
- KBCo5S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo4StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT), LOOKUP),
- actor, ALL);
- EXPORT KBCo5Stars := JOIN(KBCo5S, KBCo4Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #5 Movies
- Co5SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo5Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LOOKUP),
- movie,ALL);
- EXPORT KBCo5StarMovies := JOIN(Co5SM, KBCo4StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #6 Stars
- // Find people with a Bacon number of 6
- // KBCo5 is getting small again - can move back down to the SET?
- KBCo6S := DEDUP(IMDB.ActorsInMovies(movie IN SET(KBCo5StarMovies, movie)),
- actor, ALL);
- EXPORT KBCo6Stars := JOIN(KBCo6S, KBCo5Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #6 Movies
- Co6SM := DEDUP(IMDB.ActorsInMovies(actor IN SET(KBCo6Stars, actor)), movie, ALL);
- EXPORT KBCo6StarMovies := Co6SM(movie NOT IN SET(KBCo5StarMovies, movie));
- //----- bacon #7 Movies
- // Find people with a Bacon number of 7
- KBCo7S := DEDUP(IMDB.ActorsInMovies(movie IN SET(KBCo6StarMovies,movie)), actor, ALL);
- EXPORT KBCo7Stars := KBCo7S(actor NOT IN SET(KBCo6Stars, actor));
- //----- We just have to count them all !! (How many holes in Albert Hall?)
- EXPORT doCounts := PARALLEL(
- OUTPUT(COUNT(KBMovies), NAMED('KBMovies')),
- OUTPUT(COUNT(KBCoStars), NAMED('KBCoStars')),
- OUTPUT(COUNT(KBCoStarMovies), NAMED('KBCoStarMovies')),
- OUTPUT(COUNT(KBCo2Stars), NAMED('KBCo2Stars')),
- OUTPUT(COUNT(KBCo2StarMovies), NAMED('KBCo2StarMovies')),
- OUTPUT(COUNT(KBCo3Stars), NAMED('KBCo3Stars')),
- OUTPUT(COUNT(KBCo3StarMovies), NAMED('KBCo3StarMovies')),
- OUTPUT(COUNT(KBCo4Stars), NAMED('KBCo4Stars')),
- OUTPUT(COUNT(KBCo4StarMovies), NAMED('KBCo4StarMovies')),
- OUTPUT(COUNT(KBCo5Stars), NAMED('KBCo5Stars')),
- OUTPUT(COUNT(KBCo5StarMovies), NAMED('KBCo5StarMovies')),
- OUTPUT(COUNT(KBCo6Stars), NAMED('KBCo6Stars')),
- OUTPUT(COUNT(KBCo6StarMovies), NAMED('KBCo6StarMovies')),
- OUTPUT(COUNT(KBCo7Stars), NAMED('KBCo7Stars')),
- OUTPUT(KBCo7Stars)
- );
- END;</programlisting></para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Open a new Builder Window and type:</para>
- </listitem>
- </itemizedlist>
- <para><programlisting>IMPORT IMDB;
- IMDB.KevinBaconNumberSets.doCounts;</programlisting></para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Check the syntax then press the <emphasis
- role="bold">Submit</emphasis> button.</para>
- <para><emphasis role="bold">Note:</emphasis> Depending on the size
- of your cluster and the speed of your server(s), this process
- could take several minutes. If you are running this on a virtual
- machine, it could take as long as an hour to complete.</para>
- </listitem>
- <listitem>
- <para>When the process completes, each row shown below becomes
- it's own result tab. You will get a sample of the output as
- follows:</para>
- <para><emphasis role="bold">Note:</emphasis> The data files for
- this tutorial change frequently, your results may vary from those
- shown in this document.</para>
- </listitem>
- </itemizedlist>
- <para><informaltable>
- <?dbfo keep-together="always"?>
- <tgroup cols="2">
- <tbody>
- <row>
- <entry>KB Movies</entry>
- <entry>71</entry>
- </row>
- <row>
- <entry>KB Co Stars</entry>
- <entry>3520</entry>
- </row>
- <row>
- <entry>KB Co Star Movies</entry>
- <entry>33504</entry>
- </row>
- <row>
- <entry>KB Co 2 Stars</entry>
- <entry>430145</entry>
- </row>
- <row>
- <entry>KB Co 2 Star Movies</entry>
- <entry>251867</entry>
- </row>
- <row>
- <entry>KB Co 3 Stars</entry>
- <entry>896009</entry>
- </row>
- <row>
- <entry>KB Co 3 Star Movies</entry>
- <entry>51650</entry>
- </row>
- <row>
- <entry>KB Co 4 Stars</entry>
- <entry>102729</entry>
- </row>
- <row>
- <entry>KB Co 4 Star Movies</entry>
- <entry>2634</entry>
- </row>
- <row>
- <entry>KB Co 5 Stars</entry>
- <entry>6080</entry>
- </row>
- <row>
- <entry>KB Co 5 Star Movies</entry>
- <entry>190</entry>
- </row>
- <row>
- <entry>KB Co 6 Stars</entry>
- <entry>450</entry>
- </row>
- <row>
- <entry>KB Co 6 Star Movies</entry>
- <entry>14</entry>
- </row>
- <row>
- <entry>KB Co 7 Stars</entry>
- <entry>22</entry>
- </row>
- </tbody>
- </tgroup>
- </informaltable></para>
- </sect2>
- </sect1>
- </chapter>
- <chapter id="Next_Steps">
- <title><emphasis role="bold">Next Steps</emphasis></title>
- <para>Now that you have successfully processed the data and established
- links, what's next?</para>
- <para>Two more ECL files are included in the IMDB folder that you can use
- in conjunction with the examples you have already worked through in this
- tutorial:</para>
- <para>• KeysKevinBacon -- Builds an index of actors/actresses and the
- movies they have starred in.</para>
- <para>You must build this index before you can run queries to find the
- degree of separation between Kevin Bacon and an actor of your
- choice.</para>
- <para>To build the index, open a builder window and type the following
- code:</para>
- <para><programlisting>IMPORT IMDB;
- IMDB.KeysKevinBacon.BuildAll;</programlisting></para>
- <para>Press the <emphasis role="bold">Submit</emphasis> button to run the
- ECL code and build the index.</para>
- <para>SearchKevinBaconLinks -- Searches the index you built to give you
- the degree of separation between an actor and Kevin Bacon.</para>
- <para>For example, to find the degree of separation between Kevin Bacon
- and Andi Everingham, open a builder window and type the following
- code:</para>
- <para><programlisting>IMPORT IMDB;
- IMDB.SearchKevinBaconLinks('Everingham, Andi');</programlisting></para>
- <para>Make sure the selected cluster is your <emphasis>hThor</emphasis>
- cluster, then press the <emphasis role="bold">Submit</emphasis> button to
- run the query.</para>
- <para>When it has completed, click on the Workunit ID tab.</para>
- <para>Two results are shown.</para>
- <para><emphasis role="bold">Result1</emphasis> shows the degree of
- separation between the actor and Kevin Bacon.</para>
- <para>Interpret the results as follows:</para>
- <para>Actor is at level 1 - The actor you chose and Kevin Bacon starred in
- a movie together.</para>
- <para>Actor is at level 2 - The actor you chose starred in a movie with an
- actor who starred in a movie with Kevin Bacon.</para>
- <para>The higher the level, the greater the degree of separation between
- the actor you chose and Kevin Bacon.</para>
- <para>In this example, the actor is at level 6, indicating that there are
- 6 degrees of separation between Andi Everingham and Kevin Bacon.</para>
- <para><emphasis role="bold">Result2</emphasis> shows the level (degree of
- separation), the name of the actor and the movie they starred in.</para>
- <para>Each line shows an actor and the movie they starred in which links
- them to each other and eventually to Kevin Bacon.</para>
- <para>Have fun finding the degrees of separation between any actor and
- Kevin Bacon.</para>
- <para>Remember to build the index first.</para>
- </chapter>
- </book>
|