1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222 |
- <?xml version="1.0" encoding="utf-8"?>
- <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
- "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
- <book lang="en_US" xml:base="../">
- <title>Six Degrees of Kevin Bacon</title>
- <bookinfo>
- <title>Six Degrees of Kevin Bacon: ECL Programming Example</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/redswooshWithLogo3.jpg" />
- </imageobject>
- </mediaobject>
- <author>
- <surname>Boca Raton Documentation Team</surname>
- </author>
- <legalnotice>
- <para>We welcome your comments and feedback about this document via
- email to <email>docfeedback@hpccsystems.com</email></para>
- <para>Please include <emphasis role="bold">Documentation
- Feedback</emphasis> in the subject line and reference the document name,
- page numbers, and current Version Number in the text of the
- message.</para>
- <para>LexisNexis and the Knowledge Burst logo are registered trademarks
- of Reed Elsevier Properties Inc., used under license.</para>
- <para>HPCC Systems<superscript>®</superscript> is a registered trademark
- of LexisNexis Risk Data Management Inc.</para>
- <para>Other products, logos, and services may be trademarks or
- registered trademarks of their respective companies.</para>
- <para>All names and example data used in this manual are fictitious. Any
- similarity to actual persons, living or dead, is purely
- coincidental.</para>
- <para></para>
- </legalnotice>
- <xi:include href="common/Version.xml" xpointer="FooterInfo"
- xmlns:xi="http://www.w3.org/2001/XInclude" />
- <xi:include href="common/Version.xml" xpointer="DateVer"
- xmlns:xi="http://www.w3.org/2001/XInclude" />
- <corpname>HPCC Systems<superscript>®</superscript></corpname>
- <xi:include href="common/Version.xml" xpointer="Copyright"
- xmlns:xi="http://www.w3.org/2001/XInclude" />
- <mediaobject role="logo">
- <imageobject>
- <imagedata fileref="images/LN_Rightjustified.jpg" />
- </imageobject>
- </mediaobject>
- </bookinfo>
- <chapter id="Working_with_Data">
- <title>Working with Data</title>
- <sect1 id="Working_with_data_Intro" role="nobrk">
- <title>Introduction</title>
- <para>This exercise shows the methodology to extract useful information
- from data. Finding interesting links and relationships from large or
- massive datasets is a typical use of the HPCCSystems High Performance
- Computing Cluster (HPCC) platform.</para>
- <para>In this example, we will download the data files from the Internet
- Movie Database (IMDB) and see one technique to extract links and find
- relationships.</para>
- <para>Since the concept of actors and movies is conceptually simple;
- everyone should understand the data and relationships intuitively.
- However, the data is comprehensive enough to provide a solid example and
- inspiration for new users to gain skills to attack their own real-world
- problems with an HPCC.</para>
- <para>In this example, we will:</para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Download raw data files and supporting documentation about the
- data</para>
- </listitem>
- <listitem>
- <para>Analyze the data file to understand its format and
- contents</para>
- </listitem>
- <listitem>
- <para>Spray the file to a Data Refinery (Thor) cluster</para>
- </listitem>
- <listitem>
- <para>Examine the data and determine the pre-processing
- needed</para>
- </listitem>
- <listitem>
- <para>Pre-process the data to produce a new data file</para>
- </listitem>
- </itemizedlist>
- <para><informaltable colsep="1" frame="all">
- <?dbfo keep-together="always"?>
- <tgroup cols="2">
- <colspec colwidth="52.60pt" />
- <colspec />
- <tbody>
- <row>
- <entry><inlinegraphic fileref="images/OSSgr3.png" /></entry>
- <entry>While this example will run on a single-node HPCC, you
- will see a dramatic difference in performance on a multi-node
- system. The true power of an HPCC is its ability to work on
- different portions of the data file in parallel. This is known
- as Massively Parallel Processing (MPP).</entry>
- </row>
- </tbody>
- </tgroup>
- </informaltable></para>
- </sect1>
- <sect1 id="Processing_the_data">
- <title>Processing the Data</title>
- <sect2 id="Get_a_data_file">
- <title><emphasis>We get a data file</emphasis></title>
- <para>The Internet Movie Database (IMDB) database is a freely
- downloadable set of data files about Movies.</para>
- <para>It can be downloaded in many formats, including text file
- format. The set includes approximately 48 files about Actors,
- Actresses, Directors, Producers, and other aspects of motions
- pictures.</para>
- <para>It is manageable in size (~400MB) and is sufficient in size to
- exercise an HPCC platform but not too big to download<emphasis
- role="bold">.</emphasis></para>
- <para>The plain text data files are available from the following ftp
- sites:</para>
- <itemizedlist mark="bullet">
- <listitem>
- <para><ulink
- url="ftp://ftp.fu-berlin.de/pub/misc/movies/database/">ftp://ftp.fu-berlin.de/pub/misc/movies/database/</ulink>
- (Germany)</para>
- </listitem>
- <listitem>
- <para><ulink
- url="ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/">ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/</ulink>
- (Finland)</para>
- </listitem>
- <listitem>
- <para><ulink
- url="ftp://ftp.sunet.se/pub/tv+movies/imdb/">ftp://ftp.sunet.se/pub/tv+movies/imdb/</ulink>
- (Sweden)</para>
- </listitem>
- </itemizedlist>
- <para>The files are compressed using GNUzip to save space and
- bandwidth.</para>
- <para>We will focus initially on two of the larger data sets in the
- IMDB database</para>
- <blockquote>
- <itemizedlist mark="bullet">
- <listitem>
- <para>The Actors Dataset (Approximately 4 million
- Records)</para>
- </listitem>
- <listitem>
- <para>The Actresses Dataset (Approximately 2 million
- Records)</para>
- </listitem>
- </itemizedlist>
- </blockquote>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Download the plain text data files
- (<emphasis>actors.list.gz</emphasis> and
- <emphasis>actresses.list.gz</emphasis> )to your local drive using
- any ftp interface you choose.</para>
- </listitem>
- <listitem>
- <para>Extract the two data files (<emphasis>actors.list</emphasis>
- and <emphasis>actresses.list</emphasis> ) using any GNUzip
- interface.</para>
- </listitem>
- </itemizedlist>
- </sect2>
- <sect2 id="Analyze_the_data">
- <title><emphasis>Analyze the data file to understand its format and
- its contents</emphasis></title>
- <para>Here is the sample of the data in the Actors.list file from
- IMDB</para>
- <para><programlisting>Koolout' Starks, Johnny Nothing Like the Holidays (2008) [Alexis' Thug] <35>
- Subtle Seduction (2008) [Officer Ward]
- The Godfather of Green Bay (2005) (as Johnny Starks) [Marcus] <18>
- La Chispa', Tony Caceria de judiciales (1997) <11>
- Violencia en la sierra (1995) [Victoriano] <4></programlisting></para>
- <para>Notice the actors text file is structured as follows</para>
- <para><programlisting>Blankline
- Actorname_i Moviename (year) [role] <listing position>
- Moviename (year) [role] <listing position>
- Moviename (year) [role] <listing position>
- :
- Blankline
- Actorname_j \t Moviename (year) [role] <listing position>
- :
- Blankline</programlisting></para>
- </sect2>
- <sect2 id="Load_the_Incoming_Data" role="brk">
- <title><emphasis>Load the Incoming Data File to your Landing
- Zone</emphasis></title>
- <para>In this step, you will copy the data files to a location from
- which it can be sprayed to your HPCC cluster. A Landing Zone is a
- storage location attached to your HPCC. It has a utility running to
- facilitate file spraying to a cluster.</para>
- <para>For smaller data files, maximum of 2GB, you can use the
- upload/download file utility in ECL Watch. The sample data files are
- ~400 mb.</para>
- <para>Next you will distribute (or Spray) the dataset to all the nodes
- in the HPCC cluster. The power of the HPCC comes from its ability to
- assign multiple processors to work on different portions of the data
- file in parallel.</para>
- <orderedlist>
- <listitem>
- <para>Download the sample data files from the ftp sites as
- described in the previous section, if you have not done so
- already.</para>
- </listitem>
- <listitem>
- <para>Extract them to a folder on your local machine.</para>
- </listitem>
- <listitem>
- <para>In your browser, go to the <emphasis role="bold">ECL
- Watch</emphasis> URL. For example, http://nnn.nnn.nnn.nnn:8010,
- where nnn.nnn.nnn.nnn is your ESP Server's IP address.</para>
- <para><informaltable colsep="1" frame="all" rowsep="1">
- <?dbfo keep-together="always"?>
- <tgroup cols="2">
- <colspec colwidth="49.50pt" />
- <colspec />
- <tbody>
- <row>
- <entry><inlinegraphic
- fileref="images/caution.png" /></entry>
- <entry>Your IP address could be different from the ones
- provided in the example images. Please use the IP
- address provided by <emphasis
- role="bold">your</emphasis> installation.</entry>
- </row>
- </tbody>
- </tgroup>
- </informaltable></para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>From ECL Watch page, click on the <emphasis
- role="bold">Files </emphasis>icon, then on the <emphasis
- role="bold">Landing Zones</emphasis> link.</para>
- <para><figure>
- <title>Upload/download</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/LZimg03-1.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure></para>
- <para>Once you click on the <emphasis
- role="bold">Upload</emphasis> file link, a file dialog
- displays.</para>
- <para></para>
- </listitem>
- <listitem>
- <para>Browse the files on your local machine, then use
- multi-select to choose the files to upload and then press the
- <emphasis role="bold">Open</emphasis> button.</para>
- <para>The files you selected should appear . The data files are
- named: <emphasis>actors.list</emphasis> and
- <emphasis>actresses.list</emphasis> <emphasis role="bold">
- </emphasis></para>
- <figure>
- <title>Dropzones and Files</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_upload.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure>
- </listitem>
- <listitem>
- <para>Press the <emphasis role="bold">Start</emphasis> button to
- upload the files.</para>
- <para>You can monitor priogress as it uploads.</para>
- <figure>
- <title>Upload Progress</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_uploadProgress.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure>
- </listitem>
- </orderedlist>
- </sect2>
- <sect2 id="Spray_the_Data_to_THOR">
- <title>Spray the Data File to your <emphasis>Data Refinery (Thor)
- Cluster</emphasis></title>
- <para>To use the data file in our HPCC system, we must "spray" it to
- all the nodes. A <emphasis>spray</emphasis> or
- <emphasis>import</emphasis> is the relocation of a data file from one
- location (such as a Landing Zone) to multiple file parts on nodes in a
- cluster.</para>
- <para>The distributed or sprayed file is given a
- <emphasis>logical-file-name</emphasis> as follows<emphasis
- role="bold">: ~thor::in::IMDB::actors.list </emphasis> The system
- maintains a list of logical files and the corresponding physical file
- locations of the file parts.</para>
- <para></para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Open ECL Watch using the following URL:</para>
- <para><emphasis role="bold">http://nnn.nnn.nnn.nnn:pppp(where
- nnn.nnn.nnn.nnn is your ESP Server's IP Address and pppp is the
- port. The default port is 8010)</emphasis></para>
- </listitem>
- <listitem>
- <para>Click on the <emphasis role="bold">Files</emphasis> icon,
- then click the <emphasis role="bold">Landing Zones</emphasis> link
- from the navigation.</para>
- </listitem>
- <listitem>
- <para>Select the two files (actors.list and actresses.list ) then
- press the Delimited button.</para>
- <para>The <emphasis role="bold">Spray Delimited</emphasis> dialog
- displays.</para>
- <para><figure>
- <title>Spray Delimited</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_01.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure></para>
- <para></para>
- </listitem>
- <listitem>
- <para>Select mythor in the <emphasis role="bold">Group</emphasis>
- drop-list.</para>
- <para>The IP Address is automatically filled and the Local Path is
- partially filled with the default folder on your landing zone.
- Note: The VM and Community Edition typically only has one landing
- zone defined.</para>
- </listitem>
- <listitem>
- <para>Complete the Target Scope <emphasis
- role="bold">~thor::in::IMDB</emphasis></para>
- </listitem>
- <listitem>
- <para>Fill in the rest of the parameters (if they are not filled
- in already).</para>
- <para><itemizedlist>
- <listitem>
- <para>Max Record Length 8192</para>
- </listitem>
- <listitem>
- <para>Separator \,</para>
- </listitem>
- <listitem>
- <para>Line Terminator \n,\r\n</para>
- </listitem>
- <listitem>
- <para>Quote: '</para>
- </listitem>
- </itemizedlist></para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Make sure the <emphasis role="bold">Overwrite</emphasis> box
- is checked.</para>
- <para>If available, make sure the <emphasis
- role="bold">Replicate</emphasis> box is checked. (The Replicate
- option is only available on systems where replication has been
- enabled.)</para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Press the <emphasis role="bold">Spray</emphasis><emphasis
- role="bold"> </emphasis>button.</para>
- <para>A tab opens for each file. On these tabs, you can monitor
- the progress of each DFU Spray.</para>
- <para><figure>
- <title>View Progress</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_03a.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure></para>
- </listitem>
- <listitem>
- <para>After both sprays are complete, we can query the logical
- files on the HPCC to see the files we sprayed.</para>
- </listitem>
- <listitem>
- <para>Click on the <emphasis role="bold">Logical Files</emphasis>
- link</para>
- <para>The files display in the Logical Files list:</para>
- <para><figure>
- <title>Display Logical Files</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_05.jpg"
- vendor="eclwatchSS" />
- </imageobject>
- </mediaobject>
- </figure></para>
- </listitem>
- </itemizedlist>
- </sect2>
- <sect2 id="Working_with_the_Data">
- <title>Working With the Data</title>
- <para>In this portion of the example, we will write ECL code to make
- sure we can read the sprayed data file .We will define and execute
- simple queries on it so we can evaluate it and determine any necessary
- pre-processing.</para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Start the ECL IDE (Start >> All Programs >> HPCC
- Systems >> ECL IDE )</para>
- </listitem>
- <listitem>
- <para>Log in to your environment.</para>
- </listitem>
- <listitem>
- <para>Expand the <emphasis role="bold">examples</emphasis> ECL
- folder in the Repository toolbox.</para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Expand the <emphasis role="bold">IMDB </emphasis>folder
- inside.</para>
- <para>All the ECL files needed to complete this tutorial are
- located in the IMDB folder.</para>
- <figure>
- <title>IMDB ECL Files</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_06_new.jpg" />
- </imageobject>
- </mediaobject>
- </figure>
- </listitem>
- <listitem>
- <para>Open the CleanActor ECL file and examine the code.</para>
- <para>This code reads and processes the raw text file. The
- comments below describe the processing:</para>
- <para><programlisting>IMPORT Std;
- EXPORT STRING CleanActor(STRING infld) := FUNCTION
- //this can be refined later
- s1 := Std.Str.FindReplace(infld, '\'',''); // replace apostrophe
- s2 := Std.Str.FindReplace(s1, '\t',''); //replace tabs
- s3 := Std.Str.FindReplace(s2, '----',''); // replace multiple -----
- return TRIM(s3, LEFT, RIGHT);
- END;
- </programlisting></para>
- </listitem>
- </itemizedlist>
- <sect3 id="Examine_The_Data" role="brk">
- <title>Examine the Data</title>
- <para>In this section, we will look at the data and determine if
- there is any pre-processing we want to perform. This is the step in
- the development process where we convert the raw data into a form we
- can actually use.</para>
- <variablelist>
- <varlistentry>
- <term>Note:</term>
- <listitem>
- <para>The IMDB.FileActors.ecl file specifies the size of the
- header in the files (actors.list and actresses.list.) The
- HEADING() value in the example code was accurate at the time
- we downloaded the IMDB data, but could change at any time. We
- suggest opening in a text editor and checking the line number
- where the header ends and actual data begins (as shown
- below).</para>
- </listitem>
- </varlistentry>
- </variablelist>
- <figure>
- <title>actors.list in text editor</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_fileheading.jpg" />
- </imageobject>
- </mediaobject>
- </figure>
- <para></para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Open a new Builder window (CTRL+N) and write the following
- code:</para>
- <para><programlisting>IMPORT IMDB;
- OUTPUT(IMDB.FileActors);
- </programlisting></para>
- </listitem>
- <listitem>
- <para>Press the syntax check button on the main toolbar (or
- press F7).</para>
- <para>It is always a good idea to check syntax before
- submitting.</para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Make sure the selected cluster is your
- <emphasis>thor</emphasis> cluster, then press the <emphasis
- role="bold">Submit</emphasis> button.</para>
- <para><figure>
- <title>Submit to Thor</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_10.jpg" />
- </imageobject>
- </mediaobject>
- </figure></para>
- </listitem>
- <listitem>
- <para>When the Workunit completes it displays a green checkmark.
- <inlinegraphic fileref="images/DT173-15.jpg" /></para>
- <para><emphasis role="bold">Note:</emphasis> Depending on the
- size of your cluster and the speed of your server(s), this
- process could take several minutes. If you are running this on a
- virtual machine, it could take as long as 45 minutes to
- complete.</para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Select the Workunit tab (the one with the number and the
- checkmark next to it) and select the <emphasis
- role="bold">Result 1</emphasis> tab.</para>
- <para><figure>
- <title>Select Workunit</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_07.jpg" />
- </imageobject>
- </mediaobject>
- </figure></para>
- </listitem>
- <listitem>
- <?dbfo keep-together="always"?>
- <para>Scroll down to see more records.</para>
- <para><figure>
- <title>See more records</title>
- <mediaobject>
- <imageobject>
- <imagedata fileref="images/IMDB_08.jpg" />
- </imageobject>
- </mediaobject>
- </figure></para>
- </listitem>
- </itemizedlist>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Close the Builder Window.</para>
- </listitem>
- </itemizedlist>
- </sect3>
- </sect2>
- <sect2 id="Processing_the_Data_E-T-L">
- <title><emphasis role="bold">Processing the Data : Extract,
- </emphasis><emphasis>Transform, and Load</emphasis></title>
- <para><emphasis>In this section, we will write code to transform the
- original actor data as follows:</emphasis></para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>From the raw actors data, we will do an ETL operation
- (Extract, Transform, Load) to build an <emphasis
- role="bold">actor_movie </emphasis>relation set.</para>
- </listitem>
- <listitem>
- <para>We will also construct a Kevin Bacon degrees of separation
- lookup set. This is the structure we will query to answer the
- question:</para>
- </listitem>
- </itemizedlist>
- <para><emphasis>H</emphasis><emphasis>ow many degrees of separation
- exist between Actor X and Kevin Bacon?</emphasis></para>
- <para></para>
- <para></para>
- <para></para>
- <para><emphasis role="bold">For example: </emphasis>Using Jon Lovitz
- as the actor, we want information as follows:</para>
- <para>Jon Lovitz ( (was in) Movie X ( (with) Actor2 ((who was in)
- Movie Y ( (with) Kevin Bacon</para>
- <para>We will then write this new file to our Thor cluster so it can
- be used in parameterized queries.</para>
- <para></para>
- <para></para>
- <para><itemizedlist mark="bullet">
- <listitem>
- <para>In the ECL IDE , go to the Repository panel and expand the
- IMDB folder.</para>
- </listitem>
- <listitem>
- <para>Open the ECL File ActorsInMovies.</para>
- <para>The code in this ECL file looks like this:</para>
- </listitem>
- </itemizedlist></para>
- <para><programlisting>/* ******************************************************************************
- ## Copyright 2011 HPCC Systems®. All rights reserved.
- ******************************************************************************* */
- /**
- * Produce a slimmed down version of the IMDB actor AND actress files to
- * permit more efficient join operations.
- * Filter out the movie records we do not want in building our KBacon Number sets.
- *
- */
- IMPORT $ AS IMDB;
- IMPORT Std;
- // Filter out TV movies, Videos AND some documentary type collections
- ds_IMDB := IMDB.FileActors(actorname!='' AND moviename != '' AND
- Std.Str.Find(moviename,'Boffo',1) = 0 AND
- Std.Str.Find(moviename,'Slasher Film',1) = 0 AND
- movie_type != 'Video' AND isTVseries = 'N' AND
- movie_type != 'For TV');
- //Slim the records down to bare essentials for searching AND joining
- slim_IMDB_rec := RECORD
- STRING50 actor;
- STRING150 movie;
- END;
- slim_IMDB_rec slim_it(ds_IMDB L):= TRANSFORM
- SELF.actor := Std.Str.FindReplace(L.actorname,'(I)','');
- SELF.movie := L.moviename;;
- END;
- IMDB_names := PROJECT(ds_IMDB, slim_it(LEFT));
- export ActorsInMovies := IMDB_Names : persist('~temp::IMDB::ActorsInMovies');;
- </programlisting></para>
- <para>This defines a relational data set:-- actor:movie. We will use
- this definition later.</para>
- </sect2>
- </sect1>
- <sect1 id="Getting_Useful_Info_from_Data">
- <title>Getting Useful Information from Data</title>
- <sect2 id="Links_and_Degrees_of_Separation">
- <title><emphasis>Links and Degrees of Separation</emphasis></title>
- <para>Now that we have our data in a useful format, have a relation
- defined, and the file is in place, we can write code to use the new
- data file.</para>
- <para>We want to know how many actors are a distance
- <emphasis>N</emphasis> from Kevin Bacon. To accomplish this, we will
- construct sets of Kevin Bacon's costars that are KBacon number
- apart.</para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Open the KevinBaconNumberSets ECL file.</para>
- </listitem>
- </itemizedlist>
- <para>This ECL code counts the number of actors with <emphasis>"bacon
- numbers"</emphasis> starting from 1 thru 7, that is up to 7 Levels of
- separation. We will use this later to do searches by building an
- index.</para>
- <para><programlisting>/* ******************************************************************************
- ATTRIBUTE PURPOSE:
- Produce a series of sets for Actors and Movies that are : distance-0
- away (KBacons Direct movies ), distance-2 Away KBacon's Costars Movies ,
- distance-3 away - Movies of Costars of Costars etc all the way upto level 7
-
- The nested attributes below are shown here together for the benefit of the reader.
-
- Notes on variable naming convention used for costars and movies
- KBMovies : Movies Kevin Bacon Worked in (distance 0)
- KBCoStars : Stars who worked in KBMovies (distance 1)
- KBCoStarMovies : Movies worked in by KBCoStars
- except KBMovies (distance 1)
- KBCo2Stars : Stars(Actors) who worked in KBCoStarMovies (distance 2)
- KBCo2StarMovies : Movies worked in by KBCo2Stars
- except KBCoStarMovies (distance 2)
- KBCo3Stars : Stars(Actors) who worked in KBCo2StarMovies (distance 3)
- KBCo3StarMovies : Movies worked in by KBCo3Stars
- except KBCo2StarMovies (distance 3)
- etc..
- ******************************************************************************* */
- IMPORT Std;
- IMPORT IMDB;
- EXPORT KevinBaconNumberSets := MODULE
- // Constructing a proper name match function is an art within itself
- // For simplicity we will define a name as matching if both first and last name
- //are found within the string
- NameMatch(string full_name, string fname,string lname) :=
- Std.Str.Find(full_name,fname,1) > 0 AND
- Std.Str.Find(full_name,lname,1) > 0;
- //------ Get KBacon Movies
- AllKBEntries := IMDB.ActorsInMovies(NameMatch(actor,'Kevin','Bacon'));
- EXPORT KBMovies := DEDUP(AllKBEntries, movie, ALL); // Each movie should ONLY occur once
- //------ Get KBacon CoStars
- CoStars := IMDB.ActorsInMovies(Movie IN SET(KBMovies,Movie));
- EXPORT KBCoStars := DEDUP( CoStars(actor<>'Kevin Bacon'), actor, ALL);
- //------ Get KBacon Costars' Movies
- // CSM = First find all of the movies that a KBCoStar has been in
- CSM := DEDUP(JOIN(IMDB.ActorsInMovies,KBCoStars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LOOKUP),
- movie,ALL);
- // Now we need to remove all of those that KB was in himself
- // We can use a set; KB has not been in (quite!) that many movies
- EXPORT KBCoStarMovies := CSM(movie NOT IN SET(KBMovies,movie));
- //------ Bacon # 2 Actors
- // To be a Co2Star of Kevin Bacon you must have appeared in a movie with a
- //CoStar of Kevin Bacon
- // This corresponds to having a Bacon number of 2
- // We are now getting towards the expensive part of the process
- KBCo2S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCoStarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT), LOOKUP),
- actor, ALL);
- // KCCo2S = ALL Actors appearing in Movies of KBacon's CoActors
- // The above is all the people in the movies; but some will have been co-stars of KB
- //directly - these must be removed
- // The LEFT ONLY join removes items in one list from another
- EXPORT KBCo2Stars := JOIN(KBCo2S, KBCoStars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LEFT ONLY);
- //------- bacon # 2 Movies
- // Co2SM = what movies have all the Co2Stars been in?
- Co2SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo2Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LOOKUP),
- movie, ALL);
- // Co2SM = ALL Movies KBCo2Stars have been in
- // Of course some of these movies will have CoStars in too and thus will already have
- //been listed. Note this list will not contain any Kevin Bacon movies OR the movie would
- //have been reached earlier!
- Export KBCo2StarMovies := JOIN(Co2SM, KBCoStarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT),LEFT ONLY);
- //------ bacon #3 Actors
- // Find people with a Bacon number of 3
- // This code is very similar to KBCo2Stars; one might be tempted to common up into a
- // function or macro. However it is worth looking at the attribute counts first; we may be
- // down to a small enough set that we can start using in-memory functions (e.g.,SET) again.
- KBCo3S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo2StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT), LOOKUP),
- actor, ALL);
- // KBCo3S = ALL CoStars in KBCo2Star Movies
- // The above is all the people in the movies; but some will have been co2stars of KB
- // directly - these must be removed. The LEFT ONLY join removes items in one list from
- // another. There should not be any direct CoStars in this list (or the movie would have
- // been a CoStarMovie not a CoCoStarMovie)
- EXPORT KBCo3Stars := JOIN(KBCo3S, KBCo2Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #3 Movies
- // So what movies have all the KBCo3Stars been in?
- Co3SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo3Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LOOKUP),
- movie, ALL);
- // Co3SM = ALL Movies KBCo3Stars have been in
- // Of course some of these movies will have KBCo2Stars in too and thus will already have
- // been listed. Note We ONLY have to remove one level back from the list; previous levels
- // cannot be reached by definition
- EXPORT KBCo3StarMovies := JOIN(Co3SM, KBCo2StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT),LEFT ONLY);
- //------bacon #4 Actors
- KBCo4S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo3StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT), LOOKUP),
- actor, ALL);
- EXPORT KBCo4Stars := JOIN(KBCo4S, KBCo3Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #4 Movies
- // So what movies have all the Co4Stars been in?
- Co4SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo4Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LOOKUP),
- movie, ALL);
- // Co4SM = ALL Movies KBCo4Stars have been in
- // Of course some of these movies will have Co3Stars in too and thus will already have
- // been listed. Note We ONLY have to remove one level back from the list; previous levels
- // cannot be reached by definition
- EXPORT KBCo4StarMovies := JOIN(Co4SM, KBCo3StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #5 Stars
- KBCo5S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo4StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT), LOOKUP),
- actor, ALL);
- EXPORT KBCo5Stars := JOIN(KBCo5S, KBCo4Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #5 Movies
- Co5SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo5Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT), LOOKUP),
- movie,ALL);
- EXPORT KBCo5StarMovies := JOIN(Co5SM, KBCo4StarMovies, LEFT.movie=RIGHT.movie,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #6 Stars
- // Find people with a Bacon number of 6
- // KBCo5 is getting small again - can move back down to the SET?
- KBCo6S := DEDUP(IMDB.ActorsInMovies(movie IN SET(KBCo5StarMovies, movie)),
- actor, ALL);
- EXPORT KBCo6Stars := JOIN(KBCo6S, KBCo5Stars, LEFT.actor=RIGHT.actor,
- TRANSFORM(LEFT),LEFT ONLY);
- //----- bacon #6 Movies
- Co6SM := DEDUP(IMDB.ActorsInMovies(actor IN SET(KBCo6Stars, actor)), movie, ALL);
- EXPORT KBCo6StarMovies := Co6SM(movie NOT IN SET(KBCo5StarMovies, movie));
- //----- bacon #7 Movies
- // Find people with a Bacon number of 7
- KBCo7S := DEDUP(IMDB.ActorsInMovies(movie IN SET(KBCo6StarMovies,movie)), actor, ALL);
- EXPORT KBCo7Stars := KBCo7S(actor NOT IN SET(KBCo6Stars, actor));
- //----- We just have to count them all !! (How many holes in Albert Hall?)
- EXPORT doCounts := PARALLEL(
- OUTPUT(COUNT(KBMovies), NAMED('KBMovies')),
- OUTPUT(COUNT(KBCoStars), NAMED('KBCoStars')),
- OUTPUT(COUNT(KBCoStarMovies), NAMED('KBCoStarMovies')),
- OUTPUT(COUNT(KBCo2Stars), NAMED('KBCo2Stars')),
- OUTPUT(COUNT(KBCo2StarMovies), NAMED('KBCo2StarMovies')),
- OUTPUT(COUNT(KBCo3Stars), NAMED('KBCo3Stars')),
- OUTPUT(COUNT(KBCo3StarMovies), NAMED('KBCo3StarMovies')),
- OUTPUT(COUNT(KBCo4Stars), NAMED('KBCo4Stars')),
- OUTPUT(COUNT(KBCo4StarMovies), NAMED('KBCo4StarMovies')),
- OUTPUT(COUNT(KBCo5Stars), NAMED('KBCo5Stars')),
- OUTPUT(COUNT(KBCo5StarMovies), NAMED('KBCo5StarMovies')),
- OUTPUT(COUNT(KBCo6Stars), NAMED('KBCo6Stars')),
- OUTPUT(COUNT(KBCo6StarMovies), NAMED('KBCo6StarMovies')),
- OUTPUT(COUNT(KBCo7Stars), NAMED('KBCo7Stars')),
- OUTPUT(KBCo7Stars)
- );
- END;</programlisting></para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Open a new Builder Window and type:</para>
- </listitem>
- </itemizedlist>
- <para><programlisting>IMPORT IMDB;
- IMDB.KevinBaconNumberSets.doCounts;</programlisting></para>
- <itemizedlist mark="bullet">
- <listitem>
- <para>Check the syntax then press the <emphasis
- role="bold">Submit</emphasis> button.</para>
- <para><emphasis role="bold">Note:</emphasis> Depending on the size
- of your cluster and the speed of your server(s), this process
- could take several minutes. If you are running this on a virtual
- machine, it could take as long as an hour to complete.</para>
- </listitem>
- <listitem>
- <para>When the process completes, each row shown below becomes
- it's own result tab. You will get a sample of the output as
- follows:</para>
- <para><emphasis role="bold">Note:</emphasis> The data files for
- this tutorial change frequently, your results may vary from those
- shown in this document.</para>
- </listitem>
- </itemizedlist>
- <para><informaltable>
- <?dbfo keep-together="always"?>
- <tgroup cols="2">
- <tbody>
- <row>
- <entry>KB Movies</entry>
- <entry>71</entry>
- </row>
- <row>
- <entry>KB Co Stars</entry>
- <entry>3520</entry>
- </row>
- <row>
- <entry>KB Co Star Movies</entry>
- <entry>33504</entry>
- </row>
- <row>
- <entry>KB Co 2 Stars</entry>
- <entry>430145</entry>
- </row>
- <row>
- <entry>KB Co 2 Star Movies</entry>
- <entry>251867</entry>
- </row>
- <row>
- <entry>KB Co 3 Stars</entry>
- <entry>896009</entry>
- </row>
- <row>
- <entry>KB Co 3 Star Movies</entry>
- <entry>51650</entry>
- </row>
- <row>
- <entry>KB Co 4 Stars</entry>
- <entry>102729</entry>
- </row>
- <row>
- <entry>KB Co 4 Star Movies</entry>
- <entry>2634</entry>
- </row>
- <row>
- <entry>KB Co 5 Stars</entry>
- <entry>6080</entry>
- </row>
- <row>
- <entry>KB Co 5 Star Movies</entry>
- <entry>190</entry>
- </row>
- <row>
- <entry>KB Co 6 Stars</entry>
- <entry>450</entry>
- </row>
- <row>
- <entry>KB Co 6 Star Movies</entry>
- <entry>14</entry>
- </row>
- <row>
- <entry>KB Co 7 Stars</entry>
- <entry>22</entry>
- </row>
- </tbody>
- </tgroup>
- </informaltable></para>
- </sect2>
- </sect1>
- </chapter>
- <chapter id="Next_Steps">
- <title>Next Steps</title>
- <para>Now that you have successfully processed the data and established
- links, what's next?</para>
- <para>Two more ECL files are included in the IMDB folder that you can use
- in conjunction with the examples you have already worked through in this
- tutorial:</para>
- <para>• KeysKevinBacon -- Builds an index of actors/actresses and the
- movies they have starred in.</para>
- <para>You must build this index before you can run queries to find the
- degree of separation between Kevin Bacon and an actor of your
- choice.</para>
- <para>To build the index, open a builder window and type the following
- code:</para>
- <para><programlisting>IMPORT IMDB;
- IMDB.KeysKevinBacon.BuildAll;</programlisting></para>
- <para>Press the <emphasis role="bold">Submit</emphasis> button to run the
- ECL code and build the index.</para>
- <para>SearchKevinBaconLinks -- Searches the index you built to give you
- the degree of separation between an actor and Kevin Bacon.</para>
- <para>For example, to find the degree of separation between Kevin Bacon
- and Andi Everingham, open a builder window and type the following
- code:</para>
- <para><programlisting>IMPORT IMDB;
- IMDB.SearchKevinBaconLinks('Everingham, Andi');</programlisting></para>
- <para>Make sure the selected cluster is your <emphasis>hThor</emphasis>
- cluster, then press the <emphasis role="bold">Submit</emphasis> button to
- run the query.</para>
- <para>When it has completed, click on the Workunit ID tab.</para>
- <para>Two results are shown.</para>
- <para><emphasis role="bold">Result1</emphasis> shows the degree of
- separation between the actor and Kevin Bacon.</para>
- <para>Interpret the results as follows:</para>
- <para>Actor is at level 1 - The actor you chose and Kevin Bacon starred in
- a movie together.</para>
- <para>Actor is at level 2 - The actor you chose starred in a movie with an
- actor who starred in a movie with Kevin Bacon.</para>
- <para>The higher the level, the greater the degree of separation between
- the actor you chose and Kevin Bacon.</para>
- <para>In this example, the actor is at level 6, indicating that there are
- 6 degrees of separation between Andi Everingham and Kevin Bacon.</para>
- <para><emphasis role="bold">Result2</emphasis> shows the level (degree of
- separation), the name of the actor and the movie they starred in.</para>
- <para>Each line shows an actor and the movie they starred in which links
- them to each other and eventually to Kevin Bacon.</para>
- <para>Have fun finding the degrees of separation between any actor and
- Kevin Bacon.</para>
- <para>Remember to build the index first.</para>
- </chapter>
- </book>
|