IMDB.xml 43 KB


  1. <?xml version="1.0" encoding="utf-8"?>
  2. <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
  3. "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
  4. <book lang="en_US" xml:base="../">
  5. <title>Six Degrees of Kevin Bacon</title>
  6. <bookinfo>
  7. <title>Six Degrees of Kevin Bacon: ECL Programming Example</title>
  8. <mediaobject>
  9. <imageobject>
  10. <imagedata fileref="images/redswooshWithLogo3.jpg" />
  11. </imageobject>
  12. </mediaobject>
  13. <author>
  14. <surname>Boca Raton Documentation Team</surname>
  15. </author>
  16. <legalnotice>
  17. <para>We welcome your comments and feedback about this document via
  18. email to <email>docfeedback@hpccsystems.com</email></para>
  19. <para>Please include <emphasis role="bold">Documentation
  20. Feedback</emphasis> in the subject line and reference the document name,
  21. page numbers, and current Version Number in the text of the
  22. message.</para>
  23. <para>LexisNexis and the Knowledge Burst logo are registered trademarks
  24. of Reed Elsevier Properties Inc., used under license.</para>
  25. <para>HPCC Systems<superscript>®</superscript> is a registered trademark
  26. of LexisNexis Risk Data Management Inc.</para>
  27. <para>Other products, logos, and services may be trademarks or
  28. registered trademarks of their respective companies.</para>
  29. <para>All names and example data used in this manual are fictitious. Any
  30. similarity to actual persons, living or dead, is purely
  31. coincidental.</para>
  32. <para></para>
  33. </legalnotice>
  34. <xi:include href="common/Version.xml" xpointer="FooterInfo"
  35. xmlns:xi="http://www.w3.org/2001/XInclude" />
  36. <xi:include href="common/Version.xml" xpointer="DateVer"
  37. xmlns:xi="http://www.w3.org/2001/XInclude" />
  38. <corpname>HPCC Systems<superscript>®</superscript></corpname>
  39. <xi:include href="common/Version.xml" xpointer="Copyright"
  40. xmlns:xi="http://www.w3.org/2001/XInclude" />
  41. <mediaobject role="logo">
  42. <imageobject>
  43. <imagedata fileref="images/LN_Rightjustified.jpg" />
  44. </imageobject>
  45. </mediaobject>
  46. </bookinfo>
  47. <chapter id="Working_with_Data">
  48. <title>Working with Data</title>
  49. <sect1 id="Working_with_data_Intro" role="nobrk">
  50. <title>Introduction</title>
  51. <para>This exercise shows the methodology to extract useful information
  52. from data. Finding interesting links and relationships from large or
  53. massive datasets is a typical use of the HPCCSystems High Performance
  54. Computing Cluster (HPCC) platform.</para>
  55. <para>In this example, we will download the data files from the Internet
  56. Movie Database (IMDB) and see one technique to extract links and find
  57. relationships.</para>
  58. <para>Since the concept of actors and movies is conceptually simple;
  59. everyone should understand the data and relationships intuitively.
  60. However, the data is comprehensive enough to provide a solid example and
  61. inspiration for new users to gain skills to attack their own real-world
  62. problems with an HPCC.</para>
  63. <para>In this example, we will:</para>
  64. <itemizedlist mark="bullet">
  65. <listitem>
  66. <para>Download raw data files and supporting documentation about the
  67. data</para>
  68. </listitem>
  69. <listitem>
  70. <para>Analyze the data file to understand its format and
  71. contents</para>
  72. </listitem>
  73. <listitem>
  74. <para>Spray the file to a Data Refinery (Thor) cluster</para>
  75. </listitem>
  76. <listitem>
  77. <para>Examine the data and determine the pre-processing
  78. needed</para>
  79. </listitem>
  80. <listitem>
  81. <para>Pre-process the data to produce a new data file</para>
  82. </listitem>
  83. </itemizedlist>
  84. <para><informaltable colsep="1" frame="all">
  85. <?dbfo keep-together="always"?>
  86. <tgroup cols="2">
  87. <colspec colwidth="52.60pt" />
  88. <colspec />
  89. <tbody>
  90. <row>
  91. <entry><inlinegraphic fileref="images/OSSgr3.png" /></entry>
  92. <entry>While this example will run on a single-node HPCC, you
  93. will see a dramatic difference in performance on a multi-node
  94. system. The true power of an HPCC is its ability to work on
  95. different portions of the data file in parallel. This is known
  96. as Massively Parallel Processing (MPP).</entry>
  97. </row>
  98. </tbody>
  99. </tgroup>
  100. </informaltable></para>
  101. </sect1>
  102. <sect1 id="Processing_the_data">
  103. <title>Processing the Data</title>
  104. <sect2 id="Get_a_data_file">
  105. <title><emphasis>We get a data file</emphasis></title>
  106. <para>The Internet Movie Database (IMDB) database is a freely
  107. downloadable set of data files about Movies.</para>
  108. <para>It can be downloaded in many formats, including text file
  109. format. The set includes approximately 48 files about Actors,
  110. Actresses, Directors, Producers, and other aspects of motions
  111. pictures.</para>
  112. <para>It is manageable in size (~400MB) and is sufficient in size to
  113. exercise an HPCC platform but not too big to download<emphasis
  114. role="bold">.</emphasis></para>
  115. <para>The plain text data files are available from the following ftp
  116. sites:</para>
  117. <itemizedlist mark="bullet">
  118. <listitem>
  119. <para><ulink
  120. url="ftp://ftp.fu-berlin.de/pub/misc/movies/database/">ftp://ftp.fu-berlin.de/pub/misc/movies/database/</ulink>
  121. (Germany)</para>
  122. </listitem>
  123. <listitem>
  124. <para><ulink
  125. url="ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/">ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/</ulink>
  126. (Finland)</para>
  127. </listitem>
  128. <listitem>
  129. <para><ulink
  130. url="ftp://ftp.sunet.se/pub/tv+movies/imdb/">ftp://ftp.sunet.se/pub/tv+movies/imdb/</ulink>
  131. (Sweden)</para>
  132. </listitem>
  133. </itemizedlist>
  134. <para>The files are compressed using GNUzip to save space and
  135. bandwidth.</para>
  136. <para>We will focus initially on two of the larger data sets in the
  137. IMDB database</para>
  138. <blockquote>
  139. <itemizedlist mark="bullet">
  140. <listitem>
  141. <para>The Actors Dataset (Approximately 4 million
  142. Records)</para>
  143. </listitem>
  144. <listitem>
  145. <para>The Actresses Dataset (Approximately 2 million
  146. Records)</para>
  147. </listitem>
  148. </itemizedlist>
  149. </blockquote>
  150. <itemizedlist mark="bullet">
  151. <listitem>
  152. <para>Download the plain text data files
  153. (<emphasis>actors.list.gz</emphasis> and
  154. <emphasis>actresses.list.gz</emphasis> )to your local drive using
  155. any ftp interface you choose.</para>
  156. </listitem>
  157. <listitem>
  158. <para>Extract the two data files (<emphasis>actors.list</emphasis>
  159. and <emphasis>actresses.list</emphasis> ) using any GNUzip
  160. interface.</para>
  161. </listitem>
  162. </itemizedlist>
  163. </sect2>
  164. <sect2 id="Analyze_the_data">
  165. <title><emphasis>Analyze the data file to understand its format and
  166. its contents</emphasis></title>
  167. <para>Here is the sample of the data in the Actors.list file from
  168. IMDB</para>
  169. <para><programlisting>Koolout' Starks, Johnny Nothing Like the Holidays (2008) [Alexis' Thug] &lt;35&gt;
  170. Subtle Seduction (2008) [Officer Ward]
  171. The Godfather of Green Bay (2005) (as Johnny Starks) [Marcus] &lt;18&gt;
  172. La Chispa', Tony Caceria de judiciales (1997) &lt;11&gt;
  173. Violencia en la sierra (1995) [Victoriano] &lt;4&gt;</programlisting></para>
  174. <para>Notice the actors text file is structured as follows</para>
  175. <para><programlisting>Blankline
  176. Actorname_i Moviename (year) [role] &lt;listing position&gt;
  177. Moviename (year) [role] &lt;listing position&gt;
  178. Moviename (year) [role] &lt;listing position&gt;
  179. :
  180. Blankline
  181. Actorname_j \t Moviename (year) [role] &lt;listing position&gt;
  182. :
  183. Blankline</programlisting></para>
  184. </sect2>
  185. <sect2 id="Load_the_Incoming_Data" role="brk">
  186. <title><emphasis>Load the Incoming Data File to your Landing
  187. Zone</emphasis></title>
  188. <para>In this step, you will copy the data files to a location from
  189. which it can be sprayed to your HPCC cluster. A Landing Zone is a
  190. storage location attached to your HPCC. It has a utility running to
  191. facilitate file spraying to a cluster.</para>
  192. <para>For smaller data files, maximum of 2GB, you can use the
  193. upload/download file utility in ECL Watch. The sample data files are
  194. ~400 mb.</para>
  195. <para>Next you will distribute (or Spray) the dataset to all the nodes
  196. in the HPCC cluster. The power of the HPCC comes from its ability to
  197. assign multiple processors to work on different portions of the data
  198. file in parallel.</para>
  199. <orderedlist>
  200. <listitem>
  201. <para>Download the sample data files from the ftp sites as
  202. described in the previous section, if you have not done so
  203. already.</para>
  204. </listitem>
  205. <listitem>
  206. <para>Extract them to a folder on your local machine.</para>
  207. </listitem>
  208. <listitem>
  209. <para>In your browser, go to the <emphasis role="bold">ECL
  210. Watch</emphasis> URL. For example, http://nnn.nnn.nnn.nnn:8010,
  211. where nnn.nnn.nnn.nnn is your ESP Server's IP address.</para>
  212. <para><informaltable colsep="1" frame="all" rowsep="1">
  213. <?dbfo keep-together="always"?>
  214. <tgroup cols="2">
  215. <colspec colwidth="49.50pt" />
  216. <colspec />
  217. <tbody>
  218. <row>
  219. <entry><inlinegraphic
  220. fileref="images/caution.png" /></entry>
  221. <entry>Your IP address could be different from the ones
  222. provided in the example images. Please use the IP
  223. address provided by <emphasis
  224. role="bold">your</emphasis> installation.</entry>
  225. </row>
  226. </tbody>
  227. </tgroup>
  228. </informaltable></para>
  229. </listitem>
  230. <listitem>
  231. <?dbfo keep-together="always"?>
  232. <para>From ECL Watch page, click on the <emphasis
  233. role="bold">Files </emphasis>icon, then on the <emphasis
  234. role="bold">Landing Zones</emphasis> link.</para>
  235. <para><figure>
  236. <title>Upload/download</title>
  237. <mediaobject>
  238. <imageobject>
  239. <imagedata fileref="images/LZimg03-1.jpg"
  240. vendor="eclwatchSS" />
  241. </imageobject>
  242. </mediaobject>
  243. </figure></para>
  244. <para>Once you click on the <emphasis
  245. role="bold">Upload</emphasis> file link, a file dialog
  246. displays.</para>
  247. <para></para>
  248. </listitem>
  249. <listitem>
  250. <para>Browse the files on your local machine, then use
  251. multi-select to choose the files to upload and then press the
  252. <emphasis role="bold">Open</emphasis> button.</para>
  253. <para>The files you selected should appear . The data files are
  254. named: <emphasis>actors.list</emphasis> and
  255. <emphasis>actresses.list</emphasis> <emphasis role="bold">
  256. </emphasis></para>
  257. <figure>
  258. <title>Dropzones and Files</title>
  259. <mediaobject>
  260. <imageobject>
  261. <imagedata fileref="images/IMDB_upload.jpg"
  262. vendor="eclwatchSS" />
  263. </imageobject>
  264. </mediaobject>
  265. </figure>
  266. </listitem>
  267. <listitem>
  268. <para>Press the <emphasis role="bold">Start</emphasis> button to
  269. upload the files.</para>
  270. <para>You can monitor priogress as it uploads.</para>
  271. <figure>
  272. <title>Upload Progress</title>
  273. <mediaobject>
  274. <imageobject>
  275. <imagedata fileref="images/IMDB_uploadProgress.jpg"
  276. vendor="eclwatchSS" />
  277. </imageobject>
  278. </mediaobject>
  279. </figure>
  280. </listitem>
  281. </orderedlist>
  282. </sect2>
  283. <sect2 id="Spray_the_Data_to_THOR">
  284. <title>Spray the Data File to your <emphasis>Data Refinery (Thor)
  285. Cluster</emphasis></title>
  286. <para>To use the data file in our HPCC system, we must "spray" it to
  287. all the nodes. A <emphasis>spray</emphasis> or
  288. <emphasis>import</emphasis> is the relocation of a data file from one
  289. location (such as a Landing Zone) to multiple file parts on nodes in a
  290. cluster.</para>
  291. <para>The distributed or sprayed file is given a
  292. <emphasis>logical-file-name</emphasis> as follows<emphasis
  293. role="bold">: ~thor::in::IMDB::actors.list </emphasis> The system
  294. maintains a list of logical files and the corresponding physical file
  295. locations of the file parts.</para>
  296. <para></para>
  297. <itemizedlist mark="bullet">
  298. <listitem>
  299. <para>Open ECL Watch using the following URL:</para>
  300. <para><emphasis role="bold">http://nnn.nnn.nnn.nnn:pppp(where
  301. nnn.nnn.nnn.nnn is your ESP Server's IP Address and pppp is the
  302. port. The default port is 8010)</emphasis></para>
  303. </listitem>
  304. <listitem>
  305. <para>Click on the <emphasis role="bold">Files</emphasis> icon,
  306. then click the <emphasis role="bold">Landing Zones</emphasis> link
  307. from the navigation.</para>
  308. </listitem>
  309. <listitem>
  310. <para>Select the two files (actors.list and actresses.list ) then
  311. press the Delimited button.</para>
  312. <para>The <emphasis role="bold">Spray Delimited</emphasis> dialog
  313. displays.</para>
  314. <para><figure>
  315. <title>Spray Delimited</title>
  316. <mediaobject>
  317. <imageobject>
  318. <imagedata fileref="images/IMDB_01.jpg"
  319. vendor="eclwatchSS" />
  320. </imageobject>
  321. </mediaobject>
  322. </figure></para>
  323. <para></para>
  324. </listitem>
  325. <listitem>
  326. <para>Select mythor in the <emphasis role="bold">Group</emphasis>
  327. drop-list.</para>
  328. <para>The IP Address is automatically filled and the Local Path is
  329. partially filled with the default folder on your landing zone.
  330. Note: The VM and Community Edition typically only has one landing
  331. zone defined.</para>
  332. </listitem>
  333. <listitem>
  334. <para>Complete the Target Scope <emphasis
  335. role="bold">~thor::in::IMDB</emphasis></para>
  336. </listitem>
  337. <listitem>
  338. <para>Fill in the rest of the parameters (if they are not filled
  339. in already).</para>
  340. <para><itemizedlist>
  341. <listitem>
  342. <para>Max Record Length 8192</para>
  343. </listitem>
  344. <listitem>
  345. <para>Separator \,</para>
  346. </listitem>
  347. <listitem>
  348. <para>Line Terminator \n,\r\n</para>
  349. </listitem>
  350. <listitem>
  351. <para>Quote: '</para>
  352. </listitem>
  353. </itemizedlist></para>
  354. </listitem>
  355. <listitem>
  356. <?dbfo keep-together="always"?>
  357. <para>Make sure the <emphasis role="bold">Overwrite</emphasis> box
  358. is checked.</para>
  359. <para>If available, make sure the <emphasis
  360. role="bold">Replicate</emphasis> box is checked. (The Replicate
  361. option is only available on systems where replication has been
  362. enabled.)</para>
  363. </listitem>
  364. <listitem>
  365. <?dbfo keep-together="always"?>
  366. <para>Press the <emphasis role="bold">Spray</emphasis><emphasis
  367. role="bold"> </emphasis>button.</para>
  368. <para>A tab opens for each file. On these tabs, you can monitor
  369. the progress of each DFU Spray.</para>
  370. <para><figure>
  371. <title>View Progress</title>
  372. <mediaobject>
  373. <imageobject>
  374. <imagedata fileref="images/IMDB_03a.jpg"
  375. vendor="eclwatchSS" />
  376. </imageobject>
  377. </mediaobject>
  378. </figure></para>
  379. </listitem>
  380. <listitem>
  381. <para>After both sprays are complete, we can query the logical
  382. files on the HPCC to see the files we sprayed.</para>
  383. </listitem>
  384. <listitem>
  385. <para>Click on the <emphasis role="bold">Logical Files</emphasis>
  386. link</para>
  387. <para>The files display in the Logical Files list:</para>
  388. <para><figure>
  389. <title>Display Logical Files</title>
  390. <mediaobject>
  391. <imageobject>
  392. <imagedata fileref="images/IMDB_05.jpg"
  393. vendor="eclwatchSS" />
  394. </imageobject>
  395. </mediaobject>
  396. </figure></para>
  397. </listitem>
  398. </itemizedlist>
  399. </sect2>
  400. <sect2 id="Working_with_the_Data">
  401. <title>Working With the Data</title>
  402. <para>In this portion of the example, we will write ECL code to make
  403. sure we can read the sprayed data file .We will define and execute
  404. simple queries on it so we can evaluate it and determine any necessary
  405. pre-processing.</para>
  406. <itemizedlist mark="bullet">
  407. <listitem>
  408. <para>Start the ECL IDE (Start &gt;&gt; All Programs &gt;&gt; HPCC
  409. Systems &gt;&gt; ECL IDE )</para>
  410. </listitem>
  411. <listitem>
  412. <para>Log in to your environment.</para>
  413. </listitem>
  414. <listitem>
  415. <para>Expand the <emphasis role="bold">examples</emphasis> ECL
  416. folder in the Repository toolbox.</para>
  417. </listitem>
  418. <listitem>
  419. <?dbfo keep-together="always"?>
  420. <para>Expand the <emphasis role="bold">IMDB </emphasis>folder
  421. inside.</para>
  422. <para>All the ECL files needed to complete this tutorial are
  423. located in the IMDB folder.</para>
  424. <figure>
  425. <title>IMDB ECL Files</title>
  426. <mediaobject>
  427. <imageobject>
  428. <imagedata fileref="images/IMDB_06_new.jpg" />
  429. </imageobject>
  430. </mediaobject>
  431. </figure>
  432. </listitem>
  433. <listitem>
  434. <para>Open the CleanActor ECL file and examine the code.</para>
  435. <para>This code reads and processes the raw text file. The
  436. comments below describe the processing:</para>
  437. <para><programlisting>IMPORT Std;
  438. EXPORT STRING CleanActor(STRING infld) := FUNCTION
  439. //this can be refined later
  440. s1 := Std.Str.FindReplace(infld, '\'',''); // replace apostrophe
  441. s2 := Std.Str.FindReplace(s1, '\t',''); //replace tabs
  442. s3 := Std.Str.FindReplace(s2, '----',''); // replace multiple -----
  443. return TRIM(s3, LEFT, RIGHT);
  444. END;
  445. </programlisting></para>
  446. </listitem>
  447. </itemizedlist>
  448. <sect3 id="Examine_The_Data" role="brk">
  449. <title>Examine the Data</title>
  450. <para>In this section, we will look at the data and determine if
  451. there is any pre-processing we want to perform. This is the step in
  452. the development process where we convert the raw data into a form we
  453. can actually use.</para>
  454. <variablelist>
  455. <varlistentry>
  456. <term>Note:</term>
  457. <listitem>
  458. <para>The IMDB.FileActors.ecl file specifies the size of the
  459. header in the files (actors.list and actresses.list.) The
  460. HEADING() value in the example code was accurate at the time
  461. we downloaded the IMDB data, but could change at any time. We
  462. suggest opening in a text editor and checking the line number
  463. where the header ends and actual data begins (as shown
  464. below).</para>
  465. </listitem>
  466. </varlistentry>
  467. </variablelist>
  468. <figure>
  469. <title>actors.list in text editor</title>
  470. <mediaobject>
  471. <imageobject>
  472. <imagedata fileref="images/IMDB_fileheading.jpg" />
  473. </imageobject>
  474. </mediaobject>
  475. </figure>
  476. <para></para>
  477. <itemizedlist mark="bullet">
  478. <listitem>
  479. <para>Open a new Builder window (CTRL+N) and write the following
  480. code:</para>
  481. <para><programlisting>IMPORT IMDB;
  482. OUTPUT(IMDB.FileActors);
  483. </programlisting></para>
  484. </listitem>
  485. <listitem>
  486. <para>Press the syntax check button on the main toolbar (or
  487. press F7).</para>
  488. <para>It is always a good idea to check syntax before
  489. submitting.</para>
  490. </listitem>
  491. <listitem>
  492. <?dbfo keep-together="always"?>
  493. <para>Make sure the selected cluster is your
  494. <emphasis>thor</emphasis> cluster, then press the <emphasis
  495. role="bold">Submit</emphasis> button.</para>
  496. <para><figure>
  497. <title>Submit to Thor</title>
  498. <mediaobject>
  499. <imageobject>
  500. <imagedata fileref="images/IMDB_10.jpg" />
  501. </imageobject>
  502. </mediaobject>
  503. </figure></para>
  504. </listitem>
  505. <listitem>
  506. <para>When the Workunit completes it displays a green checkmark.
  507. <inlinegraphic fileref="images/DT173-15.jpg" /></para>
  508. <para><emphasis role="bold">Note:</emphasis> Depending on the
  509. size of your cluster and the speed of your server(s), this
  510. process could take several minutes. If you are running this on a
  511. virtual machine, it could take as long as 45 minutes to
  512. complete.</para>
  513. </listitem>
  514. <listitem>
  515. <?dbfo keep-together="always"?>
  516. <para>Select the Workunit tab (the one with the number and the
  517. checkmark next to it) and select the <emphasis
  518. role="bold">Result 1</emphasis> tab.</para>
  519. <para><figure>
  520. <title>Select Workunit</title>
  521. <mediaobject>
  522. <imageobject>
  523. <imagedata fileref="images/IMDB_07.jpg" />
  524. </imageobject>
  525. </mediaobject>
  526. </figure></para>
  527. </listitem>
  528. <listitem>
  529. <?dbfo keep-together="always"?>
  530. <para>Scroll down to see more records.</para>
  531. <para><figure>
  532. <title>See more records</title>
  533. <mediaobject>
  534. <imageobject>
  535. <imagedata fileref="images/IMDB_08.jpg" />
  536. </imageobject>
  537. </mediaobject>
  538. </figure></para>
  539. </listitem>
  540. </itemizedlist>
  541. <itemizedlist mark="bullet">
  542. <listitem>
  543. <para>Close the Builder Window.</para>
  544. </listitem>
  545. </itemizedlist>
  546. </sect3>
  547. </sect2>
  548. <sect2 id="Processing_the_Data_E-T-L">
  549. <title><emphasis role="bold">Processing the Data : Extract,
  550. </emphasis><emphasis>Transform, and Load</emphasis></title>
  551. <para><emphasis>In this section, we will write code to transform the
  552. original actor data as follows:</emphasis></para>
  553. <itemizedlist mark="bullet">
  554. <listitem>
  555. <para>From the raw actors data, we will do an ETL operation
  556. (Extract, Transform, Load) to build an <emphasis
  557. role="bold">actor_movie </emphasis>relation set.</para>
  558. </listitem>
  559. <listitem>
  560. <para>We will also construct a Kevin Bacon degrees of separation
  561. lookup set. This is the structure we will query to answer the
  562. question:</para>
  563. </listitem>
  564. </itemizedlist>
  565. <para><emphasis>H</emphasis><emphasis>ow many degrees of separation
  566. exist between Actor X and Kevin Bacon?</emphasis></para>
  567. <para></para>
  568. <para></para>
  569. <para></para>
  570. <para><emphasis role="bold">For example: </emphasis>Using Jon Lovitz
  571. as the actor, we want information as follows:</para>
  572. <para>Jon Lovitz ( (was in) Movie X ( (with) Actor2 ((who was in)
  573. Movie Y ( (with) Kevin Bacon</para>
  574. <para>We will then write this new file to our Thor cluster so it can
  575. be used in parameterized queries.</para>
  576. <para></para>
  577. <para></para>
  578. <para><itemizedlist mark="bullet">
  579. <listitem>
  580. <para>In the ECL IDE , go to the Repository panel and expand the
  581. IMDB folder.</para>
  582. </listitem>
  583. <listitem>
  584. <para>Open the ECL File ActorsInMovies.</para>
  585. <para>The code in this ECL file looks like this:</para>
  586. </listitem>
  587. </itemizedlist></para>
  588. <para><programlisting>/* ******************************************************************************
  589. ## Copyright 2011 HPCC Systems®. All rights reserved.
  590. ******************************************************************************* */
  591. /**
  592. * Produce a slimmed down version of the IMDB actor AND actress files to
  593. * permit more efficient join operations.
  594. * Filter out the movie records we do not want in building our KBacon Number sets.
  595. *
  596. */
  597. IMPORT $ AS IMDB;
  598. IMPORT Std;
  599. // Filter out TV movies, Videos AND some documentary type collections
  600. ds_IMDB := IMDB.FileActors(actorname!='' AND moviename != '' AND
  601. Std.Str.Find(moviename,'Boffo',1) = 0 AND
  602. Std.Str.Find(moviename,'Slasher Film',1) = 0 AND
  603. movie_type != 'Video' AND isTVseries = 'N' AND
  604. movie_type != 'For TV');
  605. //Slim the records down to bare essentials for searching AND joining
  606. slim_IMDB_rec := RECORD
  607. STRING50 actor;
  608. STRING150 movie;
  609. END;
  610. slim_IMDB_rec slim_it(ds_IMDB L):= TRANSFORM
  611. SELF.actor := Std.Str.FindReplace(L.actorname,'(I)','');
  612. SELF.movie := L.moviename;;
  613. END;
  614. IMDB_names := PROJECT(ds_IMDB, slim_it(LEFT));
  615. export ActorsInMovies := IMDB_Names : persist('~temp::IMDB::ActorsInMovies');;
  616. </programlisting></para>
  617. <para>This defines a relational data set:-- actor:movie. We will use
  618. this definition later.</para>
  619. </sect2>
  620. </sect1>
  621. <sect1 id="Getting_Useful_Info_from_Data">
  622. <title>Getting Useful Information from Data</title>
  623. <sect2 id="Links_and_Degrees_of_Separation">
  624. <title><emphasis>Links and Degrees of Separation</emphasis></title>
  625. <para>Now that we have our data in a useful format, have a relation
  626. defined, and the file is in place, we can write code to use the new
  627. data file.</para>
  628. <para>We want to know how many actors are a distance
  629. <emphasis>N</emphasis> from Kevin Bacon. To accomplish this, we will
  630. construct sets of Kevin Bacon's costars that are KBacon number
  631. apart.</para>
  632. <itemizedlist mark="bullet">
  633. <listitem>
  634. <para>Open the KevinBaconNumberSets ECL file.</para>
  635. </listitem>
  636. </itemizedlist>
  637. <para>This ECL code counts the number of actors with <emphasis>"bacon
  638. numbers"</emphasis> starting from 1 thru 7, that is up to 7 Levels of
  639. separation. We will use this later to do searches by building an
  640. index.</para>
  641. <para><programlisting>/* ******************************************************************************
  642. ATTRIBUTE PURPOSE:
  643. Produce a series of sets for Actors and Movies that are : distance-0
  644. away (KBacons Direct movies ), distance-2 Away KBacon's Costars Movies ,
  645. distance-3 away - Movies of Costars of Costars etc all the way upto level 7
  646. The nested attributes below are shown here together for the benefit of the reader.
  647. Notes on variable naming convention used for costars and movies
  648. KBMovies : Movies Kevin Bacon Worked in (distance 0)
  649. KBCoStars : Stars who worked in KBMovies (distance 1)
  650. KBCoStarMovies : Movies worked in by KBCoStars
  651. except KBMovies (distance 1)
  652. KBCo2Stars : Stars(Actors) who worked in KBCoStarMovies (distance 2)
  653. KBCo2StarMovies : Movies worked in by KBCo2Stars
  654. except KBCoStarMovies (distance 2)
  655. KBCo3Stars : Stars(Actors) who worked in KBCo2StarMovies (distance 3)
  656. KBCo3StarMovies : Movies worked in by KBCo3Stars
  657. except KBCo2StarMovies (distance 3)
  658. etc..
  659. ******************************************************************************* */
  660. IMPORT Std;
  661. IMPORT IMDB;
  662. EXPORT KevinBaconNumberSets := MODULE
  663. // Constructing a proper name match function is an art within itself
  664. // For simplicity we will define a name as matching if both first and last name
  665. //are found within the string
  666. NameMatch(string full_name, string fname,string lname) :=
  667. Std.Str.Find(full_name,fname,1) &gt; 0 AND
  668. Std.Str.Find(full_name,lname,1) &gt; 0;
  669. //------ Get KBacon Movies
  670. AllKBEntries := IMDB.ActorsInMovies(NameMatch(actor,'Kevin','Bacon'));
  671. EXPORT KBMovies := DEDUP(AllKBEntries, movie, ALL); // Each movie should ONLY occur once
  672. //------ Get KBacon CoStars
  673. CoStars := IMDB.ActorsInMovies(Movie IN SET(KBMovies,Movie));
  674. EXPORT KBCoStars := DEDUP( CoStars(actor&lt;&gt;'Kevin Bacon'), actor, ALL);
  675. //------ Get KBacon Costars' Movies
  676. // CSM = First find all of the movies that a KBCoStar has been in
  677. CSM := DEDUP(JOIN(IMDB.ActorsInMovies,KBCoStars, LEFT.actor=RIGHT.actor,
  678. TRANSFORM(LEFT), LOOKUP),
  679. movie,ALL);
  680. // Now we need to remove all of those that KB was in himself
  681. // We can use a set; KB has not been in (quite!) that many movies
  682. EXPORT KBCoStarMovies := CSM(movie NOT IN SET(KBMovies,movie));
  683. //------ Bacon # 2 Actors
  684. // To be a Co2Star of Kevin Bacon you must have appeared in a movie with a
  685. //CoStar of Kevin Bacon
  686. // This corresponds to having a Bacon number of 2
  687. // We are now getting towards the expensive part of the process
  688. KBCo2S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCoStarMovies, LEFT.movie=RIGHT.movie,
  689. TRANSFORM(LEFT), LOOKUP),
  690. actor, ALL);
  691. // KCCo2S = ALL Actors appearing in Movies of KBacon's CoActors
  692. // The above is all the people in the movies; but some will have been co-stars of KB
  693. //directly - these must be removed
  694. // The LEFT ONLY join removes items in one list from another
  695. EXPORT KBCo2Stars := JOIN(KBCo2S, KBCoStars, LEFT.actor=RIGHT.actor,
  696. TRANSFORM(LEFT), LEFT ONLY);
  697. //------- bacon # 2 Movies
  698. // Co2SM = what movies have all the Co2Stars been in?
  699. Co2SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo2Stars, LEFT.actor=RIGHT.actor,
  700. TRANSFORM(LEFT), LOOKUP),
  701. movie, ALL);
  702. // Co2SM = ALL Movies KBCo2Stars have been in
  703. // Of course some of these movies will have CoStars in too and thus will already have
  704. //been listed. Note this list will not contain any Kevin Bacon movies OR the movie would
  705. //have been reached earlier!
  706. Export KBCo2StarMovies := JOIN(Co2SM, KBCoStarMovies, LEFT.movie=RIGHT.movie,
  707. TRANSFORM(LEFT),LEFT ONLY);
  708. //------ bacon #3 Actors
  709. // Find people with a Bacon number of 3
  710. // This code is very similar to KBCo2Stars; one might be tempted to common up into a
  711. // function or macro. However it is worth looking at the attribute counts first; we may be
  712. // down to a small enough set that we can start using in-memory functions (e.g.,SET) again.
  713. KBCo3S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo2StarMovies, LEFT.movie=RIGHT.movie,
  714. TRANSFORM(LEFT), LOOKUP),
  715. actor, ALL);
  716. // KBCo3S = ALL CoStars in KBCo2Star Movies
  717. // The above is all the people in the movies; but some will have been co2stars of KB
  718. // directly - these must be removed. The LEFT ONLY join removes items in one list from
  719. // another. There should not be any direct CoStars in this list (or the movie would have
  720. // been a CoStarMovie not a CoCoStarMovie)
  721. EXPORT KBCo3Stars := JOIN(KBCo3S, KBCo2Stars, LEFT.actor=RIGHT.actor,
  722. TRANSFORM(LEFT),LEFT ONLY);
  723. //----- bacon #3 Movies
  724. // So what movies have all the KBCo3Stars been in?
  725. Co3SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo3Stars, LEFT.actor=RIGHT.actor,
  726. TRANSFORM(LEFT), LOOKUP),
  727. movie, ALL);
  728. // Co3SM = ALL Movies KBCo3Stars have been in
  729. // Of course some of these movies will have KBCo2Stars in too and thus will already have
  730. // been listed. Note We ONLY have to remove one level back from the list; previous levels
  731. // cannot be reached by definition
  732. EXPORT KBCo3StarMovies := JOIN(Co3SM, KBCo2StarMovies, LEFT.movie=RIGHT.movie,
  733. TRANSFORM(LEFT),LEFT ONLY);
  734. //------bacon #4 Actors
  735. KBCo4S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo3StarMovies, LEFT.movie=RIGHT.movie,
  736. TRANSFORM(LEFT), LOOKUP),
  737. actor, ALL);
  738. EXPORT KBCo4Stars := JOIN(KBCo4S, KBCo3Stars, LEFT.actor=RIGHT.actor,
  739. TRANSFORM(LEFT),LEFT ONLY);
  740. //----- bacon #4 Movies
  741. // So what movies have all the Co4Stars been in?
  742. Co4SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo4Stars, LEFT.actor=RIGHT.actor,
  743. TRANSFORM(LEFT), LOOKUP),
  744. movie, ALL);
  745. // Co4SM = ALL Movies KBCo4Stars have been in
  746. // Of course some of these movies will have Co3Stars in too and thus will already have
  747. // been listed. Note We ONLY have to remove one level back from the list; previous levels
  748. // cannot be reached by definition
  749. EXPORT KBCo4StarMovies := JOIN(Co4SM, KBCo3StarMovies, LEFT.movie=RIGHT.movie,
  750. TRANSFORM(LEFT),LEFT ONLY);
  751. //----- bacon #5 Stars
  752. KBCo5S := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo4StarMovies, LEFT.movie=RIGHT.movie,
  753. TRANSFORM(LEFT), LOOKUP),
  754. actor, ALL);
  755. EXPORT KBCo5Stars := JOIN(KBCo5S, KBCo4Stars, LEFT.actor=RIGHT.actor,
  756. TRANSFORM(LEFT),LEFT ONLY);
  757. //----- bacon #5 Movies
  758. Co5SM := DEDUP(JOIN(IMDB.ActorsInMovies, KBCo5Stars, LEFT.actor=RIGHT.actor,
  759. TRANSFORM(LEFT), LOOKUP),
  760. movie,ALL);
  761. EXPORT KBCo5StarMovies := JOIN(Co5SM, KBCo4StarMovies, LEFT.movie=RIGHT.movie,
  762. TRANSFORM(LEFT),LEFT ONLY);
  763. //----- bacon #6 Stars
  764. // Find people with a Bacon number of 6
  765. // KBCo5 is getting small again - can move back down to the SET?
  766. KBCo6S := DEDUP(IMDB.ActorsInMovies(movie IN SET(KBCo5StarMovies, movie)),
  767. actor, ALL);
  768. EXPORT KBCo6Stars := JOIN(KBCo6S, KBCo5Stars, LEFT.actor=RIGHT.actor,
  769. TRANSFORM(LEFT),LEFT ONLY);
  770. //----- bacon #6 Movies
  771. Co6SM := DEDUP(IMDB.ActorsInMovies(actor IN SET(KBCo6Stars, actor)), movie, ALL);
  772. EXPORT KBCo6StarMovies := Co6SM(movie NOT IN SET(KBCo5StarMovies, movie));
  773. //----- bacon #7 Movies
  774. // Find people with a Bacon number of 7
  775. KBCo7S := DEDUP(IMDB.ActorsInMovies(movie IN SET(KBCo6StarMovies,movie)), actor, ALL);
  776. EXPORT KBCo7Stars := KBCo7S(actor NOT IN SET(KBCo6Stars, actor));
  777. //----- We just have to count them all !! (How many holes in Albert Hall?)
  778. EXPORT doCounts := PARALLEL(
  779. OUTPUT(COUNT(KBMovies), NAMED('KBMovies')),
  780. OUTPUT(COUNT(KBCoStars), NAMED('KBCoStars')),
  781. OUTPUT(COUNT(KBCoStarMovies), NAMED('KBCoStarMovies')),
  782. OUTPUT(COUNT(KBCo2Stars), NAMED('KBCo2Stars')),
  783. OUTPUT(COUNT(KBCo2StarMovies), NAMED('KBCo2StarMovies')),
  784. OUTPUT(COUNT(KBCo3Stars), NAMED('KBCo3Stars')),
  785. OUTPUT(COUNT(KBCo3StarMovies), NAMED('KBCo3StarMovies')),
  786. OUTPUT(COUNT(KBCo4Stars), NAMED('KBCo4Stars')),
  787. OUTPUT(COUNT(KBCo4StarMovies), NAMED('KBCo4StarMovies')),
  788. OUTPUT(COUNT(KBCo5Stars), NAMED('KBCo5Stars')),
  789. OUTPUT(COUNT(KBCo5StarMovies), NAMED('KBCo5StarMovies')),
  790. OUTPUT(COUNT(KBCo6Stars), NAMED('KBCo6Stars')),
  791. OUTPUT(COUNT(KBCo6StarMovies), NAMED('KBCo6StarMovies')),
  792. OUTPUT(COUNT(KBCo7Stars), NAMED('KBCo7Stars')),
  793. OUTPUT(KBCo7Stars)
  794. );
  795. END;</programlisting></para>
  796. <itemizedlist mark="bullet">
  797. <listitem>
  798. <para>Open a new Builder Window and type:</para>
  799. </listitem>
  800. </itemizedlist>
  801. <para><programlisting>IMPORT IMDB;
  802. IMDB.KevinBaconNumberSets.doCounts;</programlisting></para>
  803. <itemizedlist mark="bullet">
  804. <listitem>
  805. <para>Check the syntax then press the <emphasis
  806. role="bold">Submit</emphasis> button.</para>
  807. <para><emphasis role="bold">Note:</emphasis> Depending on the size
  808. of your cluster and the speed of your server(s), this process
  809. could take several minutes. If you are running this on a virtual
  810. machine, it could take as long as an hour to complete.</para>
  811. </listitem>
  812. <listitem>
  813. <para>When the process completes, each row shown below becomes
  814. it's own result tab. You will get a sample of the output as
  815. follows:</para>
  816. <para><emphasis role="bold">Note:</emphasis> The data files for
  817. this tutorial change frequently, your results may vary from those
  818. shown in this document.</para>
  819. </listitem>
  820. </itemizedlist>
  821. <para><informaltable>
  822. <?dbfo keep-together="always"?>
  823. <tgroup cols="2">
  824. <tbody>
  825. <row>
  826. <entry>KB Movies</entry>
  827. <entry>71</entry>
  828. </row>
  829. <row>
  830. <entry>KB Co Stars</entry>
  831. <entry>3520</entry>
  832. </row>
  833. <row>
  834. <entry>KB Co Star Movies</entry>
  835. <entry>33504</entry>
  836. </row>
  837. <row>
  838. <entry>KB Co 2 Stars</entry>
  839. <entry>430145</entry>
  840. </row>
  841. <row>
  842. <entry>KB Co 2 Star Movies</entry>
  843. <entry>251867</entry>
  844. </row>
  845. <row>
  846. <entry>KB Co 3 Stars</entry>
  847. <entry>896009</entry>
  848. </row>
  849. <row>
  850. <entry>KB Co 3 Star Movies</entry>
  851. <entry>51650</entry>
  852. </row>
  853. <row>
  854. <entry>KB Co 4 Stars</entry>
  855. <entry>102729</entry>
  856. </row>
  857. <row>
  858. <entry>KB Co 4 Star Movies</entry>
  859. <entry>2634</entry>
  860. </row>
  861. <row>
  862. <entry>KB Co 5 Stars</entry>
  863. <entry>6080</entry>
  864. </row>
  865. <row>
  866. <entry>KB Co 5 Star Movies</entry>
  867. <entry>190</entry>
  868. </row>
  869. <row>
  870. <entry>KB Co 6 Stars</entry>
  871. <entry>450</entry>
  872. </row>
  873. <row>
  874. <entry>KB Co 6 Star Movies</entry>
  875. <entry>14</entry>
  876. </row>
  877. <row>
  878. <entry>KB Co 7 Stars</entry>
  879. <entry>22</entry>
  880. </row>
  881. </tbody>
  882. </tgroup>
  883. </informaltable></para>
  884. </sect2>
  885. </sect1>
  886. </chapter>
  887. <chapter id="Next_Steps">
  888. <title>Next Steps</title>
  889. <para>Now that you have successfully processed the data and established
  890. links, what's next?</para>
  891. <para>Two more ECL files are included in the IMDB folder that you can use
  892. in conjunction with the examples you have already worked through in this
  893. tutorial:</para>
  894. <para>• KeysKevinBacon -- Builds an index of actors/actresses and the
  895. movies they have starred in.</para>
  896. <para>You must build this index before you can run queries to find the
  897. degree of separation between Kevin Bacon and an actor of your
  898. choice.</para>
  899. <para>To build the index, open a builder window and type the following
  900. code:</para>
  901. <para><programlisting>IMPORT IMDB;
  902. IMDB.KeysKevinBacon.BuildAll;</programlisting></para>
  903. <para>Press the <emphasis role="bold">Submit</emphasis> button to run the
  904. ECL code and build the index.</para>
  905. <para>SearchKevinBaconLinks -- Searches the index you built to give you
  906. the degree of separation between an actor and Kevin Bacon.</para>
  907. <para>For example, to find the degree of separation between Kevin Bacon
  908. and Andi Everingham, open a builder window and type the following
  909. code:</para>
  910. <para><programlisting>IMPORT IMDB;
  911. IMDB.SearchKevinBaconLinks('Everingham, Andi');</programlisting></para>
  912. <para>Make sure the selected cluster is your <emphasis>hThor</emphasis>
  913. cluster, then press the <emphasis role="bold">Submit</emphasis> button to
  914. run the query.</para>
  915. <para>When it has completed, click on the Workunit ID tab.</para>
  916. <para>Two results are shown.</para>
  917. <para><emphasis role="bold">Result1</emphasis> shows the degree of
  918. separation between the actor and Kevin Bacon.</para>
  919. <para>Interpret the results as follows:</para>
  920. <para>Actor is at level 1 - The actor you chose and Kevin Bacon starred in
  921. a movie together.</para>
  922. <para>Actor is at level 2 - The actor you chose starred in a movie with an
  923. actor who starred in a movie with Kevin Bacon.</para>
  924. <para>The higher the level, the greater the degree of separation between
  925. the actor you chose and Kevin Bacon.</para>
  926. <para>In this example, the actor is at level 6, indicating that there are
  927. 6 degrees of separation between Andi Everingham and Kevin Bacon.</para>
  928. <para><emphasis role="bold">Result2</emphasis> shows the level (degree of
  929. separation), the name of the actor and the movie they starred in.</para>
  930. <para>Each line shows an actor and the movie they starred in which links
  931. them to each other and eventually to Kevin Bacon.</para>
  932. <para>Have fun finding the degrees of separation between any actor and
  933. Kevin Bacon.</para>
  934. <para>Remember to build the index first.</para>
  935. </chapter>
  936. </book>