当前位置:天才代写 > Python代写,python代做代考-价格便宜,0时差服务 > 可视化分析代写 Homework代写 Python 3.7代写 code代写

可视化分析代写 Homework代写 Python 3.7代写 code代写

2021-04-22 12:28 星期四 所属: Python代写,python代做代考-价格便宜,0时差服务 浏览:796

可视化分析代写

可视化分析代写 It is important that you read the following instructions carefully and also those about the deliverables at the end of each question

CSE 6242 / CX 4242: Data and Visual Analytics Georgia Tech, Spring 2019

Homework 1: Analyzing The MovieDB data; SQLite; D3 Warmup; Gephi; OpenRefine 可视化分析代写

Prepared by our 30+ wonderful TAs of CSE6242A,Q,OAN,O01,O3/CX4242A for our 1200+ students

Submission Instructions and Important Notes:

It is important that you read the following instructions carefully and also those about the deliverables at the end of each question or you may lose points.

  • Always check to make sure you are using the most up-to-date assignment (version number at bottom right of this document).
  • Submit a single zipped file, called “HW1-{GT username}.zip”, containing all the deliverables including source code/scripts, data files, and readme. Example: “HW1-jdoe3.zip” if GT account username is “jdoe3”. Only .zip is allowed (no other format will be accepted). Your GT username is the one with letters and numbers.
  • You may discuss high-level ideas with other students at the “whiteboard” level (e.g., how cross validation works, use hashmap instead of array) and review any relevant materials online. However, each student must write up and submit his or her own answers.可视化分析代写
  • All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures (e.g., reported to and directly handled by the Office of Student Integrity (OSI)). Consequences can be severe, e.g., academic probation or dismissal, grade penalties, a 0 grade for assignments concerned, and prohibition from withdrawing from the class.
  • At the end of this assignment,

we have specified a folder structure you must use to organize your files in a single zipped file. 5 points will be deducted for not following this strictly.

  • In your final zip file, do not include any intermediate files you may have generated to work on the task, unless your script is absolutely dependent on it to get the final result (which it ideally should not be).
  • We may use auto-grading scripts to grade some of your deliverables, so it is extremely important that you strictly follow our requirements.
  • Wherever you are asked to write down an explanation for the task you perform, stay within the word limit or you may lose points.可视化分析代写
  • Every homework assignment deliverable and every project deliverable comes with a 48-hour “grace period”. Such deliverable may be submitted (and resubmitted) up to 48 hours after the official deadline without penalty. You do not need to ask before using this grace period.
  • Any deliverable submitted after the grace period will get 0 credit.
  • We will not consider late submission of any missing parts of a deliverable. To make sure you have submitted everything, download your submitted files to double check. If your submitting large files, you are responsible for making sure they get uploaded to the system in time. You have the whole grace period to verify your submissions!

Download the HW1 Skeleton before you begin. (The Q3 folder is initially empty.)

Grading

The maximum possible score for this homework is 100 points.

CSE 6242 / CX 4242:数据和可视化分析Georgia Tech,2019春季

作业1:分析MovieDB数据; SQLite; D3热身;杰斐OpenRefine

由我们30多个CSE6242A,Q,OAN,O01,O3 / CX4242A的出色TA为我们的1200多名学生准备

提交说明和重要说明:

请务必仔细阅读以下说明以及每个问题结尾处有关可交付成果的说明,否则您可能会失去分数。
❏始终检查以确保您使用的是最新的作业(本文档右下方的版本号)。
❏提交一个名为“ HW1- {GT用户名} .zip”的压缩文件,其中包含所有可交付成果,包括源代码/脚本,数据文件和自述文件。示例:如果GT帐户用户名是“ jdoe3”,则为“ HW1-jdoe3.zip”。仅允许使用.zip(不接受其他格式)。您的GT用户名是带有字母和数字的用户名。
❏您可以在“白板”级别与其他学生讨论高层次的想法(例如,交叉验证的工作方式,使用哈希图而不是数组)并在线查看任何相关材料。但是,每个学生必须写下并提交自己的答案。可视化分析代写
❏所有涉嫌不诚实,窃或违反佐治亚州技术荣誉法典的事件都将受到该学院的学术诚信程序的约束(例如,向学生诚信办公室(OSI)报告并直接处理)。后果可能很严重,例如,学术试用或开除,等级罚款,相关作业的等级为0,以及禁止退课。
❏在分配结束时,我们指定了一个文件夹结构,您必须使用该文件夹结构将文件组织为一个压缩文件。不严格遵守将被扣5分。
❏在最终的zip文件中,请勿包括为执行该任务而可能生成的任何中间文件,除非您的脚本绝对依赖于它来获得最终的结果(理想情况下不应如此)。
❏我们可能会使用自动分级脚本来对您的一些交付物进行分级,因此严格遵守我们的要求非常重要。
❏无论何时要求您写下要执行的任务的说明,请保持在字数限制内,否则您可能会失去分数。
❏每个可交付的家庭作业和每个项目均带有48小时的“宽限期”。此类交付成果可以在官方截止日期后的48小时内提交(并重新提交),而不会受到罚款。您无需在使用此宽限期之前询问。
❏宽限期之后提交的任何可交付成果将获得0积分。
❏我们不会考虑延迟交付可交付成果的任何缺失部分。为了确保您已提交所有内容,请下载您提交的文件以进行仔细检查。如果您提交的是大文件,则您有责任确保将其及时上传到系统中。您有整个宽限期来验证您的提交!可视化分析代写

开始之前,请下载HW1骨架。 (Q3文件夹最初是空的。)

等级

该家庭作业的最大可能成绩是100分。

可视化分析代写
可视化分析代写

Q1 [40 points] Collecting and visualizing The Movie DB (TMDb) data 可视化分析代写

Q1.1 [25 points] Collecting Movie Data

You will use “The Movie DB” API version 3 to: (1) download data about movies and (2) for each movie, download its 5 similar movies.

You will write code using Python 3.7.x in script.py in this question. You will need an API key to use the TMDb data. Your API key will be an input to script.py so that we can run your code with our own API key to check the results. Running the following command should generate the CSV files specified in part b and part c:

python3 script.py <API_KEY>可视化分析代写

Please refer to this tutorial to learn how to parse command line arguments. DO NOT leave your API key written in the code.

Note:

You may only use the modules and libraries provided at the top the script.py file included in the skeleton for Q1 and modules from the Python Standard Library. Python wrappers (or modules) for the TMDb API may NOT be used for this assignment. Pandas and Numpy also may NOT be used — while we understand that they are useful libraries to learn, completing this question is not critically dependent on their functionality. In addition, to make grading more manageable and to enable our TAs to provide better, more consistent support to our students, we have decided to restrict the libraries accordingly.可视化分析代写
a. How to use TheMovieDB API:

  • Create a TMDb account and request for an API key – https://www.themoviedb.org/account/signup. Refer to this document for detailed instructions.
  • Refer to the API documentation https://developers.themoviedb.org/3/getting-started/introduction , as you work on this question.
  • Use v3 of the API. You will need to use the v3 API key in your requests.
Note:
  • The API allows you to make 40 requests every 10 seconds. Set appropriate timeout intervals in your code while making requests. We recommend you think about ​how much time your script will run for​ when solving this question, so you will complete it on time.
  • The API endpoint may return different results for the same request.
  • You will be penalized for a runtime exceeding 5 minutes.

b. [10 points] Search for movies in the ‘Drama’ genre released in the year 2004 or later. Retrieve the 350 most popular movies in this genre. The movies should be sorted from most popular to least popular. Hint: Sorting based on popularity can be done in the API call.可视化分析代写

  • Documentation for retrieving similar movies: https://developers.themoviedb.org/3/discover/movie-discover https://developers.themoviedb.org/3/genres/get-movie-list
  • Save the results in movie_ID_name.csv.
    Each line in the file should describe one movie, in the following format — NO space after comma, and do not include any column headers:movie-ID,movie-name

    For example, a line in the file could look like:

    353486,Jumanji: Welcome to the Jungle

Note:
  • You may need to make multiple API calls to retrieve all movies. For example, the results may be returned in “pages,” so you may need to retrieve them page by page.
  • Please use the “primary_release_date” parameter instead of the “release_date” parameter in the API when retrieving movies released in the time period specified above. The “release_date” parameter will incorrectly return a movie if any of its release dates fall within the years listed.

c. [15 points] Step 1: similar movie retrieval. For each movie retrieved in part b, use the API to find its 5 similar movies. If a movie has fewer than 5 similar movies, the API will return as many as it can find. Your code should be flexible to work with however many movies the API returns.可视化分析代写

Step 2: deduplication. After all similar movies have been found, remove all duplicate movie pairs. That is, if both the pairs A,B and B,A are present, only keep A,B where A < B. If you remove a pair due to duplication, there is no need to fetch additional similar movies for a given movie (That is, you should NOT re-run any part of Step 1). For example, if movie A has three similar movies X, Y and Z; and movie X has two similar movies A and B, then there should only be four lines in the file.

A,X

A,Y

A,Z

X,B

  • Documentation for obtaining similar movies: https://developers.themoviedb.org/3/movies/get-similar-movies
  • Save the results in movie_ID_sim_movie_ID.csv.
    Each line in the file should describe one pair of similar movies — NO space after comma, and do not include any column headers:

    movie-ID,similar-movie-ID

Deliverables: Place all the files listed below in the Q1 folder.

  • movie_ID_name.csv: The text file that contains the output to part b.
  • movie_ID_sim_movie_ID.csv: The text file that contains the output to part c.
  • script.py: The Python 3.7 script you write that generates both movie_ID_name.csv and movie_ID_sim_movie_ID.csv.可视化分析代写

Note : Q1.2 builds on the results of Q1.1. Specifically, Q1.2 asks that the “Source,Target” be added to the resulting file from Q1.1. If you have completed both Q1.1 and Q1.2, your csv would have the header row — please submit this file. If you have completed only Q1.1, but not Q1.2 (for any reasons), then please submit the csv file without the header row.

Q1.2 [15 points] Visualizing Movie Similarity Graph

Using Gephi, visualize the network of similar movies obtained. You can download Gephi here. Ensure your system fulfills all requirements for running Gephi.

  1. Go through the Gephi quick-start guide.
  2. [2 points] Manually insert Source,Target as the first line in movie_ID_sim_movie_ID.csv. Each line now represents a directed edge with the format Source,Target. Import all the edges contained in the file using Data Laboratory in Gephi.

Note: Ensure that “create missing nodes” option is selected while importing since we do not have an explicit nodes file.

3.[8 points] Using the following guidelines, create a visually meaningful graph:

  • Keep edge crossing to a minimum, and avoid as much node overlap as possible.
  • Keep the graph compact and symmetric if possible.
  • Whenever possible, show node labels. If showing all node labels create too much visual complexity, try showing those for the “important” nodes. We recommend that you first run the Gephi’s built-in stat functions to gain more insight about a given node.可视化分析代写
  • Using nodes’ spatial positions to convey information (e.g., “clusters” or groups).

Experiment with Gephi’s features, such as graph layouts, changing node size and color, edge thickness, etc. The objective of this task is to familiarize yourself with Gephi; therefore this is a fairly open-ended task.

  1. [5 points] Using Gephi’s built-in functions, compute the following metrics for your graph:
  • Average node degree (run the function called “Average Degree”)
  • Diameter of the graph (run the function called “Network Diameter”)
  • Average path length (run the function called “Avg. Path Length”)

Briefly explain the intuitive meaning of each metric in your own words.
You will learn about these metrics in the “graphs” lectures.

Deliverables: Place all the files listed below in the Q1 folder.

  • For part b: movie_ID_sim_movie_ID.csv (with Source,Target as its first line).
  • For part c: an image file named “graph.png” (or “graph.svg”) containing your visualization and a text file named “graph_explanation.txt” describing your design choices, using no more than 50 words.
  • For part d: a text file named “metrics.txt” containing the three metrics and your intuitive explanation for each of them, using no more than 100 words.

Q1 [40分]收集和可视化Movie DB(TMDb)数据

Q1.1 [25分]收集电影数据

您将使用“电影数据库” API版本3:(1)下载有关电影的数据,(2)为每部电影下载其5部相似的电影。

您将在此问题中在script.py中使用Python 3.7.x编写代码。您将需要一个API密钥才能使用TMDb数据。您的API密钥将作为script.py的输入,以便我们可以使用我们自己的API密钥运行您的代码以检查结果。运行以下命令应生成在b部分和c部分中指定的CSV文件:

python3 script.py <API_KEY>

请参考本教程以了解如何解析命令行参数。不要将您的API密钥写在代码中。

注意:您只能使用Q1框架中包含的script.py文件和Python标准库中的模块顶部提供的模块和库。 TMDb API的Python包装器(或模块)不得用于此分配。熊猫和Numpy也可能不会使用—虽然我们知道它们是有用的库,但完成此问题并不严格取决于它们的功能。此外,为了使评分更易于管理,并使我们的助教能够为学生提供更好,更一致的支持,我们决定相应地限制图书馆。可视化分析代写
一种。如何使用TheMovieDB API:
●创建一个TMDb帐户并请求API密钥-https://www.themoviedb.org/account/signup。有关详细说明,请参阅此文档。
●在处理此问题时,请参阅API文档https://developers.themoviedb.org/3/getting-started/introduction。
●使用API​​的v3。您将需要在请求中使用v3 API密钥。

笔记:
-该API可让您每10秒发出40个请求。发出请求时,在代码中设置适当的超时间隔。我们推荐

您会考虑解决此问题时脚本将运行多少时间,因此您将按时完成它。
-API端点可能针对同一请求返回不同的结果。
-超过5分钟的运行时间将受到处罚。

b。 [10分]搜索2004年或更晚发行的“戏剧”类型的电影。检索此类型的350部最受欢迎的电影。电影的排序应从最受欢迎到最不受欢迎。提示:可以在API调用中完成基于受欢迎程度的排序。
●检索相似电影的文档:https://developers.themoviedb.org/3/discover/movie-discover https://developers.themoviedb.org/3/genres/get-movie-list
●将结果保存在movie_ID_name.csv中。
文件中的每一行都应以以下格式描述一部电影,格式为—逗号后没有空格,并且不包括任何列标题:

电影ID,电影名称

例如,文件中的一行可能如下所示:

353486,Jumanji:欢迎来到丛林

笔记:
-您可能需要进行多个API调用才能检索所有电影。例如,结果可能以“页面”形式返回,因此您可能需要逐页检索它们。
-检索在上述指定时间段内发行的电影时,请在API中使用“ primary_release_date”参数代替“ release_date”参数。如果电影的发行日期在列出的年份之内,则“ release_date”参数将错误地返回电影。可视化分析代写

C。 [15分]第1步:类似的电影检索。对于在b部分中检索到的每部电影,请使用API​​查找其5部相似的电影。如果一部电影的相似电影少于5部,则API会返回与其能找到的尽可能多的电影。您的代码应具有灵活性,以处理API返回的许多影片。

步骤2:重复资料删除。找到所有相似的电影后,请删除所有重复的电影对。也就是说,如果同时存在A,B和B,A对,则仅保留A,B,其中A <B。如果由于复制而删除了一对,则无需为给定的电影获取其他类似的电影(也就是说,您不应重新运行步骤1)的任何部分。例如,如果电影A有三个相似的电影X,Y和Z,则电影A为电影X。并且电影X具有两个相似的电影A和B,那么文件中应该只有四行。

斧头

A,Y

A,Z

X,B

●获取类似电影的文档:https://developers.themoviedb.org/3/movies/get-similar-movies
●将结果保存在movie_ID_sim_movie_ID.csv中。
文件中的每一行都应描述一对相似的电影—逗号后没有空格,并且不包括任何列标题:

电影ID,类似电影ID

可交付成果:将下面列出的所有文件放在Q1文件夹中。
●movie_ID_name.csv:包含b部分输出的文本文件。
●movie_ID_sim_movie_ID.csv:包含c部分输出的文本文件。
●script.py:您编写的Python 3.7脚本会同时生成movie_ID_name.csv和movie_ID_sim_movie_ID.csv。可视化分析代写

注意:Q1.2基于Q1.1的结果。具体来说,Q1.2要求将“ Source,Target”添加到Q1.1的结果文件中。如果您同时完成了Q1.1和Q1.2,则您的csv将具有标题行-请提交此文件。如果您仅完成了Q1.1,而没有完成Q1.2(出于任何原因),则请提交不带标题行的csv文件。

Q1.2 [15分]可视化电影相似度图

使用Gephi,可视化获得的类似电影的网络。您可以在此处下载Gephi。确保您的系统满足运行Gephi的所有要求。

一种。阅读Gephi快速入门指南。
b。 [2分]手动将Source,Target插入movie_ID_sim_movie_ID.csv的第一行。现在,每行代表一个有向边,格式为Source,Target。使用Gephi中的Data Laboratory导入文件中包含的所有边。

注意:由于我们没有显式的节点文件,因此确保在导入时选择了“创建缺少的节点”选项。

C。 [8分]使用以下准则,创建视觉上有意义的图形:
●尽量减少边缘交叉,并尽可能避免节点重叠。
●尽可能使图紧凑且对称。
●尽可能显示节点标签。如果显示所有节点标签会造成过多的视觉复杂性,请尝试显示“重要”节点的标签。我们建议您首先运行Gephi的内置stat函数,以获取有关给定节点的更多信息。
●使用节点的空间位置来传达信息(例如“集群”或组)。

试用Gephi的功能,例如图形布局,更改节点大小和颜色,边缘厚度等。因此,这是一个相当开放的任务。

d。 [5分]使用Gephi的内置函数,为图形计算以下指标:

●平均节点度(运行称为“平均度”的功能)
●图形的直径(运行称为“网络直径”的功能)
●平均路径长度(运行名为“平均路径长度”的功能)

用您自己的话语来解释每个指标的直观含义。
您将在“图形”讲座中了解这些指标。

可交付成果:将下面列出的所有文件放在Q1文件夹中。

●对于b部分:movie_ID_sim_movie_ID.csv(以Source,Target为第一行)。
●对于c部分:一个名为“ graph.png”(或“ graph.svg”)的图像文件,其中包含您的可视化效果;一个名为“ graph_explanation.txt”的文本文件,描述了您的设计选择,使用的单词数不超过50个。
●对于d部分:一个名为“ metrics.txt”的文本文件,其中包含三个指标以及您对每个指标的直观解释,最多不超过100个字。

Q2 [35 points] SQLite 可视化分析代写

SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It is one of the world’s most popular embedded database systems. It is convenient to share data stored in an SQLite database — just one cross-platform file which doesn’t need to be parsed explicitly (unlike CSV files, which have to be loaded and parsed).

You will modify the given Q2.SQL.txt file by adding SQL statements and SQLite commands to it.

 

We will autograde your solution by running the following command that generates Q2.db and Q2.OUT.txt (assuming the current directory contains the data files).

$ sqlite3 Q2.db < Q2.SQL.txt > Q2.OUT.txt

Since no auto-grader is bullet-proof, we ask that you be mindful of all the important points and notes below, which can cause the auto-grader to return an error. Our goal is to efficiently grade your assignment and return it as quickly as we can, so you can receive feedback and learn from the experience that you’ll gain in this course.

  • You will not receive any points if we are unable to generate the two output files above.
  • You will lose points if you do not strictly follow the output format specified in each question below. The output format corresponds to the headers/column names for your SQL command output.

We have added some lines of code to the Q2.SQL.txt file for autograding purposes. DO NOT REMOVE/MODIFY THESE LINES. You will not receive any points if these statements are modified in any way (our autograder will check for changes). There are clearly marked regions in the Q2.SQL.txt file where you should add your code.可视化分析代写

Examples of modifying the autograder code that can cause you to lose points:
  • Putting any code or text of any kind, intentionally or unintentionally, outside the designated regions.
  • Modifying, updating, or removing the provided statements / instructions / text in any way.
  • Leaving in unnecessary debug/print statements in your submission. You may desire to print out more output than required during your development and debugging, but make sure to remove all extra code and text before submission.
Regrettably,

we will not be releasing the auto-grader for your testing since that will likely invite unwanted attempts to game it. However, we have provided you with Q2.OUT.SAMPLE.txt with sample data that gives an example of how your final Q2.OUT.txt should look like after running the above command. Note that the sample data should not be submitted or relied upon for any purpose other than output reference and format checking. Avoid printing unnecessary output in your final submission as it will affect autograding and you will lose points.

WARNING: Do not copy and paste any code/command from this PDF for use in the sqlite command prompt, because PDFs sometimes introduce hidden/special characters, causing SQL error. This might cause the autograder to fail and you will lose points if such a case. You should manually type out the commands instead.

NOTE:

For the questions in this section, you must only use INNER JOIN when performing a join between two tables. Other types of joins may result in incorrect results.

NOTE: Do not use .mode csv in your Q2.SQL.txt file. This will cause quotes to be printed in the output of each SELECT … ; statement.

a.Create tables and import data.

i.[2 points] Create two tables named ‘movies’ and ‘movie_cast’ with columns having the indicated data types:

  • movies
  • id (integer)
  • name (text)
  • score (integer)
  • movie_cast
  • movie_id (integer)
  • cast_id (integer)可视化分析代写
  • cast_name (text)

i.[1 point] Import the provided movie-name-score.txt file into the movies table, and movie-cast.txt into the movie_cast table. Use SQLite’s .import command for this. Only use relative paths while importing files since absolute/local paths are specific locations that exist only on your computer and will cause the autograder to fail..

b.[2 points] Create indexes. Create the following indexes for the tables specified below. This step increases the speed of subsequent operations; though the improvement in speed may be negligible for this small database, it is significant for larger databases.

i.scores_index for the score column in movies table

ii.cast_index for the cast_id column in movie_cast table

iii.movie_index for the id column in movies table

c.[3 points] Calculate a proportion. Find the proportion of movies having a score > 50. Treat each row as a different movie. The proportion should only be based on the total number of rows in the movie table.可视化分析代写

Output format and sample value:

prop

77.7

d.[3 points] Find the highest scoring movies. List the seven best movies (highest scores). Sort your output by score from highest to lowest, then by name in alphabetical order.

Output format and sample value:

id,name,score

7878,Kirk Krandall,44

e.[3 points] Find the most prolific actors. List 5 cast members with the highest number of movie appearances (movie_count).

  • Sort the number of appearances from highest to lowest.
  • In case of a tie in the number of appearances, sort the results of cast_name in alphabetical order

Output format and sample value:

cast_id,cast_name,movie_count

68686,Harrison Ford,2

f.[6 points] Get high scoring actors. Find the top ten cast members who have the highest average movie scores.

  • Sort your output by average score (from high to low).可视化分析代写
  • In case of a tie in the average score, sort the results of cast_name in alphabetical order.
  • Do not include movies with score <50 in the average score calculation.
  • Exclude cast members who have appeared in two or fewer movies.

Output format:

cast_id,cast_name,average_score

8822,Julia Roberts,53

g.[8 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs of actors who have had a good collaboration as defined here. Each row in the view describes one pair of actors who have appeared in at least 3 movies together AND the average score of these movies is >= 40.

The view should have the format:

good_collaboration(

cast_member_id1,

cast_member_id2,

movie_count,

average_movie_score)

For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any self pairs (cast_member_id1 == cast_member_id2).可视化分析代写

Full points will only be awarded for queries that use joins for part g.

Remember that creating a view will not produce any output, so you should test your view with a few simple select statements during development. One such test has already been added to the code as part of the autograding.

NOTE: Do not submit any code that creates a ‘TEMP’ or ‘TEMPORARY’ view that you may have used for testing.

Optional Reading: Why create views?

h.[4 points] Find the best collaborators. Get the 5 cast members with the highest average scores from the good_collaboration view made in the last part, and call this score the collaboration_score. This score is the average of the average_movie_score corresponding to each cast member, including actors in cast_member_id1 as well as cast_member_id2.可视化分析代写

  • Sort your output in a descending order of this score
  • Sort by alphabetically by cast_name in case of a tie in the score

Output format:

cast_id,cast_name,collaboration_score

i.SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTS documentation). Import movie overview data from the movie-overview.txt into a new FTS table called movie_overview with the schema:

movie_overview (

id integer,

name text,

year integer,

overview text,

popularity decimal)

NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR and NOT are case sensitive in FTS queries.
  1. [1 point] Count the number of movies whose overview field contains the word ‘fight’. Matches are not case sensitive. Match full words, not word parts/sub-strings.

e.g., Allowed: ‘FIGHT’, ‘Fight’, ‘fight’, ‘fight.’. Disallowed: ‘gunfight’, ‘fighting’, etc.

Output format:

count_overview

  1. [2 points] List the id’s of the movies that contain the terms ‘love’ and ‘story’ in the overview field with no more than 5 intervening terms in between. Matches are not case sensitive. As you did in h(i)(1), match full words, not word parts/sub-strings .可视化分析代写

Output format:

id

Deliverables: Place all the files listed below in the Q2 folder

  • Q2.SQL.txt: Modified file containing all the SQL statements and SQLite commands you have used to answer parts a – i in the appropriate sequence.

Q2 [35分] SQLite

SQLite是一个轻量级的,无服务器的嵌入式数据库,可以轻松处理多个千兆字节的数据。它是世界上最受欢迎的嵌入式数据库系统之一。共享存储在SQLite数据库中的数据非常方便-只需一个不需要明确解析的跨平台文件即可(与CSV文件不同,该文件必须加载和解析)。

您将通过向其添加SQL语句和SQLite命令来修改给定的Q2.SQL.txt文件。

我们将通过运行以下生成Q2.db和Q2.OUT.txt的命令来对您的解决方案进行自动分级(假设当前目录包含数据文件)。

$ sqlite3 Q2.db <Q2.SQL.txt> Q2.OUT.txt

由于没有自动分级机是防弹的,因此请注意以下所有重要要点和注意事项,这些要点和注意事项可能会导致自动分级机返回错误。我们的目标是对您的作业进行有效的评分,并尽快将其退回,这样您就可以收到反馈并从本课程中获得的经验中学习。可视化分析代写

-如果我们无法生成上面的两个输出文件,您将不会获得任何积分。
-如果您不严格遵循以下每个问题中指定的输出格式,您将失去积分。输出格式对应于SQL命令输出的标题/列名称。

我们已将一些代码行添加到Q2.SQL.txt文件中,以实现自动分级。请勿删除/修改这些行。如果以任何方式修改这些语句,您将不会获得任何分数(我们的自动分级机将检查更改)。 Q2.SQL.txt文件中有明确标记的区域,您应在其中添加代码。

修改可能导致您失去分的自动分级机代码的示例:
-有意或无意地将任何类型的代码或文本放在指定区域之外。
-以任何方式修改,更新或删除提供的声明/说明/文本。
-在提交中保留不必要的调试/打印语句。您可能希望在开发和调试过程中打印出比所需更多的输出,但是请确保在提交之前删除所有多余的代码和文本。

遗憾的是,我们不会为您的测试发布自动分级机,因为这可能会引起不必要的尝试。但是,我们为您提供了带有示例数据的Q2.OUT.SAMPLE.txt,该示例提供了运行上述命令后最终Q2.OUT.txt外观的示例。注意,除了输出参考和格式检查之外,不应出于任何其他目的提交或依赖样本数据。避免在最终提交中打印不必要的输出,因为这会影响自动分级,并且您会失去分数。可视化分析代写

警告:请勿从此PDF复制和粘贴任何代码/命令以在sqlite命令提示符中使用,因为PDF有时会引入隐藏/特殊字符,从而导致SQL错误。这可能会导致自动分级机发生故障,并且在这种情况下您将失去积分。您应该手动键入命令。

注意:对于本节中的问题,只能在执行两个表之间的联接时使用INNER JOIN。其他类型的联接可能会导致错误的结果。

注意:不要在Q2.SQL.txt文件中使用.mode csv。这将导致在每个SELECT…的输出中打印引号。陈述。

一种。创建表并导入数据。
一世。 [2分]创建两个名为“ movies”和“ movie_cast”的表,其中的列具有所指示的数据类型:
●电影
○id(整数)
○名称(文字)
○分数(整数)
●movie_cast
○movie_id(整数)
○cast_id(整数)
○cast_name(文本)

ii。 [1分]将提供的movie-name-score.txt文件导入movies表,并将movie-cast.txt导入movie_cast表。为此,请使用SQLite的.import命令。导入文件时仅使用相对路径,因为绝对/本地路径是仅在您的计算机上存在的特定位置,并且会导致自动分级机失败。

b。 [2分]创建索引。为下面指定的表创建以下索引。此步骤可提高后续操作的速度;尽管对于这种小型数据库而言,速度的提高可以忽略不计,但是对于大型数据库而言,这是非常重要的。可视化分析代写
一世。电影表中得分列的scores_index
ii。 movie_cast表中的cast_id列的cast_index
ii一世。电影表中ID列的movie_index

C。 [3分]计算一个比例。查找分数大于50的电影的比例。将每一行视为不同的电影。该比例应仅基于影片表中的总行数。

输出格式和样本值:

支柱

77.7

d。 [3分]查找得分最高的电影。列出七部最佳电影(最高分)。按分数从高到低对输出进行排序,然后按字母顺序按名称排序。

输出格式和样本值:

ID,名称,分数

7878,柯克·克兰德尔,44

e。 [3分]找到最多产的演员。列出5个演员最多的电影露面次数(movie_count)。
-按从高到低的顺序对出现次数进行排序。可视化分析代写
-如果出现次数不符,请按字母顺序对cast_name的结果进行排序

输出格式和样本值:

cast_id,cast_name,movie_count

68686,哈里森·福特,2

F。 [6分]获得高分的演员。查找平均电影得分最高的前十名演员。
-按平均分数(从高到低)对您的输出进行排序。
-如果平均得分并列,则按字母顺序对cast_name的结果进行排序。
-在平均得分计算中,请勿包括得分小于50的电影。
-排除出演过两部或以下电影的演员。

输出格式:

cast_id,cast_name,average_score

8822,茱莉亚·罗伯茨(Julia Roberts),53

G。 [8分]创建视图。创建一个名为good_collaboration的视图(虚拟表),该视图列出了如此处定义的具有良好协作的演员对。视图中的每一行都描述一对演员,这些演员一起至少出现在3部电影中,并且这些电影的平均得分> = 40。

视图应采用以下格式:

good_collaboration(

cast_member_id1,

cast_member_id2,

movie_count,

average_movie_score)

对于对称对或镜像对,仅保留cast_member_id1具有较低数值的行。例如,对于ID对(1、2)和(2、1),请保留ID为(1、2)的行。不应有任何自我对(cast_member_id1 == cast_member_id2)。

仅对于使用部分g的联接的查询,将获得满分。可视化分析代写

请记住,创建视图不会产生任何输出,因此在开发过程中应使用一些简单的select语句测试视图。作为自动分级的一部分,已经将一种这样的测试添加到了代码中。

注意:请勿提交任何可能创建用于测试的“ TEMP”或“ TEMPORARY”视图的代码。

可选读物:为什么要创建视图?

H。 [4分]寻找最佳合作者。从上一部分的good_collaboration视图中获得5个平均得分最高的演员,并将此得分称为collaboration_score。此分数是与每个演员(包括cast_member_id1和cast_member_id2中的演员)相对应的average_movie_score的平均值。
-按照此分数的降序对输出进行排序
-如果得分相等,则按字母顺序按cast_name排序

输出格式:

cast_id,cast_name,collaboration_score

一世。 SQLite支持简单但功能强大的全文搜索(FTS),用于基于文本的快速查询(FTS文档)。将具有电影模式的电影概述数据从movie-overview.txt导入到名为movie_overview的新FTS表中:

movie_overview(

id整数,

命名文字,

年整数,

概述文字,

人气十进制)

注意:仅使用fts3或fts4创建表。另请注意,在FTS查询中,诸如NEAR,AND,OR和NOT之类的关键字区分大小写。可视化分析代写

1. [1分]计算电影总览字段包含“ fight”一词的电影的数量。匹配不区分大小写。匹配完整的单词,而不是单词部分/子字符串。

例如,允许:“ FIGHT”,“ Fight”,“ fight”,“ fight。”。不允许:“打架”,“打架”等。

输出格式:

count_overview

2. [2分]在“概述”字段中列出包含“爱情”和“故事”两个词的电影的ID,中间不得超过5个中间词。匹配不区分大小写。就像在h(i)(1)中所做的那样,匹配完整的单词,而不是单词的part / sub-strings。

输出格式:

ID

可交付成果:将下面列出的所有文件放在Q2文件夹中

●Q2.SQL.txt:修改后的文件,其中包含用于按适当顺序回答a-i部分的所有SQL语句和SQLite命令。

Q3 [15 points] D3 Warmup and Tutorial

  • Go through the online D3 v3 tutorial here.
  • Complete steps 01-16 (Complete through “16. Axes”).
  • Ensure that you are using v3 of the D3 lib.

Note: This is a simple and important tutorial which lays the groundwork for Homework 2. The latest D3 version is v5, but only the v3 tutorial is available online. What you learn in this v3 tutorial is transferable to v5. In Homework 2, you will work with D3 a lot; we’re upgrading Homework 2 to v5. All Georgia Tech students have FREE access to Safari Books, which include this and its updated v4 tutorial. https://www.safaribooksonline.com. Just log in with your GT account.

Note: We recommend using Mozilla Firefox or Google Chrome, since they have relatively robust built-in developer tools.可视化分析代写

Deliverables: Place all the files/folders listed below in the Q3 folder

  • A folder named d3 containing file d3.v3.min.js (download)
  • index.html : When run in a browser, it should display a scatterplot with the following specifications:

  1. [4 points] There should be 100 points using a circle shape that are randomly generated and placed on the plot. Each point’s x coordinate should be a random number between 10 and 250 inclusively (i.e., [10, 250]), and so should each point’s y coordinate. A point’s x and y coordinates should be independently computed.
  2. [2 points] The plot must have visible X and Y axes that scale according to the generated points. The ticks on these axes should adjust automatically based on the randomly generated scatter-plot points.
  3. [2 points] Use a single linear scale and apply it to both the X and Y coordinates of each circle to map the domain of X and Y values to the range of [1,5]. Set each circle’s radius attribute to be the euclidean distance between the points (X, 0) and (0, Y) where X and Y denote the circle’s scaled X and Y values.
  4. [3 points] All points with a scaled X value greater than the average of the scaled X value of all scatter-plot points should be outlined in blue. All other points should be outlined green. The points should use a transparent fill to enable visualization of overlapping points.可视化分析代写
    Hint: Modify the ‘stroke’ parameter of the ‘circle’.
  5. [3 points] It is often desirable to emphasize some important data points in a dataset. Accomplish this by displaying a text annotation for the point that contains the smallest y value. The annotation should read “Min Y: <value>” where <value> is the minimum y. If there are multiple points that contain the same minimum y value, you can pick whichever point you like to annotate. Optional: experiment with different approaches to style the annotation text, e.g., try using different positioning settings, font-sizes, font-weights, and/or color to make it stand out. Use unscaled Y values here rather than scaled Y values.
  6. [1 point] Your GT username, in lower-case, (e.g., jdoe3) should appear above the scatterplot. Also set the HTML title tag (e.g., <title>jdoe3</title>) to your GT username (also in lower-case).

The scatterplot should appear similar to but not exactly equivalent to the sample plot provided below. Remember that the plot will contain random data.

Note: No external libraries should be used. The index.html file can only refer to d3.v3.min.js within the d3 folder using the js file’s relative path. Absolute/local paths are specific locations that exist only on your computer, which means your code won’t run on our machines we grade (and you will lose points).可视化分析代写

Q3 [15分] D3预热和教程

●在此处浏览在线D3 v3教程。
●完成步骤01-16(完成“ 16.轴”的操作)。
●确保您使用的是D3库的v3。

注意:这是一个简单而重要的教程,奠定了家庭作业2的基础。最新的D3版本是v5,但只有v3教程可在线获得。你在此v3教程中赚取的收入可以转移到v5。在“作业2”中,您将大量使用D3。我们正在将作业2升级到v5。所有佐治亚理工学院的学生都可以免费使用Safari图书,其中包括本手册及其更新的v4教程。 https://www.safaribooksonline.com。只需使用您的GT帐户登录即可。

注意:我们建议使用Mozilla Firefox或Google Chrome,因为它们具有相对强大的内置开发人员工具。

可交付成果:将下面列出的所有文件/文件夹放在Q3文件夹中

●名为d3的文件夹,其中包含文件d3.v3.min.js(下载)

●index.html:在浏览器中运行时,它应显示具有以下规格的散点图:
一种。 [4分]使用圆形的100个点应随机生成并放置在图上。每个点的x坐标应为10到250之间(包括10和250)的随机数(即[10,250]),每个点的y坐标也应为10。点的x和y坐标应独立计算。
b。 [2个点]该图必须具有可见的X轴和Y轴,这些轴会根据生成的点进行缩放。这些轴上的刻度应根据随机生成的散点图点自动调整。
C。 [2分]使用单个线性比例尺并将其应用于每个圆的X和Y坐标,以将X和Y值的域映射到[1,5]的范围。将每个圆的“半径”属性设置为点(X,0)和(0,Y)之间的欧式距离,其中X和Y表示圆的缩放X和Y值。可视化分析代写
d。 [3分]所有标度X值大于所有散点图点的标度X值的平均值的点都应以蓝色框出。所有其他点都应以绿色勾勒出轮廓。这些点应使用透明填充以使重叠点可视化。
提示:修改“圆”的“冲程”参数。
e。 [3分]通常需要强调数据集中的一些重要数据点。通过显示包含最小y值的点的文本注释来实现此目的。注释应显示为“最小值Y:<值>”,其中<值>是最小值y。如果有多个点包含相同的y最小值,则可以选择要注释的点。可选:尝试使用不同的方法来设置注释文本的样式,例如,尝试使用不同的定位设置,字体大小,字体粗细和/或颜色使其突出。在此使用未缩放的Y值,而不是缩放的Y值。
F。 [1分]您的GT用户名(小写)(例如jdoe3)应显示在散点图上方。还要将HTML标题标签(例如<title> jdoe3 </ title>)设置为您的GT用户名(也要小写)。

散点图应看起来与下面提供的样本图相似但不完全相同。请记住,该图将包含随机数据。可视化分析代写

注意:不应使用任何外部库。 index.html文件只能使用js文件的相对路径引用d3文件夹中的d3.v3.min.js。绝对/本地路径是仅在您的计算机上存在的特定位置,这意味着您的代码将无法在我们评分的我们的计算机上运行(并且您将失去分)。

Q4 [10 pt] OpenRefine可视化分析代写

  1. Watch the videos on the OpenRefine’s homepage for an overview of its features.

Download and install OpenRefine (latest release : 3.1)

2.Import Dataset:

  • Launch OpenRefine. It opens in a browser (127.0.0.1:3333).
  • We use a products dataset from Mercari, derived from a competition on Kaggle (Mercari Price Suggestion Challenge). If you are interested in the details, please refer to the data description page. We have sampled a subset of the dataset as the given “properties.csv”.
  • Choose “Create Project” -> This Computer -> “properties.csv”. Click “Next”.
  • You will now see a preview of the dataset. Click “Create Project” in the upper right corner.

3.Clean/Refine the data:

Note: OpenRefine maintains a log of all changes. You can undo changes. See the “Undo/Redo” button on the upper left corner.

i.a [1 pt] Select the “category_name” column and choose ‘Facet by Blank’ (Facet -> Customized Facets -> Facet by blank) to filter out the records that have blank values in this column. Provide the number of rows that return True. Remove these rows.

i.b [1 pt] Split the column “category_name” into multiple columns without removing the original column. For example, a row with “Kids/Toys/Dolls & Accessories” in the category_name column, would be split across the newly created columns as “Kids”, “Toys” and “Dolls & Accessories”. Use the existing functionality in OpenRefine that creates multiple columns from an existing column based on a separator (i.e., in this case ‘/’). Provide the number of columns that are created in this operation. Remove any newly created columns that do not have values in all rows.可视化分析代写

ii. [2 pt] Select the column “name” and apply the Text Facet (Facet -> Text Facet).

Click the Cluster button which opens a window where you can choose different “methods” and “keying functions” to use while clustering. Choose the keying function that produces the highest number of clusters under the “Key Collision” method. Provide the number of clusters found using this keying function. Click on ‘Select All’ and ‘Merge Selected & Close’.

iii. [2 pt] Replace the null values in the “brand_name column” with the text “Unbranded” (Edit Cells -> Transform). Provide the General Refine Evaluation Language (GREL) expression used.

iv. [2 pt] Create a new column “high_priced” with the values 0 or 1 based on the “price” column with the following conditions: If the price is greater than 100, “high_priced” should be set as 1, else 0. Provide the GREL expression used to perform this.

v. [2 pt] Create a new column “has_offer” with the values 0 or 1 based on the “item_description” column with the following conditions: If it contains the text “discount” or “offer” or “sale”, then set the value in “has_offer” as 1, else 0. Provide the GREL expression used to perform this.可视化分析代写

Note: There has been a slight confusion with c) v. and thus, we will be giving full credit to GREL statements that fulfill either of the following:

a) Look for the following words “sale”, “offer”, discount” without converting the original value in the “item description” column to lowercase

b) Look for the following words “sale”, “offer”, discount” after converting the original value in the “item description” column to lowercase

Deliverables: Place all the files listed below in the Q4 folder

  • properties_clean.csv : Export the final table as a comma-separated values (.csv) file.
  • changes.json : Submit a list of changes made to file in json format. Use the “Extract Operation History” option under the Undo/Redo tab to create this file.
  • Q4Observations.txt : A text file with answers to parts c.i.a, c.i.b, c.ii, c.iii, c.iv and c.v. Provide each answer in a new line.

4 [10分] OpenRefine

一种。观看OpenRefine主页上的视频,以了解其功能概述。

下载并安装OpenRefine(最新版本:3.1)

b。导入数据集:
●启动OpenRefine。它在浏览器中打开(127.0.0.1:3333)。
●我们使用来自Mercari的产品数据集,该数据集来自Kaggle(Mercari价格建议挑战)竞赛。如果您对这些细节感兴趣,请参考数据描述页面。我们已经采样了数据集的一个子集作为给定的“ properties.csv”。
●选择“创建项目”->“本计算机”->“ properties.csv”。点击下一步”。
●现在,您将看到数据集的预览。点击右上角的“创建项目”。

C。清理/优化数据:

注意:OpenRefine维护所有更改的日志。您可以撤消更改。请参阅左上角的“撤消/重做”按钮。可视化分析代写

i.a [1 pt]选择“ category_name”列,然后选择“ Facet by Blank”(Facet-> Customized Facets-> Facet by blank)以过滤出该列中具有空白值的记录。提供返回True的行数。删除这些行。

i.b [1 pt]将“ category_name”列拆分为多个列,而不删除原始列。例如,category_name列中包含“孩子/玩具/娃娃和配件”的行将被拆分为新创建的列,分别为“孩子”,“玩具”和“娃娃和配件”。使用OpenRefine中的现有功能,该功能会基于分隔符(即本例中的“ /”)从现有列中创建多个列。提供在此操作中创建的列数。删除所有在所有行中都没有值的新创建的列。

ii。 [2 pt]选择“名称”列并应用Text Facet(Facet-> Text Facet)。单击“群集”按钮,这将打开一个窗口,您可以在其中选择不同的“方法”和“键”群集中使用的“功能”。在“键碰撞”方法下,选择产生最多簇数的键控功能。提供使用此键控功能找到的群集数量。点击“全选”和“合并选择并关闭”。可视化分析代写

iii。 [2分]将“ brand_name列”中的空值替换为文本“ Unbranded”(编辑单元格->转换)。提供使用的通用优化评估语言(GREL)表达式。

iv。 [2分]基于“价格”列,并在以下情况下,创建一个新的列“高价”,其值为0或1:如果价格大于100,则“高价”应设置为1,否则设置为0。用于执行此操作的GREL表达式。

v。[2 pt]基于具有以下条件的“ item_description”列,创建一个值为0或1的新列“ has_offer”:如果它包含文本“ discount”或“ offer”或“ sale”,则进行设置“ has_offer”中的值为1,否则为0。提供用于执行此操作的GREL表达式。

注意:与c)v。略有混淆,因此,我们将充分考虑满足以下任一条件的GREL语句:

a)查找以下单词“ sale”(销售),“ offer”(优惠),Discount(折扣),而无需将“ item description”(商品说明)列中的原始值转换为小写

b)将“商品说明”列中的原始值转换为小写字母后,查找以下单词“ sale”,“ offer”,“ Discount”

可交付成果:将下面列出的所有文件放在Q4文件夹中

●properties_clean.csv:将最终表导出为逗号分隔值(.csv)文件。
●changes.json:以json格式提交对文件所做的更改的列表。使用“撤消/重做”选项卡下的“提取操作历史记录”选项来创建此文件。
●Q4Observations.txt:一个文本文件,其中包含部分c.i.a,c.i.b,c.ii,c.iii,c.iv和c.v的答案。在新行中提供每个答案。

Extremely Important: folder structure and content of submission zip file可视化分析代写

Extremely Important: We understand that some of you may work on this assignment until just prior to the deadline, rushing to submit your work before the submission window closes. Take the time to validate that all files are present in your submission and that you do not forget to include any deliverables! If a deliverable is not submitted, you will receive zero credit for the affected portion of the assignment — this is a very sad way to lose points, since you’ve already done the work!

You are submitting a single zip file named HW1-{GT username}.zip. The files included in each question’s folder have been clearly specified at the end of the question’s problem description.

The zip file’s folder structure must exactly be (when unzipped):

HW1-{GT username}/

Q1/

movie_ID_name.csv

movie_ID_sim_movie_ID.csv

graph.png / graph.svg

graph_explanation.txt

metrics.txt 可视化分析代写

script.py

Q2/

movie-cast.txt

movie-name-score.txt

movie-overview.txt

Q2.SQL.txt

Q3/

index.html

d3/

d3.v3.min.js

Q4/

properties_clean.csv

changes.json

Q4Observations.txt

Version 6

极其重要:文件夹结构和提交zip文件的内容

极其重要:我们了解到,您中的某些人可能要在截止日期之前进行这项工作,因此在提交窗口关闭之前急于提交您的工作。花时间来验证提交中是否存在所有文件,并且不要忘了包括任何可交付成果!如果未提交可交付成果,则您将在作业的受影响部分中获得零分—这是丢分的一种非常可悲的方式,因为您已经完成了工作!

您正在提交一个名为HW1- {GT用户名} .zip的zip文件。在问题的问题描述末尾已明确指定了每个问题的文件夹中包含的文件。可视化分析代写

压缩文件的文件夹结构必须完全(在解压缩后):

HW1- {GT用户名} /

Q1 /

movie_ID_name.csv

movie_ID_sim_movie_ID.csv

graph.png / graph.svg

graph_explanation.txt

metrics.txt

script.py

第2季

movie-cast.txt

电影名称-score.txt

movie-overview.txt 可视化分析代写

Q2.SQL.txt

第三季度/

index.html

d3 /

d3.v3.min.js

Q4 /

properties_clean.csv

changes.json

Q4Observations.txt

可视化分析代写
可视化分析代写

其他代写:algorithm代写 analysis代写 app代写 assembly代写 assignment代写 C++代写 code代写 course代写 dataset代写 java代写 web代写 北美作业代写 编程代写 考试助攻 program代写 cs作业代写 source code代写 dataset代写 金融经济统计代写 加拿大代写 jupyter notebook代写 lab代写

合作平台:essay代写 论文代写 写手招聘 英国留学生代写

 

天才代写-代写联系方式