SEOClerks

data mining and data scraping Project

data mining and data scraping Project

Information Retrieval — Problem Set II - Part 2

Programming Assignment 1

Due Date: Mon. Sep 11, 2023

In this work, you are going to scrape webpages within the KSU domain and analyze the fetched

pages. The objectives are:

1. Fetch and store the webpages.

2. Analyze how many webpages contain at least one email address in a raw textual format

(e.g., ashaw8@kennesaw.edu, not like netid at kennesaw dot edu).

3. Build a vocabulary of the collected webpages, and plot word frequency charts to see if

the distribution obeys the Zipf's law.

Crawling the Web using Scrapy

We are going to use a Python package, Scrapy, for web crawling. Scrapy is a fast high-level web

crawling and web scraping framework, used to crawl websites and extract structured data from

their pages.

Follow the official installation guide to install Scrapy. We recommend to install Scrapy in a

Python virtualenv.

Python and Virtualenv

If you are not familiar with Python virtual environments, read this.

You will need to take the following steps to install your virtual environment.

Step 1: Install the latest version of Python on you PC or your Mac

Step 2: Install the Virtual Environment with the following command:

On the PC from the Dos Shell after Python is installed:

python -m venv virtualworkspace

virtualworkspace\Scripts\activate.bat

python -m pip install --upgrade pip

Then once inside the Python Shell:

pip install Scrapy

On the Mac after Python is installed:

sudo pip3 install virtualenv virtualenvwrapper

Edit your ~/.zshrc to enable virtualenv plugin and set Python3 as the default for

virtualenvwrapper like so:

...

plugins=(...virtualenv)

...

# Virtualenvwrapper things

export VIRTUALENVWRAPPER_PYTHON='/usr/bin/python3'

export WORKON_HOME=$HOME/.virtualenvs

export PROJECT_HOME=$HOME/Workspace

source /usr/local/bin/virtualenvwrapper.sh

Create a virtual environment for this project and activate it. For example,

mkvirtualenv ir --python=python3

workon ir

Now you can install a Python package in the ir virtualenv.

pip install Scrapy

Scrapy Tutorial

We are going to extract textual data (e.g., title, body texts) from webpages and store them.

Follow this tutorial to learn the essentials.

Crawler for the KSU Web

Implement a spider subclass as described in the tutorial.

Specify the rules that your spider will follow. The rules provided to a spider govern how to

extract the links from a page and which callbacks should be called for those links. See

scrapy.spider.Rule. The LinkExtractor of your spider must obey the following rules:

• Only the links with the domain name 'kennesaw.edu' should be extracted for the next

request.

• Duplicate URLs should not be revisited

Your spider must obey the followings:

• Let the webservers know who your spider is associated with. Include a phrase in

USER_AGENT value to let them know this is a part of the course experiments. Use a name

such as "KSU CS7263-IRbot/0.1".

• Be polite:

o Wait at least 2.0 sec before downloading consecutive pages from the same

domain

o Respect robots.txt policies

• Run in breadth-first search order (i.e., FIFO)

• Once you reach the desire number of pages fetched, terminate crawling. (set

`CLOSESPIDER_PAGECOUNT' to a reasonable number, 1,000 is a good number)

Your spider should start crawling with the following three URLs:

1. KSU home (www.kennesaw.edu)

2. CCSE home (ccse.kennesaw.edu)

3. And any KSU webpage of your choice

Implement a parse function which extracts information from the url response. Yield a dictionary

which contains data of interest:

def parse_items(self, response):

entry = dict.fromkeys(['pageid', 'url', 'title', 'body', 'emails'])

# TODO. Extract corresponding information and fill the entry

yield entry

• pageid: str, A unique identifier for the page. You may use a hash function (e.g., md5) to

create a unique ID for a URL.

• url: str, URL from which the page is fetched

• title: str, Title of the page (if exists)

• body: str, Body text of the page. get_text from a Python package BeautifulSoup might

be useful to extract all the texts in a document

• emails: list, A list of email addresses found in the document.

Use the function above as a callback function of your LinkExtractor.

Run your crawler and save the scraped items

Start crawling using a spider using a command at the project root directory. Syntax:

scrapy crawl <spider_name>

If you use -O option, it will dump the scraped items to a file. Syntax:

scrapy crawl <spider_name> -O ksu1000.json

Text Statistics

Now, you should have downloaded textual data from KSU webpages stored in a file (e.g., a

JSON file). We want to compute the following statistics:

• Average length of webpages in tokens (use simple split() for tokenization)

• Top ten most frequent email addresses

• Percentage of webpages that contain at least one email address

Your output should be similar to the following:

% python text_stats.py ksu1000.json

Average tokens per page: 523.352

Most Frequent Emails:

('counseling@kennesaw.edu', 8)

('finaid@kennesaw.edu', 4)

('nricha26@kennesaw.edu', 4)

('brt@kennesaw.edu', 4)

('aventu11@kennesaw.edu', 3)

('kingra14@kennesaw.edu', 3)

('elp@kennesaw.edu', 3)

('police@kennesaw.edu', 3)

('registrar@kennesaw.edu', 2)

('scholarshipapps@kennesaw.edu', 2)

Percentage with at least one email: 0.061%

Word Frequencies:

We also want to analyze the word frequencies from the scraped KSU webpages. Build a

vocabulary and count the word frequencies. List top 30 most common words before removing

stopwords and after removing stopwords. We expect to see a similar list to the followings.

rank term freq. perc. rank term freq. perc.

------ -------- ------- ------- ------ ----------- ------- -------

1 and 15539 0.031 16 on 2991 0.006

2 the 12164 0.025 17 university 2928 0.006

3 of 9315 0.019 18 contact 2603 0.005

4 to 7990 0.016 19 about 2558 0.005

5 & 6512 0.013 20 search 2430 0.005

6 / 5743 0.012 21 information 2351 0.005

7 for 5333 0.011 22 faculty 2316 0.005

8 in 5178 0.01 23 student 2217 0.004

9 campus 4566 0.009 24 you 2203 0.004

10 ksu 4496 0.009 25 is 2201 0.004

11 a 4314 0.009 26 with 2161 0.004

12 kennesaw 4156 0.008 27 community 2014 0.004

13 students 3361 0.007 28 programs 2013 0.004

14 research 3146 0.006 29 global 1978 0.004

15 state 3065 0.006 30 marietta 1885 0.004

Now, remove any stopwords and punctuations, then print another rankings. For stopwords, you

may use nltk.corpus.stopwords. For removing punctuations, you may use

string.punctuations or you may use the regular expression: "[^\w\s]"

rank term freq. perc. rank term freq. perc.

------ ----------- ------- ------- ------ --------- ------- -------

1 campus 4566 0.009 16 marietta 1885 0.004

2 ksu 4496 0.009 17 resources 1873 0.004

3 kennesaw 4156 0.008 18 home 1855 0.004

4 students 3361 0.007 19 staff 1773 0.004

5 research 3146 0.006 20 program 1677 0.003

6 state 3065 0.006 21 diversity 1665 0.003

7 university 2928 0.006 22 ga 1564 0.003

8 contact 2603 0.005 23 © 1445 0.003

9 search 2430 0.005 24 2021 1441 0.003

10 information 2351 0.005 25 college 1354 0.003

11 faculty 2316 0.005 26 online 1346 0.003

12 student 2217 0.004 27 alumni 1308 0.003

13 community 2014 0.004 28 us 1303 0.003

14 programs 2013 0.004 29 safety 1247 0.003

15 global 1978 0.004 30 financial 1169 0.002

Next plot the word distribution for the data before removing stopwords and punctuations, and for

the data after removing stopwords and punctuations.

Use a log-log plot to see it's approximately linear pattern on the log-log plot. For example:

What to submit

1. A code for spider class which should be located at [scrapy_project]/spiders/

2. A code for generating the text statistics and plots

3. Outputs similar to the ones shown above. Please write all the outputs on one single file

(e.g., docx, pdf, md). This file should include the followings:

1. email statistics

2. two word frequency rankings (before and after removing stopwords and

punctuations)

3. two frequency plots (one w/ Stopwords and punctuation and one w/o Stopwords

and punctuation)

Please, submit all artifacts via the D2L assignments


Requirements

Requirements are in Description. Only expert needed.


Skills Required

data datamining datascrapping datascraping

Bids

No bids made yet - be the first!

Bid On Listing

Bid On Listing Created 8 months ago in Data Entry

Other jobs by Howto532364