Video metadata in web based applications.
Boicea, Alexandru ; Capitanescu, Iulia ; Radoi, Codrut Dumitru 等
Abstract: This paper analyses a technique for extracting metadata
regarding YouTube videos that are embedded in various websites. Web
pages are defined by their URL and the results are stored in a
relational database. The algorithm uses a web crawler for page browsing
and HTML tags for metadata discovery.
Key words: metadata, video, embed, web, crawler
1. INTRODUCTION
As the World-Wide-Web is incessantly expanding, the quantity of
information one can find on the Internet is continually growing. As a
result, relevant information regarding a specific topic is harder to
find. (Safari, 2004) This is why the concept of metadata was introduced.
Metadata is 'data about data', which means that it describes a
certain resource on the Internet.
There are a few things regarding metadata that need to be
considered. First of all, the information that is captured by the
metadata needs to be defined. This depends on the type of resource and
the purpose of the metadata. The second aspect concerns the way metadata
is produced. The final problem regards the way metadata is accessed and
used. (Ianella & Waugh, 2011) YouTube offers web site developers the
possibility of inserting videos into their web sites by just copying a
few lines of code into their HTML source. This process leads to the
generation of a dedicated area of the page where a user can watch the
video directly without having to navigate to the YouTube video page.
This particular approach doesn't offer the user any information
other than the actual video. Therefore, the search on a certain subject
is not very relevant in regard to pages containing embedded YouTube
videos.
The idea of this paper is to implement an algorithm that extracts
metadata regarding YouTube videos embedded in a certain web page.
Information is extracted from the HTML tags of the corresponding YouTube
page and is stored in a customized database.
The end purpose is to easily locate relevant videos. The
application can also be used for data mining purposes, to compute statistics or establish relationships between various web resources.
2. EMBEDDING YOUTUBE VIDEOS
There are two ways of embedding YouTube videos. The old way begins
with the <object> tag and only supports Flash playback
(http://www.google.com/support/youtube/bin/answer.py ...):
<object width="960" height="750"> <param name="movie"
value="http://www.youtube.com/v/WApx6lXAwMQ?
fs=1&hl=ro_RO&rel=0"></param><param
name="allowFullScreen" value="true"> </param> <param
name="allowscriptaccess"
value="always"></param> <embed
src="http://www.youtube.com/v/WApx6lXAwMQ?
fs=l&hl=ro_RO&rel=0"
type="application/xshockwave-flash"
allowscriptaccess="always"
allowfullscreen="true" width="960"
height="750"></embed></object>
A newer version uses the <iframe> tag and supports both Flash
and HTML5 video (http://apiblog.youtube.com/ ...):
<iframe title="YouTube video player" width="960"
height="750"
src="http://www.youtube.com/embed/WApx61XAwMQ"
frameborder="0" allowfullscreen></iframe>
Some services only support the older version so the application
will search for both types of embedded video code.
3. STAGES OF THE APPLICATIONS
The user is prompted to specify an URL that identifies the page
containing embedded YouTube videos. The application will extract
metadata for each of these videos. Based on the input address, it scans
the page source in search of relevant tags (<iframe> for newer
versions of embedded videos and <object> <param
name="movie"> for previous ones). The links to YouTube
pages containing the videos are discovered in this manner. Identified
links are accessed by using a web crawler.
Each YouTube page source is parsed in order to find specific tags,
knowing that they contain useful information that will later be added to
the metadata database.
[FIGURE 1 OMITTED]
4. WEB CRAWLER
The web crawler algorithm has the following phases (Blum et al,
1998):
1. getHtmlSource(url)
1.1 createHttpWebRequest to access the given URL
1.2 getHttpWebResponse
1.3 create streamReader from HttpWebResponse
1.4 while ((streamReader.ReadLine()) = null)
1.5 write read line into local file
2. parseURLPageSourceLocally
2.1 get relevant embedded url:
2.1.1 <iframe>
2.1.2 <object> <param name="movie">
3. for each embedded YouTube video 3.1 if (found link type <iframe>)
3.1.1 replace "embed/" with "watch?v="
3.1.2 browse to www.youtube.com
3.2 if (found link type object> <param name="movie">)
3.2.1 replace "v/" with "watch?v="
3.2.1 browse to www.youtube.com
3.3 similar step 1; using identified link--write in different file
3.4 similar step 2; identify tags like:
+ span id="eow-title"
+ <a id="watch-username">
+ <span class="watch-view-count">
+ etc.
3.4.1 write data into database
5. YOUTUBE PAGE PARSING
The YouTube page source is parsed to find specific tags, in order
to extract metadata regarding the video. Tags and the specific
information are listed below:
* Title [right arrow] <span id="eow-title">
* Author [right arrow] <a id="watch-username">
* Watch count [right arrow] <span class="watch-view-count">
* Likes [right arrow] <span class="likes">
* Dislikes [right arrow] <span class="dislikes">
* Upload date [right arrow] <span id="eow-date-short">
* Description [right arrow] <span id="eow-description">
* Category [right arrow] <span id="eow-category"
* Tags [right arrow] <span id="eow-tags">
Information about various user comments can also be extracted from
the YouTube web page by searching for the tag:
<div class="comments-section">
This is a list of all the comments for a certain video. This list
can be ordered by the number of likes. Another option would be to only
store the most popular comment.
6. DATABASE STRUCTURE
The database consists of four tables linked together by auxiliary
tables that resolve the "many-to-many" relationship problem,
as shown in Fig.2.
"Video" holds all the necessary metadata regarding a
video: a direct link, its title, who uploaded it, when it was uploaded,
the number of likes and dislikes, how many times it was watched, a
description of the video and the category that it belongs to.
If more information is needed, additional data regarding a video
can be found in the following tables: "Comment" contains the
most "liked" comments of the video, "Tag" displays
tags associated with each video and "Source_Page" holds the
URL and the description of the input page.
Although we could have used the URL as the identifier for the
Source_Page table, it was better to create a new column that is a
numeric id in order to make an easier connection to the Source_Page
Video table.
[FIGURE 2 OMITTED]
7. CONCLUSIONS
This paper emphasizes the need of using metadata regarding embedded
YouTube videos for search optimization. Its purpose was to obtain as
much information as possible in order to describe the videos on a
certain web page and store them in a database for future usage or
processing. The main idea is that all the information you need is
contained in the HTML source of the YouTube page.
Using this approach on a large scale could lead to further
standardization of video embedding and metadata sets for video hosting
web sites. In addition to this, Internet browsers could be modified in
order to display such metadata for videos embedded in the currently
displayed web page.
This approach has one very important limitation: it is restricted
to YouTube videos by the URL format and the extracted metadata set.
Hence, the next step is to extend metadata extraction for other
types of web resources (images, audio etc.) in order to easily locate
relevant pages related to a certain subject.
Future development may include extending the application to support
other types of video embedding offered by different video hosting web
sites. The application can be further extended with data-mining
algorithms for statistic computing. (Ungureanu & Boicea, 2008)
8. REFERENCES
Blum, T., Keislar D., Wheaton J. & Wold E. (2011) Writing a Web
Crawle in the Java Programming Language, Available on:
http://java.sun.com/developer/technicalArticles/
ThirdParty/WebCrawler/Accessed: 2011-05-13
Iannella R. & Waugh A.(2011) Metadata: Enabling the Internet
Available on: http://archive.ifla.org/documents/libraries/
cataloging/metadata/ianr1.pdf Accessed: 2011-05-13
Safari M. (2004) Metadata and the Web, Available on:
http://www.webology.ir/2004/v1n2/a7.html Accessed: 2011-05-13
Ungureanu D. & Boicea A.(2008) A Depth First Search Algorithm for Mining Intertransaction Association Rules, The 3rd International
Conference on Soft-ware and Data Technologies, ICSOFT '08, pp
148-153, Porto 2008
*** (2011) http://www.google.com/support/youtube/bin/answer.
pyanswer=171780&expand=UseOldEmbedCode#oldcode Embed a YouTube video
Accessed on: 2011-05-13