مطالعه ای مختصر در خصوص طبقه بندی توالی A Brief Survey on Sequence Classification

نوع فایل : کتاب
زبان : فارسی
ناشر : ACM
چاپ و سال / کشور: 2010

توضیحات

چاپ شده در مجله شناسایی SIGKDD
رشته های مرتبط مهندسی کامپیوتر، مهندسی الگوریتم ها و محاسبات و مهندسی نرم افزار
۱- مقدمه طبقه بندی توالی دار ای طیف وسیعی از کاربرد ها می با شد. در تحقیقات ژنومی، طبقه بندی توالی ها ی پروتین به دسته ها ی موجود بر ای یادگیری وظایف و کارکرد ها ی پروتین استفاده می شود(۱۳). در انفورماتیک سلامت، طبقه بندی سری ها ی زمانی ECG ( سری ها ی زمانی ضربان قلب) به ما می گوید که آیا داده ها مربوط به یک فرد سالم هستند یا مربوط به یک بیمار مبتلا به بیماری قلبی است(۵۹). در تشخیص نفوذ و ناهنجاری، توالی فعالیت ها ی دسترسی سیستم بر روی یونیکس بر ای تشخیص رفتار ها ی غیر طبیعی و ناهنجاری پایش می شود(۳۳). در باز یابی اطلاعات، طبقه بندی اسناد به مقوله ها ی موضوعی مختلف، توجهات زیادی ر ا به خود جلب کرده است(۵۱). سایر مثال ها ی جالب شامل طبقه بندی توالی ها ی کوئری بر ای تمایز ربات ها ی ا ینترنتی از کاربر ان انسان( ۵۸، ۱۸) و طبقه بندی داده ها ی توالی تر انسفکشن در یک بانک بر ای مبارزه با پول شویی، می با شد(۴۲). به طور کلی، یک توالی، فهرست منظمی از رویداد هاست. یک رویداد ر ا می توان به صورت یک ارزش نمادین، یک ارزش واقعی عددی، یک بردار با ارزش واقعی یا یک داده نوع پیچیده در نظر گرفت. در ا ین مقاله، داده ها ی توالی یا دنباله ای به زیر انواع زیر در نظر گرفته می شود • با توجه به القای علایم و نماد ها fE1; E2; E3; :::; Eng، یک توالی نمادین ساده ، فهرست منظمی از نماد ها از الفبا می با شد. بر ای مثال، یک توالی DNA متشکل از چهار امینو اسید A-C-G-T و قطعه DNA نظیر ACCCCCGT می با شد که یک توالی نمادین ساده است. • یک توالی نمادین ساده، فهرستی از بردار ها می با شد. هر بردار یک زیر مجموعه ای از الفبا(۳۴) می با شد. بر ای مثال، بر ای توالی ایتم ها ی خریداری شده توسط یک مشتری در یک سال، در نظر گرفتن هر تر اکنش به صورت یک بردار، یک توالی یا دنباله می تواند به صورت ساعت است (شیر؛ نان) (شیر؛ تخم مرغ) (سیب زمینی؛ پنیر؛ کک) در نظر گرفته شود.

Description

۱٫ INTRODUCTION Sequence classification has a broad range of real-world applications. In genomic research, classifying protein sequences into existing categories is used to learn the functions of a new protein [13]. In health-informatics, classifying ECG time series (the time series of heart rates) tells if the data comes from a healthy person or comes from a patient with heart disease [59]. In anomaly detection/intrusion detection, the sequence of a user’s system access activities on Unix is monitored to detect abnormal behaviors [33]. In information retrieval, classifying documents into different topic categories has attracted a lot of attentions [51]. Other interesting examples include classifying query log sequences to distinguish web-robots from human users [58; 18] and classifying transaction sequence data in a bank for the purpose of combating money laundering [42]. Generally, a sequence is an ordered list of events. An event can be represented as a symbolic value, a numerical real value, a vector of real values or a complex data type. In this paper, we consider sequence data into the following subtypes. • Given an alphabet of symbols {E1, E2, E3, …, En}, a simple symbolic sequence is an ordered list of the symbols from the alphabet. For example, a DNA sequence is composed of four animo acid A, C, G, T and a DNA segment, such as ACCCCCGT , is a simple symbolic sequence. • A complex symbolic sequence is an ordered list of vectors. Each vector is a subset of the alphabet [34]. For example, for a sequence of items bought by a customer over one year, treating each transaction as a vector, a sequence can be h(milk, bread)(milk, egg)· · · (potatos, cheese, coke)i. ۴٫۲ Time Series Data Time series data is an important type of sequence data. In Time Series Data Library [4], time series data across 22 domains, such as agriculture, chemistry, health, finance,industry, are collected. UCR time series data archive [27] provides a set of time series datasets as a benchmark for evaluating time series classification methods. For simple time series data, to apply feature based methods, the feature selection is a challenging task since we cannot do feature enumeration on numeric data. Therefore, distance based methods are widely adopted to classify time series [61; 26; 59; 48]. It is shown that comparing to a wide range of classifiers, such as neural networks, SVM and HMM, 1- nearest neighbor classifier with dynamic time warping distance is usually superior in classification accuracy [61]. To apply feature based methods on simple time series, usually, before feature selection, time series data needs to be transformed into symbolic sequences through discretization or symbolic transformation [40]. Without discretization, Ye et al. [65] propose a method to find time series shapelets and use a decision tree to classify time series. Comparing to distance based methods, feature based methods may speed up the classification process and be able to generate some interpretable results. Model based methods are also applied to classify simple time series, such as HMM which is widely used in speech recognition [47]. Multivariate time series classification has been used for gesture recognition [24] and motion recognition [38]. The multivariate data is generated by a set of sensors which measure the movements of objects in different locations and directions. For multivariate time series classification, Kadous et al. [24] propose a feature based classifier. A set of userdefined meta-features are constructed and a multivariate time series is transformed into a feature vector. Some universal meta-features include the features to describe the trends of increases and decreases and local max or min values. By using those features, multivariate time series with additional non-temporal attributes can be classified by a decision tree. One multivairate time series can be viewed as a matrix. Li et al. [31] propose a method to transform a multivariate time series into a vector through singular value decomposition and other transformations. SVM is then used to classify the vectors. ۴٫۳ Text Data Sequence classification is also widely used in information retrieval to categorize text and documents. The widely used methods for document classification include Naive Bayes [29] and SVM [43]. Text classification has various extensions such as multi-label text classification [67], hierarchical text classification [57] and semi-supervised text classification [46]. Sebastiani et al. [51] provide a more detailed survey on text classification . ۵٫ CONCLUSION In this paper, we provide a brief survey on sequence classification. We categorize sequence data into five subtypes. We group sequence classification methods in feature based methods, sequence distance based methods and model based methods. We also present several extensions of the conventional sequence classification. At last, we compare sequence classification methods applied in different application domains. We notice that most of the works focus on the classification task on simple symbolic sequences and simple time series data. Although there are a few works on multiple variate time series and complex symbolic sequences, the problem of classifying complex sequence data is still open at large. Furthermore, most of the methods are devoted to the conventional sequence classification task. Streaming sequence classification, early classification, semi-supervised classification on sequence data and the combinations of those problems on complex sequence data which have practical applications, present challenges for future studies.

مطالعه ای مختصر در خصوص طبقه بندی توالی A Brief Survey on Sequence Classification

توضیحات

Description

اگر شما نسبت به این اثر یا عنوان محق هستید، لطفا از طریق "بخش تماس با ما" با ما تماس بگیرید و برای اطلاعات بیشتر، صفحه قوانین و مقررات را مطالعه نمایید.

دیدگاه کاربران