German Credit Data Set Arff Rescue
Signal Processing and Advanced Intelligence. Credit fraud German credit fraud dataset: in weka's arff format. German Credit Data Data Set. ARFF datasets. WEKA datasets Other collection. The ELF reader for ARFF files supports only categorical features, where all entries are defined in the attribute section.
HI, I'm new to weka and data mining, I have to present a monograph about data mining, machine learning for helping fraud detection and I would like to know if someone can point me somewhere where I can find datasets for this purpose, to analyze them further with weka and use them as examples for my monograph. Thank you -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette. 2010/5/9 Jaime Hablutzel Egoavil HI, I'm new to weka and data mining, I have to present a monograph about data mining, machine learning for helping fraud detection and I would like to know if someone can point me somewhere where I can find datasets for this purpose, to analyze them further with weka and use them as examples for my monograph. Thank you -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette: -- ----------------- Harri M.T.
Saarikoski M.A, PhD graduate student Helsinki University Finland _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette. HI, I'm new to weka and data mining, I have to present a monograph about data mining, machine learning for helping fraud detection and I would like to know if someone can point me somewhere where I can find datasets for this purpose, to analyze them further with weka and use them as examples for my monograph. Thank you -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette: -- ----------------- Harri M.T.
Saarikoski M.A, PhD graduate student Helsinki University Finland _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette: -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette. HI, I'm new to weka and data mining, I have to present a monograph about data mining, machine learning for helping fraud detection and I would like to know if someone can point me somewhere where I can find datasets for this purpose, to analyze them further with weka and use them as examples for my monograph.
Thank you -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette: -- ----------------- Harri M.T. Saarikoski M.A, PhD graduate student Helsinki University Finland _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette: -- Jaime Hablutzel (tildes omitidas intencionalmente) ********************************************************************** DISCLAIMER This email and any files transmitted with it, including replies and forwarded copies (which may contain alterations) subsequently transmitted from Firmenich, are confidential and solely for the use of the intended recipient. The contents do not represent the opinion of Firmenich except to the extent that it relates to their official business. ********************************************************************** _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette. HI, I'm new to weka and data mining, I have to present a monograph about data mining, machine learning for helping fraud detection and I would like to know if someone can point me somewhere where I can find datasets for this purpose, to analyze them further with weka and use them as examples for my monograph. Thank you -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette: -- ----------------- Harri M.T.
Saarikoski M.A, PhD graduate student Helsinki University Finland _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette: -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette: -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette. Zinc databases From: wekalist-bounces@list.scms.waikato.ac.nz [mailto:wekalist-bounces@list.scms.waikato.ac.nz] On Behalf Of Jaime Hablutzel Egoavil Sent: lundi, 10. Mai 2010 15:21 To: Weka machine learning workbench list. Subject: Re: [Wekalist] credit card fraud datasets I just found a dataset with almost 4 million training instances, I'll analize it later. On Mon, May 10, 2010 at 1:25 AM, Harri Saarikoski >wrote: e.g. UCI repository and Tunedit.org have large searchable collections of datasets (keyword 'fraud' should yield several) Harri 2010/5/9 Jaime Hablutzel Egoavil >HI, I'm new to weka and data mining, I have to present a monograph about data mining, machine learning for helping fraud detection and I would like to know if someone can point me somewhere where I can find datasets for this purpose, to analyze them further with weka and use them as examples for my monograph.
Thank you -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: Wekalist@list.scms.waikato.ac.nz List info and subscription status: List etiquette: -- ----------------- Harri M.T. Saarikoski M.A, PhD graduate student Helsinki University Finland _______________________________________________ Wekalist mailing list Send posts to: Wekalist@list.scms.waikato.ac.nz List info and subscription status: List etiquette: -- Jaime Hablutzel (tildes omitidas intencionalmente) ********************************************************************** DISCLAIMER This email and any files transmitted with it, including replies and forwarded copies (which may contain alterations) subsequently transmitted from Firmenich, are confidential and solely for the use of the intended recipient.
The contents do not represent the opinion of Firmenich except to the extent that it relates to their official business. ********************************************************************** _______________________________________________ Wekalist mailing list Send posts to: Wekalist@list.scms.waikato.ac.nz List info and subscription status: List etiquette.
I just found a dataset with almost 4 million training instances, I'll analize it later. On Mon, May 10, 2010 at 1:25 AM, Harri Saarikoski wrote: >e.g. UCI repository and Tunedit.org have large searchable collections of >datasets >(keyword 'fraud' should yield several) >>Harri >>2010/5/9 Jaime Hablutzel Egoavil >>>HI, I'm new to weka and data mining, I have to present a monograph about >>data mining, machine learning for helping fraud detection and I would like >>to know if someone can point me somewhere where I can find datasets for this >>purpose, to analyze them further with weka and use them as examples for my >>monograph.
>>>>Thank you >>>>-- >>Jaime Hablutzel >>>>(tildes omitidas intencionalmente) >>>>_______________________________________________ >>Wekalist mailing list >>Send posts to: Wekalist@list.scms.waikato.ac.nz >>List info and subscription status: >>>>List etiquette: >>>>>>>>>-- >----------------- >Harri M.T. Saarikoski >M.A, PhD graduate student >Helsinki University >Finland >>_______________________________________________ >Wekalist mailing list >Send posts to: Wekalist@list.scms.waikato.ac.nz >List info and subscription status: >>List etiquette: >>>-- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: Wekalist@list.scms.waikato.ac.nz List info and subscription status: List etiquette.
These are the two links I found On Mon, May 10, 2010 at 8:26 AM, Uday Kamath wrote: >Myscript Studio Notes Edition Download Mac here. Can you send me the link or zipped version of dataset? Appreciate your help >Thanks >Uday >>Ph.D Student >GMU >USA >>>On Mon, May 10, 2010 at 9:20 AM, Jaime Hablutzel Egoavil hablutzel1@gmail.com>wrote: >>>I just found a dataset with almost 4 million training instances, I'll >>analize it later. >>>>>>On Mon, May 10, 2010 at 1:25 AM, Harri Saarikoski >harri.saarikoski@gmail.com>wrote: >>>>>e.g. UCI repository and Tunedit.org have large searchable collections of >>>datasets >>>(keyword 'fraud' should yield several) >>>>>>Harri >>>>>>2010/5/9 Jaime Hablutzel Egoavil >>>>>>>HI, I'm new to weka and data mining, I have to present a monograph about >>>>data mining, machine learning for helping fraud detection and I would like >>>>to know if someone can point me somewhere where I can find datasets for this >>>>purpose, to analyze them further with weka and use them as examples for my >>>>monograph.
>>>>>>>>Thank you >>>>>>>>-- >>>>Jaime Hablutzel >>>>>>>>(tildes omitidas intencionalmente) >>>>>>>>_______________________________________________ >>>>Wekalist mailing list >>>>Send posts to: Wekalist@list.scms.waikato.ac.nz >>>>List info and subscription status: >>>>>>>>List etiquette: >>>>>>>>>>>>>>>>>>>>>-- >>>----------------- >>>Harri M.T. These are the two links I found On Mon, May 10, 2010 at 8:26 AM, Uday Kamath wrote: >Can you send me the link or zipped version of dataset?
Appreciate your help >Thanks >Uday >>Ph.D Student >GMU >USA >>>On Mon, May 10, 2010 at 9:20 AM, Jaime Hablutzel Egoavil hablutzel1@gmail.com>wrote: >>>I just found a dataset with almost 4 million training instances, I'll >>analize it later. >>>>>>On Mon, May 10, 2010 at 1:25 AM, Harri Saarikoski >harri.saarikoski@gmail.com>wrote: >>>>>e.g. UCI repository and Tunedit.org have large searchable collections of >>>datasets >>>(keyword 'fraud' should yield several) >>>>>>Harri >>>>>>2010/5/9 Jaime Hablutzel Egoavil >>>>>>>HI, I'm new to weka and data mining, I have to present a monograph about >>>>data mining, machine learning for helping fraud detection and I would like >>>>to know if someone can point me somewhere where I can find datasets for this >>>>purpose, to analyze them further with weka and use them as examples for my >>>>monograph. >>>>>>>>Thank you >>>>>>>>-- >>>>Jaime Hablutzel >>>>>>>>(tildes omitidas intencionalmente) >>>>>>>>_______________________________________________ >>>>Wekalist mailing list >>>>Send posts to: Wekalist@list.scms.waikato.ac.nz >>>>List info and subscription status: >>>>>>>>List etiquette: >>>>>>>>>>>>>>>>>>>>>-- >>>----------------- >>>Harri M.T. For algorithms that need numerical attributes, Strathclyde University produced the file 'german.data-numeric'.
This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer.This was the form used by StatLog. Number of Attributes german: 20 (7 numerical, 13 categorical) Number of Attributes german.numer: 24 (24 numerical). HI, I'm new to weka and data mining, I have to present a monograph about data mining, machine learning for helping fraud detection and I would like to know if someone can point me somewhere where I can find datasets for this purpose, to analyze them further with weka and use them as examples for my monograph. Thank you -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: Wekalist@list.scms.waikato.ac.nz List info and subscription status: List etiquette. HI, I'm new to weka and data mining, I have to present a monograph about data mining, machine learning for helping fraud detection and I would like to know if someone can point me somewhere where I can find datasets for this purpose, to analyze them further with weka and use them as examples for my monograph.
Thank you -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette: -- ----------------- Harri M.T. Saarikoski M.A, PhD graduate student Helsinki University Finland _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette: -- Jaime Hablutzel (tildes omitidas intencionalmente) _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette: -- Jaime Hablutzel (tildes omitidas intencionalmente). Hi, I'm sure many here would be interested in possessing a copy of 'Data Mining - Practical Machine Learning Tools and Techniques.pdf' (incuded in your pdf), but it's '(more than) probably' copyrighted:) Since buying a copy of the book is surely one of the way to finance the authors/developers of Weka and ensure its quality/future, it's surely better to let people make their own investment. Also, even if I'm not really loving this - I would prefer that knowledge would be free (as in software, not beer:) - probably all the scientific papers are of limited ditribution. Cheers GM _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette. Thanks for pointing that out, I've removed that PDF, and checked quickly all the other contents, and it seems to me that I found them all freely available on their respective web sites, so maybe there is no infringement in distributing them, but maybe I'm wrong, in that case let me know, and I'll quickly update the package.
Anyway, I do really encourage everybody to visit the original web sites and credit the original authors, I'm providing this package just to make it easier the searching process for these new to AI and data mining as it could be hard to find this information when you are really new to the subject. On Fri, Jun 19, 2015 at 3:21 AM, Guillaume MULLER wrote: Hi, I'm sure many here would be interested in possessing a copy of 'Data Mining - Practical Machine Learning Tools and Techniques.pdf' (incuded in your pdf), but it's '(more than) probably' copyrighted:) Since buying a copy of the book is surely one of the way to finance the authors/developers of Weka and ensure its quality/future, it's surely better to let people make their own investment.
Also, even if I'm not really loving this - I would prefer that knowledge would be free (as in software, not beer:) - probably all the scientific papers are of limited ditribution. Cheers GM _______________________________________________ Wekalist mailing list Send posts to: List info and subscription status: List etiquette.
Modeling is one of the topics I will be writing a lot on this blog. Because of that I thought it would be nice to introduce some datasets that I will use in the illustration of models and methods later on.
In this post I describe the, very popular within the machine learning literature. This dataset contains rows, where each row has information about the credit status of an individual, which can be good or bad. Besides, it has qualitative and quantitative information about the individuals. Examples of qualitative information are purpose of the loan and sex while examples of quantitative information are duration of the loan and installment rate in percentage of disposable income. This dataset has also been described and used in and is available in R through the caret package. Require(caret) data(GermanCredit) The version above had all the categorical predictors converted to dummy variables (see for ex.
Section 3.6 of ) and can be displayed using the str function: str(GermanCredit, list.len=5) 'data.frame': 1000 obs. Of 62 variables: $ Duration: int 6 48 12. $ Amount: int 1169 5951 2096.
$ InstallmentRatePercentage: int 4 2 2. $ ResidenceDuration: int 4 2 3.
$ Age: int 67 22 49. [list output truncated] For data exploration purposes, I also like to keep a dataset where the categorical predictors are stored as factors rather than converted to dummy variables. This sometimes facilitates since it provides a grouping effect for the levels of the categorical variable.
This grouping effect is lost when we convert them to dummy variables, specially when a non-full rank parametrization of the predictors is used. The response (or target) variable here indicates the credit status of an individual and is stored in the column Class of the GermanCredit dataset as a factor with two levels, “Bad” and “Good”. We can see above (code for Figure ) that the German credit data is a case of unbalanced dataset with of the individuals being classified as having good credit. Therefore, the accuracy of a classification model should be superior to, which would be the accuracy of a naive model that classify every individual as having good credit. The nice thing about this dataset is that it has a lot of challenges faced by data scientists on a daily basis.
For example, it is unbalanced, has predictors that are constant within groups and has collinearity among predictors. In order to fit some models to this dataset, like the for example, we must deal with these challenges first.
More on that later. Kuhn, M., and Johnson, K. Applied Predictive Modeling. This is my personal blog. It is about simple but (hopefully) useful stuff about Statistics and Data Analysis, among other things. Currently working as a Data Scientist for I am a brazilian living in Trondheim, Norway. Click to know more about me.
Recent Posts • • • • • • • • Categories • (2) • (34) • (4) • (1) • (10) • (1) • (3) • (9) • (3) • (1) • (2) • (4) • (3) • (1) • (2) • (21) • (1) • (1) • (16) • (4) • (1) • (1) Archives • (1) • (1) • (2) • (3) • (2) • (2) • (4) • (5) • (4) • (4) • (5) • (4) • (5) • (7) • (1) • (2) Blogroll •.
Data mining is a critical step in knowledge discovery involving theories, methodologies and tools for revealing patterns in data. It is important to understand the rationale behind the methods so that tools and methods have appropriate fit with the data and the objective of pattern recognition. There may be several options for tools available for a data set. When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision – • If the applicant is a good credit risk, i.e. Is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank • If the applicant is a bad credit risk, i.e.
Is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank Objective of Analysis: Minimization of risk and maximization of profit on behalf of the bank. To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application. The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants.
Here is a link to the German Credit data ( right-click and 'save as' ). A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles. Data Files for this case ( right-click and 'save as' ): • German Credit data - • Training dataset - • Test dataset - The following analytical approaches are taken: • Logistic regression: The response is binary (Good credit risk or Bad) and several predictors are available. • Discriminant Analysis: • Tree-based method and Random Forest.