Best data mining books according to redditors

We found 160 Reddit comments discussing the best data mining books. We ranked the 58 resulting products by number of redditors who mentioned them. Here are the top 20.

Next page

Top Reddit comments about Data Mining:

u/syntonicC · 13 pointsr/datascience

I used R for about 4 years before I moved to Python to use it for deep learning. I have been using Python for about 2 years now.

>Are R and Python considered redundant, or are there some situations where one will be preferred over the other? If I become proficient at using Python for data wrangling, analysis, and visualization, will I have any reason to continue using R?

It depends. I haven't really found anything that I can do in Python that I could not already do in R. I still use R because I like it better as a functional programming language and because it has a wide variety of more specific statistical packages (many for biology) that are just not available for Python yet. There are some specific cases where I just find it more intuitive and simpler to implement a solution in R. And generally, I just prefer ggplot2 over any of the various Python plotting packages. Also, R has high level API for things like TensorFlow so it's not like you can't do deep learning in R.

The biggest advantage for Python is its speed and ability to work within a larger programming framework. A lot of companies tend to use Python because the models they build are integrated into a larger system that needs the capabilities of a fully-fledged programming language. Python is generally faster and has better management of big data sets in memory. R is actually moving more in the direction to fix these issues but there are still limitations.

>Where should I start? I'm looking for a resource that isn't aimed at complete beginners, since I've been using R for a few years, and took a C class before that. At the same time I wouldn't claim to be an experienced programmer. I'm interested in learning Python both for data analysis and for general programming.

I learned Python syntax using Learn Python 3 the Hard Way. I learned about Pandas and data wrangling etc using Pandas for Everyone and Pandas Cookbook. If I was to suggest just one book, it would be Pandas for Everyone. You can learn Python syntax from YouTube, MOOCs, or online tutorials. The Pandas Cookbook is just extra practice. To be honest though, the general conventions used by Pandas for data analysis and manipulation are very similar to R in many ways. Especially if you've used anything in Hadley Wickham's Tidyverse. Finally, I made a Pandas cheatsheet while I was learning and including equivalent R functions in some places. I would be happy to share this Google Sheets file with you if you are interested.

>What IDE(s) should I use, and what are some must learn packages? I'm hoping to find something similar to RStudio.

I started off using PyCharm. I've heard good things about Spyder. But now, I actually still use RStudio! It is fully integrated with Python thanks to the Reticulate package. You can pass data structures between the languages and use both in RMarkdown. You can also use virtual environments which are popular with Python. Once you install the package:

library(reticulate)
use_virtualenv("path_to_my_virtual_env") # Start virtual environment

You can now run Python scripts directly in the RStudio console

# If you want a Python REPL to use interactively just like in R run:<br />
repl_python()<br />


It's really easy to use and even comes with auto-complete and everything else.

Hope that helped.

u/greenspans · 13 pointsr/DataHoarder

I'm pretty sure this book talked about how easy it was to scrape facebook before they locked down their API.

https://www.amazon.com/Mining-Social-Web-Facebook-LinkedIn/dp/1449367615/

A lot of people probably did this. I remember a talk given in my city, the guy had a few thousand people signup to his app and got millions of entries to his graph database

https://maxdemarzi.com/2013/01/28/facebook-graph-search-with-cypher-and-neo4j/

Popular game devs probably got oodles of data. Must have been awesome having a social graph of the US

u/xamdam · 8 pointsr/MachineLearning
u/wolf2600 · 7 pointsr/cscareerquestions

https://smile.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&amp;amp;field-keywords=sql+server+performance

Also this is a great book to understand the fundamentals of SQL performance tuning. I'd recommend starting with it, then later getting a SQL Server-specific book if the tips you learn from it aren't enough:

https://smile.amazon.com/SQL-Tuning-Generating-Optimal-Execution/dp/0596005733/ref=sr_1_1?s=books&amp;amp;ie=UTF8&amp;amp;qid=1491914487&amp;amp;sr=1-1&amp;amp;keywords=sql+tuning

u/shaggorama · 6 pointsr/MLQuestions
u/dr1fter · 5 pointsr/IWantToLearn

Hoowhee, how did this text get so... wally.

Bots are usually (fairly) simple programs, so Python will make it easy to get at all the common functionality you want (maybe looking for pattern matches in a piece of text, some math/analytics, saving files to your hard drive, converting images...) and in practice you'll mostly only be limited by what you can figure out how to do in your language of choice, regardless of the bot you're writing.

Wherever possible, you should use official APIs (which will often support Python these days), or at least third-party APIs that are built on top of the official ones. The APIs are sort of like a mediator between your bot and the service, or a menu of remotely-accessible functionality -- for twitter it might include things like "get the list of tweet IDs posted by this user ID in the last month" and "get the full text and metadata for this tweet ID." The set of functionality in that API determines what is and isn't possible for your bot to do (and depending on the service, it might actually hide a lot of complexity around sending messages to multiple servers, authenticating the request, etc)

When there's no API (or if the official API doesn't let you do something that you know should be possible) you usually have to switch to scraping. It's error-prone (could break any time) and frowned on by a lot of services (which is why you have to think about rate limiting and bans -- you may well be violating their terms of service, and either way you're using the service in unintended ways that might interfere with its normal functioning). "Unofficial APIs" are often just scrapers under the hood, tidied up into something that looks more like a normal API. I've written a ton of little scrapers in Python -- it really is a great tool for the job.

I suppose the other case is that some services can be built in standardized ways, so you don't need an official API from that particular company, because anyone else's API for that standard should be interoperable. That's common for databases, for example, but probably not the services you're talking about -- the popular web services are usually either proprietary, or a "standard" they invented that no one else actually uses, so you're basically stuck with the official API anyways.

For a lot of examples of integrating with public APIs, you can try Mining the Social Web from O'Reilly. I didn't actually spend a lot of time with that book personally (I wasn't expecting the sort of "cookbook" format with lots of examples and code) but it might cover some of the APIs that you're interested in.

u/RobMagus · 5 pointsr/statistics

This is a fairly useful review that I believe is available via google scholar for free: Wainer, H., &amp; Thissen, D. (1981). Graphical data analysis. Annual review of psychology, 32, 191–241.

Tufte is useful for a historical overview and for inspiration, but he has a particular style that doesn't necessarily match up with the way that you or your audience think.

Hadley Wickham developed ggplot2 and his site is a good place to start browsing for guides to using it.

There's a pretty good o'reilly book on visualization as well, and Stephen Few's book does a really good job of enumerating the various ways you can express trends in data.

u/ictatha · 5 pointsr/BusinessIntelligence

I'd recommend Guerrilla Analytics: A Practical Approach to Working with Data. I bought it on a recommendation from somewhere on reddit (possibly this sub). It provides good software-neutral guidance for building data capabilities within an organization.

u/equinox932 · 5 pointsr/Romania

Vezi si fast.ai, au 4 cursuri foarte bune. Apoi si asta e bun. Hugo Larochelle avea un curs de retele neuronale, un pic mai vechi.

La carti as adauga si The Hundred Page Machine Learning Book si asta , probabil cea mai buna carte practica, da asteapta editia a 2a, cu tensorflow 2.0, are tf.keras.layers, sequential model, practic tf 2 include keras si scapi de kkturile alea de sessions. Asa, si ar mai fi si asta, asta si asta. Nu pierde timp cu cartea lui Bengio de deep learning, e o mizerie superficiala. Spor la invatat si sa vedem cat mai multi romani cu articole pe ML si DL!

u/klaxion · 5 pointsr/statistics

Recommendation - don't learn statistics through "statistics for biology/ecology".

Go straight to statistics texts, the applied ones aren't that hard and they usually have fewer of the lost-in-translation errors (e.g. the abuse of p-values in all of biology).

Try Gelman and Hill -

http://www.amazon.com/Analysis-Regression-Multilevel-Hierarchical-Models/dp/052168689X/ref=sr_1_1?ie=UTF8&amp;amp;qid=1427768688&amp;amp;sr=8-1&amp;amp;keywords=gelman+hill

Faraway - Practical Regression and Anova using (free)

http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf

Categorical data analysis

http://www.amazon.com/Categorical-Data-Analysis-Alan-Agresti/dp/0470463635/ref=sr_1_1?ie=UTF8&amp;amp;qid=1427768746&amp;amp;sr=8-1&amp;amp;keywords=categorical+data+analysis

u/howsyourweird · 4 pointsr/compsci

Two Python ML resources below. The former mixes math with working code. The latter is new and appears to be more of a guide to applying scikit-learn and/or milk specifically.

u/homebeer · 4 pointsr/hadoop
u/daturkel · 4 pointsr/datascience

The book Doing Data Science, cowritten by Cathy O'Neil (of Weapons of Math Destruction may be of interest to you.

&gt; In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.

I haven't read the whole thing yet, but it's well-written and has a nice survey of topics.

u/techwizrd · 3 pointsr/AskStatistics

You have ordinal categorical data. I would suggest building a straightforward regression model (like an ordinal logit model). Check out Categorical Data Analysis by Agresti for more detailed information.

You may also want to treat the hour of the day as a "circular predictor".

Since your data is a Likert scale, as other commenters have said, you can simply take the average to turn your response variable from an ordinal category to a continuous response. However, this changes what interpretations you can make, so it's not always the best policy. Nevertheless, it's not a bad place to start.

u/QuestionableQuestion · 3 pointsr/Rlanguage

I just bought R in Action on Amazon. Seems to come well-regarded!

Edit: Also ordered R for Spatial Analysis and Mapping.

u/monumentshorts · 3 pointsr/programming

Don't be. Basically everything here is either about conditional probability (naive bayes) or weights*inputs + bias &gt; threshold calculations. The weights inputs stuff is basic neural networks. Support vectors machines, perceptrons, kernel transformations, are all about finding linearly separable classes (in some appropriate dimension).

For more info this will help http://en.m.wikipedia.org/wiki/Perceptron

If you are interested I really have enjoyed reading http://www.amazon.com/gp/aw/d/1420067184/ref=redir_mdp_mobile which explains really well a lot of machine learning stuff and demystifies the math

For some practical examples here are some blog posts I've written:

K means step by step in f#: http://onoffswitch.net/k-means-step-by-step-in-f/

Automatic fogbugz triage with naive bayes: http://onoffswitch.net/fogbugz-priority-prediction-naive-bayes/

I share the links mostly cause its nice to see a worked through, practical example with code. In the end a lot of these algorithms aren't that hard to implement with some matrix math

Anyways, hope this helps!

u/deong · 3 pointsr/compsci

I haven't personally read it, but the table of contents of this book looks like it has some potential to be a more modern version of what the Mitchell book aimed to provide.

I may be teaching a course in ML next year; I may order a copy to see if it's any good. Has anyone else looked at it?

u/wscottsanders · 3 pointsr/learnpython

You might look at the O'Reilly book "Mining the Social Web". I've found it very helpful and the author has even responded to my questions about how to get the virtual machine with the ipython notebooks up and running. Has code examples, explanations about the underlying technologies, and intros to things like natural language processing included.

http://www.amazon.com/Mining-Social-Web-Facebook-LinkedIn/dp/1449367615

u/MhilPickleson · 2 pointsr/java

Cassandra High Availability is a solid new book I'd recommend. It's very up to date with a section on CQL and Spark connector.

u/Archawn · 2 pointsr/MachineLearning

Pick up a numerical analysis book of some sort. I learned out of Sauer, "Numerical Analysis" which I wasn't incredibly happy with and so I can't recommend it, but it covered the basics with example code in Matlab. Someone else here can probably recommend a better numerical methods textbook.

Some great free resources:

u/pulsetsar · 2 pointsr/rstats

This should get you started to port over some existing skills. You can then fill in the rest with one of the suggested references.

R for SAS and SPSS Users (Statistics and Computing) https://www.amazon.com/dp/1461406846/ref=cm_sw_r_cp_awd_2doPwbMVQTRYH

u/efrique · 2 pointsr/rstats

R is not like SPSS (in most ways I think it's better), but you can bridge the gap some. You could run a menu-driven system on top of it (there are several) or you can leverage your SPSS knowledge via a book like Bob Muenchen's R for SAS and SPSS users

u/sazken · 2 pointsr/GetStudying

Yo, I'm not getting that image, but at a base level I can tell you this -

  1. I don't know you if you know any R or Python, but there are good NLP (Natural Language Processing) libraries available for both

    Here's a good book for Python: http://www.nltk.org/book/

    A link to some more: http://nlp.stanford.edu/~manning/courses/DigitalHumanities/DH2011-Manning.pdf

    And for R, there's http://www.springer.com/us/book/9783319207018
    and
    https://www.amazon.com/Analysis-Students-Literature-Quantitative-Humanities-ebook/dp/B00PUM0DAA/ref=sr_1_9?ie=UTF8&amp;amp;qid=1483316118&amp;amp;sr=8-9&amp;amp;keywords=humanities+r

    There's also this https://www.amazon.com/Mining-Social-Web-Facebook-LinkedIn/dp/1449367615/ref=asap_bc?ie=UTF8 for web scraping with Python

    I know the R context better, and using R, you'd want to do something like this:

  2. Scrape a bunch of sites using the R library 'rvest'
  3. Put everything into a 'Corpus' using the 'tm' library
  4. Use some form of clustering (k-nearest neighbor, LDA, or Structural Topic Model using the libraries 'knn', 'lda', or 'stm' respectively) to draw out trends in the data

    And that's that!
u/Mandelliant · 2 pointsr/learnprogramming

It really depends on the book. Some books, like Automate the Boring Stuff with Python, are a broken up into a series of projects that get progressively more advanced. On the other hand, O'Reilly's Hadoop: the Definitive Guide is more of a general walkthrough of principals and fundamentals. u/Updatebjarni gave really good advice

u/amazon-converter-bot · 2 pointsr/FreeEBOOKS

Here are all the local Amazon links I could find:


amazon.com

amazon.co.uk

amazon.ca

amazon.com.au

amazon.in

amazon.com.mx

amazon.de

amazon.it

amazon.es

amazon.com.br

amazon.nl

amazon.co.jp

amazon.fr

Beep bloop. I'm a bot to convert Amazon ebook links to local Amazon sites.
I currently look here: amazon.com, amazon.co.uk, amazon.ca, amazon.com.au, amazon.in, amazon.com.mx, amazon.de, amazon.it, amazon.es, amazon.com.br, amazon.nl, amazon.co.jp, amazon.fr, if you would like your local version of Amazon adding please contact my creator.

u/Wrennnn_n · 2 pointsr/IWantToLearn

I'm reading a book called Beautiful Visualization that I strongly recommend.

http://www.amazon.com/Beautiful-Visualization-Looking-through-Practice/dp/1449379869

Think about any time someone tells a story with stats. Is it misleading? How could it be more objective. What are the trade offs? How could people misinterpret your visual?

u/tolbertam · 2 pointsr/cassandra

"Cassandra High Availability" by Robbie Strickland has been highly recommended by others. It's on the top of my list of tech books to check out.

I'd recommend reading the Dynamo and Bigtable papers if you haven't already as C* uses a lot of ideas from those papers.

The Last Pickle has a great blog that has a lot of nice deep dive articles.

The Cassandra wiki has some nice detailed pages describing the design and read/write paths.

u/Bayes_the_Lord · 2 pointsr/statistics

How far in are you? I've been wanting to do more stuff with PyMC3. I recently bought Bayesian Methods for Hackers to do so but a Slack study group sounds very helpful if I were to order this book as well.

Edit: I'd just like to confirm this group does plenty of Python despite the book only using R and whatever Stan is.

u/rsoccermemesarecance · 2 pointsr/datascience

Mining the Social Web

Not exactly what you're looking for but it's very helpful, imo

u/MrFromEurope · 2 pointsr/hadoop

Well that sucks. I already have this one:Hadoop Guide

There is a whole chapter on HBase in there but nothing else.

u/dunesidebee · 2 pointsr/datascience
u/trystanr · 2 pointsr/graphic_design

Beautiful Visualisation Here

O'Reilly Designing Interfaces Here

u/Mirber · 1 pointr/datamining

From my experience this is an excellent book and I'm eagerly awaiting the second edition--which comes out in Feb 2017? :(

http://www.amazon.com/Introduction-Data-Mining-Pang-Ning-Tan/dp/0321321367

u/CanYouPleaseChill · 1 pointr/Python

I highly recommend Pandas for Everyone: Python Data Analysis by Daniel Chen. It's extremely practical and well-organized.

u/anonypanda · 1 pointr/consulting

I picked this up on the recommendation of a colleague. Very useful.


https://www.amazon.co.uk/Guerrilla-Analytics-Practical-Approach-Working/dp/0128002182

u/berf · 1 pointr/statistics

You have an ordered categorical (Likert) response variable and one quantitative predictor variable? You need to read up on ordered categorical data analysis. There are discussions of this in Agresti and in Venables and Ripley and, of course, lots of other places.

u/Aidtor · 1 pointr/datascience

If you want to be valuable to companies post graduation you should learn more about programming (design templates, how to write tests, how to go from a paper to code). I recommend this book as a good starting place. Once you're comfortable with how the different methods work, pick up this book.

u/Westarmy · 1 pointr/AskProgramming

Python and NumPy.

I will link some tutorials soon (Using a phone)

There's a book called:
Machine Learning: An Algorithmic Perspective

https://www.amazon.co.uk/Machine-Learning-Algorithmic-Perspective-Recognition/dp/1466583282

A useful book to have.

u/kerosion · 1 pointr/technology

Correct. I felt this would be a fairly accessible recent example of data mining I could reference for sake of brevity, rather than going into detail on classification and prediction algorithms. The Target story was close enough to examples in the Introduction to Data Mining textbook we used at my University.

u/eyesay · 1 pointr/GradSchool

OP, it's great that you're recognizing this need early! I was in the same position as you (0 programming experience), so perhaps I can offer you my strategy:

I started off learning R, because a lot of biological data analysis packages have already been written for it (see Bioconductor, http://www.bioconductor.org). R and it's corresponding IDE (RStudio) are easy and free to download:
http://cran.us.r-project.org
https://www.rstudio.com

For a super basic introduction to R (this will take you only a few hours), see Code School's 'Try R' tutorial:
http://tryr.codeschool.com

All Bioconductor packages come with a reference manual and sample data that you can use to practice. Also, most have a corresponding publication that explains the algorithm/stats behind it. I just went through and picked a few packages that seemed like they would be useful for the type of data my lab analyzes.

When I finished going through those and running practice data sets, I decided I wanted to actually learn R from scratch so I could understand what each function does. To that end, I bought a few O'Reilly books:
Learning R (http://shop.oreilly.com/product/mobile/0636920028352.do)
R for Everyone (http://www.amazon.com/gp/aw/d/0321888030?pc_redir=1398333875&amp;amp;robot_redir=1)

I've been dedicating my Mondays to going through the books, chapter by chapter, and doing all the exercises. It's been really helpful, and I'm finding it easier and easier to understand the Bioconductor packages.

Finally, I enlisted some other grad students that had no experience but also wanted to learn, and together, we started a weekly meetup in which we each select and demonstrate a Bioconductor package. This basically forces us to keep up with learning and master the packages, since we don't want to lose face :)

Next, I'm planning on diving into Python, but for now, learning R has proven very useful.

Hope this helps a bit, and good luck!


u/mtelesha · 1 pointr/statistics

I am a book guy myself. Ebooks to be specific. I really liked the R is for everyone http://www.amazon.com/gp/aw/d/0321888030?pc_redir=1405401911&amp;amp;robot_redir=1

Pick one and stick with it. I really just recommend R for a number of reasons but Python is also a good choice just R works for me.

You need a personal project to work on. Mine was an end of the year report.

u/ogrisel · 1 pointr/Python

You can find it here on amazon.com:

http://www.amazon.com/gp/product/1420067184?ie=UTF8&amp;amp;tag=oliviergrisel-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=1420067184

(Please feel free to strip the reference to oliviergrisel-20 in the URL should you want not to tip me through the amazon affiliates program)

u/Magical_Destroyer · 1 pointr/rstats

The author of R in Action has his website statmethods.net that is all statistics done in R.

u/j6keey · 1 pointr/computerscience

In terms of AI, I used this text for one of my AI/machine learning topics. Would recommend Artificial-Intelligence-Modern-Approach

Some other suggested readings from that topic:

Introduction Data Mining

Artificial Intelligence

u/machinedunlearned · 1 pointr/datascience

I teach applied math, stats, and computation courses to B.S. degree seeking students. Two observations:

  • First, the theory is not pointless. I know, you didn't say or imply that. But I just have to get that out there.

  • Second, my observation is that many applied mathematics/statistics courses at the undergraduate level do an astoundingly poor job of connecting mathematics to real world data. I can go on about this for days, but I'll stop with saying that (a) this is an incredible disservice to students across all disciplines that require any math, and (b) there are mathematicians and statisticians who are aware of the problem and working to make it better.

    Okay, rant concluded, book recommendations! First, try Doing Data Science by O'Neil and Schutt. This assumes some knowledge of linear algebra, stats, and programming. Examples are given in R. I think this book is very good at bringing out the idea that data science involves both theory and experience, and is good at bringing out the
    feel" of working on data science problems.

    Second, if your math background is calling for plenty of math, you might take a look at Machine Learning from a Probabilistic Perspective. This takes a closer look at the data modeling process, which is somewhat lacking in more CS-oriented texts on ML. Requires good knowledge of probability, obviously.
u/adcqds · 1 pointr/datascience

The pymc3 documentation is a good place to start if you enjoy reading through mini-tutorials: pymc3 docs

Also these books are pretty good, the first is a nice soft introduction to programming with pymc &amp; bayesian methods, and the second is quite nice too, albeit targeted at R/STAN.

u/zith · 1 pointr/MachineLearning

I'm afraid I can't answer your question specifically, but Stephen Marsland has written a great introductory level machine learning book that has all of the example code written in python.



u/glancedattit · 1 pointr/visualization

I would check out Ben Fry's book first.

Then Beautiful Visualization.

There is another good McCandless eyecandy.

Manuel Lima did an amazing book on network visualization with excellent essays from other people. Visual Complexity. Network vis is very difficult and if you want to "game up" understanding the taxonomy he built for network vis will give you a real perspective on the taxonomy in other types of vis.

There are things outside of the "take data and render visualization" world that are critical to data vis, imo. For moving data vis, start with the godfather, Muybridge

And look way way back for the long human history of data vis in cartography with stuff like Cartographia.

Hope to see some more books and discvoer a reading list on this thread! Great idea for a post.

u/pointy · 0 pointsr/programming

The article does not discuss any performance particulars at all. I would suggest that the author buy and study the O'Reilly book SQL Tuning, which explains SQL query performance techniques in a database-neutral way.