Computers & technology books>Books>Databases & big data books

Best data warehousing books according to redditors

We found 70 Reddit comments discussing the best data warehousing books. We ranked the 24 resulting products by number of redditors who mentioned them. Here are the top 20.

1. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition

13 mentions

John Wiley Sons

Read Reddit comments View price

2. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data

7 mentions

Wiley

Read Reddit comments View price

3. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling

4 mentions

Read Reddit comments View price

4. Practical Statistics for Data Scientists: 50 Essential Concepts

4 mentions

Read Reddit comments View price

5. The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence

4 mentions

Read Reddit comments View price

6. Kimball's Data Warehouse Toolkit Classics, 3 Volume Set

3 mentions

Wiley

Read Reddit comments View price

7. The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2 and the Microsoft Business Intelligence Toolset

3 mentions

John Wiley Sons

Read Reddit comments View price

8. Database Reliability Engineering: Designing and Operating Resilient Database Systems

2 mentions

O Reilly Media

Read Reddit comments View price

9. Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration

2 mentions

Read Reddit comments View price

13. The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence Remastered Collection

1 mention

Read Reddit comments View price

14. Graph Analysis and Visualization: Discovering Business Opportunity in Linked Data

1 mention

Read Reddit comments View price

15. Google BigQuery Analytics

1 mention

Read Reddit comments View price

16. Look Smarter Than You Are with Smart View and Essbase 11.1.2.2: An End User's Guide

1 mention

Read Reddit comments View price

17. Learning Elastic Stack 7.0: Distributed search, analytics, and visualization using Elasticsearch, Logstash, Beats, and Kibana, 2nd Edition

1 mention

Read Reddit comments View price

18. PostgreSQL High Availability Cookbook - Second Edition

1 mention

Read Reddit comments View price

19. DAX Patterns 2015

1 mention

Read Reddit comments View price

20. The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence Remastered Collection

1 mention

Wiley

Read Reddit comments View price

Top Reddit comments about Data Warehousing:

u/gfody · 84 pointsr/programming

First don't think of this as "DBA" stuff - you're a developer, you need to know database technology, period. Read this rant by Dennis Forbes in response to Digg's CTO's complaints about databases it's very reminiscent of TFA.

Read Data and Reality by the late William Kent (here's a free copy) and get a fundamental understanding of "information" vs. "data". Then read Information Modeling and Relational Databases to pickup a couple practical approaches to modeling (ER & OR). Now read The Datawarehouse Toolkit to learn dimensional modeling and when to use it. Practice designing effective models, build some production databases from scratch, inherit some, revisit your old designs, learn from your mistakes, write lots and lots and lots of SQL (if you want to get tricky with SQL I suggest to pickup Celko's SQL for smarties - it's like the Hacker's Delight for SQL).

Many strange models you may encounter in the wild are actually optimizations. Some are premature, some outright stupid, and some brilliant, if you want to be able to tell one from the other then you're going to dive deep into internals. Do this generically with Modern Information Retrieval and Managing Gigabytes then for specific RDBMSs Pro SQL Server Internals, PostgreSQL Internals, Oracle CORE, etc.

Reflect on how awesome practically every modern RDBMS is as a great technological achievement for mankind and how wonderful it is to be standing on the shoulders of giants. Roll your eyes a little bit whenever you overhear another twenty-something millenial fresh CS graduate who skipped every RDBMS elective bleat about NoSQL, Mongo, whatever. Try not to fly into murderous rage when another loud-mouthed know-nothing writes at length about how bad RDBMS technology is when they couldn't be bothered to learn the most basic requisite skills and knowledge to use one competently.

u/LittleOlaf · 32 pointsr/humblebundles

Maybe this table can help some of you to gauge how worth the bundle is.

| | | Amazon | | | Goodreads | |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|------------------|---------|--------------|-----------|--------------|
| Tier | Title | Kindle Price ($) | Average | # of Ratings | Average | # of Ratings |
| 1 | Painting with Numbers: Presenting Financials and Other Numbers So People Will Understand You | 25.99 | 3.9 | 20 | 4.05 | 40 |
| 1 | Presenting Data: How to Communicate Your Message Effectively | 26.99 | 2.9 | 4 | 4.25 | 8 |
| 1 | Stories that Move Mountains: Storytelling and Visual Design for Persuasive Presentations | - | 4.0 | 13 | 3.84 | 56 |
| 1 | Storytelling with Data: A Data Visualization Guide for Business Professionals (Excerpt) | 25.99 | 4.6 | 281 | 4.37 | 1175 |
| 2 | 101 Design Methods: A Structured Approach for Driving Innovation in Your Organization | 22.99 | 4.2 | 70 | 3.98 | 390 |
| 2 | Cool Infographics: Effective Communication with Data Visualization and Design | 25.99 | 4.3 | 39 | 3.90 | 173 |
| 2 | The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions | 31.71 | 3.8 | 43 | 3.03 | 35 |
| 2 | Visualize This: The FlowingData Guide to Design, Visualization, and Statistics | 25.99 | 3.9 | 83 | 3.88 | 988 |
| 3 | Data Points: Visualization That Means Something | 25.99 | 3.9 | 34 | 3.87 | 362 |
| 3 | Infographics: The Power of Visual Storytelling | 19.99 | 4.0 | 38 | 3.79 | 221 |
| 3 | Graph Analysis and Visualization: Discovering Business Opportunity in Linked Data | 40.99 | 4.2 | 3 | 3.59 | 14 |
| 3 | Tableau Your Data!: Fast and Easy Visual Analysis with Tableau Software, 2nd Edition | 39.99 | 4.0 | 66 | 4.14 | 111 |
| 3 | Visualizing Financial Data | 36.99 | 4.7 | 4 | 3.83 | 6 |

u/Yehosua · 22 pointsr/programming

Whenever I get an email about a new programming Humble Bundle, I hop over to Reddit to see if anyone else thinks it's worth buying. In this case, Reddit has failed me, because no one has shared their opinions. All is not lost, however, for I can share mine!

These are probably the most commonly recommended DevOps books:

The Phoenix Project - written in the form of a novel to teach DevOps principles. (I've read it. It's awesome.)
The DevOps Handbook - a non-fiction book from the authors of The Phoenix Project. (I've started it.)
Site Reliability Engineering - "SRE" is more or less Google's term for DevOps. This book is more or less how Google does DevOps. (I've not read it. It's available online for free.)
Accelerate: The Science of Lean Software and DevOps - Martin Fowler calls this the best programming book of this year. (I've not read it.)
The Site Reliability Workbook - a sequel of sorts to Site Reliability Engineering. Probably less popular than the others I just listed. (I've not read it.)

The Site Reliability Workbook is the only one of these that's included in this bundle. So the first question I ask myself regarding this bundle is, "Do I want to spend the time and money on this bundle's books, or should I spend that on one of the highly recommended books instead?" (Personally, I'm going with the latter.)

Otherwise, most of the books here are technology-specific, so the second question is, "Do I want to learn any of these specific technologies now, and are e-books a good way of doing it?" (Personally, my answer is no.)

Depending on how you answer the first two questions, the last question is, "Are the non-technology-specific books worth getting?" To answer that, here are Amazon links to the non-technology-specific books, for reviews and sales rankings:
The Site Reliability Workbook
Designing Distributed Systems
Database Reliability Engineering
Seeking SRE
Cloud Native Infrastructure
Practical Monitoring
Effective DevOps

u/camelrow · 19 pointsr/BusinessIntelligence

The Data Warehouse Toolkit by Kimball was recommended to me as "The Source" for DW. I just started reading it, so no experience yet.

The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition https://www.amazon.com/dp/1118530802/ref=cm_sw_r_cp_apa_i_LZ-7CbHQTXGRM

u/NAMOS · 10 pointsr/onions

Basically any SRE advice for a normal service but replace/compliment HAproxy / nginx / ingress controller / ELB with the Tor daemon / OnionBalance.

I run Ablative Hosting and we have a few people who value uptime over anonymity etc and so we follow the usual processes for keeping stuff online.

Have multiples of everything (especially stuff that doesn't keep state), ensure you have monitoring of everything from connections, memory pressure, open files, free RAM etc etc.

Just think of the Tor daemon onion service as just a TCP reverse proxy, with load-balancing capability and then follow any other advice when it comes to building reliable infrastructure;

https://www.amazon.co.uk/Site-Reliability-Engineering-Production-Systems/dp/149192912X/
https://www.amazon.co.uk/Infrastructure-Code-Managing-Servers-Cloud/dp/1491924357/
https://www.amazon.co.uk/Architecting-Scale-Lee-Atchinson/dp/1491943394/
https://www.amazon.co.uk/Building-Microservices-Sam-Newman/dp/1491950358/
https://www.amazon.co.uk/Practical-Monitoring-Mike-Julian/dp/1491957352/
https://www.amazon.co.uk/Cloud-Native-Infrastructure-Justin-Garrison/dp/1491984309/
https://www.amazon.co.uk/Designing-Distributed-Systems-Brendan-Burns/dp/1491983647/
https://www.amazon.co.uk/Databases-at-Scale-Operations-Engineering/dp/1491925949/

Once you've got to grips with running a reliable service then you can start layering your Onion reverse proxy / load balancing on top.
Isolate your daemons from the clearnet (clearnet example; CloudFlare 'Access' suggests you only allow CloudFlare's IPs through your firewall guaranteeing that only traffic scrubbed by them reaches your servers)
Configure Onionbalance (clearnet example; using keepalive-d / CARP / F5 LTMs to have redundancy for the ingress controller)
Harden your software Unlike the clearnet examples of security you'll want to ensure that your software can't be 'tricked' into making a call to a hostile server (e.g. "Please fetch and attach content at this URL") - you can do this with tricks like Ablative's "QuadHop" setup or simply ensure that your software doesn't have such functionality.

All of this aside, check /u/alecmuffett's "Onions that don't suck" repo for examples that are both well setup and stable.

TL;DR; Tor is just a TCP reverse proxy with load balancer capabilities go learn some DevOps dodads

Edit: As per Alec's comment - clarify that Tor is technically a reverse proxy with load-balancing capabilities rather than a straight up TCP load balancer.

u/VaporDotWAV · 8 pointsr/BusinessIntelligence

If OP is heading down the data warehouse path, The Kimball Group Reader is fantastic for learning the concepts behind dimensional modeling.

u/thibaut_barrere · 6 pointsr/rails

It's hard to provide a full answer just based on available information, but roughly you have many different ways to achieve what you have in mind.

Some families of ways to handle this:

ETL (Extract Transform Load) -> extract from CSV, during the process transform / clean / remap any field, if you don't do incremental reload you could also dedupe in the transform step, then load a "clean dataset" into either Postgres, ElasticSearch, etc
ELT (Extract Load Transform) -> extract from CSV, dump right into ES or PG (mostly unchanged), then modify there (or query a "temporary dataset" to do a sort of ETL, clean / filter etc, and pour the data into a more "final" destination in the same datastore

What's the most adequate way to do this depends on various factors:
Do you want to deduplicate inside a single CSV (which can be achieved in memory before loading), or cross-CSVs (in which case you need a business key, with unique constraint, and do "upserts", or at least verify if you must drop the rows by checking id presence before)
Do you have many different CSV formats or are them quite different? (if they are quite different, it's often more easy to go ETL, to have a very flexible & well tested way to verify the mappings & conversions etc)
Are the outputs mostly largely similar with a bit of different fields, or mostly completely different?

Finally, here are some tools which can help:
My own gem https://www.kiba-etl.org (which I use both for ETL & ELT scenarios)
Ruby Sequel https://github.com/jeremyevans/sequel (which is actually used by my upcoming Kiba Pro offer for database related tasks)
pgloader https://github.com/dimitri/pgloader
embulk http://www.embulk.org/docs/

If you are into this for the long term, it can be worth reading a book that I often mention, which is the ETL book by Ralph Kimball. While quite old, it provides interesting patterns.

Happy to detail this more if you provide more input!

u/pooerh · 5 pointsr/SQLServer

SQL Server Developer, BI Backend Developer, Data Architect, stuff like that.

The Kimball Group Reader (Amazon link, without a referral or whatever) is a must read in the business of DWH and BI. Coincidentally, I haven't read it (impossible to get where I live), so don't ask me for an opinion if it's any good!

> I try to use the same principles of code cleanliness, separation of concerns, etc from .Net dev.

The first thing I imagine when I hear something like that is the person happily converting code into user defined scalar functions (I hope you don't use them). Not all rules from procedural development apply to SQL, remember that!

u/Grovbolle · 5 pointsr/BusinessIntelligence

https://www.amazon.com/Data-Warehouse-Lifecycle-Toolkit/dp/0470149779

https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional-ebook/dp/B00DRZX6XS

Get them to read the classics

u/PrettyMuchAVegetable · 5 pointsr/datascience

Alright, Example time. I hope this isn't TMI but I'm open to answering more questions.

I decided to use Oracle 11g R2 since I am familiar with it but a lot of my code would be portable to another vendor's implementation of SQL.

You can use www.sqlfiddle.com and choose Oracle 11g R2 from the list of Database Vendors.

First we have to put your data into the database, use the left "build schema" box with the following code to create 2 database tables and 1 primary key object:

CREATE TABLE STG_TABLENAME (
ID1 VARCHAR2(2) ,
ID2 VARCHAR2(2) ,
ID3 VARCHAR2(2) ,
"2018" INT,
"2019" INT,
"2020" INT
)
;
--Create a core table, for retaining historical data
CREATE TABLE CORE_TABLENAME (
ID1 VARCHAR2(2) ,
ID2 VARCHAR2(2) ,
ID3 VARCHAR2(2) ,
MEASURE_YEAR INT,
MEASURE INT,
CONSTRAINT CDT_PK PRIMARY KEY (ID1,ID2,ID3,MEASURE_YEAR)
) ;

--Give our stg table your sample data
INSERT INTO STG_TABLENAME VALUES ('X', 'AA', 'T1', 20, 30, 48);
INSERT INTO STG_TABLENAME VALUES ('X', 'AA', 'T2', 33, 6666, 66);
INSERT INTO STG_TABLENAME VALUES ('X', 'B', 'T1', 2, 555, 5);
INSERT INTO STG_TABLENAME VALUES ('Y', 'C', 'T2', 5, 6, 6);

--PIVOT THE STG TABLE AND INSERT THE RESULTS INTO THE CORE TABLE
INSERT INTO CORE_TABLENAME (ID1,ID2,ID3,MEASURE_YEAR,MEASURE)
SELECT ID1,ID2,ID3,"2018" as "MEASURE_YEAR","MEASURE" FROM STG_TABLENAME
UNPIVOT ( "MEASURE" FOR "2018" IN (
"2018" as '2018',
"2019" as '2019',
"2020" as '2020'
)
)
;

The STG_TABLENAME is built to accept your raw data without any questions or constraints as to duplicate keys. The idea is that you will keep this table empty most of the time, except when there is new data to be pushed to the CORE_TABLENAME table. You can use your application/load script/ETL to check for constraints later, but that is beyond the scope of what I am explaining right now. This table design is not very good in the long term, it will get wider (cols) instead of long (rows) and is confusing
The INSERT commands load the sample data one row at a time into the table. This is not how we would load a large set of data into a DB but it suits for small datasets (or programmatic entries of small sets etc).
I used a Primary KEY on the CORE_DATATABLE for multiple reasons, first it explicitly defines the table key for us, second it automatically creates an index on the key fields, and third it ensures these fields are NOT NULL and never duplicated, you don't want duplicate keys in your table, and finally this table is designed with keeping historical records in mind so it may get quite large (long due to rows).

The final Insert pivots your data into a more useful long term format like u/seanv507 suggested. The code might look a bit complicated but pivoting/unpivoting is normal when data is coming in from the real world where long term db storage is not always top of mind so you would be wise to learn it if the data ingestion is part of your role. Really this pivoting could be done in any language, SQL, Python, or even powershell. An ETL Tool like Informatica or open source Talend or Pentaho can handle this as well. (Sorry I might be way outside the scope of your question now), anyway lets look at querying both of these tables we've made. In the Right hand side SQL Query box you can put all or any of these individual queries:

--See everything in both tables
SELECT FROM STG_TABLENAME;
SELECT FROM CORE_TABLENAME;

--Make use of the select list to see only columns you want
SELECT ID1,ID2,ID3,2018 FROM STG_TABLENAME;
SELECT ID1,ID2,ID3,MEASURE_YEAR,MEASURE FROM CORE_TABLENAME;

--Make use of the WHERE Clause to select only records you want to see
SELECT ID1,ID2,ID3,"2018" FROM STG_TABLENAME
WHERE ID1='X'
AND ID2='AA'
AND ID3='T1';

SELECT ID1,ID2,ID3,MEASURE_YEAR,MEASURE FROM CORE_TABLENAME
WHERE ID1='X'
AND ID2='AA'
AND ID3='T1'
AND MEASURE_YEAR=2018;

--Summing up the year 2018 under both table structures
SELECT ID1,ID2,ID3,SUM("2018") as SUM_2018 FROM STG_TABLENAME
WHERE ID1='X'
AND ID2='AA'
AND ID3='T1'
GROUP BY ID1,ID2,ID3;

SELECT ID1,ID2,ID3,MEASURE_YEAR,SUM(MEASURE) FROM CORE_TABLENAME
WHERE ID1='X'
AND ID2='AA'
AND ID3='T1'
AND MEASURE_YEAR=2018
GROUP BY ID1,ID2,ID3,MEASURE_YEAR;

--Summing and grouping multiple years under both table designs
SELECT ID1,ID2,ID3,SUM("2018") as SUM_2018, SUM("2019") as SUM_2019 FROM STG_TABLENAME
WHERE ID1='X'
AND ID2='AA'
AND ID3='T1'
GROUP BY ID1,ID2,ID3;

SELECT ID1,ID2,ID3,MEASURE_YEAR,SUM(MEASURE) FROM CORE_TABLENAME
WHERE ID1='X'
AND ID2='AA'
AND ID3='T1'
AND MEASURE_YEAR IN (2018,2019) -- Note the use of the IN operator
GROUP BY ID1,ID2,ID3,MEASURE_YEAR;

--SUMMING Multiple years but grouping only by ID1,ID2 and ID3
SELECT ID1,ID2,ID3,SUM("2018" + "2019") as SUM_18_19 FROM STG_TABLENAME
WHERE ID1='X'
AND ID2='AA'
AND ID3='T1'
GROUP BY ID1,ID2,ID3;

SELECT ID1,ID2,ID3,SUM(MEASURE) FROM CORE_TABLENAME
WHERE ID1='X'
AND ID2='AA'
AND ID3='T1'
AND MEASURE_YEAR IN (2018,2019) -- Note the use of the IN operator
GROUP BY ID1,ID2,ID3;

I chose a variety of different queries to highlight different aspects of SQL querying for you. The -- comments out the line that follows it, so I made short notes as to what the queries are doing.

To learn more about SQL I always recommend people take the Stanford DB5 open course and the DB 8 course on relational modelling.

https://lagunita.stanford.edu/courses/DB/2014/SelfPaced/about

To learn more about data warehousing I reccomend reading some Kimball:
https://www.amazon.ca/Kimballs-Data-Warehouse-Toolkit-Classics/dp/1118875184/ref=sr_1_3?s=books&amp;ie=UTF8&amp;qid=1537275891&amp;sr=1-3

Yes there are other design paradigms, this is the one I find works best in real world situations that I tend to run into in my practice.

Feel free to ask me any questions that pop up, I'm happy to talk SQL or DW. Also don't get discouraged if you find any of this overwhelming or complex. It's actually very approachable and with a bit of effort, a course or some reading combined with practice you'll be humming along in SQL in no time. Keep in mind a couple of these statements/queries (I'm looking at you UNPIVOT)took me 2-3 times to get the syntax right, and I've been doing SQL for 9 years and have an Oracle cert in SQL.

u/randumnumber · 4 pointsr/oracle

ohh "set things up" is a very very wide term. OBIEE can do a ton of stuff. First do you have a data warehouse? What is the source of your data? I can give you the basics. OBIEE uses a metadata repository its called and RPD this is the source of all queries. You pull metadata from your source and then build out the RPD through a physical -> Business -> Presentation layer. The Business layer can do quite a bit of work for you in terms of combining dimensions and joins but you want as much of a star schema as possible from the source. Read Kimballs book listed below to understand star schema and warehousing concepts.

Inside of the OBI admin tool there is also some user management, user management isa whole nother aspect. Are you using some ldap authentiacaiton or will you be managing users though obiee? There are USERS, GROUPS, & ROLES. This is another aspect to deal with.

There is also the EM web portal, Enterprise Manager from here you do other management of users and roles and the actual services. This is another thing, where is this hosted? Do you already have OBIEE 11g set up on a server? If so you will need access to that box to do services management. Also may need to modify config files here.

Then there is the actual reporting service, OBIEE uses dimensions and a fact to create charts, pivot tables etc. Here you will log into the web front end this would be accessed by going to http://servername:port/analytics From here you log in as your development user by default its weblogic i beileve. And here is where you would create dashboards etc.

This is just one aspect of the tool set, there is also BIP (bi publisher) used to develop reports from various sources by creating a template and filling the template out by using XML.

Oracle offers classes, which if your managment is throwing you into OBIEE they should be giving you at least 1 class. The report building stuff is easy enough to pick up, but if you are responsible for the management of the server, you need a class.. there is just so much to know about it.

I have worked in the RPD and reports/dashboard building side of things for 2 years. and im still learning stuff (usually the limitations of OBIEE). We have a whole nother TEAM(TEAM) of people who manage the databases and server side.

Resources:

Get a subscription to METALINK from oracle to issue service requests and look up bug fixes etc.

https://login.oracle.com/mysso/signon.jsp

Blogs:

http://www.rittmanmead.com/
http://gerardnico.com/

There are also youtube videos to explain simple stuff for setting up and RPD etc. You can also download an entire sample setup of OBIEE 11g from oracle.. its a huge download 50gb or something like that, but it has database, RPD, sample reports. all in a virtual machine. You can spend a week setting it up just to have examples to work from.

There is plenty of resources, but to give 1 generalized resource is difficult, you need to search for specific things you need to do. "Installing obiee11g on linux" "importing meta data into RPD"

If you need books on Data Warehousing and explanations of STAR schema and data denormalization I suggest reading up on kimball method:

http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/1118530802/ref=sr_1_1?ie=UTF8&amp;qid=1377213718&amp;sr=8-1&amp;keywords=kimball

and

Inmon

http://www.amazon.com/Building-Data-Warehouse-W-Inmon/dp/0764599445/ref=sr_1_2?ie=UTF8&amp;qid=1377213827&amp;sr=8-2&amp;keywords=inmon

They have different philosophies for data warehousing i personally subscribe to the Kimball method because it supports rapid development better.

I'd like you to know but not discourage you, this is a large undertaking for 1 person. We manage 2 RPD's and 2 sets of dashboards for a custom reporting application we also do the ETL and warehousing. The whole warehouse was set up by a team, then we moved in ETL is handled by another team of people and we have a team doing reporting, then there is management and functional. So building out an OBIEE implementation from the ground up doing warehousing is a huge undertaking. There is another team of people doing server management and upgrades, and migrations.

This is at least a 3 man job, with each person being specialized. Push for RPD traning, Server managment Traning, and dashboard design Training. Warehousing methods and ETL work is another story.

u/elus · 4 pointsr/SQLServer

Grab a copy of dimensional modeling book from the Kimball group. This one deals with MS technology specifically.

To get a better understanding of data visualization theory and techniques I suggest reading all the material from Stephen Few and Edward Tufte. Few's book on dashboard is fantastic.

Network and meet people in the industry through events thrown by The Data Warehouse Institute. Download free datasets online and use those to gain insight into subjects you're interested in as well as to practice the skills you learn as you play around with SQL Server.

I can add more later but I have to go for the day. Feel free to post more specific questions in this thread and I'll try to answer.

u/bbowler86 · 4 pointsr/BusinessIntelligence

The Data Warehouse ETL Toolkit

u/yahelc · 4 pointsr/dataengineering

The most important reading from a database design perspective, IMO, is one of Kimball’s books:

https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802

It’s less technically focused, and more focused on how to build good datasets. It’s an older text so it’s references to specific technologies are a bit out of date, but when it comes to describing how to design particular schemas (or at least speak the language of people who design schemas), it’s pretty much canon.

u/johngabbradley · 4 pointsr/BusinessIntelligence

Dimensional modeling is under valued in today's climate. Any complex models on a large scale will be more effective when modeled optimally.

https://www.amazon.com/Data-WarehouseÂ-ETL-Toolkit-Techniques-Extracting/dp/0764567578

u/Autoexec_bat · 4 pointsr/BusinessIntelligence

Assuming you've already read the Data Warehouse Toolkit? If not, do. http://www.amazon.com/dp/1118530802

u/Rehd · 3 pointsr/learnSQL

Anything by Kimball

u/muraii · 3 pointsr/datascience

Look up the DMBOK and Ralph Kimball’s The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling https://www.amazon.com/dp/1118530802/ref=cm_sw_r_cp_api_r4XMBbY0729K9 .

u/CarbonChauvinist · 3 pointsr/PowerBI

Agree completely the Purple Book is the Bible! And yes, wait for the 2nd edition at this point as it's very close. They also just released version 2 of their Mastering DAX video class on SQLBI as well which I'm looking forward to going back over.

DAX patterns is amazing as well, but maybe easier to access via web and I'm sure that will be seeing a new edition soon.

I cant' stress enough how useful the Analyzing Data/Orange Book was for me personally. It was more of a high-level overview of best practices for constructing data models which was crucial for me coming from unrelated backgrounds.

u/SQLSavant · 3 pointsr/datascience

If you're working in an enterprise environment, then most likely your data will live - at the source, in a transaction-based database (OLTP). For this, I'd recommend Database Design for Mere Mortals - it's a well written book that is more heavily based on the practical application of how your data is architected, designed and stored and less on the theoretical side of things - but it's written in a way I feel most any learned person can understand. For theoretical review, there's always the seminal work of E.F. Codd's A Relational Model For Large Shared Data Banks and also some of his follow up work The Relational Model for Database Management

From the analytical database side of things (Data Warehouses/BI Solutions) and, where hopefully you'll actually be pulling and manipulating your data from there is The Definitive Guide to Dimensional Modeling - this is a more verbose read - and not practical, but more thought experiment provoking and includes the business reasons why dimensional modeling should be used so that Data Science/Data Analytics professionals can get at their data - nevertheless - for most large companies this is the "foundation" by which your data sits on if you're a Data Scientist. I, unfortunately, do not have a good recommendation for the practical application of OLAP databases as I've never found one that generally tickled my fancy.

Just skimming through these and periodically reading through them should at least give you an idea about how your data is stored, which more importantly gives you an idea around how it can be pulled and manipulated by the systems within your company.

As an example, I had a hard time explaining once to a research assistant why I couldn't 100% match two free-text string fields with names in them to one another in a large data set. I tried explaining to him that while there is fuzzy string matching algorithms I can apply to a given data set (Like Jaro-Winkler or Levenshtein), that it wasn't always 100% and was an approximation - I guess he wanted me to further the field of Computer Science by making fuzzy string matching 100% and therefore doing what many CS and Stats gurus haven't been able to do -shrugs-.

u/elliotbot · 3 pointsr/cscareerquestions

Resources I used for my DE interviews:

Kimball's Data Warehouse Toolkit
Mode Analytics SQL Tutorial

Less useful:
Database Design for Mere Mortals
SQL Cookbook

u/StrafeReddit · 3 pointsr/Database

Start with the source, Kimball:

http://www.amazon.com/Data-Warehouse-Toolkit-Dimensional-ebook/dp/B00DRZX6XS/ref=sr_1_1?ie=UTF8&amp;qid=1375223918&amp;sr=8-1&amp;keywords=data+warehouse+kimball

u/wolf2600 · 3 pointsr/SQL

Kimball's dimensional modeling. It's the standard for data warehousing.

https://smile.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802/

u/savoy9 · 2 pointsr/PowerBI

You should consider reading this book. Its a great primer on Dimensional Modeling

https://smile.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802/ref=sr_1_1?s=instant-video&ie=UTF8&qid=1549073081&sr=8-1&keywords=kimball+dimensional+modeling

u/ichp · 2 pointsr/SQL

It seems like you can use a date dimension here.

To read about multi-dimensional concepts: http://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802/ref=sr_1_1?s=books&amp;ie=UTF8&amp;qid=1415023036&amp;sr=1-1&amp;keywords=the+data+warehouse+toolkit

u/sathley90 · 2 pointsr/databases

Also The Data WarehouseÂ ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data https://www.amazon.com/dp/0764567578/ref=cm_sw_r_cp_apa_lnUSAbDC5NK4X

u/[deleted] · 2 pointsr/BusinessIntelligence

I know this post is kinda stale, but do yourself a favor and pick up the Kimball Group Reader book. Fantastic for understanding the concepts behind designing and building dimensions, facts, and how they represent various business processes.

From what you say, you sound like a "senior" data analyst: you write queries that give correct results in a performant manner. You could probably look for a "junior" DWH engineer role which should pay about the same or maybe 10% less per year. However, as you grow in experience, the top end for relational DWH engineers is around US$110,000 per year while I would expect a data analyst to top out around US$70,000 per year.

Aside from learning DWH concepts, I'd say learn Tableau or Power BI as a data analyst should know not just how to get to the data, but how to present it in a meaningful manner. Learn about the "concepts" of ETL and the patterns they implement. Again, the book I referenced above will give some good examples of what should be done within each letter of E-T-L. These are the most logical and attainable next steps in your career. From there, once you understand more about ETL, the next logical step is to start thinking about how all the various systems integrate with one-another.

u/thephilski · 2 pointsr/SQL

>data warehouse toolkit

Can you confirm that this is the book you are referring to? Amazon Link

u/arbiter_of_tastes · 2 pointsr/datascience

Definitely. Don't forget atomic modeling, either. As I said, I'm not an architect, but this book I see commonly referenced around data warehouse modeling:
https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional-ebook/dp/B00DRZX6XS/ref=sr_1_1?ie=UTF8&amp;qid=1539091831&amp;sr=8-1&amp;keywords=kimball+modeling

It's probably also worth pointing out that data warehouse design can be simplified by obtaining a pre-built data model (either purchased from someone like IBM for specific sectors, or obtained open-source, such as OMOP for healthcare), or data warehouse design and modeling can be outsourced to consultants that do it for you. Those may not be great options for you, but if you have a mission critical need for your business, it might be nice to have access to an experienced architect.

u/flipstables · 2 pointsr/cscareerquestions

BI? How much about data warehousing theory do you know? I hope you have a thorough understanding of Kimball's methodology.

For ETL, focus on specific ETL tools (e.g. SSIS) but also know how to custom build your own tool from the ground up using a scripting/programming language. You could strictly specialize with one vendor like Microsoft or you could branch out to other BI stacks.

If you want to be more of a "full stack" BI developer, again you have to figure out whether you want to be a Microsoft specialist or know the range of technologies out there. If you don't know, I would focus my energy on learning vendor neutral skills for now and figure the rest out later. For instance, you're going to want to learn MDX very, very well no matter which platform(s) you decide to pursue.

u/nakun · 2 pointsr/datascience

Hi,

I am starting out studying Data Science myself (not employed in the field). Here is what has been useful to me:

Data Quest has some free lessons and they are good. (They also have a weekly newsletter that has learning resources/articles).

Practical Statistics for Data Scientists has been very helpful in getting me up to speed on statistics (note that the code here uses R, not Python).

u/datadude · 2 pointsr/statistics

Data Analysis using SQL and Excel

u/welshfargo · 2 pointsr/Database

Informatica is widely used for ETL tools, but more important is understanding the challenges. How to design staging tables, what update strategy to use, how to design restartable jobs, and so on. This book refers to data warehousing, but the techniques are applicable to most scenarios.

u/fullofbones · 1 pointr/PostgreSQL

The first chapter of PostgreSQL High Availability Cookbook has a couple sections on picking out hardware, with example spreadsheets to calculate everything. Unfortunately the sizing makes the assumption that you know roughly how many concurrent users you might have, the number of queries that will be running, your expected database size, and so on.

u/phl12 · 1 pointr/datascience

Practical Statistics for Data Science is on sale right now on Amazon. Only $13!

u/babypng · 1 pointr/BusinessIntelligence

There is also a Microsoft specific version that shows how to apply Kimball methodology with SSxS stack.

u/fhoffa · 1 pointr/bigquery

The best book around (a little outdated, but it comes straight from the engineers who made BigQuery):

https://smile.amazon.com/Google-BigQuery-Analytics-Jordan-Tigani-ebook/dp/B00JUUZIZI

Many partners offer classes, including official GCP trainings:
https://cloud.google.com/training/data-ml

You could also take the Google Certified Professional exam:
https://cloud.google.com/certification/data-engineer

u/abcoolynr · 1 pointr/learnprogramming

https://www.amazon.in/MySQL-Tutorials-Beginners-Advanced-Languages-ebook/dp/B079QH8SQP/ref=sr_1_3?keywords=mysql&qid=1558427485&s=gateway&sr=8-3

&#x200B;

http://www.luciopanasci.it/Ebooks/MySQL%20Cookbook,%203rd%20Edition.pdf [In depth]

&#x200B;

https://downloads.mysql.com/docs/refman-8.0-en.pdf [Direct from MYSQL itself]

&#x200B;

https://www.hackerrank.com/domains/databases [Practice problems]

&#x200B;

http://iips.icci.edu.iq/images/exam/databases-ramaz.pdf [How to design a database]

u/onetwosex · 1 pointr/datascience

OP, so that you know, you mention uncle Bob's "Clean Code", but your link redirects to the book "Clean Coder". They're both great, but different.

I've ordered the book Practical Statistics for Data Scientists: 50 Essential Concepts. Looks great to brush up the basics of statistics and machine learning. Since I haven't actually read it yet, take my input with a grain of salt.

u/seadave77 · 1 pointr/BusinessIntelligence

http://www.amazon.com/The-Kimball-Group-Reader-Relentlessly/dp/0470563109

This a great book. Bite size chunks explaining why to do a warehouse and how. Pretty much any Kimball book it seems good.

u/hagemajr · 1 pointr/AskReddit

Awesome! I kind of fell into the job. I was initially hired as a web developer, and didn't even know what BI was, and then got recruited by one of the BI managers and fell in love. To me, it is one of the few places in IT where what you create will directly impact the choices a business will make.

Most of what I do is ETL work (taking data from multiple systems, and loading them into a single warehouse), with a few cubes (multidimensional data analaysis) and SSRS report models (logical data model built on top of a relational data store used for ad hoc report creation). I also do a bit of report design, and lots of InfoPath 2010 + SharePoint 2010 custom development.

We use the entire Microsoft BI stack here, so SQL Server Integration (SSIS), Analysis (SSAS), and Reporting Services (SSRS). Microsoft is definitely up and coming in the BI world, but you might want to try to familiarize yourself with Oracle BI, Business Objects, or Cognos. Unfortunately, most of these tools are very expensive and not easy to get up and running. I would suggest you familiarize yourself with the concepts, and then you will be able to use any tool to apply them.

For data warehousing, check out the Kimball books:

Here and here and here

For reporting, get good with data visualizations, anything by Few or Tufte, like:

Here and here

For integration, check these out:

Here and here

Also, if you're interested in Microsoft BI (SSIS, SSAS, SSRS) check out this site. It has some awesome videos around SSAS that are easy to follow along with.

Also, check out the MSDN BI Blog: http://blogs.msdn.com/b/bi/

Currently at work, but if you have more questions, feel free to shoot me a message!

u/vededju · 1 pointr/Accounting

I taught myself how to use Essbase in 1 day using this book.

Learning essbase isn't difficult, you can pick it up pretty easily.

u/FabianKiel · 1 pointr/SQLServer

I like my own book. ;) You find a free excerpt here: http://bi-solutions.gaussling.com/sql_basics_excerpt/

If you want to go directly to amazon: https://www.amazon.com/dp/B0733D7MCR/ref=cm_sw_r_cp_dp_T2_fXmtzb2XB2WRY

u/evolving6000 · 1 pointr/PowerBI

amazon has some great books on this. You can also search YouTube for dimensional modeling. Follow William McKnight for great info on this too.

u/heroicjunk · 1 pointr/mentors

Believe it or not, attitude is half (if not more) of the battle. Keep doing what you are doing to continuously expand and build upon your skills.

With that said, I cannot stress highly enough taking a look at the Kimball Group Reader. It is chock full of real-world scenarios and is structured in a way that lends itself great to newcomers and seasoned pros alike. I think this will help to supplement what you are learning with the AdventureWorks system at the moment.

Once you familiarize yourself with SQL Server, I would recommend researching Oracle (Database, OLAP, OBIEE) and other technologies to better understand how they are alike, but also how they differ.

A cool way to learn more about data analysis is to try and apply it to something that interests you and expand from there. For example, if you like sports, you can mine football, baseball, etc... statistics for analysis. Find something that interests you and it may help reinforce your learning, especially on the parts that you may not find as satisfying to learn.

My background: I have been in Data Warehousing for about 8 years now and in my current role I flex between an Oracle DBA and DW Architect.

Best of luck to you and continue to strive learning new things!

u/story-of-your-life · 1 pointr/statistics

The book Practical Statistics for Data Scientists is worth a look. https://www.amazon.com/Practical-Statistics-Data-Scientists-Essential/dp/1491952962

u/CalvinLawson · 0 pointsr/SQL

Read this book:

http://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802

Seriously.

Best data warehousing books according to redditors

1. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition

2. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data

3. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling

4. Practical Statistics for Data Scientists: 50 Essential Concepts

5. The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence

6. Kimball's Data Warehouse Toolkit Classics, 3 Volume Set

7. The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2 and the Microsoft Business Intelligence Toolset

8. Database Reliability Engineering: Designing and Operating Resilient Database Systems

9. Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration

10. Data Analysis Using SQL and Excel

11. MySQL: MySQL Tutorials for Beginners Basic to Advanced MySQL Languages

12. SQL: Learn SQL the easy way

13. The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence Remastered Collection

14. Graph Analysis and Visualization: Discovering Business Opportunity in Linked Data

15. Google BigQuery Analytics

16. Look Smarter Than You Are with Smart View and Essbase 11.1.2.2: An End User's Guide

17. Learning Elastic Stack 7.0: Distributed search, analytics, and visualization using Elasticsearch, Logstash, Beats, and Kibana, 2nd Edition

18. PostgreSQL High Availability Cookbook - Second Edition

19. DAX Patterns 2015

20. The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence Remastered Collection

Top Reddit comments about Data Warehousing: