Sunday, 11 August 2019

91 job interview questions for data scientists

  1. What is the biggest data set that you processed, and how did you process it, what were the results?
  2. Tell me two success stories about your analytic or computer science projects? How was lift (or success) measured?
  3. What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?
  4. What is: collaborative filtering, n-grams, map reduce, cosine distance?
  5. How to optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases?
  6. How would you come up with a solution to identify plagiarism?
  7. How to detect individual paid accounts shared by multiple users?
  8. Should click data be handled in real time? Why? In which contexts?
  9. What is better: good data or good models? And how do you define "good"? Is there a universal good model? Are there any models that are definitely not so good?
  10. What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? Which languages would you choose for semi-structured text data reconciliation? 
  11. How do you handle missing data? What imputation techniques do you recommend?
  12. What is your favorite programming language / vendor? why?
  13. Tell me 3 things positive and 3 things negative about your favorite statistical software.
  14. Compare SAS, R, Python, Perl
  15. What is the curse of big data?
  16. Have you been involved in database design and data modeling?
  17. Have you been involved in dashboard creation and metric selection? What do you think about Birt?
  18. What features of Teradata do you like?
  19. You are about to send one million email (marketing campaign). How do you optimze delivery? How do you optimize response? Can you optimize both separately? (answer: not really)
  20. Toad or Brio or any other similar clients are quite inefficient to query Oracle databases. Why? How would you do to increase speed by a factor 10, and be able to handle far bigger outputs? 
  21. How would you turn unstructured data into structured data? Is it really necessary? Is it OK to store data as flat text files rather than in an SQL-powered RDBMS?
  22. What are hash table collisions? How is it avoided? How frequently does it happen?
  23. How to make sure a mapreduce application has good load balance? What is load balance?
  24. Examples where mapreduce does not work? Examples where it works very well? What are the security issues involved with the cloud? What do you think of EMC's solution offering an hybrid approach - both internal and external cloud - to mitigate the risks and offer other advantages (which ones)?
  25. Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?
  26. Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?
  27. Have you been working with white lists? Positive rules? (In the context of fraud or spam detection)
  28. What is star schema? Lookup tables? 
  29. Can you perform logistic regression with Excel? (yes) How? (use linest on log-transformed data)? Would the result be good? (Excel has numerical issues, but it's very interactive)
  30. Have you optimized code or algorithms for speed: in SQL, Perl, C++, Python etc. How, and by how much?
  31. Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy? Depends on the context?
  32. Define: quality assurance, six sigma, design of experiments. Give examples of good and bad designs of experiments.
  33. What are the drawbacks of general linear model? Are you familiar with alternatives (Lasso, ridge regression, boosted trees)?
  34. Do you think 50 small decision trees are better than a large one? Why?
  35. Is actuarial science not a branch of statistics (survival analysis)? If not, how so?
  36. Give examples of data that does not have a Gaussian distribution, nor log-normal. Give examples of data that has a very chaotic distribution?
  37. Why is mean square error a bad measure of model performance? What would you suggest instead?
  38. How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything? Are you familiar with A/B testing?
  39. What is sensitivity analysis? Is it better to have low sensitivity (that is, great robustness) and low predictive power, or the other way around? How to perform good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?
  40. Compare logistic regression w. decision trees, neural networks. How have these technologies been vastly improved over the last 15 years?
  41. Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with? When is full data better than reduced data or sample?
  42. How would you build non parametric confidence intervals, e.g. for scores? (see the AnalyticBridge theorem)
  43. Are you familiar either with extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
  44. What is root cause analysis? How to identify a cause vs. a correlation? Give examples.
  45. How would you define and measure the predictive power of a metric?
  46. How to detect the best rule set for a fraud detection scoring technology? How do you deal with rule redundancy, rule discovery, and the combinatorial nature of the problem (for finding optimum rule set - the one with best predictive power)? Can an approximate solution to the rule set problem be OK? How would you find an OK approximate solution? How would you decide it is good enough and stop looking for a better one?
  47. How to create a keyword taxonomy?
  48. What is a Botnet? How can it be detected?
  49. Any experience with using API's? Programming API's? Google or Amazon API's? AaaS (Analytics as a service)?
  50. When is it better to write your own code than using a data science software package?
  51. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?
  52. What is POC (proof of concept)?
  53. What types of clients have you been working with: internal, external, sales / finance / marketing / IT people? Consulting experience? Dealing with vendors, including vendor selection and testing?
  54. Are you familiar with software life cycle? With IT project life cycle - from gathering requests to maintenance? 
  55. What is a cron job? 
  56. Are you a lone coder? A production guy (developer)? Or a designer (architect)?
  57. Is it better to have too many false positives, or too many false negatives?
  58. Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples. 
  59. How does Zillow's algorithm work? (to estimate the value of any home in US)
  60. How to detect bogus reviews, or bogus Facebook accounts used for bad purposes?
  61. How would you create a new anonymous digital currency?
  62. Have you ever thought about creating a startup? Around which idea / concept?
  63. Do you think that typed login / password will disappear? How could they be replaced?
  64. Have you used time series models? Cross-correlations with time lags? Correlograms? Spectral analysis? Signal processing and filtering techniques? In which context?
  65. Which data scientists do you admire most? which startups?
  66. How did you become interested in data science?
  67. What is an efficiency curve? What are its drawbacks, and how can they be overcome?
  68. What is a recommendation engine? How does it work?
  69. What is an exact test? How and when can simulations help us when we do not use an exact test?
  70. What do you think makes a good data scientist?
  71. Do you think data science is an art or a science?
  72. What is the computational complexity of a good, fast clustering algorithm? What is a good clustering algorithm? How do you determine the number of clusters? How would you perform clustering on one million unique keywords, assuming you have 10 million data points - each one consisting of two keywords, and a metric measuring how similar these two keywords are? How would you create this 10 million data points table in the first place?
  73. Give a few examples of "best practices" in data science.
  74. What could make a chart misleading, difficult to read or interpret? What features should a useful chart have?
  75. Do you know a few "rules of thumb" used in statistical or computer science? Or in business analytics?
  76. What are your top 5 predictions for the next 20 years?
  77. How do you immediately know when statistics published in an article (e.g. newspaper) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject? For instance, what do you think about the official monthly unemployment statistics regularly discussed in the press? What could make them more accurate?
  78. Testing your analytic intuition: look at these three charts. Two of them exhibit patterns. Which ones? Do you know that these charts are called scatter-plots? Are there other ways to visually represent this type of data?
  79. You design a robust non-parametric statistic (metric) to replace correlation or R square, that (1) is independent of sample size, (2) always between -1 and +1, and (3) based on rank statistics. How do you normalize for sample size? Write an algorithm that computes all permutations of n elements. How do you sample permutations (that is, generate tons of random permutations) when n is large, to estimate the asymptotic distribution for your newly created metric? You may use this asymptotic distribution for normalizing your metric. Do you think that an exact theoretical distribution might exist, and therefore, we should find it, and use it rather than wasting our time trying to estimate the asymptotic distribution using simulations? 
  80. More difficult, technical question related to previous one. There is an obvious one-to-one correspondence between permutations of n elements and integers between 1 and n! Design an algorithm that encodes an integer less than n! as a permutation of n elements. What would be the reverse algorithm, used to decode a permutation and transform it back into a number? Hint: An intermediate step is to use the factorial number system representation of an integer. Feel free to check this reference online to answer the question. Even better, feel free to browse the web to find the full answer to the question (this will test the candidate's ability to quickly search online and find a solution to a problem without spending hours reinventing the wheel).  
  81. How many "useful" votes will a Yelp review receive? My answer: Eliminate bogus accounts (read this article), or competitor reviews (how to detect them: use taxonomy to classify users, and location - two Italian restaurants in same Zip code could badmouth each other and write great comments for themselves). Detect fake likes: some companies (e.g. FanMeNow.com) will charge you to produce fake accounts and fake likes. Eliminate prolific users who like everything, those who hate everything. Have a blacklist of keywords to filter fake reviews. See if IP address or IP block of reviewer is in a blacklist such as "Stop Forum Spam". Create honeypot to catch fraudsters.  Also watch out for disgruntled employees badmouthing their former employer. Watch out for 2 or 3 similar comments posted the same day by 3 users regarding a company that receives very few reviews. Is it a brand new company? Add more weight to trusted users (create a category of trusted users).  Flag all reviews that are identical (or nearly identical) and come from same IP address or same user. Create a metric to measure distance between two pieces of text (reviews). Create a review or reviewer taxonomy. Use hidden decision trees to rate or score review and reviewers.
  82. What did you do today? Or what did you do this week / last week?
  83. What/when is the latest data mining book / article you read? What/when is the latest data mining conference / webinar / class / workshop / training you attended? What/when is the most recent programming skill that you acquired?
  84. What are your favorite data science websites? Who do you admire most in the data science community, and why? Which company do you admire most?
  85. What/when/where is the last data science blog post you wrote? 
  86. In your opinion, what is data science? Machine learning? Data mining?
  87. Who are the best people you recruited and where are they today?
  88. Can you estimate and forecast sales for any book, based on Amazon public data? Hint: read this article.
  89. What's wrong with this picture?
  90. Should removing stop words be Step 1 rather than Step 3, in the search engine algorithm described hereAnswer: Have you thought about the fact that mine and yours could also be stop words? So in a bad implementation, data mining would become data mine after stemming, then data. In practice, you remove stop words before stemming. So Step 3 should indeed become step 1. 
  91. Experimental design and a bit of computer science with Lego's

Thursday, 22 March 2018

Association, Aggregation, Composition, Abstraction, Generalization, Realization, Dependency
èWhat is Association?
èWhat is Aggregation?
èWhat is Composition?
èDifference between Aggregation VS Composition?
èInheritance, IS-A and Has-A
èWhat is Abstraction?
èWhat is Generalization?
èWhat is Realization?
èWhat is Dependency?
What is Association?
Association: Association is the relation between two separate classes which establishes through their Objects. In other words, Association defines the multiplicity between objects. Association can be one-to-one, one-to-many, many-to-one, many-to-many.
                
èAggregation is a special form of Association.
èComposition is a special form of Aggregation.
èIt is unidirectional .
 

Example: A Student and a Faculty are having an association.
Example With Java Code:
// Java program to illustrate the
// concept of Association
import java.io.*;

// class bank
class Bank {
     private String name;

     // bank name
     Bank(String name) {
          this.name = name;
     }

     public String getBankName() {
          return this.name;
     }
}

// employee class
class Employee {
     private String name;

     // employee name
     Employee(String name) {
          this.name = name;
     }

     public String getEmployeeName() {
          return this.name;
     }
}

// Association between both the
// classes in main method
class Association {
     public static void main(String[] args) {
          Bank bank = new Bank("Axis");
          Employee emp = new Employee("Sitansu");

System.out.println(emp.getEmployeeName() + " is employee of " + bank.getBankName());
     }
}

Output:
Sitansu is employee of Axis
Bank and employee are associated through their Objects. Bank Can have Many Employee (One to Many Relationship)

èWhat is Aggregation?
Aggregation: It is a special form of Association where:
·        It represents Has-A relationship.
·        It is a unidirectional association i.e. a one-way relationship. For example, the department can have students but vice versa is not possible and thus unidirectional in nature.
·        In Aggregation, both the entries can survive individually which means ending one entity will not affect the other entity

Example:

// Java program to illustrate
// the concept of Aggregation.
import java.io.*;
import java.util.*;

// student class
class Student {
     String name;
     int id;
     String dept;

     Student(String name, int id, String dept) {

          this.name = name;
          this.id = id;
          this.dept = dept;

     }
}

/*
 * Department class contains list of student Objects. It is associated with
 * student class through its Object(s).
 */
class Department {

     String name;
     private List<Student> students;

     Department(String name, List<Student> students) {

          this.name = name;
          this.students = students;

     }

     public List<Student> getStudents() {
          return students;
     }
}

/*
 * Institute class contains list of Department Objects. It is asoociated with
 * Department class through its Object(s).
 */
class Institute {

     String instituteName;
     private List<Department> departments;

     Institute(String instituteName, List<Department> departments) {
          this.instituteName = instituteName;
          this.departments = departments;
     }

     // count total students of all departments
     // in a given institute
     public int getTotalStudentsInInstitute() {
          int noOfStudents = 0;
          List<Student> students;
          for (Department dept : departments) {
              students = dept.getStudents();
              for (Student s : students) {
                   noOfStudents++;
              }
          }
          return noOfStudents;
     }

}

// main method
class GFG {
     public static void main(String[] args) {
          Student s1 = new Student("Sitansu", 1, "CSE");
          Student s2 = new Student("Kuldeep", 2, "CSE");
          Student s3 = new Student("Bimal", 1, "EE");
          Student s4 = new Student("Goutham", 2, "EE");

          // making a List of
          // CSE Students.
          List<Student> cse_students = new ArrayList<Student>();
          cse_students.add(s1);
          cse_students.add(s2);

          // making a List of
          // EE Students
          List<Student> ee_students = new ArrayList<Student>();
          ee_students.add(s3);
          ee_students.add(s4);

          Department CSE = new Department("CSE", cse_students);
          Department EE = new Department("EE", ee_students);

          List<Department> departments = new ArrayList<Department>();
          departments.add(CSE);
          departments.add(EE);

          // creating an instance of Institute.
          Institute institute = new Institute("BITS", departments);

          System.out.print("Total students in institute: ");
          System.out.print(institute.getTotalStudentsInInstitute());
     }
}

Output:
Total students in institute: 4
In this example, there is an Institute which has no. of departments like CSE, EE. Every department has no. of students. So, we make a Institute class which has a reference to Object or no. of Objects (i.e. List of Objects) of the Department class. That means Institute class is associated with Department class through its Object(s). And Department class has also a reference to Object or Objects (i.e. List of Objects) of Student class means it is associated with Student class through its Object(s).
When do we use Aggregation?
Code reuse is best achieved by aggregation.


What is Composition?
Composition: Composition is a restricted form of Aggregation in which two entities are highly dependent on each other.
§  It represents the part-of relationship.
§  In composition, both the entities are dependent on each other.
§  When there is a composition between two entities, the composed object cannot exist without the other entity.

Let's take the example of Library.

/*
 * Department class contains list of student Objects. It is associated with
 * student class through its Object(s).
 */
class Department {

     String name;
     private List<Student> students;

     Department(String name, List<Student> students) {

          this.name = name;
          this.students = students;

     }

     public List<Student> getStudents() {
          return students;
     }
}

/*
 * Institute class contains list of Department Objects. It is asoociated with
 * Department class through its Object(s).
 */
class Institute {

     String instituteName;
     private List<Department> departments;

     Institute(String instituteName, List<Department> departments) {
          this.instituteName = instituteName;
          this.departments = departments;
     }

     // count total students of all departments
     // in a given institute
     public int getTotalStudentsInInstitute() {
          int noOfStudents = 0;
          List<Student> students;
          for (Department dept : departments) {
              students = dept.getStudents();
              for (Student s : students) {
                   noOfStudents++;
              }
          }
          return noOfStudents;
     }

}

// main method
class GFG {
     public static void main(String[] args) {
          Student s1 = new Student("Mia", 1, "CSE");
          Student s2 = new Student("Priya", 2, "CSE");
          Student s3 = new Student("John", 1, "EE");
          Student s4 = new Student("Rahul", 2, "EE");

          // making a List of
          // CSE Students.
          List<Student> cse_students = new ArrayList<Student>();
          cse_students.add(s1);
          cse_students.add(s2);

          // making a List of
          // EE Students
          List<Student> ee_students = new ArrayList<Student>();
          ee_students.add(s3);
          ee_students.add(s4);

          Department CSE = new Department("CSE", cse_students);
          Department EE = new Department("EE", ee_students);

          List<Department> departments = new ArrayList<Department>();
          departments.add(CSE);
          departments.add(EE);

          // creating an instance of Institute.
          Institute institute = new Institute("BITS", departments);

          System.out.print("Total students in institute: ");
          System.out.print(institute.getTotalStudentsInInstitute());
     }
}
Output
Title : EffectiveJ Java and  Author : Joshua Bloch
Title : Thinking in Java and  Author : Bruce Eckel
Title : Java: The Complete Reference and  Author : Herbert Schildt
So, If Library gets destroyed then All books within that particular library will be destroyed. i.e. book can not exist without library. That’s why it is composition.





Difference between Aggregation VS Composition ?

Aggregation VS Composition
1.     Dependency: Aggregation implies a relationship where the child can exist independently of the parent. For example, Bank and Employee, delete the Bank and the Employee still exist. whereas Composition implies a relationship where the child cannot exist independent of the parent. Example: Human and heart, heart don’t exist separate to a Human
2.     Type of Relationship: Aggregation relation is “has-a” and composition is “part of” relation.
3.     Type of association: Composition is a strong Association whereas Aggregation is a weak association.

Example:

// Java program to illustrate the
// difference between Aggregation
// Composition.

import java.io.*;

// Engine class which will
// be used by car. so 'Car'
// class will have a field
// of Engine type.
class Engine {
     // starting an engine.
     public void work() {

          System.out.println("Engine of car has been started ");

     }

}

// Engine class
final class Car {

     // For a car to move,
     // it need to have a engine.
     private final Engine engine; // Composition
     // private Engine engine; // Aggregation

     Car(Engine engine) {
          this.engine = engine;
     }

     // car start moving by starting engine
     public void move() {

          // if(engine != null)
          {
              engine.work();
              System.out.println("Car is moving ");
          }
     }
}

class GFG {
     public static void main(String[] args) {

          // making an engine by creating
          // an instance of Engine class.
          Engine engine = new Engine();

          // Making a car with engine.
          // so we are passing a engine
          // instance as an argument while
          // creating instace of Car.
          Car car = new Car(engine);
          car.move();

     }
}


Output:
Engine of car has been started 
Car is moving 


In case of aggregation, the Car also performs its functions through an Engine. but the Engine is not always an internal part of the Car. An engine can be swapped out or even can be removed from the car. That’ why we make The Engine type field non-final.

What is Generalization?
Inheritance/ Generalization, IS-A and Has-A: Depends on the logical relation. It just needs to make sense.
Example:
Lets say you have Animal classes.
So you have these classes: Animal, Dog, Cat , Leopard, Fur, Feet
Cat and Dog IS A Animal.
Leopard IS A Cat.
Animal HAS A Fur, Feet.
In a nutshell:
IS A relationship means you inherit and extend the functionality of the base class.
HAS A relationship means the class is using another class, so it has it as a member.

An IS-A relationship is inheritances. The classes which inherit are known as sub classes or child classes. On the other hand, HAS - A relationship is a composition
In OOP, IS - A relationship is completely inheritance. This means, that the child class is a type of parent class. For example, an apple is a fruit. So you will extend fruit to get the apple.
class Apple extends Fruit{
.
.
}
On the other hand, composition means creating instances which have references to other objects. For example, a room has a table. So you will create a class room and then in that class create an instance of type table.

class Room{

:
Table table = new Table ();
:
:
}

A HAS-A relationship is dynamic (run time ) binding while inheritance is a static (compile time ) binding. If you just want to reuse the code and you know that the two are not of same kind use composition. For example, you cannot an oven from a kitchen. A kitchen HAS-A oven. When you feel there is a natural relationship like Apple is a Fruit use inheritance.

What is Abstraction?
Abstraction: Data Abstraction is the property by virtue of which only the essential details are displayed to the user.The trivial or the non-essentials units are not displayed to the user. Ex: A car is viewed as a car rather than its individual components.
Data Abstraction may also be defined as the process of identifying only the required characteristics of an object ignoring the irrelevant details.The properties and behaviors of an object differentiate it from other objects of similar type and also help in classifying/grouping the objects.
Consider a real-life example of a man driving a car. The man only knows that pressing the accelerators will increase the speed of car or applying brakes will stop the car but he does not know about how on pressing the accelerator the speed is actually increasing, he does not know about the inner mechanism of the car or the implementation of accelerator, brakes etc in the car. This is what abstraction is.

In java, abstraction is achieved by Interface and abstract class. We can achieve 100% abstraction using interfaces.
Abstract classes and Abstract methods :
1.     An abstract class is a class that is declared with abstract keyword.
2.     An abstract method is a method that is declared without an implementation.
3.     An abstract class may or may not have all abstract methods. Some of them can be concrete methods
4.     A method defined abstract must always be redefined in the subclass,thus making overridingcompulsory OR either make subclass itself abstract.
5.     Any class that contains one or more abstract methods must also be declared with abstract keyword.
6.     There can be no object of an abstract class.That is, an abstract class can not be directly instantiated with the new operator.
7.     An abstract class can have parametrized constructors and default constructor is always present in an abstract class.



When to use abstract classes and abstract methods with an example

There are situations in which we will want to define a superclass that declares the structure of a given abstraction without providing a complete implementation of every method. That is, sometimes we will want to create a superclass that only defines a generalization form that will be shared by all of its subclasses, leaving it to each subclass to fill in the details.
Consider a classic “shape” example, perhaps used in a computer-aided design system or game simulation. The base type is “shape” and each shape has a color, size and so on. From this, specific types of shapes are derived(inherited)-circle, square, triangle and so on – each of which may have additional characteristics and behaviors. For example, certain shapes can be flipped. Some behaviors may be different, such as when you want to calculate the area of a shape. The type hierarchy embodies both the similarities and differences between the shapes.


// Java program to illustrate the
// concept of Abstraction
abstract class Shape {
     String color;

     // these are abstract methods
     abstract double area();

     public abstract String toString();

     // abstract class can have constructor
     public Shape(String color) {
           System.out.println("Shape constructor called");
           this.color = color;
     }

     // this is a concrete method
     public String getColor() {
           return color;
     }
}

class Circle extends Shape {
     double radius;

     public Circle(String color, double radius) {

           // calling Shape constructor
           super(color);
           System.out.println("Circle constructor called");
           this.radius = radius;
     }

     @Override
     double area() {
           return Math.PI * Math.pow(radius, 2);
     }

     @Override
     public String toString() {
           return "Circle color is " + super.color + "and area is : " + area();
     }

}

class Rectangle extends Shape {

     double length;
     double width;

     public Rectangle(String color, double length, double width) {
           // calling Shape constructor
           super(color);
           System.out.println("Rectangle constructor called");
           this.length = length;
           this.width = width;
     }

     @Override
     double area() {
           return length * width;
     }

     @Override
     public String toString() {
           return "Rectangle color is " + super.color + "and area is : " + area();
     }

}

public class Test {
     public static void main(String[] args) {
           Shape s1 = new Circle("Red", 2.2);
           Shape s2 = new Rectangle("Yellow", 2, 4);

           System.out.println(s1.toString());
           System.out.println(s2.toString());
     }
}

Output:
Shape constructor called
Circle constructor called
Shape constructor called
Rectangle constructor called
Circle color is Redand area is : 15.205308443374602
Rectangle color is Yellowand area is : 8.0

Advantages of Abstraction

1.     It reduces the complexity of viewing the things.
2.     Avoids code duplication and increases reusability.
3.     Helps to increase security of an application or program as only important details are provided to the user.

What is Realization?
Realization: Realization is a relationship between the blueprint class and the object containing its respective implementation level details. This object is said to realize the blueprint class. In other words, you can understand this as the relationship between the interface and the implementing class.

Example: A particular model of a car ‘GTB Fiorano’ that implements the blueprint of a car realizes the abstraction.


public interface MyRunnable {

}

public class RunnableTask implements MyRunnable {

}

What is Dependency?

Dependency: Change in structure or behavior of a class affects the other related class, then there is a dependency between those two classes. It need not be the same vice-versa. When one class contains the other class it this happens.


Example: Relationship between shape and circle is the dependency.